An MCP pentest server with a 6-layer safety architecture

Published 2026-05-03 · 11 min read

The setup and origin

I wanted a safe-to-use AI assistant designed for pentesting and built around local AI models, so scan data and findings stay on the operator's machine. That matters in a European context: GDPR, client confidentiality clauses, and Swedish data-handling expectations make uploading raw scan output to a US-hosted LLM a non-starter for serious work. Its job is to process and triage the output of the pentesting tools it wraps, not to drive an engagement on its own.

Instead, I wanted it to read raw data (nmap XML), recognise that a scan result such as "ssh-2.0-OpenSSH_7.4 with CVE-2018-15473" is something of interest, and autonomously write a markdown finding for the user to act upon.

A Model Context Protocol (MCP) seemed like the immediate correct shape for that, and I found it especially interesting to try out how well it could work in combination with modern local AI models running on my modest laptop. The main idea was that modern, "smaller" AI models would have a restricted list of tools available that they could call, and the tools would run as local subprocesses on the user's machine; resulting in a structured response being returned as output to the user.

The problem was that almost every "MCP for security tools" project I looked at took one of three shapes:

Shell-out wrappers. They passed the AI's prompt directly to bash -c. That is exactly the kind of thing you do not want a language model anywhere near, especially in a professional pentest engagement.
Single-tool wrappers with no output discipline. One tool, a thin layer, and 60 KB of raw text dumped into context on every call. A local 32K-context model would forget what it was doing after two of those.
Kali-over-SSH bridges. Useful for the breadth of tools that come pre-installed, but heavyweight for most use cases and dependent on a working SSH connection just to function.

So I wrote my own. It's called Raven Nest. It's open source under Apache-2.0. Currently it wraps 22 security tools, including the Metasploit Framework, behind 43 MCP endpoints (16 tools and 34 endpoints when this post was first published). A few design decisions are worth talking about, plus two things that didn't go the way I planned.

What I'd been trying

Before settling on a from-scratch Rust implementation I tried three other things. None of them worked.

Manual copy-paste. Manually run a tool + command in terminal, then paste the relevant lines into the chat and ask the model to interpret. Anyone who has used LLMs like this would likely quickly agree that such a workflow could be fine for one or a few findings, but awful for a richer and fuller form of best-practice pentesting. Consistency would lack between sessions. It'd be both more expensive (token- and time-wise) to operate that way, if one wishes to have both a list of findings and a proper security audit report at the end of the workday.
A thin Python wrapper that exposed run_command(cmd). This worked for about ten minutes. It also let the model decide to type sqlmap -u "http://example.com" --crawl=10 --level=5 --risk=3 --batch --tamper=... whenever it felt like it. I didn't trust that even with myself sitting at the keyboard, let alone unattended.
Existing MCP-for-security projects. Most of them have one of three problems: they pass user input straight to a shell, they expose every tool flag as a parameter, or they have no way to keep findings between calls. I wanted something where the AI could only pick from preset scan profiles, where the output was parsed and capped, and where findings persisted to disk in a structured form.

The thing I kept coming back to was that an LLM is not your enemy... but it's not exactly your friend either (surprise, surprise!). It has zero situational awareness about what subnet you're in or whose box you're allowed to touch. The defaults need to be tight enough that even an enthusiastic hallucination couldn't do any real, unwanted damage.

The six-layer safety design

Every tool call in Raven Nest passes through six checks. Any of them can refuse the call.

Allowlist. The config file lists tools by name. If sqlmap is not in allowed_tools, it simply isn't callable, full stop. Defaults to a small set; you opt in.
Input validation. Targets must parse as IP, hostname, CIDR, or URL. Shell metacharacters (;, |, `, $(), etc.) get rejected before anything else looks at the string.
Preset arguments. The AI never picks raw flags. Each tool exposes a small enum of scan types - for nmap that's quick (-T4 -F, the default), service (-sV), os (-O, root only), and vuln (-sV --script=vuln), and the wrapper assembles the actual argv. There is no extra_args: String parameter anywhere in the schema.
Execution containment. Every subprocess is spawned with a timeout and kill_on_drop. If the MCP client disconnects mid-scan, the child process dies with it. Concurrent scans are capped.
Output sanitisation. ANSI escape codes stripped, output truncated at a configurable limit, and the truncation is UTF-8 safe so a multi-byte character at the boundary doesn't corrupt the JSON response.
Quality assessment. A small classifier tags each result as Complete, Empty, Partial, or RateLimited (rate-limit and WAF indicators share the last variant). The AI sees the tag plus a one-line warning and can react instead of trying the same scan again with the same parameters.

This is not subtle or any novelty, but just defence in depth that's applied to a class of input that happens to be a language model. The point is that you have to actually do all six of the layers. The path traversal check on wordlist arguments was one I almost didn't bother with because "well, the AI wouldn't ask for /etc/shadow", but it absolutely would! A 7B model on a bad day will happily ask for /etc/shadow after having seen a tutorial about it once.

Here's the shape of the metacharacter check, lifted from crates/raven-core/src/safety.rs:

pub fn validate_target(target: &str) -> Result<(), PentestError> {
    if target.is_empty() {
        return Err(PentestError::InvalidTarget("empty target".into()));
    }

    // URL form is parsed first - query strings may contain `&`, which is
    // banned in non-URL targets but safe inside a parsed URL.
    if (target.starts_with("http://") || target.starts_with("https://"))
        && let Ok(parsed) = url::Url::parse(target)
    {
        // Validate host + path only (query string is ignored).
        // ... reject if any BANNED char appears in host/path ...
        return Ok(());
    }

    // Plain target: full metacharacter check.
    const BANNED: &[char] = &[
        ';', '|', '&', '$', '`', '(', ')', '{', '}', '<', '>', '!', '\n',
    ];
    if let Some(c) = target.chars().find(|c| BANNED.contains(c)) {
        return Err(PentestError::InvalidTarget(format!(
            "forbidden character: '{c}'"
        )));
    }

    // Then IP, CIDR, hostname checks, in that order.
    Ok(())
}

The shape is boring, and compact on purpose. Good thing that boring is best. Incomparably better than an alternative like subprocess.shell=True.

The context budget tracker

This is the bit I'm happiest with, and it's the one that made the project stop being a toy.

Local models with 32K or 64K context windows hit a wall fast when you start running scans. A single nuclei scan against a moderately sized target produces several hundred lines of JSONL output. nmap XML is verbose. Run three scans in a row and your model has lost the thread.

Most "AI security tool" projects pretend this isn't a problem. Mine has a session-aware context budget tracker that scales every tool's output cap dynamically.

Full mode. All findings, all details, up to 8 KB per tool call. Active when context is below 40% consumed.
Compact mode. Top-N results, critical and high severity findings only. Triggers at 40% consumed.
Minimal mode. One-line summaries. Triggers at 70% consumed.

The trick is that this isn't a global flag. Each tool's output parser asks the budget tracker for its current cap on the way out. The nmap parser will return all open ports in full mode, just the ones with NSE-script CVE matches in compact mode, and a count plus top-3 services in minimal mode. nuclei drops everything below medium severity in compact mode.

When the budget is exhausted, the server returns a message telling the AI to stop scanning, save its findings, and call generate_report. This is what I mean by "session-aware": the wrapper knows what stage of the engagement it's in and can suggest the right next move instead of letting the model keep scanning into a brick wall.

The Metasploit confirmation gate

Metasploit is in a separate trust tier. Disabled by default, you have to opt in via config, and even then there's a per-tool allowlist on top.

The interesting bit is the double-call confirmation gate. The AI calls msf_exploit with target details and module. The first call returns a "CONFIRM EXPLOIT" message that prints back the proposed module, target, port and payload, then asks the AI to call msf_exploit again with identical parameters to execute. The request is hashed; if the second call's hash doesn't match the pending one, no exploit fires. There is no confirm=true flag - the second call's parameters are themselves the confirmation.

This sounds a bit full-of-itself, but it actually solved a real problem in testing. Local models can hallucinate target IPs, especially when there are several hosts in the conversation context. Without the gate, a typo could potentially end up firing an exploit at the wrong host. With the gate, the AI has to commit to the target twice in two separate tool calls, with the proposed action printed back to it in between. In every case where I deliberately mistyped a target the gate caught it because the model corrected itself when it saw the proposed command.

Other Metasploit hardening:

Per-tool allowlist on top of the framework allowlist (you can enable msf_search without enabling msf_exploit).
Path-boundary blocklist on module names - rejects anything outside the standard module tree.
Session command filter on msf_sessions interact: a blocklist that rejects destructive prefixes (rm , del , format , mkfs , dd , shutdown, reboot, halt, poweroff, upload ) before any string is written to a live session, so a typo or a hallucination cannot wipe a foothold by accident.
Passwords redacted from RPC error messages because Metasploit is famously "gossipy" in its errors.
TLS certificate bypass restricted to loopback only (127.0.0.1, localhost, or ::1).

I am not interested in your enthusiasm for "agentic pentesting". The framework is here to help me handle, process, and output scans faster and automatically, and not to drive a full-on engagement on its own.

Output parsers, not raw output

Every one of the 16 tools has a structured parser:

nmap XML is parsed into ports, services, NSE script matches, and CVE references sorted by CVSS.
nuclei JSONL is parsed by severity.
testssl output is parsed into vulnerability and certificate findings.
sqlmap output produces an injection type and parameter list.

This was the single biggest cost in the project, and the single biggest payoff. Every parser returns Option<String> and falls back to raw output when parsing fails, so a tool version mismatch doesn't break the wrapper. But when parsing works, what flows back into the model's context is a few hundred bytes of structured findings instead of 30 KB of ANSI-coloured tool noise.

The cap on each parser scales with the budget tracker via a single helper:

// Method on SessionBudget. Full = 100%, Compact = 50%, Minimal = 25%.
// Never returns less than 3.
pub fn scale_cap(&self, full: usize) -> usize {
    match self.current_mode() {
        OutputMode::Full    => full,
        OutputMode::Compact => (full / 2).max(3),
        OutputMode::Minimal => (full / 4).max(3),
    }
}

Findings persistence and the OWASP mapping

A pentest is a stream of small discoveries that have to compose into a report. The MCP wrapper has explicit save_finding, get_finding, list_findings, and generate_report endpoints. Findings are persisted to disk as one JSON file per finding, keyed by UUID. The schema includes severity, target, tool, CVSS score, optional CVE, optional OWASP Top 10 category, evidence, and remediation.

generate_report produces a structured markdown document with table of contents, executive summary, severity breakdown table, methodology section (PTES), tools used (deduplicated), and numbered findings sorted by severity. If the AI tagged a finding with an owasp_category, that category is rendered as a labelled field on the entry, so a reviewer can pivot the report against OWASP Top 10 by hand or by grep. This is the artefact that actually leaves the engagement.

There is one subtle hardening detail here. Markdown report generation escapes user-supplied finding fields, because otherwise an attacker who can poison your scan output could inject markdown into your report. I added that one after writing a deliberately malicious nuclei template that put [click here](javascript:alert(1)) into the matched-at field. It worked. It doesn't any longer.

The hard parts and the mistakes

This project had three things that didn't go cleanly.

Wordlist path validation took three iterations. First version compared the wordlist path string-prefix against /usr/share/. Second version was bypassed by /usr/share/../../etc/shadow. Third version (in validate_file_path) rejects any path containing .. outright before the prefix check, alongside a shell-metacharacter sweep, so the bypass cannot reach the prefix logic at all. Path traversal bugs in security tooling are the most embarrassing kind of bug because the whole thing exists to prevent that class of problem.
The context budget tracker had a race condition. Two parsers were reading the budget and updating it without coordination. Result: occasional double-counting and premature jumps to compact mode. Fixed by switching the cumulative counters to AtomicUsize inside the SessionBudget struct (held as Arc<SessionBudget> on the server) and centralising the truncation in a single wrap_result() function. Lock-free updates for a counter beats a mutex any day; good knowledge to bring into similar future projects from day one.
Cookie jar persistence across context clears. I wanted authenticated scanning to survive an MCP session clear. The cookie jar should get saved to disk and reloaded automatically. What I missed was that the jar is shared with http_request calls, but not with subprocess tools like nikto, feroxbuster, sqlmap, etc. Each of them needs cookies passed via its own cookie parameter. That's now in the docs and in the AI-facing tool descriptions, but it tripped me up when I expected sqlmap to inherit the cookies that I'd just collected with http_request.

What I'd do differently

If I were starting again I would treat the context budget tracker as a first-class architectural concern from the beginning, not as a feature added halfway through. Half the friction in Phase 1 came from output sizing being a per-tool decision instead of a centralised one, and refactoring 16 parsers to take a budget mode was the kind of change I would rather have done from the very start.

I would also write the AI-facing tool descriptions before the implementations. The thing the model sees - the description field in the MCP tool schema - is what determines whether it picks the right tool, the right preset, and the right parameter combination. I rewrote those three times after launch because the model(s) kept misreading them. Doing it last meant doing it under pressure with worse signal.

What's next

The current focus is on report quality. Markdown reports work and the OWASP grouping is genuinely useful, but the executive-summary risk rating is still hand-rolled, naive, and needs to have a more sturdy base to stand on. I want to bring in CVSS-based scoring with explicit assumptions about exposure and exploitability. The other thing on the list is a --dry-run mode that exercises every parser end to end with golden fixtures - currently each tool has unit tests against captured output, but a single cargo test --features golden would catch regressions across the whole stack at once.

Code is at github.com/tidynest/raven-nest-mcp. 226 tests across the three crates, design docs in docs/, safety architecture documented file-by-file. If you find a way around any of the six layers, please tell me. Contributions are welcome.

An MCP pentest server with a 6-layer safety architecture

The setup and origin

What I'd been trying

The six-layer safety design

The context budget tracker

The Metasploit confirmation gate

Output parsers, not raw output

Findings persistence and the OWASP mapping

The hard parts and the mistakes

What I'd do differently

What's next

Theme Settings

Theme

Background

Accent Colour

Font

Pixel Effects