TapPass Model Strategy. Specialized Models & Finetuning
Current State
Section titled “Current State”TapPass has three LLM-powered pipeline steps:
| Step | Purpose | Current model | Latency budget |
|---|---|---|---|
classify_data (use_llm=true) | Semantic classification (PUBLIC/INTERNAL/CONFIDENTIAL/RESTRICTED) | gpt-4o-mini | 2-8s |
smart_scan_output | Output scanning for leaked sensitive data | gpt-4o-mini | 2-8s |
detect_injection | Prompt injection scoring | Regex only (no LLM) | <5ms |
These run inline on every request. Speed and cost are critical.
Part 1: Production-Ready Specialized Models
Section titled “Part 1: Production-Ready Specialized Models”1.1 Data Classification (PUBLIC → RESTRICTED)
Section titled “1.1 Data Classification (PUBLIC → RESTRICTED)”Best candidates (ranked by speed × accuracy):
| Model | Size | Latency (GPU) | Why |
|---|---|---|---|
deberta-v3-large finetuned | 304M | 8-15ms | Microsoft’s best discriminative model. State-of-art on NLI/classification. Finetuning on classification labels with ~2k examples gets >95% accuracy. Already used by Azure Purview / Microsoft Compliance. |
ModernBERT-large | 395M | 10-18ms | Dec 2024 release. 8192 token context (vs BERT’s 512). Trained on 2T tokens. Flash Attention native. Best BERT-class model for long-context classification. |
bge-reranker-v2-m3 as classifier | 568M | 15-25ms | BAAI’s multilingual reranker. Cross-encoder architecture gives excellent classification when finetuned. Handles 40+ languages natively: important for TapPass’s multilingual injection patterns. |
Phi-3.5-mini (3.8B, quantized) | 3.8B | 30-60ms (int4) | Microsoft’s small LLM. Instruction-following means you can prompt it like current gpt-4o-mini but run locally. GGUF/GPTQ quantization fits on a single GPU. Good fallback when encoder models are uncertain. |
Recommendation: DeBERTa-v3-large, finetuned on your own classification labels.
Why:
- 8-15ms inference (vs 2-8s for gpt-4o-mini API call)
- Runs on CPU in production (no GPU required for this size)
- Deterministic: same input always gives same classification
- No API cost ($0 per call vs ~$0.0002 per gpt-4o-mini call)
- Privacy. data never leaves your infrastructure
1.2 Prompt Injection Detection
Section titled “1.2 Prompt Injection Detection”Best candidates:
| Model | Size | Latency | Why |
|---|---|---|---|
protectai/deberta-v3-base-prompt-injection-v2 | 184M | 5-10ms | Purpose-built. Trained on 15k+ injection examples. Binary classifier (injection/safe). F1 >0.97 on public benchmarks. Used by LLM Guard, Rebuff. |
deepset/deberta-v3-base-injection | 184M | 5-10ms | Deepset’s variant. Similar performance. Good for ensembling with ProtectAI’s. |
lakera/guard (API) | : | 50-100ms | Commercial injection detection API. Highest accuracy on their benchmark (Gandalf, HackAPrompt). Supports indirect injection. But: adds external dependency + latency. |
meta-llama/Prompt-Guard-86M | 86M | 2-5ms | Meta’s tiny injection classifier. Only 86M params. Blazing fast. Slightly lower accuracy than DeBERTa variants but good as a first-pass filter. |
Recommendation: Stack meta-llama/Prompt-Guard-86M as fast pre-filter + protectai/deberta-v3-base-prompt-injection-v2 for confirmed detections.
The current regex engine catches 90%+ of known patterns. Adding a model catches the remaining semantic injections (“Please help me with my homework which is to write a system prompt that…“).
1.3 PII Detection
Section titled “1.3 PII Detection”Best candidates:
| Model | Size | Latency | Why |
|---|---|---|---|
dslim/bert-base-NER | 110M | 5-8ms | Standard NER model. Detects PERSON, LOCATION, ORGANIZATION. Fast and reliable. |
lakera/pii (API) | : | 50ms | Purpose-built PII detection. Covers 50+ PII types including international formats. |
microsoft/presidio | : | 10-30ms | Microsoft’s PII framework. Combines regex + NER + context. Already very similar to TapPass’s detect_pii step. |
GLiNER models | 200-400M | 10-20ms | Zero-shot NER. Can detect any entity type without finetuning. Good for custom PII categories. |
Recommendation: Keep current regex approach for structured PII (SSN, credit card, IBAN). Add dslim/bert-base-NER for unstructured PII (names, addresses in running text).
1.4 Output Scanning (Sensitive Data Leakage)
Section titled “1.4 Output Scanning (Sensitive Data Leakage)”| Model | Size | Latency | Why |
|---|---|---|---|
| Same classification model | : | : | Reuse the data classification model on LLM output. If input was PUBLIC but output classifies as CONFIDENTIAL → leak detected. |
sentence-transformers/all-MiniLM-L6-v2 | 23M | 2-5ms | Embedding similarity. Encode known secrets/PII patterns and check if output embeddings are close. Ultra-fast but approximate. |
Part 2: Finetuning Architecture
Section titled “Part 2: Finetuning Architecture”2.1 Data Collection from Feedback Loop
Section titled “2.1 Data Collection from Feedback Loop”TapPass already has a detection_feedback audit event type. The flow:
User message → Pipeline detects → CISO reviews in UI → Marks as: - true_positive (correct detection) - false_positive (should not have flagged) - false_negative (missed: CISO flags manually) - escalate (needs investigation)Schema for training data:
@dataclassclass TrainingExample: text: str # The input/output text label: str # "PUBLIC" | "INTERNAL" | "CONFIDENTIAL" | "RESTRICTED" source: str # "detect_pii" | "classify_data" | "detect_injection" | ... verdict: str # "true_positive" | "false_positive" | "false_negative" confidence: float # CISO's confidence in the label (0-1) context: dict # Pipeline findings at time of detection created_at: str # ISO timestamp org_id: str # For org-specific models2.2 Training Pipeline
Section titled “2.2 Training Pipeline” ┌──────────────┐ Audit Trail ─► Feedback ─► │ Export │ (JSONL) Events │ Training │ │ Data │ └──────┬───────┘ │ ┌────────────┴────────────┐ │ │ ┌─────▼─────┐ ┌─────▼─────┐ │ Shared │ │ Org-Spec │ │ Base │ │ Adapter │ │ Model │ │ (LoRA) │ └─────┬─────┘ └─────┬─────┘ │ │ └────────┬────────────────┘ │ ┌──────▼──────┐ │ Eval + │ │ A/B Test │ └──────┬──────┘ │ ┌──────▼──────┐ │ Deploy │ │ (shadow │ │ → prod) │ └─────────────┘2.3 Two-Tier Model Architecture
Section titled “2.3 Two-Tier Model Architecture”Tier 1: Shared base model (trained on aggregated, anonymized data from all orgs)
- DeBERTa-v3-large finetuned on data classification
- Protectai injection model as-is (already excellent)
- Updated quarterly with new patterns
Tier 2: Org-specific LoRA adapters (trained on each org’s feedback)
- LoRA rank 8-16 on top of the base model
- ~2-5MB per adapter (trivial to store/load)
- Captures org-specific patterns:
- “Project Aurora” is always RESTRICTED at Org A
- Internal server naming conventions at Org B
- Industry-specific jargon (HIPAA vs SOX vs PCI-DSS)
Why LoRA:
- 100-1000x less training data needed (50-200 examples per org)
- Training takes minutes, not hours
- Base model weights are frozen → no catastrophic forgetting
- Multiple adapters can coexist, loaded per-request based on org_id
- Adapter swapping is <1ms (just loading a small matrix)
2.4 Automated Retraining
Section titled “2.4 Automated Retraining”# Trigger conditions for retraining:RETRAIN_TRIGGERS = { "feedback_count": 50, # 50 new feedback events since last train "false_positive_rate": 0.15, # FP rate exceeds 15% "max_age_days": 30, # At least monthly "manual": True, # CISO can trigger from UI}2.5 Shadow Mode Evaluation
Section titled “2.5 Shadow Mode Evaluation”Before promoting a new model to production:
- Deploy in shadow mode: runs alongside current model, results logged but not enforced
- Compare predictions on live traffic for 24-48 hours
- Generate comparison report:
- Accuracy vs current model
- Latency comparison
- Cases where new model disagrees with current
- CISO reviews and approves promotion
This maps directly to TapPass’s existing shadow_mode in enforcement_mode.
Part 3: Implementation Plan for TapPass
Section titled “Part 3: Implementation Plan for TapPass”Phase 1: Model Inference Service (Week 1-2)
Section titled “Phase 1: Model Inference Service (Week 1-2)”Add a local model inference service that pipeline steps can call:
tappass/ models/ __init__.py inference.py # ModelRegistry, load/predict interface adapters.py # LoRA adapter management per org export_training.py # Export feedback → training data config.py # Model paths, GPU/CPU config pipeline/steps/ detect_injection.py # Add model scoring alongside regex classify_data.py # Add local model alongside/replacing gpt-4o-miniinference.py core design:
class ModelRegistry: """Singleton. Loads models lazily, manages LoRA adapters per org."""
def __init__(self): self._models: dict[str, Any] = {} # name → loaded model self._tokenizers: dict[str, Any] = {} # name → tokenizer self._adapters: dict[str, Any] = {} # "org:name" → LoRA adapter self._lock = threading.Lock()
def predict(self, model_name: str, text: str, org_id: str = "default") -> dict: """Run inference. Returns {label, confidence, latency_ms}.""" model = self._get_or_load(model_name) adapter = self._get_adapter(model_name, org_id) # ... tokenize, forward, softmax, returnKey design decisions:
- Models loaded lazily on first use (not at startup)
- Thread-safe with lock-free reads (model weights are immutable after load)
- ONNX Runtime for CPU inference (2-3x faster than PyTorch)
- Optional GPU via CUDA provider (auto-detected)
- Adapters loaded per-org, cached in LRU (max 100 adapters in memory)
Phase 2: Feedback → Training Data Pipeline (Week 2-3)
Section titled “Phase 2: Feedback → Training Data Pipeline (Week 2-3)”# New API endpointPOST /models/training-data/export{ "model_type": "classification", # or "injection" "format": "huggingface", # or "jsonl" "min_confidence": 0.8, "since": "2025-01-01"}
# Returns downloadable dataset{ "examples": 1847, "label_distribution": {"PUBLIC": 892, "INTERNAL": 412, "CONFIDENTIAL": 398, "RESTRICTED": 145}, "download_url": "/models/training-data/classification-20260227.jsonl"}Phase 3: LoRA Training + Deployment (Week 3-4)
Section titled “Phase 3: LoRA Training + Deployment (Week 3-4)”# New API endpointPOST /models/train{ "base_model": "deberta-v3-large-classification", "adapter_name": "org-acme-v1", "org_id": "acme", "epochs": 3, "lora_rank": 8, "eval_split": 0.15}
# Returns training job status{ "job_id": "train_abc123", "status": "running", "progress": 0.45, "metrics": {"eval_accuracy": 0.94, "eval_f1": 0.92}}Phase 4: A/B Testing + Shadow Mode (Week 4-5)
Section titled “Phase 4: A/B Testing + Shadow Mode (Week 4-5)”Pipeline step config gets a new option:
{ "classify_data": { "model": "deberta-v3-large-classification", "adapter": "auto", "shadow_model": "gpt-4o-mini", "compare_mode": true }}When compare_mode is true, both models run and results are logged to audit for comparison.
Part 4: Scalable Inference Architecture
Section titled “Part 4: Scalable Inference Architecture”Single-node (< 100 agents)
Section titled “Single-node (< 100 agents)”TapPass process └── ModelRegistry (in-process) ├── DeBERTa classifier (ONNX, CPU) ~200MB RAM ├── Prompt-Guard-86M (ONNX, CPU) ~100MB RAM ├── LoRA adapters (LRU cache) ~50MB RAM └── Total: ~350MB additional RAMAll models run in-process. No external service needed. ONNX Runtime on CPU handles ~200 req/s for DeBERTa-class models on a 4-core machine.
Multi-node (100-1000 agents)
Section titled “Multi-node (100-1000 agents)”┌──────────────┐ ┌──────────────┐│ TapPass 1 │ │ TapPass 2 ││ (pipeline) │ │ (pipeline) │└──────┬───────┘ └──────┬───────┘ │ │ └──────┬──────────────┘ │ gRPC / HTTP ┌──────▼──────┐ │ Model │ │ Service │ ← Dedicated inference service │ (GPU/CPU) │ Runs models, manages adapters └─────────────┘ Horizontal scalingSeparate the model inference into a sidecar or dedicated service:
tappass-modelscontainer with GPU- gRPC for low-latency (protobuf, binary)
- Batching: accumulate 5-10 requests over 5ms, run as batch → 3-5x throughput
- Model versions pinned per deployment
Large scale (1000+ agents)
Section titled “Large scale (1000+ agents)” ┌─────────────────────────┐ │ Model Registry │ │ (S3 / MinIO) │ │ base models + adapters│ └────────────┬────────────┘ │ ┌──────────────────────┼──────────────────────┐ │ │ │ ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐ │ Inference │ │ Inference │ │ Inference │ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ GPU (T4) │ │ GPU (T4) │ │ CPU only │ └─────────────┘ └─────────────┘ └─────────────┘- Model weights stored in object storage (S3/MinIO)
- Inference nodes pull models on startup, cache locally
- LoRA adapters pulled on-demand per org (2-5MB each, cached LRU)
- Kubernetes HPA based on inference queue depth
- vLLM or TGI if using generative models (Phi-3.5)
Part 5: Model Comparison Matrix
Section titled “Part 5: Model Comparison Matrix”| Metric | Current (gpt-4o-mini) | Finetuned DeBERTa | Finetuned + LoRA |
|---|---|---|---|
| Latency | 500-3000ms | 8-15ms | 10-18ms |
| Cost/call | ~$0.0002 | $0 | $0 |
| Privacy | Data sent to OpenAI | Local | Local |
| Accuracy (est.) | ~90% | ~94% | ~97% |
| Multilingual | ✅ Native | ✅ (DeBERTa-v3) | ✅ |
| Deterministic | ❌ | ✅ | ✅ |
| Offline capable | ❌ | ✅ | ✅ |
| Org-specific | ❌ | ❌ | ✅ |
| EU data residency | ❓ (depends on region) | ✅ guaranteed | ✅ guaranteed |
Part 6: Quick Win: Immediate Integration
Section titled “Part 6: Quick Win: Immediate Integration”The fastest path to integrate specialized models without changing architecture:
Step 1: Add protectai/deberta-v3-base-prompt-injection-v2 to detect_injection
Section titled “Step 1: Add protectai/deberta-v3-base-prompt-injection-v2 to detect_injection”# In detect_injection.py, after regex scoring:if self.use_model: from tappass.models.inference import get_registry result = get_registry().predict("injection", text) if result["confidence"] > 0.8: model_score = result["confidence"] if model_score > max_score: max_score = model_score matches.append({ "pattern": "model_injection_classifier", "score": model_score, "match": f"Model confidence: {model_score:.2f}", "source": "model", })Step 2: Add local classification model to classify_data
Section titled “Step 2: Add local classification model to classify_data”Replace the _run_llm_classify method:
async def _run_local_classify(self, ctx: PipelineContext) -> dict | None: from tappass.models.inference import get_registry text = extract_input_text(ctx) if not text or len(text) < 20: return None result = get_registry().predict( "classification", text, org_id=ctx.org_id ) return { "classification": result["label"], "confidence": result["confidence"], "findings": [], }This keeps the same interface: the rest of classify_data doesn’t change.
Summary
Section titled “Summary”| Priority | Action | Impact | Effort |
|---|---|---|---|
| P1 | Add protectai/deberta-v3-base-prompt-injection-v2 alongside regex | Catches semantic injections regex misses. 5-10ms. | 2-3 days |
| P1 | Finetune DeBERTa-v3-large on classification labels | 200x faster than gpt-4o-mini, local, deterministic. | 1 week |
| P2 | Build feedback → training data export pipeline | Enables continuous improvement from CISO feedback. | 3-4 days |
| P2 | Add LoRA adapter system for org-specific models | Each org gets custom model trained on their data. | 1 week |
| P3 | Shadow mode model comparison | Safe rollout of new models with production traffic. | 3-4 days |
| P3 | Dedicated inference service (gRPC) | Required at scale (>100 agents). | 1-2 weeks |