Skip to content

TapPass Model Strategy. Specialized Models & Finetuning

TapPass has three LLM-powered pipeline steps:

StepPurposeCurrent modelLatency budget
classify_data (use_llm=true)Semantic classification (PUBLIC/INTERNAL/CONFIDENTIAL/RESTRICTED)gpt-4o-mini2-8s
smart_scan_outputOutput scanning for leaked sensitive datagpt-4o-mini2-8s
detect_injectionPrompt injection scoringRegex only (no LLM)<5ms

These run inline on every request. Speed and cost are critical.


Part 1: Production-Ready Specialized Models

Section titled “Part 1: Production-Ready Specialized Models”

1.1 Data Classification (PUBLIC → RESTRICTED)

Section titled “1.1 Data Classification (PUBLIC → RESTRICTED)”

Best candidates (ranked by speed × accuracy):

ModelSizeLatency (GPU)Why
deberta-v3-large finetuned304M8-15msMicrosoft’s best discriminative model. State-of-art on NLI/classification. Finetuning on classification labels with ~2k examples gets >95% accuracy. Already used by Azure Purview / Microsoft Compliance.
ModernBERT-large395M10-18msDec 2024 release. 8192 token context (vs BERT’s 512). Trained on 2T tokens. Flash Attention native. Best BERT-class model for long-context classification.
bge-reranker-v2-m3 as classifier568M15-25msBAAI’s multilingual reranker. Cross-encoder architecture gives excellent classification when finetuned. Handles 40+ languages natively: important for TapPass’s multilingual injection patterns.
Phi-3.5-mini (3.8B, quantized)3.8B30-60ms (int4)Microsoft’s small LLM. Instruction-following means you can prompt it like current gpt-4o-mini but run locally. GGUF/GPTQ quantization fits on a single GPU. Good fallback when encoder models are uncertain.

Recommendation: DeBERTa-v3-large, finetuned on your own classification labels.

Why:

  • 8-15ms inference (vs 2-8s for gpt-4o-mini API call)
  • Runs on CPU in production (no GPU required for this size)
  • Deterministic: same input always gives same classification
  • No API cost ($0 per call vs ~$0.0002 per gpt-4o-mini call)
  • Privacy. data never leaves your infrastructure

Best candidates:

ModelSizeLatencyWhy
protectai/deberta-v3-base-prompt-injection-v2184M5-10msPurpose-built. Trained on 15k+ injection examples. Binary classifier (injection/safe). F1 >0.97 on public benchmarks. Used by LLM Guard, Rebuff.
deepset/deberta-v3-base-injection184M5-10msDeepset’s variant. Similar performance. Good for ensembling with ProtectAI’s.
lakera/guard (API):50-100msCommercial injection detection API. Highest accuracy on their benchmark (Gandalf, HackAPrompt). Supports indirect injection. But: adds external dependency + latency.
meta-llama/Prompt-Guard-86M86M2-5msMeta’s tiny injection classifier. Only 86M params. Blazing fast. Slightly lower accuracy than DeBERTa variants but good as a first-pass filter.

Recommendation: Stack meta-llama/Prompt-Guard-86M as fast pre-filter + protectai/deberta-v3-base-prompt-injection-v2 for confirmed detections.

The current regex engine catches 90%+ of known patterns. Adding a model catches the remaining semantic injections (“Please help me with my homework which is to write a system prompt that…“).

Best candidates:

ModelSizeLatencyWhy
dslim/bert-base-NER110M5-8msStandard NER model. Detects PERSON, LOCATION, ORGANIZATION. Fast and reliable.
lakera/pii (API):50msPurpose-built PII detection. Covers 50+ PII types including international formats.
microsoft/presidio:10-30msMicrosoft’s PII framework. Combines regex + NER + context. Already very similar to TapPass’s detect_pii step.
GLiNER models200-400M10-20msZero-shot NER. Can detect any entity type without finetuning. Good for custom PII categories.

Recommendation: Keep current regex approach for structured PII (SSN, credit card, IBAN). Add dslim/bert-base-NER for unstructured PII (names, addresses in running text).

1.4 Output Scanning (Sensitive Data Leakage)

Section titled “1.4 Output Scanning (Sensitive Data Leakage)”
ModelSizeLatencyWhy
Same classification model::Reuse the data classification model on LLM output. If input was PUBLIC but output classifies as CONFIDENTIAL → leak detected.
sentence-transformers/all-MiniLM-L6-v223M2-5msEmbedding similarity. Encode known secrets/PII patterns and check if output embeddings are close. Ultra-fast but approximate.

TapPass already has a detection_feedback audit event type. The flow:

User message → Pipeline detects → CISO reviews in UI → Marks as:
- true_positive (correct detection)
- false_positive (should not have flagged)
- false_negative (missed: CISO flags manually)
- escalate (needs investigation)

Schema for training data:

@dataclass
class TrainingExample:
text: str # The input/output text
label: str # "PUBLIC" | "INTERNAL" | "CONFIDENTIAL" | "RESTRICTED"
source: str # "detect_pii" | "classify_data" | "detect_injection" | ...
verdict: str # "true_positive" | "false_positive" | "false_negative"
confidence: float # CISO's confidence in the label (0-1)
context: dict # Pipeline findings at time of detection
created_at: str # ISO timestamp
org_id: str # For org-specific models
┌──────────────┐
Audit Trail ─► Feedback ─► │ Export │
(JSONL) Events │ Training │
│ Data │
└──────┬───────┘
┌────────────┴────────────┐
│ │
┌─────▼─────┐ ┌─────▼─────┐
│ Shared │ │ Org-Spec │
│ Base │ │ Adapter │
│ Model │ │ (LoRA) │
└─────┬─────┘ └─────┬─────┘
│ │
└────────┬────────────────┘
┌──────▼──────┐
│ Eval + │
│ A/B Test │
└──────┬──────┘
┌──────▼──────┐
│ Deploy │
│ (shadow │
│ → prod) │
└─────────────┘

Tier 1: Shared base model (trained on aggregated, anonymized data from all orgs)

  • DeBERTa-v3-large finetuned on data classification
  • Protectai injection model as-is (already excellent)
  • Updated quarterly with new patterns

Tier 2: Org-specific LoRA adapters (trained on each org’s feedback)

  • LoRA rank 8-16 on top of the base model
  • ~2-5MB per adapter (trivial to store/load)
  • Captures org-specific patterns:
    • “Project Aurora” is always RESTRICTED at Org A
    • Internal server naming conventions at Org B
    • Industry-specific jargon (HIPAA vs SOX vs PCI-DSS)

Why LoRA:

  • 100-1000x less training data needed (50-200 examples per org)
  • Training takes minutes, not hours
  • Base model weights are frozen → no catastrophic forgetting
  • Multiple adapters can coexist, loaded per-request based on org_id
  • Adapter swapping is <1ms (just loading a small matrix)
# Trigger conditions for retraining:
RETRAIN_TRIGGERS = {
"feedback_count": 50, # 50 new feedback events since last train
"false_positive_rate": 0.15, # FP rate exceeds 15%
"max_age_days": 30, # At least monthly
"manual": True, # CISO can trigger from UI
}

Before promoting a new model to production:

  1. Deploy in shadow mode: runs alongside current model, results logged but not enforced
  2. Compare predictions on live traffic for 24-48 hours
  3. Generate comparison report:
    • Accuracy vs current model
    • Latency comparison
    • Cases where new model disagrees with current
  4. CISO reviews and approves promotion

This maps directly to TapPass’s existing shadow_mode in enforcement_mode.


Phase 1: Model Inference Service (Week 1-2)

Section titled “Phase 1: Model Inference Service (Week 1-2)”

Add a local model inference service that pipeline steps can call:

tappass/
models/
__init__.py
inference.py # ModelRegistry, load/predict interface
adapters.py # LoRA adapter management per org
export_training.py # Export feedback → training data
config.py # Model paths, GPU/CPU config
pipeline/steps/
detect_injection.py # Add model scoring alongside regex
classify_data.py # Add local model alongside/replacing gpt-4o-mini

inference.py core design:

class ModelRegistry:
"""Singleton. Loads models lazily, manages LoRA adapters per org."""
def __init__(self):
self._models: dict[str, Any] = {} # name → loaded model
self._tokenizers: dict[str, Any] = {} # name → tokenizer
self._adapters: dict[str, Any] = {} # "org:name" → LoRA adapter
self._lock = threading.Lock()
def predict(self, model_name: str, text: str,
org_id: str = "default") -> dict:
"""Run inference. Returns {label, confidence, latency_ms}."""
model = self._get_or_load(model_name)
adapter = self._get_adapter(model_name, org_id)
# ... tokenize, forward, softmax, return

Key design decisions:

  • Models loaded lazily on first use (not at startup)
  • Thread-safe with lock-free reads (model weights are immutable after load)
  • ONNX Runtime for CPU inference (2-3x faster than PyTorch)
  • Optional GPU via CUDA provider (auto-detected)
  • Adapters loaded per-org, cached in LRU (max 100 adapters in memory)

Phase 2: Feedback → Training Data Pipeline (Week 2-3)

Section titled “Phase 2: Feedback → Training Data Pipeline (Week 2-3)”
# New API endpoint
POST /models/training-data/export
{
"model_type": "classification", # or "injection"
"format": "huggingface", # or "jsonl"
"min_confidence": 0.8,
"since": "2025-01-01"
}
# Returns downloadable dataset
{
"examples": 1847,
"label_distribution": {"PUBLIC": 892, "INTERNAL": 412, "CONFIDENTIAL": 398, "RESTRICTED": 145},
"download_url": "/models/training-data/classification-20260227.jsonl"
}

Phase 3: LoRA Training + Deployment (Week 3-4)

Section titled “Phase 3: LoRA Training + Deployment (Week 3-4)”
# New API endpoint
POST /models/train
{
"base_model": "deberta-v3-large-classification",
"adapter_name": "org-acme-v1",
"org_id": "acme",
"epochs": 3,
"lora_rank": 8,
"eval_split": 0.15
}
# Returns training job status
{
"job_id": "train_abc123",
"status": "running",
"progress": 0.45,
"metrics": {"eval_accuracy": 0.94, "eval_f1": 0.92}
}

Phase 4: A/B Testing + Shadow Mode (Week 4-5)

Section titled “Phase 4: A/B Testing + Shadow Mode (Week 4-5)”

Pipeline step config gets a new option:

{
"classify_data": {
"model": "deberta-v3-large-classification",
"adapter": "auto",
"shadow_model": "gpt-4o-mini",
"compare_mode": true
}
}

When compare_mode is true, both models run and results are logged to audit for comparison.


TapPass process
└── ModelRegistry (in-process)
├── DeBERTa classifier (ONNX, CPU) ~200MB RAM
├── Prompt-Guard-86M (ONNX, CPU) ~100MB RAM
├── LoRA adapters (LRU cache) ~50MB RAM
└── Total: ~350MB additional RAM

All models run in-process. No external service needed. ONNX Runtime on CPU handles ~200 req/s for DeBERTa-class models on a 4-core machine.

┌──────────────┐ ┌──────────────┐
│ TapPass 1 │ │ TapPass 2 │
│ (pipeline) │ │ (pipeline) │
└──────┬───────┘ └──────┬───────┘
│ │
└──────┬──────────────┘
│ gRPC / HTTP
┌──────▼──────┐
│ Model │
│ Service │ ← Dedicated inference service
│ (GPU/CPU) │ Runs models, manages adapters
└─────────────┘ Horizontal scaling

Separate the model inference into a sidecar or dedicated service:

  • tappass-models container with GPU
  • gRPC for low-latency (protobuf, binary)
  • Batching: accumulate 5-10 requests over 5ms, run as batch → 3-5x throughput
  • Model versions pinned per deployment
┌─────────────────────────┐
│ Model Registry │
│ (S3 / MinIO) │
│ base models + adapters│
└────────────┬────────────┘
┌──────────────────────┼──────────────────────┐
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ Inference │ │ Inference │ │ Inference │
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ GPU (T4) │ │ GPU (T4) │ │ CPU only │
└─────────────┘ └─────────────┘ └─────────────┘
  • Model weights stored in object storage (S3/MinIO)
  • Inference nodes pull models on startup, cache locally
  • LoRA adapters pulled on-demand per org (2-5MB each, cached LRU)
  • Kubernetes HPA based on inference queue depth
  • vLLM or TGI if using generative models (Phi-3.5)

MetricCurrent (gpt-4o-mini)Finetuned DeBERTaFinetuned + LoRA
Latency500-3000ms8-15ms10-18ms
Cost/call~$0.0002$0$0
PrivacyData sent to OpenAILocalLocal
Accuracy (est.)~90%~94%~97%
Multilingual✅ Native✅ (DeBERTa-v3)
Deterministic
Offline capable
Org-specific
EU data residency❓ (depends on region)✅ guaranteed✅ guaranteed

The fastest path to integrate specialized models without changing architecture:

Step 1: Add protectai/deberta-v3-base-prompt-injection-v2 to detect_injection

Section titled “Step 1: Add protectai/deberta-v3-base-prompt-injection-v2 to detect_injection”
# In detect_injection.py, after regex scoring:
if self.use_model:
from tappass.models.inference import get_registry
result = get_registry().predict("injection", text)
if result["confidence"] > 0.8:
model_score = result["confidence"]
if model_score > max_score:
max_score = model_score
matches.append({
"pattern": "model_injection_classifier",
"score": model_score,
"match": f"Model confidence: {model_score:.2f}",
"source": "model",
})

Step 2: Add local classification model to classify_data

Section titled “Step 2: Add local classification model to classify_data”

Replace the _run_llm_classify method:

async def _run_local_classify(self, ctx: PipelineContext) -> dict | None:
from tappass.models.inference import get_registry
text = extract_input_text(ctx)
if not text or len(text) < 20:
return None
result = get_registry().predict(
"classification", text, org_id=ctx.org_id
)
return {
"classification": result["label"],
"confidence": result["confidence"],
"findings": [],
}

This keeps the same interface: the rest of classify_data doesn’t change.


PriorityActionImpactEffort
P1Add protectai/deberta-v3-base-prompt-injection-v2 alongside regexCatches semantic injections regex misses. 5-10ms.2-3 days
P1Finetune DeBERTa-v3-large on classification labels200x faster than gpt-4o-mini, local, deterministic.1 week
P2Build feedback → training data export pipelineEnables continuous improvement from CISO feedback.3-4 days
P2Add LoRA adapter system for org-specific modelsEach org gets custom model trained on their data.1 week
P3Shadow mode model comparisonSafe rollout of new models with production traffic.3-4 days
P3Dedicated inference service (gRPC)Required at scale (>100 agents).1-2 weeks