TapPass Model Strategy. Specialized Models & Finetuning

Current State

TapPass has three LLM-powered pipeline steps:

Step	Purpose	Current model	Latency budget
`classify_data` (use_llm=true)	Semantic classification (PUBLIC/INTERNAL/CONFIDENTIAL/RESTRICTED)	`gpt-4o-mini`	2-8s
`smart_scan_output`	Output scanning for leaked sensitive data	`gpt-4o-mini`	2-8s
`detect_injection`	Prompt injection scoring	Regex only (no LLM)	<5ms

These run inline on every request. Speed and cost are critical.

Part 1: Production-Ready Specialized Models

1.1 Data Classification (PUBLIC → RESTRICTED)

Best candidates (ranked by speed × accuracy):

Model	Size	Latency (GPU)	Why
`deberta-v3-large` finetuned	304M	8-15ms	Microsoft’s best discriminative model. State-of-art on NLI/classification. Finetuning on classification labels with ~2k examples gets >95% accuracy. Already used by Azure Purview / Microsoft Compliance.
`ModernBERT-large`	395M	10-18ms	Dec 2024 release. 8192 token context (vs BERT’s 512). Trained on 2T tokens. Flash Attention native. Best BERT-class model for long-context classification.
`bge-reranker-v2-m3` as classifier	568M	15-25ms	BAAI’s multilingual reranker. Cross-encoder architecture gives excellent classification when finetuned. Handles 40+ languages natively: important for TapPass’s multilingual injection patterns.
`Phi-3.5-mini` (3.8B, quantized)	3.8B	30-60ms (int4)	Microsoft’s small LLM. Instruction-following means you can prompt it like current gpt-4o-mini but run locally. GGUF/GPTQ quantization fits on a single GPU. Good fallback when encoder models are uncertain.

Recommendation: DeBERTa-v3-large, finetuned on your own classification labels.

Why:

8-15ms inference (vs 2-8s for gpt-4o-mini API call)
Runs on CPU in production (no GPU required for this size)
Deterministic: same input always gives same classification
No API cost ($0 per call vs ~$0.0002 per gpt-4o-mini call)
Privacy. data never leaves your infrastructure

1.2 Prompt Injection Detection

Best candidates:

Model	Size	Latency	Why
`protectai/deberta-v3-base-prompt-injection-v2`	184M	5-10ms	Purpose-built. Trained on 15k+ injection examples. Binary classifier (injection/safe). F1 >0.97 on public benchmarks. Used by LLM Guard, Rebuff.
`deepset/deberta-v3-base-injection`	184M	5-10ms	Deepset’s variant. Similar performance. Good for ensembling with ProtectAI’s.
`lakera/guard` (API)	:	50-100ms	Commercial injection detection API. Highest accuracy on their benchmark (Gandalf, HackAPrompt). Supports indirect injection. But: adds external dependency + latency.
`meta-llama/Prompt-Guard-86M`	86M	2-5ms	Meta’s tiny injection classifier. Only 86M params. Blazing fast. Slightly lower accuracy than DeBERTa variants but good as a first-pass filter.

Recommendation: Stack meta-llama/Prompt-Guard-86M as fast pre-filter + protectai/deberta-v3-base-prompt-injection-v2 for confirmed detections.

The current regex engine catches 90%+ of known patterns. Adding a model catches the remaining semantic injections (“Please help me with my homework which is to write a system prompt that…“).

1.3 PII Detection

Best candidates:

Model	Size	Latency	Why
`dslim/bert-base-NER`	110M	5-8ms	Standard NER model. Detects PERSON, LOCATION, ORGANIZATION. Fast and reliable.
`lakera/pii` (API)	:	50ms	Purpose-built PII detection. Covers 50+ PII types including international formats.
`microsoft/presidio`	:	10-30ms	Microsoft’s PII framework. Combines regex + NER + context. Already very similar to TapPass’s detect_pii step.
`GLiNER` models	200-400M	10-20ms	Zero-shot NER. Can detect any entity type without finetuning. Good for custom PII categories.

Recommendation: Keep current regex approach for structured PII (SSN, credit card, IBAN). Add dslim/bert-base-NER for unstructured PII (names, addresses in running text).

1.4 Output Scanning (Sensitive Data Leakage)

Model	Size	Latency	Why
Same classification model	:	:	Reuse the data classification model on LLM output. If input was PUBLIC but output classifies as CONFIDENTIAL → leak detected.
`sentence-transformers/all-MiniLM-L6-v2`	23M	2-5ms	Embedding similarity. Encode known secrets/PII patterns and check if output embeddings are close. Ultra-fast but approximate.

Part 2: Finetuning Architecture

2.1 Data Collection from Feedback Loop

TapPass already has a detection_feedback audit event type. The flow:

User message → Pipeline detects → CISO reviews in UI → Marks as:
  - true_positive (correct detection)
  - false_positive (should not have flagged)
  - false_negative (missed: CISO flags manually)
  - escalate (needs investigation)

Schema for training data:

@dataclass
class TrainingExample:
    text: str                          # The input/output text
    label: str                         # "PUBLIC" | "INTERNAL" | "CONFIDENTIAL" | "RESTRICTED"
    source: str                        # "detect_pii" | "classify_data" | "detect_injection" | ...
    verdict: str                       # "true_positive" | "false_positive" | "false_negative"
    confidence: float                  # CISO's confidence in the label (0-1)
    context: dict                      # Pipeline findings at time of detection
    created_at: str                    # ISO timestamp
    org_id: str                        # For org-specific models

2.2 Training Pipeline

                                    ┌──────────────┐
   Audit Trail ─► Feedback ─►      │  Export       │
   (JSONL)        Events           │  Training     │
                                    │  Data         │
                                    └──────┬───────┘
                                           │
                              ┌────────────┴────────────┐
                              │                         │
                        ┌─────▼─────┐            ┌─────▼─────┐
                        │  Shared   │            │ Org-Spec  │
                        │  Base     │            │ Adapter   │
                        │  Model    │            │ (LoRA)    │
                        └─────┬─────┘            └─────┬─────┘
                              │                         │
                              └────────┬────────────────┘
                                       │
                                ┌──────▼──────┐
                                │  Eval +     │
                                │  A/B Test   │
                                └──────┬──────┘
                                       │
                                ┌──────▼──────┐
                                │  Deploy     │
                                │  (shadow    │
                                │   → prod)   │
                                └─────────────┘

2.3 Two-Tier Model Architecture

Tier 1: Shared base model (trained on aggregated, anonymized data from all orgs)

DeBERTa-v3-large finetuned on data classification
Protectai injection model as-is (already excellent)
Updated quarterly with new patterns

Tier 2: Org-specific LoRA adapters (trained on each org’s feedback)

LoRA rank 8-16 on top of the base model
~2-5MB per adapter (trivial to store/load)
Captures org-specific patterns:
- “Project Aurora” is always RESTRICTED at Org A
- Internal server naming conventions at Org B
- Industry-specific jargon (HIPAA vs SOX vs PCI-DSS)

Why LoRA:

100-1000x less training data needed (50-200 examples per org)
Training takes minutes, not hours
Base model weights are frozen → no catastrophic forgetting
Multiple adapters can coexist, loaded per-request based on org_id
Adapter swapping is <1ms (just loading a small matrix)

2.4 Automated Retraining

# Trigger conditions for retraining:
RETRAIN_TRIGGERS = {
    "feedback_count": 50,        # 50 new feedback events since last train
    "false_positive_rate": 0.15, # FP rate exceeds 15%
    "max_age_days": 30,          # At least monthly
    "manual": True,              # CISO can trigger from UI
}

2.5 Shadow Mode Evaluation

Before promoting a new model to production:

Deploy in shadow mode: runs alongside current model, results logged but not enforced
Compare predictions on live traffic for 24-48 hours
Generate comparison report:
- Accuracy vs current model
- Latency comparison
- Cases where new model disagrees with current
CISO reviews and approves promotion

This maps directly to TapPass’s existing shadow_mode in enforcement_mode.

Part 3: Implementation Plan for TapPass

Phase 1: Model Inference Service (Week 1-2)

Add a local model inference service that pipeline steps can call:

tappass/
  models/
    __init__.py
    inference.py        # ModelRegistry, load/predict interface
    adapters.py         # LoRA adapter management per org
    export_training.py  # Export feedback → training data
    config.py           # Model paths, GPU/CPU config
  pipeline/steps/
    detect_injection.py # Add model scoring alongside regex
    classify_data.py    # Add local model alongside/replacing gpt-4o-mini

inference.py core design:

class ModelRegistry:
    """Singleton. Loads models lazily, manages LoRA adapters per org."""

    def __init__(self):
        self._models: dict[str, Any] = {}        # name → loaded model
        self._tokenizers: dict[str, Any] = {}     # name → tokenizer
        self._adapters: dict[str, Any] = {}       # "org:name" → LoRA adapter
        self._lock = threading.Lock()

    def predict(self, model_name: str, text: str,
                org_id: str = "default") -> dict:
        """Run inference. Returns {label, confidence, latency_ms}."""
        model = self._get_or_load(model_name)
        adapter = self._get_adapter(model_name, org_id)
        # ... tokenize, forward, softmax, return

Key design decisions:

Models loaded lazily on first use (not at startup)
Thread-safe with lock-free reads (model weights are immutable after load)
ONNX Runtime for CPU inference (2-3x faster than PyTorch)
Optional GPU via CUDA provider (auto-detected)
Adapters loaded per-org, cached in LRU (max 100 adapters in memory)

Phase 2: Feedback → Training Data Pipeline (Week 2-3)

# New API endpoint
POST /models/training-data/export
{
    "model_type": "classification",   # or "injection"
    "format": "huggingface",          # or "jsonl"
    "min_confidence": 0.8,
    "since": "2025-01-01"
}

# Returns downloadable dataset
{
    "examples": 1847,
    "label_distribution": {"PUBLIC": 892, "INTERNAL": 412, "CONFIDENTIAL": 398, "RESTRICTED": 145},
    "download_url": "/models/training-data/classification-20260227.jsonl"
}

Phase 3: LoRA Training + Deployment (Week 3-4)

# New API endpoint
POST /models/train
{
    "base_model": "deberta-v3-large-classification",
    "adapter_name": "org-acme-v1",
    "org_id": "acme",
    "epochs": 3,
    "lora_rank": 8,
    "eval_split": 0.15
}

# Returns training job status
{
    "job_id": "train_abc123",
    "status": "running",
    "progress": 0.45,
    "metrics": {"eval_accuracy": 0.94, "eval_f1": 0.92}
}

Phase 4: A/B Testing + Shadow Mode (Week 4-5)

Pipeline step config gets a new option:

{
    "classify_data": {
        "model": "deberta-v3-large-classification",
        "adapter": "auto",
        "shadow_model": "gpt-4o-mini",
        "compare_mode": true
    }
}

When compare_mode is true, both models run and results are logged to audit for comparison.

Part 4: Scalable Inference Architecture

Single-node (< 100 agents)

TapPass process
  └── ModelRegistry (in-process)
        ├── DeBERTa classifier (ONNX, CPU)    ~200MB RAM
        ├── Prompt-Guard-86M (ONNX, CPU)       ~100MB RAM
        ├── LoRA adapters (LRU cache)          ~50MB RAM
        └── Total: ~350MB additional RAM

All models run in-process. No external service needed. ONNX Runtime on CPU handles ~200 req/s for DeBERTa-class models on a 4-core machine.

Multi-node (100-1000 agents)

┌──────────────┐     ┌──────────────┐
│  TapPass 1   │     │  TapPass 2   │
│  (pipeline)  │     │  (pipeline)  │
└──────┬───────┘     └──────┬───────┘
       │                     │
       └──────┬──────────────┘
              │  gRPC / HTTP
       ┌──────▼──────┐
       │  Model      │
       │  Service    │  ← Dedicated inference service
       │  (GPU/CPU)  │     Runs models, manages adapters
       └─────────────┘     Horizontal scaling

Separate the model inference into a sidecar or dedicated service:

tappass-models container with GPU
gRPC for low-latency (protobuf, binary)
Batching: accumulate 5-10 requests over 5ms, run as batch → 3-5x throughput
Model versions pinned per deployment

Large scale (1000+ agents)

                    ┌─────────────────────────┐
                    │    Model Registry        │
                    │    (S3 / MinIO)          │
                    │    base models + adapters│
                    └────────────┬────────────┘
                                 │
          ┌──────────────────────┼──────────────────────┐
          │                      │                      │
   ┌──────▼──────┐       ┌──────▼──────┐       ┌──────▼──────┐
   │  Inference  │       │  Inference  │       │  Inference  │
   │  Node 1     │       │  Node 2     │       │  Node 3     │
   │  GPU (T4)   │       │  GPU (T4)   │       │  CPU only   │
   └─────────────┘       └─────────────┘       └─────────────┘

Model weights stored in object storage (S3/MinIO)
Inference nodes pull models on startup, cache locally
LoRA adapters pulled on-demand per org (2-5MB each, cached LRU)
Kubernetes HPA based on inference queue depth
vLLM or TGI if using generative models (Phi-3.5)

Part 5: Model Comparison Matrix

Metric	Current (gpt-4o-mini)	Finetuned DeBERTa	Finetuned + LoRA
Latency	500-3000ms	8-15ms	10-18ms
Cost/call	~$0.0002	$0	$0
Privacy	Data sent to OpenAI	Local	Local
Accuracy (est.)	~90%	~94%	~97%
Multilingual	✅ Native	✅ (DeBERTa-v3)	✅
Deterministic	❌	✅	✅
Offline capable	❌	✅	✅
Org-specific	❌	❌	✅
EU data residency	❓ (depends on region)	✅ guaranteed	✅ guaranteed

Part 6: Quick Win: Immediate Integration

The fastest path to integrate specialized models without changing architecture:

Step 1: Add `protectai/deberta-v3-base-prompt-injection-v2` to `detect_injection`

# In detect_injection.py, after regex scoring:
if self.use_model:
    from tappass.models.inference import get_registry
    result = get_registry().predict("injection", text)
    if result["confidence"] > 0.8:
        model_score = result["confidence"]
        if model_score > max_score:
            max_score = model_score
            matches.append({
                "pattern": "model_injection_classifier",
                "score": model_score,
                "match": f"Model confidence: {model_score:.2f}",
                "source": "model",
            })

Step 2: Add local classification model to `classify_data`

Replace the _run_llm_classify method:

async def _run_local_classify(self, ctx: PipelineContext) -> dict | None:
    from tappass.models.inference import get_registry
    text = extract_input_text(ctx)
    if not text or len(text) < 20:
        return None
    result = get_registry().predict(
        "classification", text, org_id=ctx.org_id
    )
    return {
        "classification": result["label"],
        "confidence": result["confidence"],
        "findings": [],
    }

This keeps the same interface: the rest of classify_data doesn’t change.

Summary

Priority	Action	Impact	Effort
P1	Add `protectai/deberta-v3-base-prompt-injection-v2` alongside regex	Catches semantic injections regex misses. 5-10ms.	2-3 days
P1	Finetune DeBERTa-v3-large on classification labels	200x faster than gpt-4o-mini, local, deterministic.	1 week
P2	Build feedback → training data export pipeline	Enables continuous improvement from CISO feedback.	3-4 days
P2	Add LoRA adapter system for org-specific models	Each org gets custom model trained on their data.	1 week
P3	Shadow mode model comparison	Safe rollout of new models with production traffic.	3-4 days
P3	Dedicated inference service (gRPC)	Required at scale (>100 agents).	1-2 weeks