Local LLM Support: Full Analysis
Goal: Enable tappass[ollama], tappass[mistral], tappass[kimi], etc. as SDK extras so users can run the entire TapPass governance stack on local LLMs with zero cloud dependency.
Date: 2026-02-27 Author: SDK Review
Table of contents
Section titled “Table of contents”- Executive summary
- Current state
- What “local LLM support” actually means for TapPass
- Architecture: the two LLM roles
- Provider analysis: which local LLMs to support
- SDK extras design
- Server-side changes needed
- SDK-side changes needed
- Eval: can local models do governance?
- Competitive positioning
- Risks and mitigations
- Implementation roadmap
- Marketing angle
1. Executive summary
Section titled “1. Executive summary”TapPass already supports local LLMs at the server level. call_llm.py uses litellm.acompletion() which routes to Ollama, vLLM, LM Studio, or any OpenAI-compatible endpoint. The config already has ollama_api_base and eu_allowed_judge_models with Ollama entries. The model routing step already routes RESTRICTED → ollama/llama3.2.
What’s missing is the user-facing packaging and DX. The value proposition. “run everything on local LLMs, nothing leaves your machine”. is buried in config comments. There’s no tappass[ollama] extra, no one-command setup, no documentation, and no eval benchmarks for local models as governance judges.
This is a high-impact, medium-effort initiative. The plumbing exists. What’s needed:
- SDK extras (
tappass[ollama],tappass[vllm],tappass[mistral],tappass[kimi]) that install provider deps and configure the right env - A
tappass up --localmode that auto-detects Ollama/vLLM and configures everything - Eval benchmarks: which local models meet the governance quality bar (PII, injection, classification)
- Marketing: “Air-gapped governance. Nothing leaves your network.”
2. Current state
Section titled “2. Current state”What already works
Section titled “What already works”| Layer | Local LLM support | Status |
|---|---|---|
call_llm.py: agent LLM calls | ✅ Full | Routes via litellm to any provider. model="ollama/llama3.2" works today. |
classify_data.py: LLM judge | ✅ Full | "llm_model": "ollama/llama3.2" in step config. |
llm_step.py: custom LLM steps | ✅ Full | CISO can set "model": "ollama/mistral" per step. |
model_routing.py: sensitivity routing | ✅ Full | Already routes RESTRICTED → ollama/llama3.2. |
config.py: server config | ✅ Partial | Has ollama_api_base, eu_allowed_judge_models with Ollama entries. Missing vLLM, LM Studio, Kimi. |
circuit_breaker.py: provider routing | ✅ Full | Handles ollama/ prefix, fallback between providers. |
model_gateway.py: model registry | ⚠️ Minimal | Only registers 3 cloud models. No local model auto-discovery. |
| SDK | ❌ None | No local-LLM extras, no provider deps, no auto-config. |
| CLI | ❌ None | No tappass up --local, no model auto-detection. |
| Docs | ❌ None | No local LLM guide. Buried in config comments. |
| Eval | ❌ None | No benchmarks for local models as governance judges. |
The litellm abstraction
Section titled “The litellm abstraction”The core insight: litellm is already the universal LLM adapter. It supports 100+ providers including:
- Ollama:
ollama/llama3.2,ollama/mistral,ollama/qwen2.5 - vLLM.
openai/model-namewith customapi_base - LM Studio.
openai/model-namewithhttp://localhost:1234/v1 - Mistral (cloud).
mistral/mistral-small-latest - Mistral (local via Ollama).
ollama/mistral - Kimi / Moonshot.
openai/moonshot-v1-8kwith customapi_base - DeepSeek (cloud).
deepseek/deepseek-chat - DeepSeek (local).
ollama/deepseek-r1 - llamacpp.
openai/modelwith llama.cpp server - text-generation-inference (TGI).
huggingface/model
TapPass doesn’t need to implement any provider adapters. It just needs to:
- Configure litellm correctly for each provider
- Package the dependencies
- Make the UX seamless
3. What “local LLM support” actually means for TapPass
Section titled “3. What “local LLM support” actually means for TapPass”There are three distinct user stories and they have different requirements:
Story A: “I want governed agents, using a local LLM for the agent work”
Section titled “Story A: “I want governed agents, using a local LLM for the agent work””The agent’s LLM calls go to a local model instead of OpenAI. TapPass still sits in the middle and governs everything.
Agent → TapPass → Ollama (local) → TapPass → Agent ↑ llama3.2 running on localhost:11434This already works. The user just sets model="ollama/llama3.2" in their chat call. No changes needed to TapPass. litellm handles it.
Story B: “I want the governance pipeline itself to run on local LLMs”
Section titled “Story B: “I want the governance pipeline itself to run on local LLMs””The pipeline’s LLM judge (for classification, injection scoring, custom LLM steps) runs on a local model instead of calling OpenAI.
Agent → TapPass pipeline → [classify: ollama/llama3.2] → [call_llm: ollama/llama3.2] → Agent ↑ LLM judge runs locally tooThis already works at the config level (TAPPASS_LLM_JUDGE_MODEL=ollama/llama3.2) but:
- No one knows about it (not documented)
- No eval: does llama3.2 actually detect PII/injection well enough?
- No preset configs for common local setups
Story C: “I want TapPass 100% air-gapped. nothing leaves my network”
Section titled “Story C: “I want TapPass 100% air-gapped. nothing leaves my network””The entire stack. server, pipeline, LLM calls, judge calls. runs locally. Zero egress. This is the nuclear option for defense, healthcare, and regulated industries.
┌─────────────────── Air-gapped network ───────────────────┐│ ││ Agent → TapPass → Ollama/vLLM (local GPU) ││ │ ││ [all pipeline steps use local models] ││ [no internet access required] ││ │└───────────────────────────────────────────────────────────┘This is the killer feature. It requires:
- All deterministic pipeline steps work without any LLM (they already do)
- LLM judge steps configured to use local models
- NER/PII detection via spaCy (already supported, runs locally)
- No telemetry, no license check calling home
- Docker image with all deps baked in
4. Architecture: the two LLM roles
Section titled “4. Architecture: the two LLM roles”TapPass uses LLMs in exactly two roles, and they have different quality requirements:
Role 1: Agent LLM (the “work” model)
Section titled “Role 1: Agent LLM (the “work” model)”- What: The model that does the agent’s actual work (answering questions, generating code, etc.)
- Where:
call_llm.pystep - Quality bar: Up to the user. If they want to use llama3.2 for their agent, that’s their choice.
- TapPass involvement: Just routes the call and governs it. Doesn’t care about model quality.
Role 2: Governance judge (the “safety” model)
Section titled “Role 2: Governance judge (the “safety” model)”- What: The model that evaluates requests/responses for PII, injection, classification
- Where:
classify_data.py(whenuse_llm=true),llm_step.py(custom CISO steps),scan_output.py - Quality bar: HIGH. A bad judge = false negatives = PII leaks, injection bypasses.
- TapPass involvement: This is the critical path. If the judge model is weak, governance is weak.
Key insight: Most of TapPass’s governance is deterministic (regex, pattern matching, heuristics). The LLM judge is an optional escalation layer. This means:
TapPass governance works without any LLM at all. The 28 deterministic steps (PII regex, secret patterns, injection heuristics, taint tracking, rate limits, budgets) run without an LLM. The LLM judge only fires for semantic classification that regex can’t catch.
This is a massive selling point for local/air-gapped deployments: you get 80% of the governance with zero LLM dependency, and can add a local LLM for the remaining 20%.
5. Provider analysis: which local LLMs to support
Section titled “5. Provider analysis: which local LLMs to support”Tier 1: Must-have (highest user demand)
Section titled “Tier 1: Must-have (highest user demand)”| Provider | Protocol | Local model | Why |
|---|---|---|---|
| Ollama | ollama/* | llama3.2, mistral, qwen2.5, phi3, deepseek-r1, gemma2 | #1 local LLM runner. 1M+ downloads. Zero config. |
| vLLM | openai/* + custom base | Any HF model | Production-grade. Used by most enterprises for local serving. |
| LM Studio | openai/* + localhost:1234 | Any GGUF model | Developer-friendly GUI. 10M+ downloads. |
Tier 2: High value (cloud providers with local options)
Section titled “Tier 2: High value (cloud providers with local options)”| Provider | Protocol | Local option | Why |
|---|---|---|---|
| Mistral | mistral/* (cloud) or ollama/mistral (local) | Yes via Ollama | EU-headquartered. GDPR-friendly. Strong small models (7B). |
| DeepSeek | deepseek/* (cloud) or ollama/deepseek-r1 (local) | Yes via Ollama | Best price/performance. DeepSeek-R1 is SOTA for reasoning. |
Tier 3: Niche but valuable
Section titled “Tier 3: Niche but valuable”| Provider | Protocol | Why |
|---|---|---|
| Kimi / Moonshot | openai/* + custom base (api.moonshot.cn) | Strong Chinese LLM. Growing enterprise adoption in APAC. |
| llamacpp | openai/* + custom base | C++ inference. Fastest on CPU-only. |
| TGI (HuggingFace) | huggingface/* | Popular in MLOps teams. |
| Apple MLX | openai/* + custom base (mlx_lm.server) | Growing fast on Apple Silicon Macs. |
Recommendation
Section titled “Recommendation”Ship Tier 1 + Tier 2 first. Ollama covers 90% of local LLM users. vLLM covers enterprise. LM Studio covers developers. Mistral gets the EU angle. Kimi and others can come later.
6. SDK extras design
Section titled “6. SDK extras design”The problem
Section titled “The problem”Currently the SDK (pip install tappass) is a pure HTTP client. It talks to the TapPass server, which talks to the LLM. The SDK doesn’t need to know about LLM providers.
But for local LLM DX, we want pip install tappass[ollama] to:
- Validate Ollama is installed and reachable
- Auto-discover available local models
- Configure the TapPass server to use them
- Provide a
tappass up --localcommand
Proposed extras
Section titled “Proposed extras”[project.optional-dependencies]sandbox = ["nono-py>=0.1.0"]ollama = ["ollama>=0.4.0"] # Ollama Python client (for model discovery)mistral = ["mistralai>=1.0.0"] # Mistral client (for API key validation)kimi = [] # No extra deps: uses OpenAI-compatible APIlocal = ["ollama>=0.4.0"] # Meta-extra: everything needed for local LLMall = ["nono-py>=0.1.0", "ollama>=0.4.0", "mistralai>=1.0.0"]Important: The extras are for DX tooling (model discovery, validation, health checks), not for inference. Inference always goes through the TapPass server which uses litellm. The SDK extras just make setup easier.
Server extras (already exist, need expansion)
Section titled “Server extras (already exist, need expansion)”# pyproject.toml (server)[project.optional-dependencies]llm = ["litellm>=1.50.0"] # Already existslocal = ["psycopg[binary]>=3.1.0"] # Already exists. rename to "postgres"ollama = ["litellm>=1.50.0"] # Same as llm (litellm handles Ollama)vllm = ["litellm>=1.50.0"] # Same as llm (litellm handles vLLM)mistral = ["litellm>=1.50.0"] # Same as llmairgap = ["litellm>=1.50.0", "spacy>=3.7.0", "presidio-analyzer>=2.2.0", "presidio-anonymizer>=2.2.0"] # Everything for air-gappedThe server extras are all the same under the hood (litellm handles every provider). But having named extras is a marketing and DX signal. when someone searches “tappass ollama” they see it’s explicitly supported.
7. Server-side changes needed
Section titled “7. Server-side changes needed”7.1 Model auto-discovery in model_gateway.py
Section titled “7.1 Model auto-discovery in model_gateway.py”Currently, only 3 hardcoded cloud models are registered. Add auto-discovery:
async def discover_local_models() -> list[RegisteredModel]: """Auto-discover locally running models (Ollama, vLLM, LM Studio).""" discovered = []
# Ollama try: async with httpx.AsyncClient(timeout=3) as client: resp = await client.get(f"{settings.ollama_api_base}/api/tags") if resp.status_code == 200: for model in resp.json().get("models", []): name = model.get("name", "").split(":")[0] discovered.append(RegisteredModel( name=f"ollama/{name}", provider="ollama", max_data_tier=DataClassification.RESTRICTED, # local = safe for anything region="local", context_window=model.get("details", {}).get("parameter_size", 128_000), cost_per_1k_input=0.0, cost_per_1k_output=0.0, )) except Exception: pass
# vLLM (check common port 8000) # LM Studio (check port 1234) # ... similar pattern
return discovered7.2 tappass up --local mode in CLI
Section titled “7.2 tappass up --local mode in CLI”# New CLI command flow:# 1. Detect Ollama → list available models# 2. User picks agent model + judge model# 3. Auto-configure .env with local settings# 4. Start server7.3 Config presets
Section titled “7.3 Config presets”Add config presets for common local setups:
PRESETS = { "local-ollama": { "TAPPASS_LLM_JUDGE_MODEL": "ollama/llama3.2", "TAPPASS_LLM_JUDGE_FALLBACK_MODEL": "ollama/mistral", "TAPPASS_OLLAMA_API_BASE": "http://localhost:11434", "TAPPASS_NER_ENABLED": "1", # spaCy for PII (no cloud needed) "TAPPASS_PII_LLM_ENABLED": "0", # Disable LLM PII to reduce latency }, "local-vllm": { "TAPPASS_LLM_JUDGE_MODEL": "openai/meta-llama/Meta-Llama-3.1-8B-Instruct", "OPENAI_API_BASE": "http://localhost:8000/v1", }, "air-gapped": { "TAPPASS_LLM_JUDGE_MODEL": "ollama/llama3.2", "TAPPASS_NER_ENABLED": "1", "TAPPASS_PII_LLM_ENABLED": "0", "TAPPASS_EU_DATA_RESIDENCY": "true", }, "eu-sovereign": { "TAPPASS_LLM_JUDGE_MODEL": "mistral/mistral-small-latest", "TAPPASS_EU_DATA_RESIDENCY": "true", },}7.4 Local model health monitoring
Section titled “7.4 Local model health monitoring”Add health checks for local model endpoints:
# In health endpoint, add local model status{ "status": "healthy", "local_models": { "ollama": {"status": "connected", "models": ["llama3.2", "mistral"]}, "vllm": {"status": "not_configured"} }}8. SDK-side changes needed
Section titled “8. SDK-side changes needed”8.1 Model provider helpers in the SDK
Section titled “8.1 Model provider helpers in the SDK”def discover_ollama(base_url: str = "http://localhost:11434") -> list[str]: """Discover models available in a local Ollama instance.""" import httpx resp = httpx.get(f"{base_url}/api/tags", timeout=3) resp.raise_for_status() return [m["name"].split(":")[0] for m in resp.json().get("models", [])]
def discover_local_models() -> dict[str, list[str]]: """Discover all local LLM providers and their models.""" result = {} # Ollama (11434) try: result["ollama"] = discover_ollama() except Exception: pass # vLLM (8000) try: result["vllm"] = _discover_openai_compat("http://localhost:8000") except Exception: pass # LM Studio (1234) try: result["lmstudio"] = _discover_openai_compat("http://localhost:1234") except Exception: pass return result8.2 Agent with local model shortcut
Section titled “8.2 Agent with local model shortcut”# In Agent class. convenience for local modelsagent = Agent("http://localhost:9620", "tp_...", model="ollama/llama3.2")This already works (model is just a string passed to the server). But document it prominently.
8.3 Validation on connect
Section titled “8.3 Validation on connect”When tappass[ollama] is installed, the SDK can validate the model exists:
# In Agent.chat(), add optional pre-flight checkif model.startswith("ollama/") and _ollama_available: available = discover_ollama() if model_name not in available: raise TapPassConfigError( f"Model '{model_name}' not found in Ollama. " f"Available: {available}. Pull it with: ollama pull {model_name}" )9. Eval: can local models do governance?
Section titled “9. Eval: can local models do governance?”This is the critical question. TapPass’s governance judge needs to:
- Classify data sensitivity. PUBLIC / INTERNAL / CONFIDENTIAL / RESTRICTED
- Score injection attempts. is this prompt injection? (0.0–1.0)
- Evaluate custom rules. BLOCK / PASS / WARN decisions
What we need to benchmark
Section titled “What we need to benchmark”Run the existing 456-example eval corpus against local models:
| Model | Size | Classification F1 | Injection F1 | Custom rule accuracy | Latency (p50) | Verdict |
|---|---|---|---|---|---|---|
| gpt-4o-mini | Cloud | 100% (baseline) | 100% | 100% | ~300ms | ✅ Production |
| llama3.2 3B | 2GB | ? | ? | ? | ~50ms | ? |
| llama3.1 8B | 5GB | ? | ? | ? | ~100ms | ? |
| mistral 7B | 4GB | ? | ? | ? | ~80ms | ? |
| qwen2.5 7B | 4GB | ? | ? | ? | ~90ms | ? |
| phi3 3.8B | 2GB | ? | ? | ? | ~60ms | ? |
| deepseek-r1 7B | 4GB | ? | ? | ? | ~100ms | ? |
| gemma2 9B | 6GB | ? | ? | ? | ~120ms | ? |
Hypothesis (based on industry benchmarks)
Section titled “Hypothesis (based on industry benchmarks)”- 8B+ models (llama3.1, mistral 7B, qwen2.5 7B): Should hit 85–95% on classification and injection detection. Good enough for most deployments.
- 3B models (llama3.2, phi3): Will likely hit 70–85%. Acceptable for classification, may miss subtle injection attacks.
- Reasoning models (deepseek-r1): Should be excellent at classification but slow.
Recommendation
Section titled “Recommendation”- Run the eval and publish results as a public benchmark page (
docs/local-model-benchmarks.md) - Define quality tiers:
- Recommended: Models that hit ≥90% on all evals
- Acceptable: Models that hit ≥80% (with warning)
- Not recommended: Models below 80% (with clear warning)
- The
tappass up --localwizard should show these recommendations
The escape hatch: deterministic-only mode
Section titled “The escape hatch: deterministic-only mode”For users who don’t trust local model quality:
# Pipeline config: disable all LLM judge stepsclassify_data: use_llm: false # regex-only classificationdetect_injection: use_llm: false # heuristic-only injection detectionThis gives zero LLM dependency for the governance pipeline. The deterministic steps alone (regex PII, pattern injection, taint tracking, etc.) already caught 95/95 red-team attacks with 0 bypasses. The LLM judge is a safety net, not the primary defense.
10. Competitive positioning
Section titled “10. Competitive positioning”Why this matters for GitHub stars
Section titled “Why this matters for GitHub stars”| Signal | Impact on stars | Why |
|---|---|---|
| ”Works with Ollama” | ⭐⭐⭐⭐⭐ | Ollama users are the most active open-source AI community. They star everything that works with Ollama. |
| ”Air-gapped deployment” | ⭐⭐⭐⭐ | Enterprise security teams share these on LinkedIn. Defense, healthcare, gov. |
| ”EU data sovereignty” | ⭐⭐⭐ | European developers specifically search for this. |
| ”Zero cloud dependency” | ⭐⭐⭐⭐ | Privacy-conscious developers love this messaging. |
Competitors
Section titled “Competitors”| Competitor | Local LLM support | TapPass advantage |
|---|---|---|
| Guardrails AI | Partial: some validators run locally via HF models, but core needs OpenAI | TapPass: full stack local. multi-step governance pipeline works without any LLM. |
| LangSmith | No local LLM for tracing | TapPass: governance + audit, fully local |
| PromptGuard | No | TapPass: full local |
| NeMo Guardrails | Good: supports local models via LangChain | Comparable. TapPass differentiates on capability tokens + audit. |
| LlamaGuard | Yes: IS a local model | TapPass: orchestration layer. Can use LlamaGuard as one of its judge models. |
Key differentiator
Section titled “Key differentiator”TapPass is the only governance platform where the deterministic pipeline works without any LLM. All 28 regex/heuristic steps run locally with zero model dependency. The LLM is optional enhancement, not a requirement.
This is unique. Every competitor requires an LLM for their core functionality. TapPass doesn’t.
11. Risks and mitigations
Section titled “11. Risks and mitigations”| Risk | Severity | Mitigation |
|---|---|---|
| Local model quality too low for governance judge | HIGH | Publish eval benchmarks. Recommend specific models. Offer deterministic-only mode. |
| Ollama not installed / wrong version | MEDIUM | tappass doctor detects and guides. SDK extras validate on import. |
| GPU memory exhaustion | MEDIUM | Guide: governance judge should use small model (3B). Agent model can be larger. Two separate models. |
| Users assume “local = secure” without understanding threat model | HIGH | Document clearly: local LLM prevents data exfiltration to cloud. It does NOT prevent the agent from misusing data locally. TapPass pipeline + sandbox together provide full coverage. |
| Latency: local models slower than cloud API on CPU | MEDIUM | Recommend GPU for production. For dev: 3B models are fast enough on CPU. LLM judge is optional. |
| Supporting too many providers | LOW | litellm handles the actual calls. SDK extras are just DX wrappers. Minimal maintenance burden. |
12. Implementation roadmap
Section titled “12. Implementation roadmap”Phase 1: Documentation + Eval (1 week)
Section titled “Phase 1: Documentation + Eval (1 week)”Zero code changes. Maximum impact.
- Write
docs/local-llm-guide.md: how to run TapPass with local models today - Run eval corpus against top 5 local models via Ollama
- Publish
docs/local-model-benchmarks.mdwith results - Add “Local LLMs” section to main README
- Add local model examples to
examples/frameworks/(e.g.ollama_local.py) - Blog post: “Air-gapped AI governance with TapPass + Ollama”
Phase 2: DX improvements (1–2 weeks)
Section titled “Phase 2: DX improvements (1–2 weeks)”- Add
tappass up --localCLI mode (detect Ollama, pick models, auto-configure) - Add
tappass modelscommand (list available local + cloud models) - Add model auto-discovery to health endpoint
- Add
tappass[ollama]SDK extra with model discovery - Add config presets (
local-ollama,air-gapped,eu-sovereign) - Add local model entries to
model_gateway.pyauto-discovery - Update
.env.examplewith local model sections
Phase 3: Provider-specific extras (1 week)
Section titled “Phase 3: Provider-specific extras (1 week)”-
tappass[mistral]. Mistral API key validation, EU compliance checks -
tappass[kimi]. Moonshot/Kimi API config -
tappass[vllm]. vLLM endpoint discovery -
tappass[lmstudio]. LM Studio model discovery -
tappass[airgap]. meta-extra with everything for air-gapped deployment
Phase 4: Advanced (2–3 weeks)
Section titled “Phase 4: Advanced (2–3 weeks)”- Docker image variant:
tappass/tappass:localwith Ollama baked in - Docker Compose template:
docker-compose.local.yml(TapPass + Ollama + PostgreSQL) - Helm chart variant for local deployment
- Governance judge auto-selection based on available local model quality
- LlamaGuard integration as a pipeline judge step
- Streaming from local models (already supported via litellm, just needs testing)
13. Marketing angle
Section titled “13. Marketing angle”Headline
Section titled “Headline”TapPass: The only AI governance platform that works without cloud LLMs.
Sub-messages
Section titled “Sub-messages”- “Air-gapped governance”. For defense, healthcare, government. Nothing leaves your network.
- “EUR 0.00/month LLM costs”. Run the entire governance stack on Ollama. Free forever.
- “EU data sovereignty by design”. Mistral (Paris) + Ollama (your server) = no US data transfers.
- “Works with Ollama in 60 seconds”.
tappass up --localdetects your models and configures everything.
README badge
Section titled “README badge”[](#)[](#)Comparison page: docs/tappass-vs-guardrails-ai.md
Section titled “Comparison page: docs/tappass-vs-guardrails-ai.md”“Guardrails AI requires OpenAI for its core validators. TapPass’s 28 deterministic steps work without any LLM. Add a local model for the extra 20%. or don’t.”
Summary
Section titled “Summary”| Dimension | Current | After |
|---|---|---|
| Local LLM for agent calls | ✅ Works (via litellm) | ✅ Documented, easy to discover |
| Local LLM for governance judge | ✅ Works (config only) | ✅ Benchmarked, recommended models, presets |
| Air-gapped deployment | ⚠️ Possible but undocumented | ✅ First-class: tappass up --local, Docker image |
| SDK extras | ❌ None | ✅ tappass[ollama], tappass[mistral], tappass[kimi], tappass[airgap] |
| Eval benchmarks | ❌ None | ✅ Published benchmarks for 8+ local models |
| Marketing | ❌ Not mentioned | ✅ “Air-gapped governance” as key differentiator |
Bottom line: The hard engineering work is done (litellm abstraction, deterministic pipeline). What’s needed is packaging, documentation, benchmarks, and marketing. This is a 4–6 week initiative that could be the single biggest driver of GitHub stars and enterprise adoption.