Skip to content

Local LLM Support: Full Analysis

Goal: Enable tappass[ollama], tappass[mistral], tappass[kimi], etc. as SDK extras so users can run the entire TapPass governance stack on local LLMs with zero cloud dependency.

Date: 2026-02-27 Author: SDK Review


  1. Executive summary
  2. Current state
  3. What “local LLM support” actually means for TapPass
  4. Architecture: the two LLM roles
  5. Provider analysis: which local LLMs to support
  6. SDK extras design
  7. Server-side changes needed
  8. SDK-side changes needed
  9. Eval: can local models do governance?
  10. Competitive positioning
  11. Risks and mitigations
  12. Implementation roadmap
  13. Marketing angle

TapPass already supports local LLMs at the server level. call_llm.py uses litellm.acompletion() which routes to Ollama, vLLM, LM Studio, or any OpenAI-compatible endpoint. The config already has ollama_api_base and eu_allowed_judge_models with Ollama entries. The model routing step already routes RESTRICTED → ollama/llama3.2.

What’s missing is the user-facing packaging and DX. The value proposition. “run everything on local LLMs, nothing leaves your machine”. is buried in config comments. There’s no tappass[ollama] extra, no one-command setup, no documentation, and no eval benchmarks for local models as governance judges.

This is a high-impact, medium-effort initiative. The plumbing exists. What’s needed:

  1. SDK extras (tappass[ollama], tappass[vllm], tappass[mistral], tappass[kimi]) that install provider deps and configure the right env
  2. A tappass up --local mode that auto-detects Ollama/vLLM and configures everything
  3. Eval benchmarks: which local models meet the governance quality bar (PII, injection, classification)
  4. Marketing: “Air-gapped governance. Nothing leaves your network.”

LayerLocal LLM supportStatus
call_llm.py: agent LLM calls✅ FullRoutes via litellm to any provider. model="ollama/llama3.2" works today.
classify_data.py: LLM judge✅ Full"llm_model": "ollama/llama3.2" in step config.
llm_step.py: custom LLM steps✅ FullCISO can set "model": "ollama/mistral" per step.
model_routing.py: sensitivity routing✅ FullAlready routes RESTRICTED → ollama/llama3.2.
config.py: server config✅ PartialHas ollama_api_base, eu_allowed_judge_models with Ollama entries. Missing vLLM, LM Studio, Kimi.
circuit_breaker.py: provider routing✅ FullHandles ollama/ prefix, fallback between providers.
model_gateway.py: model registry⚠️ MinimalOnly registers 3 cloud models. No local model auto-discovery.
SDK❌ NoneNo local-LLM extras, no provider deps, no auto-config.
CLI❌ NoneNo tappass up --local, no model auto-detection.
Docs❌ NoneNo local LLM guide. Buried in config comments.
Eval❌ NoneNo benchmarks for local models as governance judges.

The core insight: litellm is already the universal LLM adapter. It supports 100+ providers including:

  • Ollama: ollama/llama3.2, ollama/mistral, ollama/qwen2.5
  • vLLM. openai/model-name with custom api_base
  • LM Studio. openai/model-name with http://localhost:1234/v1
  • Mistral (cloud). mistral/mistral-small-latest
  • Mistral (local via Ollama). ollama/mistral
  • Kimi / Moonshot. openai/moonshot-v1-8k with custom api_base
  • DeepSeek (cloud). deepseek/deepseek-chat
  • DeepSeek (local). ollama/deepseek-r1
  • llamacpp. openai/model with llama.cpp server
  • text-generation-inference (TGI). huggingface/model

TapPass doesn’t need to implement any provider adapters. It just needs to:

  1. Configure litellm correctly for each provider
  2. Package the dependencies
  3. Make the UX seamless

3. What “local LLM support” actually means for TapPass

Section titled “3. What “local LLM support” actually means for TapPass”

There are three distinct user stories and they have different requirements:

Story A: “I want governed agents, using a local LLM for the agent work”

Section titled “Story A: “I want governed agents, using a local LLM for the agent work””

The agent’s LLM calls go to a local model instead of OpenAI. TapPass still sits in the middle and governs everything.

Agent → TapPass → Ollama (local) → TapPass → Agent
llama3.2 running on localhost:11434

This already works. The user just sets model="ollama/llama3.2" in their chat call. No changes needed to TapPass. litellm handles it.

Story B: “I want the governance pipeline itself to run on local LLMs”

Section titled “Story B: “I want the governance pipeline itself to run on local LLMs””

The pipeline’s LLM judge (for classification, injection scoring, custom LLM steps) runs on a local model instead of calling OpenAI.

Agent → TapPass pipeline → [classify: ollama/llama3.2] → [call_llm: ollama/llama3.2] → Agent
LLM judge runs locally too

This already works at the config level (TAPPASS_LLM_JUDGE_MODEL=ollama/llama3.2) but:

  • No one knows about it (not documented)
  • No eval: does llama3.2 actually detect PII/injection well enough?
  • No preset configs for common local setups

Story C: “I want TapPass 100% air-gapped. nothing leaves my network”

Section titled “Story C: “I want TapPass 100% air-gapped. nothing leaves my network””

The entire stack. server, pipeline, LLM calls, judge calls. runs locally. Zero egress. This is the nuclear option for defense, healthcare, and regulated industries.

┌─────────────────── Air-gapped network ───────────────────┐
│ │
│ Agent → TapPass → Ollama/vLLM (local GPU) │
│ │ │
│ [all pipeline steps use local models] │
│ [no internet access required] │
│ │
└───────────────────────────────────────────────────────────┘

This is the killer feature. It requires:

  • All deterministic pipeline steps work without any LLM (they already do)
  • LLM judge steps configured to use local models
  • NER/PII detection via spaCy (already supported, runs locally)
  • No telemetry, no license check calling home
  • Docker image with all deps baked in

TapPass uses LLMs in exactly two roles, and they have different quality requirements:

  • What: The model that does the agent’s actual work (answering questions, generating code, etc.)
  • Where: call_llm.py step
  • Quality bar: Up to the user. If they want to use llama3.2 for their agent, that’s their choice.
  • TapPass involvement: Just routes the call and governs it. Doesn’t care about model quality.

Role 2: Governance judge (the “safety” model)

Section titled “Role 2: Governance judge (the “safety” model)”
  • What: The model that evaluates requests/responses for PII, injection, classification
  • Where: classify_data.py (when use_llm=true), llm_step.py (custom CISO steps), scan_output.py
  • Quality bar: HIGH. A bad judge = false negatives = PII leaks, injection bypasses.
  • TapPass involvement: This is the critical path. If the judge model is weak, governance is weak.

Key insight: Most of TapPass’s governance is deterministic (regex, pattern matching, heuristics). The LLM judge is an optional escalation layer. This means:

TapPass governance works without any LLM at all. The 28 deterministic steps (PII regex, secret patterns, injection heuristics, taint tracking, rate limits, budgets) run without an LLM. The LLM judge only fires for semantic classification that regex can’t catch.

This is a massive selling point for local/air-gapped deployments: you get 80% of the governance with zero LLM dependency, and can add a local LLM for the remaining 20%.


5. Provider analysis: which local LLMs to support

Section titled “5. Provider analysis: which local LLMs to support”
ProviderProtocolLocal modelWhy
Ollamaollama/*llama3.2, mistral, qwen2.5, phi3, deepseek-r1, gemma2#1 local LLM runner. 1M+ downloads. Zero config.
vLLMopenai/* + custom baseAny HF modelProduction-grade. Used by most enterprises for local serving.
LM Studioopenai/* + localhost:1234Any GGUF modelDeveloper-friendly GUI. 10M+ downloads.

Tier 2: High value (cloud providers with local options)

Section titled “Tier 2: High value (cloud providers with local options)”
ProviderProtocolLocal optionWhy
Mistralmistral/* (cloud) or ollama/mistral (local)Yes via OllamaEU-headquartered. GDPR-friendly. Strong small models (7B).
DeepSeekdeepseek/* (cloud) or ollama/deepseek-r1 (local)Yes via OllamaBest price/performance. DeepSeek-R1 is SOTA for reasoning.
ProviderProtocolWhy
Kimi / Moonshotopenai/* + custom base (api.moonshot.cn)Strong Chinese LLM. Growing enterprise adoption in APAC.
llamacppopenai/* + custom baseC++ inference. Fastest on CPU-only.
TGI (HuggingFace)huggingface/*Popular in MLOps teams.
Apple MLXopenai/* + custom base (mlx_lm.server)Growing fast on Apple Silicon Macs.

Ship Tier 1 + Tier 2 first. Ollama covers 90% of local LLM users. vLLM covers enterprise. LM Studio covers developers. Mistral gets the EU angle. Kimi and others can come later.


Currently the SDK (pip install tappass) is a pure HTTP client. It talks to the TapPass server, which talks to the LLM. The SDK doesn’t need to know about LLM providers.

But for local LLM DX, we want pip install tappass[ollama] to:

  1. Validate Ollama is installed and reachable
  2. Auto-discover available local models
  3. Configure the TapPass server to use them
  4. Provide a tappass up --local command
python-sdk/pyproject.toml
[project.optional-dependencies]
sandbox = ["nono-py>=0.1.0"]
ollama = ["ollama>=0.4.0"] # Ollama Python client (for model discovery)
mistral = ["mistralai>=1.0.0"] # Mistral client (for API key validation)
kimi = [] # No extra deps: uses OpenAI-compatible API
local = ["ollama>=0.4.0"] # Meta-extra: everything needed for local LLM
all = ["nono-py>=0.1.0", "ollama>=0.4.0", "mistralai>=1.0.0"]

Important: The extras are for DX tooling (model discovery, validation, health checks), not for inference. Inference always goes through the TapPass server which uses litellm. The SDK extras just make setup easier.

Server extras (already exist, need expansion)

Section titled “Server extras (already exist, need expansion)”
# pyproject.toml (server)
[project.optional-dependencies]
llm = ["litellm>=1.50.0"] # Already exists
local = ["psycopg[binary]>=3.1.0"] # Already exists. rename to "postgres"
ollama = ["litellm>=1.50.0"] # Same as llm (litellm handles Ollama)
vllm = ["litellm>=1.50.0"] # Same as llm (litellm handles vLLM)
mistral = ["litellm>=1.50.0"] # Same as llm
airgap = ["litellm>=1.50.0", "spacy>=3.7.0", "presidio-analyzer>=2.2.0", "presidio-anonymizer>=2.2.0"] # Everything for air-gapped

The server extras are all the same under the hood (litellm handles every provider). But having named extras is a marketing and DX signal. when someone searches “tappass ollama” they see it’s explicitly supported.


7.1 Model auto-discovery in model_gateway.py

Section titled “7.1 Model auto-discovery in model_gateway.py”

Currently, only 3 hardcoded cloud models are registered. Add auto-discovery:

tappass/services/model_gateway.py
async def discover_local_models() -> list[RegisteredModel]:
"""Auto-discover locally running models (Ollama, vLLM, LM Studio)."""
discovered = []
# Ollama
try:
async with httpx.AsyncClient(timeout=3) as client:
resp = await client.get(f"{settings.ollama_api_base}/api/tags")
if resp.status_code == 200:
for model in resp.json().get("models", []):
name = model.get("name", "").split(":")[0]
discovered.append(RegisteredModel(
name=f"ollama/{name}",
provider="ollama",
max_data_tier=DataClassification.RESTRICTED, # local = safe for anything
region="local",
context_window=model.get("details", {}).get("parameter_size", 128_000),
cost_per_1k_input=0.0,
cost_per_1k_output=0.0,
))
except Exception:
pass
# vLLM (check common port 8000)
# LM Studio (check port 1234)
# ... similar pattern
return discovered
# New CLI command flow:
# 1. Detect Ollama → list available models
# 2. User picks agent model + judge model
# 3. Auto-configure .env with local settings
# 4. Start server

Add config presets for common local setups:

tappass/config_presets.py
PRESETS = {
"local-ollama": {
"TAPPASS_LLM_JUDGE_MODEL": "ollama/llama3.2",
"TAPPASS_LLM_JUDGE_FALLBACK_MODEL": "ollama/mistral",
"TAPPASS_OLLAMA_API_BASE": "http://localhost:11434",
"TAPPASS_NER_ENABLED": "1", # spaCy for PII (no cloud needed)
"TAPPASS_PII_LLM_ENABLED": "0", # Disable LLM PII to reduce latency
},
"local-vllm": {
"TAPPASS_LLM_JUDGE_MODEL": "openai/meta-llama/Meta-Llama-3.1-8B-Instruct",
"OPENAI_API_BASE": "http://localhost:8000/v1",
},
"air-gapped": {
"TAPPASS_LLM_JUDGE_MODEL": "ollama/llama3.2",
"TAPPASS_NER_ENABLED": "1",
"TAPPASS_PII_LLM_ENABLED": "0",
"TAPPASS_EU_DATA_RESIDENCY": "true",
},
"eu-sovereign": {
"TAPPASS_LLM_JUDGE_MODEL": "mistral/mistral-small-latest",
"TAPPASS_EU_DATA_RESIDENCY": "true",
},
}

Add health checks for local model endpoints:

# In health endpoint, add local model status
{
"status": "healthy",
"local_models": {
"ollama": {"status": "connected", "models": ["llama3.2", "mistral"]},
"vllm": {"status": "not_configured"}
}
}

python-sdk/tappass/providers.py
def discover_ollama(base_url: str = "http://localhost:11434") -> list[str]:
"""Discover models available in a local Ollama instance."""
import httpx
resp = httpx.get(f"{base_url}/api/tags", timeout=3)
resp.raise_for_status()
return [m["name"].split(":")[0] for m in resp.json().get("models", [])]
def discover_local_models() -> dict[str, list[str]]:
"""Discover all local LLM providers and their models."""
result = {}
# Ollama (11434)
try:
result["ollama"] = discover_ollama()
except Exception:
pass
# vLLM (8000)
try:
result["vllm"] = _discover_openai_compat("http://localhost:8000")
except Exception:
pass
# LM Studio (1234)
try:
result["lmstudio"] = _discover_openai_compat("http://localhost:1234")
except Exception:
pass
return result
# In Agent class. convenience for local models
agent = Agent("http://localhost:9620", "tp_...", model="ollama/llama3.2")

This already works (model is just a string passed to the server). But document it prominently.

When tappass[ollama] is installed, the SDK can validate the model exists:

# In Agent.chat(), add optional pre-flight check
if model.startswith("ollama/") and _ollama_available:
available = discover_ollama()
if model_name not in available:
raise TapPassConfigError(
f"Model '{model_name}' not found in Ollama. "
f"Available: {available}. Pull it with: ollama pull {model_name}"
)

This is the critical question. TapPass’s governance judge needs to:

  1. Classify data sensitivity. PUBLIC / INTERNAL / CONFIDENTIAL / RESTRICTED
  2. Score injection attempts. is this prompt injection? (0.0–1.0)
  3. Evaluate custom rules. BLOCK / PASS / WARN decisions

Run the existing 456-example eval corpus against local models:

ModelSizeClassification F1Injection F1Custom rule accuracyLatency (p50)Verdict
gpt-4o-miniCloud100% (baseline)100%100%~300ms✅ Production
llama3.2 3B2GB???~50ms?
llama3.1 8B5GB???~100ms?
mistral 7B4GB???~80ms?
qwen2.5 7B4GB???~90ms?
phi3 3.8B2GB???~60ms?
deepseek-r1 7B4GB???~100ms?
gemma2 9B6GB???~120ms?
  • 8B+ models (llama3.1, mistral 7B, qwen2.5 7B): Should hit 85–95% on classification and injection detection. Good enough for most deployments.
  • 3B models (llama3.2, phi3): Will likely hit 70–85%. Acceptable for classification, may miss subtle injection attacks.
  • Reasoning models (deepseek-r1): Should be excellent at classification but slow.
  1. Run the eval and publish results as a public benchmark page (docs/local-model-benchmarks.md)
  2. Define quality tiers:
    • Recommended: Models that hit ≥90% on all evals
    • Acceptable: Models that hit ≥80% (with warning)
    • Not recommended: Models below 80% (with clear warning)
  3. The tappass up --local wizard should show these recommendations

For users who don’t trust local model quality:

# Pipeline config: disable all LLM judge steps
classify_data:
use_llm: false # regex-only classification
detect_injection:
use_llm: false # heuristic-only injection detection

This gives zero LLM dependency for the governance pipeline. The deterministic steps alone (regex PII, pattern injection, taint tracking, etc.) already caught 95/95 red-team attacks with 0 bypasses. The LLM judge is a safety net, not the primary defense.


SignalImpact on starsWhy
”Works with Ollama”⭐⭐⭐⭐⭐Ollama users are the most active open-source AI community. They star everything that works with Ollama.
”Air-gapped deployment”⭐⭐⭐⭐Enterprise security teams share these on LinkedIn. Defense, healthcare, gov.
”EU data sovereignty”⭐⭐⭐European developers specifically search for this.
”Zero cloud dependency”⭐⭐⭐⭐Privacy-conscious developers love this messaging.
CompetitorLocal LLM supportTapPass advantage
Guardrails AIPartial: some validators run locally via HF models, but core needs OpenAITapPass: full stack local. multi-step governance pipeline works without any LLM.
LangSmithNo local LLM for tracingTapPass: governance + audit, fully local
PromptGuardNoTapPass: full local
NeMo GuardrailsGood: supports local models via LangChainComparable. TapPass differentiates on capability tokens + audit.
LlamaGuardYes: IS a local modelTapPass: orchestration layer. Can use LlamaGuard as one of its judge models.

TapPass is the only governance platform where the deterministic pipeline works without any LLM. All 28 regex/heuristic steps run locally with zero model dependency. The LLM is optional enhancement, not a requirement.

This is unique. Every competitor requires an LLM for their core functionality. TapPass doesn’t.


RiskSeverityMitigation
Local model quality too low for governance judgeHIGHPublish eval benchmarks. Recommend specific models. Offer deterministic-only mode.
Ollama not installed / wrong versionMEDIUMtappass doctor detects and guides. SDK extras validate on import.
GPU memory exhaustionMEDIUMGuide: governance judge should use small model (3B). Agent model can be larger. Two separate models.
Users assume “local = secure” without understanding threat modelHIGHDocument clearly: local LLM prevents data exfiltration to cloud. It does NOT prevent the agent from misusing data locally. TapPass pipeline + sandbox together provide full coverage.
Latency: local models slower than cloud API on CPUMEDIUMRecommend GPU for production. For dev: 3B models are fast enough on CPU. LLM judge is optional.
Supporting too many providersLOWlitellm handles the actual calls. SDK extras are just DX wrappers. Minimal maintenance burden.

Zero code changes. Maximum impact.

  • Write docs/local-llm-guide.md: how to run TapPass with local models today
  • Run eval corpus against top 5 local models via Ollama
  • Publish docs/local-model-benchmarks.md with results
  • Add “Local LLMs” section to main README
  • Add local model examples to examples/frameworks/ (e.g. ollama_local.py)
  • Blog post: “Air-gapped AI governance with TapPass + Ollama”
  • Add tappass up --local CLI mode (detect Ollama, pick models, auto-configure)
  • Add tappass models command (list available local + cloud models)
  • Add model auto-discovery to health endpoint
  • Add tappass[ollama] SDK extra with model discovery
  • Add config presets (local-ollama, air-gapped, eu-sovereign)
  • Add local model entries to model_gateway.py auto-discovery
  • Update .env.example with local model sections

Phase 3: Provider-specific extras (1 week)

Section titled “Phase 3: Provider-specific extras (1 week)”
  • tappass[mistral]. Mistral API key validation, EU compliance checks
  • tappass[kimi]. Moonshot/Kimi API config
  • tappass[vllm]. vLLM endpoint discovery
  • tappass[lmstudio]. LM Studio model discovery
  • tappass[airgap]. meta-extra with everything for air-gapped deployment
  • Docker image variant: tappass/tappass:local with Ollama baked in
  • Docker Compose template: docker-compose.local.yml (TapPass + Ollama + PostgreSQL)
  • Helm chart variant for local deployment
  • Governance judge auto-selection based on available local model quality
  • LlamaGuard integration as a pipeline judge step
  • Streaming from local models (already supported via litellm, just needs testing)

TapPass: The only AI governance platform that works without cloud LLMs.

  1. “Air-gapped governance”. For defense, healthcare, government. Nothing leaves your network.
  2. “EUR 0.00/month LLM costs”. Run the entire governance stack on Ollama. Free forever.
  3. “EU data sovereignty by design”. Mistral (Paris) + Ollama (your server) = no US data transfers.
  4. “Works with Ollama in 60 seconds”. tappass up --local detects your models and configures everything.
[![Works with Ollama](https://img.shields.io/badge/Ollama-supported-green)](#)
[![Air-gapped](https://img.shields.io/badge/deployment-air--gapped-blue)](#)

Comparison page: docs/tappass-vs-guardrails-ai.md

Section titled “Comparison page: docs/tappass-vs-guardrails-ai.md”

“Guardrails AI requires OpenAI for its core validators. TapPass’s 28 deterministic steps work without any LLM. Add a local model for the extra 20%. or don’t.”


DimensionCurrentAfter
Local LLM for agent calls✅ Works (via litellm)✅ Documented, easy to discover
Local LLM for governance judge✅ Works (config only)✅ Benchmarked, recommended models, presets
Air-gapped deployment⚠️ Possible but undocumented✅ First-class: tappass up --local, Docker image
SDK extras❌ Nonetappass[ollama], tappass[mistral], tappass[kimi], tappass[airgap]
Eval benchmarks❌ None✅ Published benchmarks for 8+ local models
Marketing❌ Not mentioned✅ “Air-gapped governance” as key differentiator

Bottom line: The hard engineering work is done (litellm abstraction, deterministic pipeline). What’s needed is packaging, documentation, benchmarks, and marketing. This is a 4–6 week initiative that could be the single biggest driver of GitHub stars and enterprise adoption.