Local LLM Support: Full Analysis

Goal: Enable tappass[ollama], tappass[mistral], tappass[kimi], etc. as SDK extras so users can run the entire TapPass governance stack on local LLMs with zero cloud dependency.

Date: 2026-02-27 Author: SDK Review

Executive summary
Current state
What “local LLM support” actually means for TapPass
Architecture: the two LLM roles
Provider analysis: which local LLMs to support
SDK extras design
Server-side changes needed
SDK-side changes needed
Eval: can local models do governance?
Competitive positioning
Risks and mitigations
Implementation roadmap
Marketing angle

1. Executive summary

TapPass already supports local LLMs at the server level. call_llm.py uses litellm.acompletion() which routes to Ollama, vLLM, LM Studio, or any OpenAI-compatible endpoint. The config already has ollama_api_base and eu_allowed_judge_models with Ollama entries. The model routing step already routes RESTRICTED → ollama/llama3.2.

What’s missing is the user-facing packaging and DX. The value proposition. “run everything on local LLMs, nothing leaves your machine”. is buried in config comments. There’s no tappass[ollama] extra, no one-command setup, no documentation, and no eval benchmarks for local models as governance judges.

This is a high-impact, medium-effort initiative. The plumbing exists. What’s needed:

SDK extras (tappass[ollama], tappass[vllm], tappass[mistral], tappass[kimi]) that install provider deps and configure the right env
A tappass up --local mode that auto-detects Ollama/vLLM and configures everything
Eval benchmarks: which local models meet the governance quality bar (PII, injection, classification)
Marketing: “Air-gapped governance. Nothing leaves your network.”

2. Current state

What already works

Layer	Local LLM support	Status
`call_llm.py`: agent LLM calls	✅ Full	Routes via litellm to any provider. `model="ollama/llama3.2"` works today.
`classify_data.py`: LLM judge	✅ Full	`"llm_model": "ollama/llama3.2"` in step config.
`llm_step.py`: custom LLM steps	✅ Full	CISO can set `"model": "ollama/mistral"` per step.
`model_routing.py`: sensitivity routing	✅ Full	Already routes `RESTRICTED → ollama/llama3.2`.
`config.py`: server config	✅ Partial	Has `ollama_api_base`, `eu_allowed_judge_models` with Ollama entries. Missing vLLM, LM Studio, Kimi.
`circuit_breaker.py`: provider routing	✅ Full	Handles `ollama/` prefix, fallback between providers.
`model_gateway.py`: model registry	⚠️ Minimal	Only registers 3 cloud models. No local model auto-discovery.
SDK	❌ None	No local-LLM extras, no provider deps, no auto-config.
CLI	❌ None	No `tappass up --local`, no model auto-detection.
Docs	❌ None	No local LLM guide. Buried in config comments.
Eval	❌ None	No benchmarks for local models as governance judges.

The litellm abstraction

The core insight: litellm is already the universal LLM adapter. It supports 100+ providers including:

Ollama: ollama/llama3.2, ollama/mistral, ollama/qwen2.5
vLLM. openai/model-name with custom api_base
LM Studio. openai/model-name with http://localhost:1234/v1
Mistral (cloud). mistral/mistral-small-latest
Mistral (local via Ollama). ollama/mistral
Kimi / Moonshot. openai/moonshot-v1-8k with custom api_base
DeepSeek (cloud). deepseek/deepseek-chat
DeepSeek (local). ollama/deepseek-r1
llamacpp. openai/model with llama.cpp server
text-generation-inference (TGI). huggingface/model

TapPass doesn’t need to implement any provider adapters. It just needs to:

Configure litellm correctly for each provider
Package the dependencies
Make the UX seamless

3. What “local LLM support” actually means for TapPass

There are three distinct user stories and they have different requirements:

Story A: “I want governed agents, using a local LLM for the agent work”

The agent’s LLM calls go to a local model instead of OpenAI. TapPass still sits in the middle and governs everything.

Agent → TapPass → Ollama (local) → TapPass → Agent
                     ↑
              llama3.2 running on localhost:11434

This already works. The user just sets model="ollama/llama3.2" in their chat call. No changes needed to TapPass. litellm handles it.

Story B: “I want the governance pipeline itself to run on local LLMs”

The pipeline’s LLM judge (for classification, injection scoring, custom LLM steps) runs on a local model instead of calling OpenAI.

Agent → TapPass pipeline → [classify: ollama/llama3.2] → [call_llm: ollama/llama3.2] → Agent
                                    ↑
                          LLM judge runs locally too

This already works at the config level (TAPPASS_LLM_JUDGE_MODEL=ollama/llama3.2) but:

No one knows about it (not documented)
No eval: does llama3.2 actually detect PII/injection well enough?
No preset configs for common local setups

Story C: “I want TapPass 100% air-gapped. nothing leaves my network”

The entire stack. server, pipeline, LLM calls, judge calls. runs locally. Zero egress. This is the nuclear option for defense, healthcare, and regulated industries.

┌─────────────────── Air-gapped network ───────────────────┐
│                                                           │
│  Agent → TapPass → Ollama/vLLM (local GPU)               │
│              │                                            │
│        [all pipeline steps use local models]              │
│        [no internet access required]                      │
│                                                           │
└───────────────────────────────────────────────────────────┘

This is the killer feature. It requires:

All deterministic pipeline steps work without any LLM (they already do)
LLM judge steps configured to use local models
NER/PII detection via spaCy (already supported, runs locally)
No telemetry, no license check calling home
Docker image with all deps baked in

4. Architecture: the two LLM roles

TapPass uses LLMs in exactly two roles, and they have different quality requirements:

Role 1: Agent LLM (the “work” model)

What: The model that does the agent’s actual work (answering questions, generating code, etc.)
Where: call_llm.py step
Quality bar: Up to the user. If they want to use llama3.2 for their agent, that’s their choice.
TapPass involvement: Just routes the call and governs it. Doesn’t care about model quality.

Role 2: Governance judge (the “safety” model)

What: The model that evaluates requests/responses for PII, injection, classification
Where: classify_data.py (when use_llm=true), llm_step.py (custom CISO steps), scan_output.py
Quality bar: HIGH. A bad judge = false negatives = PII leaks, injection bypasses.
TapPass involvement: This is the critical path. If the judge model is weak, governance is weak.

Key insight: Most of TapPass’s governance is deterministic (regex, pattern matching, heuristics). The LLM judge is an optional escalation layer. This means:

TapPass governance works without any LLM at all. The 28 deterministic steps (PII regex, secret patterns, injection heuristics, taint tracking, rate limits, budgets) run without an LLM. The LLM judge only fires for semantic classification that regex can’t catch.

This is a massive selling point for local/air-gapped deployments: you get 80% of the governance with zero LLM dependency, and can add a local LLM for the remaining 20%.

5. Provider analysis: which local LLMs to support

Tier 1: Must-have (highest user demand)

Provider	Protocol	Local model	Why
Ollama	`ollama/*`	llama3.2, mistral, qwen2.5, phi3, deepseek-r1, gemma2	#1 local LLM runner. 1M+ downloads. Zero config.
vLLM	`openai/*` + custom base	Any HF model	Production-grade. Used by most enterprises for local serving.
LM Studio	`openai/*` + `localhost:1234`	Any GGUF model	Developer-friendly GUI. 10M+ downloads.

Tier 2: High value (cloud providers with local options)

Provider	Protocol	Local option	Why
Mistral	`mistral/*` (cloud) or `ollama/mistral` (local)	Yes via Ollama	EU-headquartered. GDPR-friendly. Strong small models (7B).
DeepSeek	`deepseek/*` (cloud) or `ollama/deepseek-r1` (local)	Yes via Ollama	Best price/performance. DeepSeek-R1 is SOTA for reasoning.

Tier 3: Niche but valuable

Provider	Protocol	Why
Kimi / Moonshot	`openai/*` + custom base (`api.moonshot.cn`)	Strong Chinese LLM. Growing enterprise adoption in APAC.
llamacpp	`openai/*` + custom base	C++ inference. Fastest on CPU-only.
TGI (HuggingFace)	`huggingface/*`	Popular in MLOps teams.
Apple MLX	`openai/*` + custom base (mlx_lm.server)	Growing fast on Apple Silicon Macs.

Recommendation

Ship Tier 1 + Tier 2 first. Ollama covers 90% of local LLM users. vLLM covers enterprise. LM Studio covers developers. Mistral gets the EU angle. Kimi and others can come later.

6. SDK extras design

The problem

Currently the SDK (pip install tappass) is a pure HTTP client. It talks to the TapPass server, which talks to the LLM. The SDK doesn’t need to know about LLM providers.

But for local LLM DX, we want pip install tappass[ollama] to:

Validate Ollama is installed and reachable
Auto-discover available local models
Configure the TapPass server to use them
Provide a tappass up --local command

Proposed extras

[project.optional-dependencies]
sandbox = ["nono-py>=0.1.0"]
ollama = ["ollama>=0.4.0"]         # Ollama Python client (for model discovery)
mistral = ["mistralai>=1.0.0"]     # Mistral client (for API key validation)
kimi = []                          # No extra deps: uses OpenAI-compatible API
local = ["ollama>=0.4.0"]          # Meta-extra: everything needed for local LLM
all = ["nono-py>=0.1.0", "ollama>=0.4.0", "mistralai>=1.0.0"]

Important: The extras are for DX tooling (model discovery, validation, health checks), not for inference. Inference always goes through the TapPass server which uses litellm. The SDK extras just make setup easier.

Server extras (already exist, need expansion)

# pyproject.toml (server)
[project.optional-dependencies]
llm = ["litellm>=1.50.0"]                    # Already exists
local = ["psycopg[binary]>=3.1.0"]           # Already exists. rename to "postgres"
ollama = ["litellm>=1.50.0"]                 # Same as llm (litellm handles Ollama)
vllm = ["litellm>=1.50.0"]                   # Same as llm (litellm handles vLLM)
mistral = ["litellm>=1.50.0"]                # Same as llm
airgap = ["litellm>=1.50.0", "spacy>=3.7.0", "presidio-analyzer>=2.2.0", "presidio-anonymizer>=2.2.0"]  # Everything for air-gapped

The server extras are all the same under the hood (litellm handles every provider). But having named extras is a marketing and DX signal. when someone searches “tappass ollama” they see it’s explicitly supported.

7. Server-side changes needed

7.1 Model auto-discovery in `model_gateway.py`

Currently, only 3 hardcoded cloud models are registered. Add auto-discovery:

async def discover_local_models() -> list[RegisteredModel]:
    """Auto-discover locally running models (Ollama, vLLM, LM Studio)."""
    discovered = []

    # Ollama
    try:
        async with httpx.AsyncClient(timeout=3) as client:
            resp = await client.get(f"{settings.ollama_api_base}/api/tags")
            if resp.status_code == 200:
                for model in resp.json().get("models", []):
                    name = model.get("name", "").split(":")[0]
                    discovered.append(RegisteredModel(
                        name=f"ollama/{name}",
                        provider="ollama",
                        max_data_tier=DataClassification.RESTRICTED,  # local = safe for anything
                        region="local",
                        context_window=model.get("details", {}).get("parameter_size", 128_000),
                        cost_per_1k_input=0.0,
                        cost_per_1k_output=0.0,
                    ))
    except Exception:
        pass

    # vLLM (check common port 8000)
    # LM Studio (check port 1234)
    # ... similar pattern

    return discovered

7.2 `tappass up --local` mode in CLI

# New CLI command flow:
# 1. Detect Ollama → list available models
# 2. User picks agent model + judge model
# 3. Auto-configure .env with local settings
# 4. Start server

7.3 Config presets

Add config presets for common local setups:

PRESETS = {
    "local-ollama": {
        "TAPPASS_LLM_JUDGE_MODEL": "ollama/llama3.2",
        "TAPPASS_LLM_JUDGE_FALLBACK_MODEL": "ollama/mistral",
        "TAPPASS_OLLAMA_API_BASE": "http://localhost:11434",
        "TAPPASS_NER_ENABLED": "1",       # spaCy for PII (no cloud needed)
        "TAPPASS_PII_LLM_ENABLED": "0",   # Disable LLM PII to reduce latency
    },
    "local-vllm": {
        "TAPPASS_LLM_JUDGE_MODEL": "openai/meta-llama/Meta-Llama-3.1-8B-Instruct",
        "OPENAI_API_BASE": "http://localhost:8000/v1",
    },
    "air-gapped": {
        "TAPPASS_LLM_JUDGE_MODEL": "ollama/llama3.2",
        "TAPPASS_NER_ENABLED": "1",
        "TAPPASS_PII_LLM_ENABLED": "0",
        "TAPPASS_EU_DATA_RESIDENCY": "true",
    },
    "eu-sovereign": {
        "TAPPASS_LLM_JUDGE_MODEL": "mistral/mistral-small-latest",
        "TAPPASS_EU_DATA_RESIDENCY": "true",
    },
}

7.4 Local model health monitoring

Add health checks for local model endpoints:

# In health endpoint, add local model status
{
    "status": "healthy",
    "local_models": {
        "ollama": {"status": "connected", "models": ["llama3.2", "mistral"]},
        "vllm": {"status": "not_configured"}
    }
}

8. SDK-side changes needed

8.1 Model provider helpers in the SDK

def discover_ollama(base_url: str = "http://localhost:11434") -> list[str]:
    """Discover models available in a local Ollama instance."""
    import httpx
    resp = httpx.get(f"{base_url}/api/tags", timeout=3)
    resp.raise_for_status()
    return [m["name"].split(":")[0] for m in resp.json().get("models", [])]


def discover_local_models() -> dict[str, list[str]]:
    """Discover all local LLM providers and their models."""
    result = {}
    # Ollama (11434)
    try:
        result["ollama"] = discover_ollama()
    except Exception:
        pass
    # vLLM (8000)
    try:
        result["vllm"] = _discover_openai_compat("http://localhost:8000")
    except Exception:
        pass
    # LM Studio (1234)
    try:
        result["lmstudio"] = _discover_openai_compat("http://localhost:1234")
    except Exception:
        pass
    return result

8.2 Agent with local model shortcut

# In Agent class. convenience for local models
agent = Agent("http://localhost:9620", "tp_...", model="ollama/llama3.2")

This already works (model is just a string passed to the server). But document it prominently.

8.3 Validation on connect

When tappass[ollama] is installed, the SDK can validate the model exists:

# In Agent.chat(), add optional pre-flight check
if model.startswith("ollama/") and _ollama_available:
    available = discover_ollama()
    if model_name not in available:
        raise TapPassConfigError(
            f"Model '{model_name}' not found in Ollama. "
            f"Available: {available}. Pull it with: ollama pull {model_name}"
        )

9. Eval: can local models do governance?

This is the critical question. TapPass’s governance judge needs to:

Classify data sensitivity. PUBLIC / INTERNAL / CONFIDENTIAL / RESTRICTED
Score injection attempts. is this prompt injection? (0.0–1.0)
Evaluate custom rules. BLOCK / PASS / WARN decisions

What we need to benchmark

Run the existing 456-example eval corpus against local models:

Model	Size	Classification F1	Injection F1	Custom rule accuracy	Latency (p50)	Verdict
gpt-4o-mini	Cloud	100% (baseline)	100%	100%	~300ms	✅ Production
llama3.2 3B	2GB	?	?	?	~50ms	?
llama3.1 8B	5GB	?	?	?	~100ms	?
mistral 7B	4GB	?	?	?	~80ms	?
qwen2.5 7B	4GB	?	?	?	~90ms	?
phi3 3.8B	2GB	?	?	?	~60ms	?
deepseek-r1 7B	4GB	?	?	?	~100ms	?
gemma2 9B	6GB	?	?	?	~120ms	?

Hypothesis (based on industry benchmarks)

8B+ models (llama3.1, mistral 7B, qwen2.5 7B): Should hit 85–95% on classification and injection detection. Good enough for most deployments.
3B models (llama3.2, phi3): Will likely hit 70–85%. Acceptable for classification, may miss subtle injection attacks.
Reasoning models (deepseek-r1): Should be excellent at classification but slow.

Recommendation

Run the eval and publish results as a public benchmark page (docs/local-model-benchmarks.md)
Define quality tiers:
- Recommended: Models that hit ≥90% on all evals
- Acceptable: Models that hit ≥80% (with warning)
- Not recommended: Models below 80% (with clear warning)
The tappass up --local wizard should show these recommendations

The escape hatch: deterministic-only mode

For users who don’t trust local model quality:

# Pipeline config: disable all LLM judge steps
classify_data:
  use_llm: false       # regex-only classification
detect_injection:
  use_llm: false       # heuristic-only injection detection

This gives zero LLM dependency for the governance pipeline. The deterministic steps alone (regex PII, pattern injection, taint tracking, etc.) already caught 95/95 red-team attacks with 0 bypasses. The LLM judge is a safety net, not the primary defense.

10. Competitive positioning

Why this matters for GitHub stars

Signal	Impact on stars	Why
”Works with Ollama”	⭐⭐⭐⭐⭐	Ollama users are the most active open-source AI community. They star everything that works with Ollama.
”Air-gapped deployment”	⭐⭐⭐⭐	Enterprise security teams share these on LinkedIn. Defense, healthcare, gov.
”EU data sovereignty”	⭐⭐⭐	European developers specifically search for this.
”Zero cloud dependency”	⭐⭐⭐⭐	Privacy-conscious developers love this messaging.

Competitors

Competitor	Local LLM support	TapPass advantage
Guardrails AI	Partial: some validators run locally via HF models, but core needs OpenAI	TapPass: full stack local. multi-step governance pipeline works without any LLM.
LangSmith	No local LLM for tracing	TapPass: governance + audit, fully local
PromptGuard	No	TapPass: full local
NeMo Guardrails	Good: supports local models via LangChain	Comparable. TapPass differentiates on capability tokens + audit.
LlamaGuard	Yes: IS a local model	TapPass: orchestration layer. Can use LlamaGuard as one of its judge models.

Key differentiator

TapPass is the only governance platform where the deterministic pipeline works without any LLM. All 28 regex/heuristic steps run locally with zero model dependency. The LLM is optional enhancement, not a requirement.

This is unique. Every competitor requires an LLM for their core functionality. TapPass doesn’t.

11. Risks and mitigations

Risk	Severity	Mitigation
Local model quality too low for governance judge	HIGH	Publish eval benchmarks. Recommend specific models. Offer deterministic-only mode.
Ollama not installed / wrong version	MEDIUM	`tappass doctor` detects and guides. SDK extras validate on import.
GPU memory exhaustion	MEDIUM	Guide: governance judge should use small model (3B). Agent model can be larger. Two separate models.
Users assume “local = secure” without understanding threat model	HIGH	Document clearly: local LLM prevents data exfiltration to cloud. It does NOT prevent the agent from misusing data locally. TapPass pipeline + sandbox together provide full coverage.
Latency: local models slower than cloud API on CPU	MEDIUM	Recommend GPU for production. For dev: 3B models are fast enough on CPU. LLM judge is optional.
Supporting too many providers	LOW	litellm handles the actual calls. SDK extras are just DX wrappers. Minimal maintenance burden.

12. Implementation roadmap

Phase 1: Documentation + Eval (1 week)

Zero code changes. Maximum impact.

Write docs/local-llm-guide.md: how to run TapPass with local models today
Run eval corpus against top 5 local models via Ollama
Publish docs/local-model-benchmarks.md with results
Add “Local LLMs” section to main README
Add local model examples to examples/frameworks/ (e.g. ollama_local.py)
Blog post: “Air-gapped AI governance with TapPass + Ollama”

Phase 2: DX improvements (1–2 weeks)

Add tappass up --local CLI mode (detect Ollama, pick models, auto-configure)
Add tappass models command (list available local + cloud models)
Add model auto-discovery to health endpoint
Add tappass[ollama] SDK extra with model discovery
Add config presets (local-ollama, air-gapped, eu-sovereign)
Add local model entries to model_gateway.py auto-discovery
Update .env.example with local model sections

Phase 3: Provider-specific extras (1 week)

tappass[mistral]. Mistral API key validation, EU compliance checks
tappass[kimi]. Moonshot/Kimi API config
tappass[vllm]. vLLM endpoint discovery
tappass[lmstudio]. LM Studio model discovery
tappass[airgap]. meta-extra with everything for air-gapped deployment

Phase 4: Advanced (2–3 weeks)

Docker image variant: tappass/tappass:local with Ollama baked in
Docker Compose template: docker-compose.local.yml (TapPass + Ollama + PostgreSQL)
Helm chart variant for local deployment
Governance judge auto-selection based on available local model quality
LlamaGuard integration as a pipeline judge step
Streaming from local models (already supported via litellm, just needs testing)

13. Marketing angle

Headline

TapPass: The only AI governance platform that works without cloud LLMs.

Sub-messages

“Air-gapped governance”. For defense, healthcare, government. Nothing leaves your network.
“EUR 0.00/month LLM costs”. Run the entire governance stack on Ollama. Free forever.
“EU data sovereignty by design”. Mistral (Paris) + Ollama (your server) = no US data transfers.
“Works with Ollama in 60 seconds”. tappass up --local detects your models and configures everything.

README badge

[![Works with Ollama](https://img.shields.io/badge/Ollama-supported-green)](#)
[![Air-gapped](https://img.shields.io/badge/deployment-air--gapped-blue)](#)

Comparison page: `docs/tappass-vs-guardrails-ai.md`

“Guardrails AI requires OpenAI for its core validators. TapPass’s 28 deterministic steps work without any LLM. Add a local model for the extra 20%. or don’t.”

Summary

Dimension	Current	After
Local LLM for agent calls	✅ Works (via litellm)	✅ Documented, easy to discover
Local LLM for governance judge	✅ Works (config only)	✅ Benchmarked, recommended models, presets
Air-gapped deployment	⚠️ Possible but undocumented	✅ First-class: `tappass up --local`, Docker image
SDK extras	❌ None	✅ `tappass[ollama]`, `tappass[mistral]`, `tappass[kimi]`, `tappass[airgap]`
Eval benchmarks	❌ None	✅ Published benchmarks for 8+ local models
Marketing	❌ Not mentioned	✅ “Air-gapped governance” as key differentiator

Bottom line: The hard engineering work is done (litellm abstraction, deterministic pipeline). What’s needed is packaging, documentation, benchmarks, and marketing. This is a 4–6 week initiative that could be the single biggest driver of GitHub stars and enterprise adoption.

Local LLM Support: Full Analysis

Table of contents

1. Executive summary

2. Current state

What already works

The litellm abstraction

3. What “local LLM support” actually means for TapPass

Story A: “I want governed agents, using a local LLM for the agent work”

Story B: “I want the governance pipeline itself to run on local LLMs”

Story C: “I want TapPass 100% air-gapped. nothing leaves my network”

4. Architecture: the two LLM roles

Role 1: Agent LLM (the “work” model)

Role 2: Governance judge (the “safety” model)

5. Provider analysis: which local LLMs to support

Tier 1: Must-have (highest user demand)

Tier 2: High value (cloud providers with local options)

Tier 3: Niche but valuable

Recommendation

6. SDK extras design

The problem

Proposed extras

Server extras (already exist, need expansion)

7. Server-side changes needed

7.1 Model auto-discovery in model_gateway.py

7.2 tappass up --local mode in CLI

7.3 Config presets

7.4 Local model health monitoring

8. SDK-side changes needed

8.1 Model provider helpers in the SDK

8.2 Agent with local model shortcut

8.3 Validation on connect

9. Eval: can local models do governance?

What we need to benchmark

Hypothesis (based on industry benchmarks)

Recommendation

The escape hatch: deterministic-only mode

10. Competitive positioning

Why this matters for GitHub stars

Competitors

Key differentiator

11. Risks and mitigations

12. Implementation roadmap

Phase 1: Documentation + Eval (1 week)

Phase 2: DX improvements (1–2 weeks)

Phase 3: Provider-specific extras (1 week)

Phase 4: Advanced (2–3 weeks)

13. Marketing angle

Headline

Sub-messages

README badge

Comparison page: docs/tappass-vs-guardrails-ai.md

Summary

7.1 Model auto-discovery in `model_gateway.py`

7.2 `tappass up --local` mode in CLI

Comparison page: `docs/tappass-vs-guardrails-ai.md`