Deployment Playbook
Goal: Deploy TapPass on a fresh server or upgrade an existing instance.
Audience: Platform engineer, DevOps
Architecture
Section titled “Architecture”The production stack consists of 7 services:
| Service | Image | Purpose |
|---|---|---|
| tappass | Built from Dockerfile | Core governance server, port 9620 |
| opa | openpolicyagent/opa:1.14.0 | Policy decision point (Rego policies) |
| postgres | postgres:16-alpine | Persistent storage (agents, audit, sessions) |
| redis | redis:7-alpine | Rate limiting, session state, cache (64MB, LRU) |
| spire-server | ghcr.io/spiffe/spire-server:1.11.0 | SPIFFE certificate authority |
| spire-agent | ghcr.io/spiffe/spire-agent:1.11.0 | Workload attestation, cert issuance |
| tunnel | cloudflare/cloudflared:latest | Cloudflare Tunnel for external access |
Additionally, spiffe-helper runs as a sidecar for cert rotation.
Fresh Deployment
Section titled “Fresh Deployment”Prerequisites
Section titled “Prerequisites”See Server Prerequisites for full requirements.
Minimum: Ubuntu 24.04 LTS, 4 cores, 4 GB RAM, 40 GB disk, Docker + Docker Compose.
Recommended: 8+ cores, 8+ GB RAM, 100+ GB disk.
Option A: Full Production Stack (Docker Compose)
Section titled “Option A: Full Production Stack (Docker Compose)”git clone git@github.com:tappass/tappass.gitcd tappass/deploy
# Create secretscp ../.env.example .env.prodRequired secrets in .env.prod:
# Security (all required for production)TAPPASS_ADMIN_API_KEY="tp_..." # python -c "import secrets; print('tp_' + secrets.token_urlsafe(32))"TAPPASS_JWT_SECRET="..." # python -c "import secrets; print(secrets.token_urlsafe(48))"TAPPASS_VAULT_KEY="..." # python -c "import secrets,base64; print(base64.b64encode(secrets.token_bytes(32)).decode())"TAPPASS_TOKEN_KEY_FILE="tappass-token.pem" # openssl ecparam -genkey -name prime256v1 -noout -out tappass-token.pemPOSTGRES_PASSWORD="..." # python -c "import secrets; print(secrets.token_urlsafe(24))"SPIRE_JOIN_TOKEN="..." # python -c "import secrets; print(secrets.token_urlsafe(32))"
# LicenseTAPPASS_LICENSE="..." # From license server
# LLM providers (at least one)OPENAI_API_KEY="sk-..."# ANTHROPIC_API_KEY="sk-ant-..."# AZURE_API_KEY="..."# AZURE_API_BASE="https://your-resource.openai.azure.com"
# LLM judge (for semantic analysis steps)TAPPASS_LLM_JUDGE_MODEL="gpt-4o-mini"
# Tunnel (for external access)TUNNEL_TOKEN="..." # From Cloudflare Zero Trust dashboardDeploy:
docker compose --env-file .env.prod -f docker-compose.prod.yml up -d
# Wait for all services to be healthydocker compose --env-file .env.prod -f docker-compose.prod.yml ps
# Register SPIRE workload entries (one-time)docker compose --env-file .env.prod -f docker-compose.prod.yml exec spire-server \ bash /opt/spire/register-entries.sh
# Verifycurl -s http://localhost:9620/health | python3 -m json.toolExpected health response:
{ "status": "healthy", "version": "0.5.0", "storage": "local", "license": {"org": "Client Corp", "tier": "enterprise", "expires": "2027-01-01"}}Option B: Quick Start (development/demo)
Section titled “Option B: Quick Start (development/demo)”pip install tappasstappass up --license <key>Interactive wizard walks through:
- License validation
- Storage selection (memory / local PostgreSQL / Supabase)
- Secret generation
- Server start
Option C: Full Server Bootstrap
Section titled “Option C: Full Server Bootstrap”For a completely fresh Ubuntu 24.04 server, use the bootstrap script:
sudo ./deploy/bootstrap.shThis installs everything: Docker, clones the repo, sets up systemd services, configures Cloudflare tunnels, installs MkDocs for docs, and creates the deploy user (gebruiker).
Option 1: Cloudflare Tunnel (recommended)
Already included in the Docker Compose stack. Set TUNNEL_TOKEN and configure the tunnel’s public hostname in Cloudflare Zero Trust dashboard to point to http://tappass:9620.
Option 2: Caddy (self-hosted TLS)
# Caddyfile included in deploy/caddy run --config deploy/CaddyfileDatabase
Section titled “Database”Backends
Section titled “Backends”| Backend | Config | Use case |
|---|---|---|
| Memory | No config needed | Dev/testing only. Data lost on restart. |
| Local PostgreSQL | DATABASE_URL=postgresql:/... | Self-hosted production. |
| Supabase | SUPABASE_URL + SUPABASE_KEY | Managed PostgreSQL. |
Migrations
Section titled “Migrations”12 migration files in deploy/migrations/. Run automatically on first start when using Docker Compose (mounted as init scripts).
For manual migration:
psql -h localhost -U tappass -d tappass -f deploy/migrations/001_schema.sql# ... through 012_governance_policies.sqlBackup
Section titled “Backup”# PostgreSQL dumpdocker compose exec postgres pg_dump -U tappass tappass > backup_$(date +%Y%m%d).sql
# Restoredocker compose exec -T postgres psql -U tappass tappass < backup_20260315.sqlUpgrade
Section titled “Upgrade”cd tappassgit pull origin main
# Rebuild and restartcd deploydocker compose --env-file .env.prod -f docker-compose.prod.yml up -d --build
# Verifycurl -s http://localhost:9620/healthNew migrations run automatically on PostgreSQL container restart (via init scripts).
Monitoring
Section titled “Monitoring”Health Endpoints
Section titled “Health Endpoints”| Endpoint | Purpose | Auth required |
|---|---|---|
GET /health | Readiness check (DB status, version) | No |
GET /health/live | Liveness probe (process running) | No |
GET /health/ready | Readiness probe (can serve traffic) | No |
GET /health/startup | Startup probe (finished init) | No |
GET /health/detailed | Full diagnostics (DB, Redis, OPA, SPIRE) | Yes (AUDITOR+) |
GET /metrics | Prometheus metrics | Yes (AUDITOR+) |
Kubernetes Probes
Section titled “Kubernetes Probes”livenessProbe: httpGet: path: /health/live port: 9620 periodSeconds: 10readinessProbe: httpGet: path: /health/ready port: 9620 periodSeconds: 10startupProbe: httpGet: path: /health/startup port: 9620 failureThreshold: 30 periodSeconds: 5Key Metrics to Alert On
Section titled “Key Metrics to Alert On”| Signal | Alert if | Action |
|---|---|---|
/health returns non-200 | Any occurrence | Check DB, Redis, OPA |
| Pipeline latency P99 | > 500ms | Disable heavy steps or scale |
Error rate on /v1/chat/completions | > 1% | Check logs, LLM provider status |
| Redis memory | > 90% of 64MB limit | Increase maxmemory or check for leaks |
| PostgreSQL connections | > 80% pool | Scale DB or optimize queries |
| Disk usage | > 80% | Rotate audit logs, clean Docker images |
OpenTelemetry
Section titled “OpenTelemetry”Set OTEL_EXPORTER_OTLP_ENDPOINT to push traces to your collector. Zero overhead when unset.
OTEL_EXPORTER_OTLP_ENDPOINT="http://otel-collector:4317"OTEL_SERVICE_NAME="tappass"Scaling
Section titled “Scaling”Horizontal
Section titled “Horizontal”TapPass core is stateless. Scale by running multiple instances behind a load balancer:
services: tappass: deploy: replicas: 3All instances must share the same PostgreSQL and Redis.
What’s Not Horizontally Scalable (Yet)
Section titled “What’s Not Horizontally Scalable (Yet)”- OPA sidecar: one per TapPass instance (stateless, low overhead)
- SPIRE: single server, but agents can run per-node
- The Docker Compose
license-netnetwork assumes a single-host license server
Disaster Recovery
Section titled “Disaster Recovery”| Scenario | RTO | RPO | Action |
|---|---|---|---|
| Single container crash | ~10s | 0 | Docker auto-restarts (unless-stopped) |
| Full server failure | ~30min | Last DB backup | Redeploy from git + restore backup |
| Database corruption | ~1h | Last backup | pg_restore from backup |
| Redis data loss | 0 | 0 | Redis is cache-only, recovers automatically |
Networking
Section titled “Networking”| Port | Service | Exposed externally |
|---|---|---|
| 9620 | TapPass API | Via tunnel or reverse proxy only |
| 8181 | OPA | Internal only (container network) |
| 5432 | PostgreSQL | Internal only |
| 6379 | Redis | Internal only |
Firewall Rules
Section titled “Firewall Rules”Outbound only:
- LLM providers (api.openai.com, api.anthropic.com, etc.)
- Cloudflare (for tunnel)
- GitHub (for updates)
- License server (for license validation)