SDK Resilience. Fail-Open, Circuit Breaker, Offline Audit
TapPass sits in the hot path of every AI agent call. The resilience layer ensures TapPass never becomes a single point of failure.
The Problem
Section titled “The Problem”Without resilience, if TapPass goes down, every AI agent in your organization stops. The resilience layer gives you control over this trade-off.
Three Degradation Modes
Section titled “Three Degradation Modes”| Mode | Behavior | When to use |
|---|---|---|
fail_closed | Agent stops. TapPassConnectionError raised. | Regulated environments (healthcare, finance). Safety > availability. |
fail_open_cached | Agent continues with last-known-good policy. Audit entries queued locally and flushed on recovery. | Recommended for most enterprises. Best balance of safety and availability. |
fail_open_logged | Agent continues ungoverned. Every call logged locally as DEGRADED. | Development, testing, non-critical agents. |
Quick Start
Section titled “Quick Start”from tappass import Agentfrom tappass.resilience import ResiliencePolicy, FailMode
agent = Agent( "http://tappass:9620", resilience=ResiliencePolicy( mode=FailMode.FAIL_OPEN_CACHED, cache_ttl_seconds=300, # 5 min cache max_offline_requests=100, # hard cap ),)
# If TapPass is down:response = agent.chat("Analyze Q3 revenue")print(response.pipeline.degraded) # Trueprint(response.pipeline.degraded_reason) # "TapPass unreachable: serving from cache"Environment Variable Configuration
Section titled “Environment Variable Configuration”# Configure via env vars (no code change needed)export TAPPASS_FAIL_MODE=fail_open_cachedexport TAPPASS_CACHE_TTL=300export TAPPASS_MAX_OFFLINE_REQUESTS=100export TAPPASS_LOCAL_AUDIT_PATH=.tappass_audit_buffer.jsonlThe SDK auto-detects TAPPASS_FAIL_MODE and configures resilience accordingly.
How It Works
Section titled “How It Works”Circuit Breaker
Section titled “Circuit Breaker”The circuit breaker tracks consecutive failures to TapPass:
CLOSED ──(3 failures)──▶ OPEN ──(30s timeout)──▶ HALF_OPEN ──(probe succeeds)──▶ CLOSED │ │ │ (use fallback) │ (one probe request) ▼ ▼ Cached response Test real request- CLOSED: Normal operation. All requests go to TapPass.
- OPEN: TapPass unreachable. Use fallback (cache or fail-closed). No requests attempted.
- HALF_OPEN: Recovery timeout expired. One probe request sent. If it succeeds → CLOSED. If it fails → OPEN.
Response Cache
Section titled “Response Cache”Successful TapPass responses are cached (keyed by model + last user message). When the circuit opens, cached responses are served with degraded=true metadata.
- Cache is in-memory (not persisted across restarts)
- Max 50 entries, LRU eviction
- Configurable TTL (default 5 minutes)
Local Audit Buffer
Section titled “Local Audit Buffer”During degraded mode, audit entries are written to a local JSONL file:
.tappass_audit_buffer.jsonlWhen TapPass recovers, the SDK automatically flushes buffered entries to the server. Entries include _degraded: true and _buffered_at timestamps for forensic analysis.
Resilience Status
Section titled “Resilience Status”# Check current resilience statestatus = agent.resilience_statusprint(status)# {# "mode": "fail_open_cached",# "circuit": {"state": "closed", "consecutive_failures": 0, ...},# "cache_size": 12,# "buffered_audit_entries": 0,# "offline_request_count": 0,# "degraded_duration_seconds": null,# }Configuration Reference
Section titled “Configuration Reference”| Parameter | Env Var | Default | Description |
|---|---|---|---|
mode | TAPPASS_FAIL_MODE | fail_closed | Degradation mode |
cache_ttl_seconds | TAPPASS_CACHE_TTL | 300 | How long cached responses are valid |
max_offline_requests | TAPPASS_MAX_OFFLINE_REQUESTS | 100 | Hard cap on degraded calls (0 = unlimited) |
local_audit_path | TAPPASS_LOCAL_AUDIT_PATH | .tappass_audit_buffer.jsonl | Path for local audit buffer |
circuit_failure_threshold | TAPPASS_CIRCUIT_FAILURE_THRESHOLD | 3 | Consecutive failures before circuit opens |
circuit_recovery_timeout | TAPPASS_CIRCUIT_RECOVERY_TIMEOUT | 30 | Seconds before half-open probe |
alert_on_degradation | : | true | Log WARNING when entering degraded mode |
Offline Request Cap
Section titled “Offline Request Cap”The max_offline_requests parameter is a safety valve. If the cap is reached and TapPass is still down, the SDK falls back to fail_closed regardless of the configured mode. This prevents unbounded ungoverned operation.
Deployment Recommendations
Section titled “Deployment Recommendations”| Environment | Recommended Mode | Cache TTL | Max Offline |
|---|---|---|---|
| Development | fail_open_logged | 60s | 0 (unlimited) |
| Staging | fail_open_cached | 300s | 1000 |
| Production (standard) | fail_open_cached | 300s | 100 |
| Production (regulated) | fail_closed | : | : |
| Healthcare / Finance | fail_closed | : | : |
See Also
Section titled “See Also”- Network Architecture Guide: firewall rules, proxy configuration
- Enterprise Setup. production deployment
- Observability. health scores, drift detection