Skip to content

SDK Resilience. Fail-Open, Circuit Breaker, Offline Audit

TapPass sits in the hot path of every AI agent call. The resilience layer ensures TapPass never becomes a single point of failure.

Without resilience, if TapPass goes down, every AI agent in your organization stops. The resilience layer gives you control over this trade-off.

ModeBehaviorWhen to use
fail_closedAgent stops. TapPassConnectionError raised.Regulated environments (healthcare, finance). Safety > availability.
fail_open_cachedAgent continues with last-known-good policy. Audit entries queued locally and flushed on recovery.Recommended for most enterprises. Best balance of safety and availability.
fail_open_loggedAgent continues ungoverned. Every call logged locally as DEGRADED.Development, testing, non-critical agents.
from tappass import Agent
from tappass.resilience import ResiliencePolicy, FailMode
agent = Agent(
"http://tappass:9620",
resilience=ResiliencePolicy(
mode=FailMode.FAIL_OPEN_CACHED,
cache_ttl_seconds=300, # 5 min cache
max_offline_requests=100, # hard cap
),
)
# If TapPass is down:
response = agent.chat("Analyze Q3 revenue")
print(response.pipeline.degraded) # True
print(response.pipeline.degraded_reason) # "TapPass unreachable: serving from cache"
Terminal window
# Configure via env vars (no code change needed)
export TAPPASS_FAIL_MODE=fail_open_cached
export TAPPASS_CACHE_TTL=300
export TAPPASS_MAX_OFFLINE_REQUESTS=100
export TAPPASS_LOCAL_AUDIT_PATH=.tappass_audit_buffer.jsonl

The SDK auto-detects TAPPASS_FAIL_MODE and configures resilience accordingly.

The circuit breaker tracks consecutive failures to TapPass:

CLOSED ──(3 failures)──▶ OPEN ──(30s timeout)──▶ HALF_OPEN ──(probe succeeds)──▶ CLOSED
│ │
│ (use fallback) │ (one probe request)
▼ ▼
Cached response Test real request
  • CLOSED: Normal operation. All requests go to TapPass.
  • OPEN: TapPass unreachable. Use fallback (cache or fail-closed). No requests attempted.
  • HALF_OPEN: Recovery timeout expired. One probe request sent. If it succeeds → CLOSED. If it fails → OPEN.

Successful TapPass responses are cached (keyed by model + last user message). When the circuit opens, cached responses are served with degraded=true metadata.

  • Cache is in-memory (not persisted across restarts)
  • Max 50 entries, LRU eviction
  • Configurable TTL (default 5 minutes)

During degraded mode, audit entries are written to a local JSONL file:

.tappass_audit_buffer.jsonl

When TapPass recovers, the SDK automatically flushes buffered entries to the server. Entries include _degraded: true and _buffered_at timestamps for forensic analysis.

# Check current resilience state
status = agent.resilience_status
print(status)
# {
# "mode": "fail_open_cached",
# "circuit": {"state": "closed", "consecutive_failures": 0, ...},
# "cache_size": 12,
# "buffered_audit_entries": 0,
# "offline_request_count": 0,
# "degraded_duration_seconds": null,
# }
ParameterEnv VarDefaultDescription
modeTAPPASS_FAIL_MODEfail_closedDegradation mode
cache_ttl_secondsTAPPASS_CACHE_TTL300How long cached responses are valid
max_offline_requestsTAPPASS_MAX_OFFLINE_REQUESTS100Hard cap on degraded calls (0 = unlimited)
local_audit_pathTAPPASS_LOCAL_AUDIT_PATH.tappass_audit_buffer.jsonlPath for local audit buffer
circuit_failure_thresholdTAPPASS_CIRCUIT_FAILURE_THRESHOLD3Consecutive failures before circuit opens
circuit_recovery_timeoutTAPPASS_CIRCUIT_RECOVERY_TIMEOUT30Seconds before half-open probe
alert_on_degradation:trueLog WARNING when entering degraded mode

The max_offline_requests parameter is a safety valve. If the cap is reached and TapPass is still down, the SDK falls back to fail_closed regardless of the configured mode. This prevents unbounded ungoverned operation.

EnvironmentRecommended ModeCache TTLMax Offline
Developmentfail_open_logged60s0 (unlimited)
Stagingfail_open_cached300s1000
Production (standard)fail_open_cached300s100
Production (regulated)fail_closed::
Healthcare / Financefail_closed::