SDK Resilience. Fail-Open, Circuit Breaker, Offline Audit

TapPass sits in the hot path of every AI agent call. The resilience layer ensures TapPass never becomes a single point of failure.

The Problem

Without resilience, if TapPass goes down, every AI agent in your organization stops. The resilience layer gives you control over this trade-off.

Three Degradation Modes

Mode	Behavior	When to use
`fail_closed`	Agent stops. `TapPassConnectionError` raised.	Regulated environments (healthcare, finance). Safety > availability.
`fail_open_cached`	Agent continues with last-known-good policy. Audit entries queued locally and flushed on recovery.	Recommended for most enterprises. Best balance of safety and availability.
`fail_open_logged`	Agent continues ungoverned. Every call logged locally as `DEGRADED`.	Development, testing, non-critical agents.

Quick Start

from tappass import Agent
from tappass.resilience import ResiliencePolicy, FailMode

agent = Agent(
    "http://tappass:9620",
    resilience=ResiliencePolicy(
        mode=FailMode.FAIL_OPEN_CACHED,
        cache_ttl_seconds=300,        # 5 min cache
        max_offline_requests=100,      # hard cap
    ),
)

# If TapPass is down:
response = agent.chat("Analyze Q3 revenue")
print(response.pipeline.degraded)          # True
print(response.pipeline.degraded_reason)   # "TapPass unreachable: serving from cache"

Environment Variable Configuration

# Configure via env vars (no code change needed)
export TAPPASS_FAIL_MODE=fail_open_cached
export TAPPASS_CACHE_TTL=300
export TAPPASS_MAX_OFFLINE_REQUESTS=100
export TAPPASS_LOCAL_AUDIT_PATH=.tappass_audit_buffer.jsonl

The SDK auto-detects TAPPASS_FAIL_MODE and configures resilience accordingly.

How It Works

Circuit Breaker

The circuit breaker tracks consecutive failures to TapPass:

CLOSED ──(3 failures)──▶ OPEN ──(30s timeout)──▶ HALF_OPEN ──(probe succeeds)──▶ CLOSED
                           │                         │
                           │ (use fallback)           │ (one probe request)
                           ▼                         ▼
                     Cached response          Test real request

CLOSED: Normal operation. All requests go to TapPass.
OPEN: TapPass unreachable. Use fallback (cache or fail-closed). No requests attempted.
HALF_OPEN: Recovery timeout expired. One probe request sent. If it succeeds → CLOSED. If it fails → OPEN.

Response Cache

Successful TapPass responses are cached (keyed by model + last user message). When the circuit opens, cached responses are served with degraded=true metadata.

Cache is in-memory (not persisted across restarts)
Max 50 entries, LRU eviction
Configurable TTL (default 5 minutes)

Local Audit Buffer

During degraded mode, audit entries are written to a local JSONL file:

.tappass_audit_buffer.jsonl

When TapPass recovers, the SDK automatically flushes buffered entries to the server. Entries include _degraded: true and _buffered_at timestamps for forensic analysis.

Resilience Status

# Check current resilience state
status = agent.resilience_status
print(status)
# {
#   "mode": "fail_open_cached",
#   "circuit": {"state": "closed", "consecutive_failures": 0, ...},
#   "cache_size": 12,
#   "buffered_audit_entries": 0,
#   "offline_request_count": 0,
#   "degraded_duration_seconds": null,
# }

Configuration Reference

Parameter	Env Var	Default	Description
`mode`	`TAPPASS_FAIL_MODE`	`fail_closed`	Degradation mode
`cache_ttl_seconds`	`TAPPASS_CACHE_TTL`	`300`	How long cached responses are valid
`max_offline_requests`	`TAPPASS_MAX_OFFLINE_REQUESTS`	`100`	Hard cap on degraded calls (0 = unlimited)
`local_audit_path`	`TAPPASS_LOCAL_AUDIT_PATH`	`.tappass_audit_buffer.jsonl`	Path for local audit buffer
`circuit_failure_threshold`	`TAPPASS_CIRCUIT_FAILURE_THRESHOLD`	`3`	Consecutive failures before circuit opens
`circuit_recovery_timeout`	`TAPPASS_CIRCUIT_RECOVERY_TIMEOUT`	`30`	Seconds before half-open probe
`alert_on_degradation`	:	`true`	Log WARNING when entering degraded mode

Offline Request Cap

The max_offline_requests parameter is a safety valve. If the cap is reached and TapPass is still down, the SDK falls back to fail_closed regardless of the configured mode. This prevents unbounded ungoverned operation.

Deployment Recommendations

Environment	Recommended Mode	Cache TTL	Max Offline
Development	`fail_open_logged`	60s	0 (unlimited)
Staging	`fail_open_cached`	300s	1000
Production (standard)	`fail_open_cached`	300s	100
Production (regulated)	`fail_closed`	:	:
Healthcare / Finance	`fail_closed`	:	: