Resilience & Reliability

Beever Atlas depends on 6 external services (Weaviate, Neo4j, MongoDB, Gemini, Jina, Tavily). A critical design principle is that any component failure must degrade gracefully — not cause total system failure.

Dependency Health Registry

Each external dependency has a circuit breaker with three states:

CLOSED (healthy): Requests pass through normally

OPEN (failing): Requests blocked, system uses fallback

HALF_OPEN (probing): Probing for recovery after timeout

Circuit Breaker States

Loading diagram...

Dependency Configuration

DependencyCriticalTimeoutFallback
Weaviate✅ Yes5sServe cached wiki only
Neo4j❌ No5sSemantic-only queries
MongoDB✅ Yes5sRead-only from cache
Gemini✅ Yes10sClaude via LiteLLM
Jina❌ No10sBM25-only search
Tavily❌ No5sInternal-only results
Redis❌ No2sChat SDK bot offline

Critical dependencies: System severely degraded but still functional Non-critical: Degraded mode with reduced functionality

Degradation Matrix

When components fail, Beever Atlas degrades predictably:

Component DownIngestion ImpactQuery ImpactUser Experience
Neo4jEntity extraction skipped; facts stored in Weaviate onlyroute=graph → reclassified as route=semanticWiki People/Decisions show "temporarily unavailable"
GeminiMessages queued in dead letter queueADK agents fall back to Claude modelsAlert fired; retry on recovery
RedisNo impact (batch ingestion unaffected)No impact (MCP queries unaffected)Chat SDK bot offline; users see "bot unavailable"
JinaEmbeddings queued; facts stored text-onlyExisting embeddings work; new facts use BM25-onlyBackfill embeddings when Jina recovers
TavilyNo impactSilently drop external sub-queriesUser sees "external search unavailable" note
WeaviateFull ingestion paused (queue in MongoDB)Return cached wiki; graph-only for relational queriesCritical alert — system severely degraded
MongoDBFull system pausedRead-only from Weaviate/Neo4j if cached connections surviveCritical alert — system offline

LLM Fallback

All LLM calls use Google ADK with LiteLLM integration for model fallback:

Primary → Fallback Chain

Agent TierPrimaryFallbackLast Resort
Fast (routing, extraction)Gemini 2.0 Flash LiteClaude HaikuRegex fast-path
Quality (response, wiki)Gemini 2.0 FlashClaude SonnetReturn raw results

Per-Agent Fallback

ADK AgentPrimaryFallbackLast Resort
query_router_agentGemini Flash LiteClaude HaikuRegex classifier
fact_extractor_agentGemini Flash LiteClaude HaikuDead letter queue
entity_extractor_agentGemini Flash LiteClaude HaikuSkip (Weaviate-only)
response_agentGemini FlashClaude SonnetReturn raw results
consolidation_agentGemini Flash LiteClaude HaikuServe stale cache

Fallback trigger: 3 consecutive failures OR 30s timeout

Recovery: Circuit breaker HALF_OPEN after 30s, probe with one request

Ingestion Pipeline Resilience

Each pipeline stage is independently skippable:

Stage-Level Skips

async def ingest_message(msg: NormalizedMessage):
    # Stage 1: Preprocess (required)
    preprocessed = await preprocessor.process(msg)
    
    # Stage 2a: Extract facts (required — queue to DLQ on failure)
    try:
        facts = await extractor.extract(preprocessed)
    except LLMUnavailableError:
        await dead_letter_queue.enqueue(msg)
        return
    
    # Stage 2b: Entity extraction (optional — skip if Neo4j/LLM down)
    entities = []
    if await health.check("neo4j") and await health.check("gemini"):
        try:
            entities = await entity_extractor.extract(preprocessed, facts)
        except Exception as e:
            logger.warning(f"Entity extraction failed, continuing: {e}")
            await backfill_queue.enqueue("entities", msg.id, preprocessed)
    
    # Stage 3: Embed (optional — queue if Jina down)
    embeddings = None
    if await health.check("jina"):
        embeddings = await embedder.embed(facts)
    else:
        await backfill_queue.enqueue("embeddings", msg.id, facts)
    
    # Stage 4: Persist via outbox pattern
    await persister.persist(facts, entities, embeddings)

Backfill Queues

Failed optional stages are queued for backfill:

Entities not extracted: Queued when Neo4j or LLM unavailable

  • Backfilled when both dependencies recover
  • Processed in order by timestamp

Embeddings not generated: Queued when Jina unavailable

  • Backfilled when Jina recovers
  • Processed in batches of 100

Wiki not updated: Rebuild on next scheduled run

  • No data loss (facts already stored)
  • Wiki regenerates from complete memory

Write Safety — Outbox Pattern

Cross-store writes use the outbox pattern for safety:

Two-Phase Persist

Phase 1: Write Intent

# Atomic write to MongoDB
intent = WriteIntent(
    id=deterministic_uuid(facts),
    facts=facts,
    entities=entities,
    embeddings=embeddings,
    status={
        "weaviate": "pending",
        "neo4j": "pending" if entities else "skipped",
        "state": "pending"
    },
    retry_count=0
)
await mongo.write_intents.insert_one(intent.dict())

Phase 2: Fan Out

# Weaviate — idempotent via deterministic UUID
if intent.status["weaviate"] == "pending":
    try:
        await weaviate.upsert(intent.facts, intent.embeddings)
        await mark(intent.id, "weaviate", "done")
    except Exception:
        await mark(intent.id, "weaviate", "failed")

# Neo4j — idempotent via MERGE semantics
if intent.status["neo4j"] == "pending":
    try:
        for entity in intent.entities:
            await neo4j.upsert_entity(entity)
        await mark(intent.id, "neo4j", "done")
    except Exception:
        await mark(intent.id, "neo4j", "failed")

# MongoDB sync state — final step
await update_sync_state(intent)
await mark(intent.id, "state", "done")

Background Write Reconciler

Runs every 15 minutes to retry incomplete writes:

async def reconcile():
    stale = await mongo.write_intents.find({
        "$or": [
            {"status.weaviate": {"$in": ["pending", "failed"]}},
            {"status.neo4j": {"$in": ["pending", "failed"]}}
        ],
        "created_at": {"$lt": now() - timedelta(minutes=5)},
        "retry_count": {"$lt": 5}
    }).to_list()
    
    for intent in stale:
        await fan_out(WriteIntent(**intent))
        await mongo.write_intents.update_one(
            {"id": intent["id"]},
            {"$inc": {"retry_count": 1}}
        )

Max retries: 5 attempts before permanent failure

Timeout: Abandoned after 24 hours

Query Resilience

Query Router Fallbacks

Graph timeout: Fall back to semantic-only

try:
    graph_results = await graph_agent.traverse(entities)
except TimeoutError:
    logger.warning("Graph traversal timed out, using semantic-only")
    graph_results = []

Both systems down: Return cached wiki

if not await health.check("weaviate") and not await health.check("neo4j"):
    cached = await wiki_cache.get(channel_id)
    if cached:
        return cached
    return {"error": "System temporarily unavailable"}

External search failure: Return internal-only results

try:
    external = await tavily.search(query)
except Exception:
    logger.warning("External search failed, using internal-only")
    external = None

Graceful Degradation

ScenarioBehavior
Neo4j timeoutUse semantic-only results (no graph context)
Weaviate timeoutUse graph-only results (no factual context)
Both timeoutReturn cached wiki if available
LLM timeoutReturn raw retrieved results
External search timeoutReturn internal-only results with note

Error Recovery Strategies

Automatic Recovery

Circuit breaker: Automatically closes after successful probe request

Backfill processing: Automatic on dependency recovery

Retry queues: Processed in order with exponential backoff

Dead letter queue: Manual inspection required after 5 failures

Manual Recovery

Admin API endpoints:

# Retry failed writes
POST /api/admin/reconcile

# Rebuild wiki for channel
POST /api/admin/wiki/rebuild
{ "channel_id": "C123456" }

# Refresh ACL for user
POST /api/admin/acl/refresh
{ "user_id": "U123456" }

# Trigger backfill
POST /api/admin/backfill/:type
# types: entities, embeddings

Monitoring

Health check endpoint:

GET /api/health
{
  "status": "degraded",
  "dependencies": {
    "weaviate": "healthy",
    "neo4j": "failing",
    "mongodb": "healthy",
    "gemini": "healthy",
    "jina": "degraded",
    "tavily": "healthy"
  },
  "uptime": 99.5,
  "degraded_since": "2026-04-13T12:00:00Z"
}

Alerts:

  • Critical dependencies down → PagerDuty alert
  • Non-critical dependencies down → Slack notification
  • High failure rate → Warning alert
  • Circuit breaker opens → Info log

Operational Impact

User Experience During Failures

Neo4j down:

  • ✅ Factual queries work
  • ❌ Relational queries fail with clear message
  • ⚠️ Wiki entity pages show "temporarily unavailable"

Gemini down:

  • ✅ Queries work (fallback to Claude)
  • ⚠️ Ingestion pauses (messages queued)
  • ✅ Wiki serves cached content

Weaviate down:

  • ❌ Ingestion paused
  • ⚠️ Queries return cached wiki only
  • ❌ No new content until recovery

MongoDB down:

  • ❌ Full system offline
  • ❌ No ingestion or queries
  • 🚨 Critical alert

Recovery Time Objectives

ComponentRTO (Recovery Time)RPO (Data Loss)
Neo4j< 5 min0 min (queued)
Gemini< 1 min (auto fallback)0 min (queued)
Jina< 15 min (backfill)0 min (queued)
Weaviate< 15 min< 5 min (queued)
MongoDB< 15 min< 5 min (queued)

Best Practices

For Operators

Monitor circuit breakers: Set up alerts for OPEN state

Check reconciliation: Review reconciler logs for failed writes

Test failover: Regularly test dependency failure scenarios

Plan capacity: Ensure backfill queues don't grow unbounded

Document runbooks: Step-by-step recovery procedures

For Users

Expect degraded service: Some features unavailable during outages

Check status page: Real-time system status at /status

Use cached wiki: Wiki content available even during outages

Report issues: Slack channel for system status updates

Next Steps

On this page