Beever Atlas v0.1 has launched! Star us on GitHub
Beever AtlasBeever Atlas

Resilience & Reliability

Beever Atlas depends on 6 external services (Weaviate, Neo4j, MongoDB, Gemini, Jina, Tavily). A critical design principle is that any component failure must degrade gracefully — not cause total system failure.

Dependency Health Registry

Each external dependency has a circuit breaker with three states:

CLOSED (healthy): Requests pass through normally

OPEN (failing): Requests blocked, system uses fallback

HALF_OPEN (probing): Probing for recovery after timeout

Circuit Breaker States

Loading diagram...

Dependency Configuration

DependencyCriticalTimeoutFallback
Weaviate✅ Yes5sServe cached wiki only
Neo4j❌ No5sSemantic-only queries
MongoDB✅ Yes5sRead-only from cache
Gemini✅ Yes10sClaude via LiteLLM
Jina❌ No10sBM25-only search
Tavily❌ No5sInternal-only results
Redis❌ No2sChat SDK bot offline

Critical dependencies: System severely degraded but still functional Non-critical: Degraded mode with reduced functionality

Degradation Matrix

When components fail, Beever Atlas degrades predictably:

Component DownIngestion ImpactQuery ImpactUser Experience
Neo4jEntity extraction skipped; facts stored in Weaviate onlyroute=graph → reclassified as route=semanticWiki People/Decisions show "temporarily unavailable"
GeminiMessages queued in dead letter queueADK agents fall back to Claude modelsAlert fired; retry on recovery
RedisNo impact (batch ingestion unaffected)No impact (MCP queries unaffected)Chat SDK bot offline; users see "bot unavailable"
JinaEmbeddings queued; facts stored text-onlyExisting embeddings work; new facts use BM25-onlyBackfill embeddings when Jina recovers
TavilyNo impactSilently drop external sub-queriesUser sees "external search unavailable" note
WeaviateFull ingestion paused (queue in MongoDB)Return cached wiki; graph-only for relational queriesCritical alert — system severely degraded
MongoDBFull system pausedRead-only from Weaviate/Neo4j if cached connections surviveCritical alert — system offline

LLM Fallback

All LLM calls use Google ADK with LiteLLM integration for model fallback:

Primary → Fallback Chain

Agent TierPrimaryFallbackLast Resort
Fast (routing, extraction)Gemini 2.0 Flash LiteClaude HaikuRegex fast-path
Quality (response, wiki)Gemini 2.0 FlashClaude SonnetReturn raw results

Per-Agent Fallback

ADK AgentPrimaryFallbackLast Resort
query_router_agentGemini Flash LiteClaude HaikuRegex classifier
fact_extractor_agentGemini Flash LiteClaude HaikuDead letter queue
entity_extractor_agentGemini Flash LiteClaude HaikuSkip (Weaviate-only)
response_agentGemini FlashClaude SonnetReturn raw results
consolidation_agentGemini Flash LiteClaude HaikuServe stale cache

Fallback trigger: 3 consecutive failures OR 30s timeout

Recovery: Circuit breaker HALF_OPEN after 30s, probe with one request

Ingestion Pipeline Resilience

Each pipeline stage is independently skippable:

Stage-Level Skips

async def ingest_message(msg: NormalizedMessage):
    # Stage 1: Preprocess (required)
    preprocessed = await preprocessor.process(msg)
    
    # Stage 2a: Extract facts (required — queue to DLQ on failure)
    try:
        facts = await extractor.extract(preprocessed)
    except LLMUnavailableError:
        await dead_letter_queue.enqueue(msg)
        return
    
    # Stage 2b: Entity extraction (optional — skip if Neo4j/LLM down)
    entities = []
    if await health.check("neo4j") and await health.check("gemini"):
        try:
            entities = await entity_extractor.extract(preprocessed, facts)
        except Exception as e:
            logger.warning(f"Entity extraction failed, continuing: {e}")
            await backfill_queue.enqueue("entities", msg.id, preprocessed)
    
    # Stage 3: Embed (optional — queue if Jina down)
    embeddings = None
    if await health.check("jina"):
        embeddings = await embedder.embed(facts)
    else:
        await backfill_queue.enqueue("embeddings", msg.id, facts)
    
    # Stage 4: Persist via outbox pattern
    await persister.persist(facts, entities, embeddings)

Backfill Queues

Failed optional stages are queued for backfill:

Entities not extracted: Queued when Neo4j or LLM unavailable

  • Backfilled when both dependencies recover
  • Processed in order by timestamp

Embeddings not generated: Queued when Jina unavailable

  • Backfilled when Jina recovers
  • Processed in batches of 100

Wiki not updated: Rebuild on next scheduled run

  • No data loss (facts already stored)
  • Wiki regenerates from complete memory

Write Safety — Outbox Pattern

Cross-store writes use the outbox pattern for safety:

Two-Phase Persist

Phase 1: Write Intent

# Atomic write to MongoDB
intent = WriteIntent(
    id=deterministic_uuid(facts),
    facts=facts,
    entities=entities,
    embeddings=embeddings,
    status={
        "weaviate": "pending",
        "neo4j": "pending" if entities else "skipped",
        "state": "pending"
    },
    retry_count=0
)
await mongo.write_intents.insert_one(intent.dict())

Phase 2: Fan Out

# Weaviate — idempotent via deterministic UUID
if intent.status["weaviate"] == "pending":
    try:
        await weaviate.upsert(intent.facts, intent.embeddings)
        await mark(intent.id, "weaviate", "done")
    except Exception:
        await mark(intent.id, "weaviate", "failed")

# Neo4j — idempotent via MERGE semantics
if intent.status["neo4j"] == "pending":
    try:
        for entity in intent.entities:
            await neo4j.upsert_entity(entity)
        await mark(intent.id, "neo4j", "done")
    except Exception:
        await mark(intent.id, "neo4j", "failed")

# MongoDB sync state — final step
await update_sync_state(intent)
await mark(intent.id, "state", "done")

Background Write Reconciler

Runs every 15 minutes to retry incomplete writes:

async def reconcile():
    stale = await mongo.write_intents.find({
        "$or": [
            {"status.weaviate": {"$in": ["pending", "failed"]}},
            {"status.neo4j": {"$in": ["pending", "failed"]}}
        ],
        "created_at": {"$lt": now() - timedelta(minutes=5)},
        "retry_count": {"$lt": 5}
    }).to_list()
    
    for intent in stale:
        await fan_out(WriteIntent(**intent))
        await mongo.write_intents.update_one(
            {"id": intent["id"]},
            {"$inc": {"retry_count": 1}}
        )

Max retries: 5 attempts before permanent failure

Timeout: Abandoned after 24 hours

Query Resilience

Query Router Fallbacks

Graph timeout: Fall back to semantic-only

try:
    graph_results = await graph_agent.traverse(entities)
except TimeoutError:
    logger.warning("Graph traversal timed out, using semantic-only")
    graph_results = []

Both systems down: Return cached wiki

if not await health.check("weaviate") and not await health.check("neo4j"):
    cached = await wiki_cache.get(channel_id)
    if cached:
        return cached
    return {"error": "System temporarily unavailable"}

External search failure: Return internal-only results

try:
    external = await tavily.search(query)
except Exception:
    logger.warning("External search failed, using internal-only")
    external = None

Graceful Degradation

ScenarioBehavior
Neo4j timeoutUse semantic-only results (no graph context)
Weaviate timeoutUse graph-only results (no factual context)
Both timeoutReturn cached wiki if available
LLM timeoutReturn raw retrieved results
External search timeoutReturn internal-only results with note

Error Recovery Strategies

Automatic Recovery

Circuit breaker: Automatically closes after successful probe request

Backfill processing: Automatic on dependency recovery

Retry queues: Processed in order with exponential backoff

Dead letter queue: Manual inspection required after 5 failures

Manual Recovery

Admin API endpoints:

# Retry failed writes
POST /api/admin/reconcile

# Rebuild wiki for channel
POST /api/admin/wiki/rebuild
{ "channel_id": "C123456" }

# Refresh ACL for user
POST /api/admin/acl/refresh
{ "user_id": "U123456" }

# Trigger backfill
POST /api/admin/backfill/:type
# types: entities, embeddings

Monitoring

Health check endpoint:

GET /api/health
{
  "status": "degraded",
  "dependencies": {
    "weaviate": "healthy",
    "neo4j": "failing",
    "mongodb": "healthy",
    "gemini": "healthy",
    "jina": "degraded",
    "tavily": "healthy"
  },
  "uptime": 99.5,
  "degraded_since": "2026-04-13T12:00:00Z"
}

Alerts:

  • Critical dependencies down → PagerDuty alert
  • Non-critical dependencies down → Slack notification
  • High failure rate → Warning alert
  • Circuit breaker opens → Info log

Operational Impact

User Experience During Failures

Recovery Time Objectives

ComponentRTO (Recovery Time)RPO (Data Loss)
Neo4j< 5 min0 min (queued)
Gemini< 1 min (auto fallback)0 min (queued)
Jina< 15 min (backfill)0 min (queued)
Weaviate< 15 min< 5 min (queued)
MongoDB< 15 min< 5 min (queued)

Best Practices

For Operators

Monitor circuit breakers: Set up alerts for OPEN state

Check reconciliation: Review reconciler logs for failed writes

Test failover: Regularly test dependency failure scenarios

Plan capacity: Ensure backfill queues don't grow unbounded

Document runbooks: Step-by-step recovery procedures

For Users

Expect degraded service: Some features unavailable during outages

Check status page: Real-time system status at /status

Use cached wiki: Wiki content available even during outages

Report issues: Slack channel for system status updates

Next Steps

How is this guide?

On this page

Ready for production?

Ship to production with SSO, audit logs, spend controls, and guardrails your security team will approve.

Talk to the team

or email hello@beever.ai