Want the OSS library? extremis.ashwanijha.dev
Layered, learning memory for AI agents — with verification, consolidation, and a triage dashboard — in one HTTP call. The same Extremis library, fully managed.
# pip install "extremis[client]" from extremis import HostedClient mem = HostedClient(api_key="extremis_sk_…") mem.remember("User is building a WhatsApp AI") hits = mem.recall("what is the user building?") hits[0].verification # grounded 0.93, judge 0.94 hits[0].source # doc + chunk + span → citation
Inside one recall()
Watch a single mem.recall()travel through Extremis Cloud's pipeline. Each stage is a moving part you'd otherwise wire up, scale, and pay for separately.
Your agent
mem.recall(...)
Embed
MiniLM-L6
Retrieve
semantic + BM25
Verify
NLI
Judge
haiku-4-5
Re-rank
RL-scored
Result
hits + .source
returned to your agent
1 hit · reason "similarity 0.87 · used 5×" · grounded 0.93 · source support-handbook.md#42
Dashed box = the whole hosted retrieval stack: embedder, hybrid retrieval (semantic + BM25), NLI verifier, LLM judge, RL re-ranker. One HTTP call replaces five services you'd otherwise pick, wire up, and pay for.
What you'd otherwise wire up
Rough estimate for an agent serving ~1,000 recalls per day. Numbers shift with traffic; the count of moving parts doesn't.
| Component | DIY (self-assembled) | Extremis Cloud |
|---|---|---|
| Hybrid retrieval (semantic + BM25) | Pinecone ~$30–70/mo or pgvector + tsvector on Supabase Pro $25/mo + ops + a hand-tuned hybrid scorer | Per-tenant Postgres included, semantic + BM25 fused server-side |
| Embedder | OpenAI text-embedding-3-small $0.02/M tokens or self-host sentence-transformers (90MB model, CPU/GPU) | MiniLM included; bring your own OpenAI key if you want bigger |
| NLI verifier | Free (deberta-v3-small, ~200MB) but needs Python infra + warmup | Included, pre-warmed, ~12ms per check |
| LLM judge | Claude Haiku at ~$0.001/call · ~$30/mo at 1k recalls/day | Included up to the cohort tier |
| Consolidation worker | LLM cost $30–50/mo + cron infra + queue + retry logic | Nightly dream pass runs for you |
| Observability / dashboard | Build it yourself, or pay a vendor $50+/mo | Triage queue + span tree + drift trends — included |
| Citations / source tracking | Hand-wire doc-id, chunk-id, span offsets into your retriever; usually skipped | Every recall returns .source — doc, chunk, span — for citing back to the original |
| Engineering time | 1–2 days to wire up, then ongoing maintenance | ≈30 seconds (sign in, paste API key) |
Like AWS Bedrock for retrieval pipelines, but agent-memory-shaped — and self-hostable under MIT if Cloud ever stops working for you.
A new framing
Every customer conversation, every internal doc, every decision your team has made is a memory. Extremis stores them in four layers, verifies each one at write time, returns citations on recall, and never sends your data anywhere you didn't agree to. Vector DB plus retrieval plus verification plus governance — one product, your data, your control.
Who is the user, what role do they hold, what they prefer.
"Alex is the new Head of Support · prefers terse Slack replies"
Durable facts about your business, products, customers, policies.
"SLA is 99.9% uptime · refund window is 30 days · EU customers pay VAT"
How your team does things — playbooks, escalation paths, rules.
"When a P0 lands: page on-call, post in #incidents, open a Linear issue"
Time-stamped events — conversations, tickets, decisions, what happened when.
"Ticket 8412 closed 2026-05-18 · customer wanted refund · resolved via 30-day policy"
Stays private by design
BYO Postgres or self-host on your VPC — Cloud orchestrates, you store. Air-gapped mode is one config flag away.
Citable by default
Every recall() returns .source — doc, chunk, byte-span. Show your customers where the answer came from.
Governed at write time
The two-tier verifier flags wrong memories before they enter your namespace. You triage, supersede, or accept.
Hallucination detection
A two-tier verifier runs at write time: a fast NLI model first, then an LLM judge for grey-zone scores. Failing memories aren't silently dropped — they're tagged unverified and downranked at recall time. Every recall returns a verification trace you can inspect.
verifier thresholds (defaults)
below grey — auto-flag
below grey — store as unverified, downrank
pass — store as-is
›NLI ≥ 0.85 stores as-is · 0.5–0.85 invokes judge · <0.5 flags
Memory observability
Every memory write, every recall, every verification — surfaced in the dashboard the moment it happens.
Memories flagged by NLI or the LLM judge land in a queue with the source excerpt, the extraction, the score, and a one-click supersede.
Click any recall in the feed → see the call tree (embed → search → verify → judge → re-rank). Errors are red and carry a suggested fix.
14-day rolling NDCG, MRR, and groundedness. Alerts when this week's score deviates >2σ from baseline.
example: contradicted recall (from the live dashboard)
extremis.hosted.recall 124ms ⌐ embedder.embed 10ms ✓ retrieve.hybrid 11ms ✓ (semantic + BM25) verifier.nli 14ms ⌐ grounded 0.42 verifier.judge 47ms ⌐ grounded 0.18 why it failed: sources self-correct from 99.95% to 99.9%; extracted memory captured the pre-correction value. what to try: mem.remember_now(layer="semantic", confidence=0.95)
Privacy posture
Extremis is MIT-licensed. The library, the server, and the Cloud orchestrator are the same engine — what differs is where state lives and who runs the compute. The 1-click default is the right starting point for most teams; the other three are escape hatches.
Anyone who wants memory + verification + dashboard with zero setup. Pick this unless compliance says otherwise.
Sign in with Google →Teams that want managed compute but residency stays with them.
Sign in & wire your DB →Compliance-heavy buyers who want full control.
Self-host guide →Regulated workloads, secure environments, dev laptops.
Read the OSS docs →The three primitives
mem.remember()append to fsync'd log + episodic store
mem.remember( "user wants the SLA in writing", conversation_id="c1", )
mem.recall()ranked by cosine × RL score × recency
hits = mem.recall("SLA")
# returns ranked results,
# each with .reason and .verificationmem.reinforce()asymmetric 1.5× weight on negative signals
mem.report_outcome( [h.memory.id for h in hits[:2]], success=True, )
Vs the alternatives
| Feature | Extremis Cloud | Mem0 Cloud | Letta Cloud | Raw RAG (Pinecone) |
|---|---|---|---|---|
| Layered memory (identity/semantic/episodic/procedural) | ✓ | — | — | — |
| RL-scored retrieval (1.5× asymmetric on negatives) | ✓ | — | — | — |
| Per-recall reason strings | ✓ | — | — | — |
| Knowledge graph built in | ✓ | — | partial | — |
| Hallucination detection bundled | ✓ | — | — | — |
| Per-tenant Postgres / BYO storage | ✓ | shared | ✓ | — |
| Self-host the same library (MIT) | ✓ | — | — | ✓ |
| MCP server (9 tools) | ✓ | — | — | — |
Benchmarks
Same numbers as the OSS library — Hosted Extremis is the identical engine, fully managed. Reproducible benchmark run on GitHub. QA accuracy depends on the answerer model.
94.4%
Retrieval R@5
top-5 includes the answer session
38.8%
QA Accuracy
claude-haiku-4-5 as answerer
~35ms
p50 recall latency
local model · MPS · varies in prod
Every recall explains itself
No black box. Every result carries a one-line reason — the same string the OSS library returns. You see exactly why a memory surfaced, in plain English.
example reason strings
Pricing
Manual invoicing for the first 10 customers — pick a number that works for you and we'll talk. Tier pricing lands after we know what real usage looks like. Free tier sustains the first 10k memories; self-hosting is always free.
30 seconds from sign-in to first recall. If Cloud isn't for you, self-host the OSS library — same engine, MIT-licensed.