Extremis Cloud · early access

Want the OSS library? extremis.ashwanijha.dev

Stop assembling memory from six moving parts.

Layered, learning memory for AI agents — with verification, consolidation, and a triage dashboard — in one HTTP call. The same Extremis library, fully managed.

✓ Sign in with Google · 30 seconds to first recall✓ Hallucination detection bundled✓ Your data, your Postgres if you want
agent.py
# pip install "extremis[client]"
from extremis import HostedClient

mem = HostedClient(api_key="extremis_sk_…")

mem.remember("User is building a WhatsApp AI")
hits = mem.recall("what is the user building?")

hits[0].verification  # grounded 0.93, judge 0.94
hits[0].source        # doc + chunk + span → citation

Inside one recall()

Six stages. One API call. We operate every box.

Watch a single mem.recall()travel through Extremis Cloud's pipeline. Each stage is a moving part you'd otherwise wire up, scale, and pay for separately.

live pipeline · one recall() call

Your agent

mem.recall(...)

HTTP request
Managed by Extremis

Embed

MiniLM-L6

Retrieve

semantic + BM25

Verify

NLI

Judge

haiku-4-5

Re-rank

RL-scored

HTTP response

Result

hits + .source

returned to your agent

1 hit · reason "similarity 0.87 · used 5×" · grounded 0.93 · source support-handbook.md#42

Dashed box = the whole hosted retrieval stack: embedder, hybrid retrieval (semantic + BM25), NLI verifier, LLM judge, RL re-ranker. One HTTP call replaces five services you'd otherwise pick, wire up, and pay for.

What you'd otherwise wire up

Don't rebuild this from scratch.

Rough estimate for an agent serving ~1,000 recalls per day. Numbers shift with traffic; the count of moving parts doesn't.

ComponentDIY (self-assembled)Extremis Cloud
Hybrid retrieval (semantic + BM25)Pinecone ~$30–70/mo or pgvector + tsvector on Supabase Pro $25/mo + ops + a hand-tuned hybrid scorerPer-tenant Postgres included, semantic + BM25 fused server-side
EmbedderOpenAI text-embedding-3-small $0.02/M tokens or self-host sentence-transformers (90MB model, CPU/GPU)MiniLM included; bring your own OpenAI key if you want bigger
NLI verifierFree (deberta-v3-small, ~200MB) but needs Python infra + warmupIncluded, pre-warmed, ~12ms per check
LLM judgeClaude Haiku at ~$0.001/call · ~$30/mo at 1k recalls/dayIncluded up to the cohort tier
Consolidation workerLLM cost $30–50/mo + cron infra + queue + retry logicNightly dream pass runs for you
Observability / dashboardBuild it yourself, or pay a vendor $50+/moTriage queue + span tree + drift trends — included
Citations / source trackingHand-wire doc-id, chunk-id, span offsets into your retriever; usually skippedEvery recall returns .source — doc, chunk, span — for citing back to the original
Engineering time1–2 days to wire up, then ongoing maintenance≈30 seconds (sign in, paste API key)

Like AWS Bedrock for retrieval pipelines, but agent-memory-shaped — and self-hostable under MIT if Cloud ever stops working for you.

A new framing

Think of it as a managed knowledge base for your business — that stays private.

Every customer conversation, every internal doc, every decision your team has made is a memory. Extremis stores them in four layers, verifies each one at write time, returns citations on recall, and never sends your data anywhere you didn't agree to. Vector DB plus retrieval plus verification plus governance — one product, your data, your control.

Identity layer

Who is the user, what role do they hold, what they prefer.

"Alex is the new Head of Support · prefers terse Slack replies"

Semantic layer

Durable facts about your business, products, customers, policies.

"SLA is 99.9% uptime · refund window is 30 days · EU customers pay VAT"

Procedural layer

How your team does things — playbooks, escalation paths, rules.

"When a P0 lands: page on-call, post in #incidents, open a Linear issue"

Episodic layer

Time-stamped events — conversations, tickets, decisions, what happened when.

"Ticket 8412 closed 2026-05-18 · customer wanted refund · resolved via 30-day policy"

Stays private by design

BYO Postgres or self-host on your VPC — Cloud orchestrates, you store. Air-gapped mode is one config flag away.

Citable by default

Every recall() returns .source — doc, chunk, byte-span. Show your customers where the answer came from.

Governed at write time

The two-tier verifier flags wrong memories before they enter your namespace. You triage, supersede, or accept.

Hallucination detection

Wrong memories are flagged, not stored quietly.

A two-tier verifier runs at write time: a fast NLI model first, then an LLM judge for grey-zone scores. Failing memories aren't silently dropped — they're tagged unverified and downranked at recall time. Every recall returns a verification trace you can inspect.

  • On self-host: configure your own thresholds, pick the NLI model, point at any judge LLM.
  • On Cloud: dashboard surfaces flagged memories as a triage queue and renders the trace tree with red rows on failures.

verifier thresholds (defaults)

NLI · cross-encoder/nli-deberta-v3-small0.42
0grey 0.5pass 0.851

below grey — auto-flag

LLM judge · claude-haiku-4-50.18
0grey 0.4pass 0.71

below grey — store as unverified, downrank

NLI · grounded recall0.93
0grey 0.5pass 0.851

pass — store as-is

NLI ≥ 0.85 stores as-is · 0.5–0.85 invokes judge · <0.5 flags

Memory observability

The dashboard you'd otherwise build yourself.

Every memory write, every recall, every verification — surfaced in the dashboard the moment it happens.

Triage

Hallucination queue

Memories flagged by NLI or the LLM judge land in a queue with the source excerpt, the extraction, the score, and a one-click supersede.

Trace

Per-recall span tree

Click any recall in the feed → see the call tree (embed → search → verify → judge → re-rank). Errors are red and carry a suggested fix.

Drift

Quality trends

14-day rolling NDCG, MRR, and groundedness. Alerts when this week's score deviates >2σ from baseline.

example: contradicted recall (from the live dashboard)

extremis.hosted.recall        124ms ⌐
  embedder.embed             10ms ✓
  retrieve.hybrid            11ms ✓ (semantic + BM25)
  verifier.nli               14ms ⌐ grounded 0.42
  verifier.judge             47ms ⌐ grounded 0.18

why it failed:
  sources self-correct from 99.95% to 99.9%;
  extracted memory captured the pre-correction value.

what to try:
  mem.remember_now(layer="semantic", confidence=0.95)

Privacy posture

Four deployment modes. Pick what fits.

Extremis is MIT-licensed. The library, the server, and the Cloud orchestrator are the same engine — what differs is where state lives and who runs the compute. The 1-click default is the right starting point for most teams; the other three are escape hatches.

Default · 1-click

Cloud · managed Postgres

  • Sign in with Google → API key in ~30 seconds
  • We auto-provision a per-tenant Postgres on Supabase free tier
  • Upgrade the underlying DB in Supabase when you outgrow it

Anyone who wants memory + verification + dashboard with zero setup. Pick this unless compliance says otherwise.

Sign in with Google
Compliance-friendly

Cloud · BYO Postgres

  • Paste a Neon / RDS / self-managed Postgres connection string
  • We orchestrate, you store — persistent state never leaves your DB
  • Pause Cloud anytime; your data is already with you

Teams that want managed compute but residency stays with them.

Sign in & wire your DB
Your infra

Self-host · your VPC

  • Run extremis-server in your own infra (Docker)
  • Your Postgres, your S3, your secrets, your egress
  • Same image we run in Cloud — no version skew

Compliance-heavy buyers who want full control.

Self-host guide
Strictest · air-gapped

OSS · fully local

  • SQLite on disk; zero network calls
  • Optionally point judge at a local model, or disable verification
  • No data leaves the machine, ever

Regulated workloads, secure environments, dev laptops.

Read the OSS docs

The three primitives

A library this small, with this much running behind it.

1mem.remember()

append to fsync'd log + episodic store

mem.remember(
  "user wants the SLA in writing",
  conversation_id="c1",
)
2mem.recall()

ranked by cosine × RL score × recency

hits = mem.recall("SLA")
# returns ranked results,
# each with .reason and .verification
3mem.reinforce()

asymmetric 1.5× weight on negative signals

mem.report_outcome(
  [h.memory.id for h in hits[:2]],
  success=True,
)

Vs the alternatives

What sets Extremis apart.

FeatureExtremis CloudMem0 CloudLetta CloudRaw RAG (Pinecone)
Layered memory (identity/semantic/episodic/procedural)
RL-scored retrieval (1.5× asymmetric on negatives)
Per-recall reason strings
Knowledge graph built inpartial
Hallucination detection bundled
Per-tenant Postgres / BYO storageshared
Self-host the same library (MIT)
MCP server (9 tools)

Benchmarks

LongMemEval-S · 500 QA instances · ~53 sessions each.

Same numbers as the OSS library — Hosted Extremis is the identical engine, fully managed. Reproducible benchmark run on GitHub. QA accuracy depends on the answerer model.

94.4%

Retrieval R@5

top-5 includes the answer session

38.8%

QA Accuracy

claude-haiku-4-5 as answerer

~35ms

p50 recall latency

local model · MPS · varies in prod

Every recall explains itself

Debuggable by default.

No black box. Every result carries a one-line reason — the same string the OSS library returns. You see exactly why a memory surfaced, in plain English.

example reason strings

  • similarity 0.87 · score +2.0 · used 5× · 3d old
  • identity layer (×2 weight) · matched user's prior preference
  • downranked: judge flagged unverified at write time

Pricing

First cohort is hand-priced.

Manual invoicing for the first 10 customers — pick a number that works for you and we'll talk. Tier pricing lands after we know what real usage looks like. Free tier sustains the first 10k memories; self-hosting is always free.

Drop into your agent today.

30 seconds from sign-in to first recall. If Cloud isn't for you, self-host the OSS library — same engine, MIT-licensed.