Hybrid RAG that knows when it doesn't know
- #ai
- #rag
- #postgres
Acorn Reply drafts support replies for you. The draft is only ever a draft — a human edits it and hits send — but a bad draft still costs something. It costs the few seconds it takes to notice it's wrong, delete it, and start over. Do that often enough and the AI is a tax, not a tool.
The failure mode we cared about most isn't a draft that's a little off. It's a draft that is confidently wrong: a fluent, well-formatted answer to a question the knowledge base can't actually answer. A human skimming a plausible paragraph is exactly the wrong reader to catch a fabricated refund window or a made-up setting name.
So the retrieval layer that feeds our drafts is built around one idea: it should be able to tell the difference between "I found the answer" and "I found the nearest thing, which isn't the same." Here's how it works.
Why a single vector search isn't enough
The default RAG recipe is: embed the question, embed your documents, and pull back the nearest neighbors by cosine similarity. It works remarkably well — until it doesn't, and the way it fails is the problem.
Vector search always returns something. Ask about a feature that doesn't exist and you'll still get the top-k closest chunks, ranked, scored, and looking for all the world like a real match. Cosine distance tells you which chunk is nearest. It does not tell you whether nearest is near enough to be an answer. Embeddings are also fuzzy in a specific way: they're great at "these mean the same thing" and weak at exact tokens — order IDs, error codes, a product name spelled a particular way — which is precisely the vocabulary support questions hinge on.
Keyword search has the opposite personality. Postgres full-text search
(websearch_to_tsquery) is unforgiving about meaning but excellent at
literal tokens, and when it matches nothing it's honest about it: you get
zero rows, not a confidently-ranked wrong guess.
Neither arm is trustworthy alone. But their mistakes are uncorrelated — and that's the opening.
Two arms, run in parallel
Every retrieval runs both searches against the same workspace's chunks at once. They share the (LLM-rewritten) query but otherwise see the world differently:
rewritten = llm_rewrite(inbound_message) # fall back to raw text on failure
embedding = embed(rewritten)
vector_hits, keyword_hits = in_parallel(
# arm 1 — nearest by meaning: pgvector cosine distance ( <=> )
nearest_by_cosine(embedding, workspace, published_only, limit = 30),
# arm 2 — nearest by token: Postgres full-text ( websearch_to_tsquery )
full_text_match(rewritten, workspace, published_only, limit = 30),
)A few things worth pointing out:
- It's all Postgres. The
<=>operator is pgvector's cosine distance;tsv @@ websearch_to_tsquery(...)is the full-text arm. One database, two indexes, no separate vector service to operate. For a product built by a small team, "the search engine is just Postgres" is a feature, not a compromise. - Tenancy is in the
WHEREclause. Every row is scoped byworkspace_id. Retrieval can't leak across tenants because the query can't see across tenants. - Both arms pull a generous candidate pool (30 rows by default), not the final count — so fusion has something to work with before we trim to the handful that reaches the model.
We rewrite the raw inbound message into a cleaner search query with a small LLM call before either arm runs. If that call fails, we fall back to the original text rather than failing the retrieval — the search should degrade, not disappear.
Fusing two rankings: RRF
Now we have two ranked lists with scores that aren't comparable —
a cosine similarity and a ts_rank live on completely different scales.
Normalizing them against each other is fiddly and brittle. So we don't.
We throw the scores away and keep only the ranks, then combine with
Reciprocal Rank Fusion:
function rrf(rankings, k = 60):
score = {} # id -> running total
for each ranked_list in rankings:
for rank, hit in enumerate(ranked_list): # rank starts at 0
score[hit.id] += 1 / (k + rank + 1)
return ids sorted by score, descendingRRF's whole appeal is that it only cares about position. A chunk ranked
first in an arm contributes 1 / (k + 1); one ranked tenth contributes
1 / (k + 10). The constant k (60 is the conventional value) softens
the curve so the top few ranks don't completely dominate. A chunk that
shows up in both lists gets both contributions added together, which
is the property we're about to lean on hard.
A subtle trap: don't let one source vote twice
Our knowledge base stores more than the verbatim FAQ text. Each source also gets "hypothetical" chunks — generated paraphrases and example questions that give the vector arm more surface area to match against. They're great for recall. They're also dangerous for fusion: if two hypotheticals derived from the same parent chunk both rank in an arm, naïve RRF treats them as two independent endorsements and stacks their scores — manufacturing confidence out of what is really a single source agreeing with itself.
So before fusing, every hit collapses to its parent and we keep only the first occurrence of each parent within an arm:
function dedupe_within_arm(hits):
seen = {}
for hit in hits:
hit.id = hit.parent_id or hit.id # paraphrase -> its source
if hit.id not in seen: # keep only the best-ranked
seen.add(hit.id) # occurrence per arm, so one
keep(hit) # source can't vote twiceIt's a small function guarding a real bug: without it, the confidence score we're about to compute would be systematically too high for sources that happen to have many paraphrases.
Turning rank into a confidence the product can use
Fusion gives us an ordered list and a top RRF score. But "RRF score 0.0312" means nothing to the drafting layer. What it needs is a single number it can reason about: how much should I trust the top result?
Two signals feed that number. The first is the raw RRF score of the top hit — higher is better. The second is binary but powerful: did the top hit rank in both arms, or just one?
fused = rrf([vector_hits, keyword_hits])
top = fused.first # null if both arms were empty
in_both = top != null
and top.id in vector_hits
and top.id in keyword_hits
confidence = calibrate(top.rrf_score, in_both) if top else 0inBothArms is the consensus signal. When the same chunk is both the
nearest neighbor by meaning and a strong match by keyword, two
methods that fail in different ways agreed — that's the strongest
evidence we have that we actually found the answer rather than the
nearest-looking thing.
The two signals get mapped to a 0–1 confidence with a plain logistic function:
# A logistic curve over two inputs: the top RRF score, and the consensus
# bit. A, B, C are seed coefficients — refit offline, not tuned in prod.
function calibrate(rrf_top, in_both):
z = A * rrf_top + B * (1 if in_both else 0) + C
return sigmoid(z) # squashes z into a 0–1 confidenceThe coefficients are seed values, not laws of nature. They're chosen
so the curve lands roughly where intuition says it should — a top hit
that's strong and present in both arms reads as high confidence; a
weak, single-arm hit reads as low; nothing at all reads as near-zero —
and they're meant to be refit by an offline eval harness as the knowledge
base grows, not hand-tuned in production. Pulling the consensus bonus (B)
out as its own term is deliberate: it keeps "two methods agreed" legible
as a thing we can dial up or down on its own, instead of burying it
inside a blended score.
Calibrated confidence is an honest relative signal — it ranks "probably right" above "probably guessing." It is not a calibrated probability of correctness in the statistical sense, and we don't present it as one. Treating a sigmoid output as truth is its own kind of confident wrongness.
The payoff: a draft that can say "I'm not sure"
All of this exists to make uncertainty legible to the person hitting send. The confidence score isn't thrown away after retrieval — it's stored on the draft and mapped to a triage tier the inbox can surface:
function triage(confidence, verification):
if verification == FAILED: return "verification_failed"
if confidence is null: return "no_match"
if confidence >= 0.75: return "confident"
if confidence >= 0.45: return "uncertain"
return "no_match"So a draft doesn't just arrive — it arrives labeled, and the human knows up front whether the system is vouching for it or hedging.
The actual answer-or-ask fork is deliberately simpler and stricter than "trust the confidence number." Two guardrails do the real work:
1. No hits, no guessing. When retrieval comes back empty, the draft is generated from a clarify-only prompt that asks a short question instead of inventing an answer from nothing:
if retrieval.hits is empty:
draft = run(clarify_only_prompt) # ask a question, don't guess
else:
draft = run(answer_prompt, knowledge = retrieval.hits)
# every citation the model makes is checked back against the source text
verification = verify_citations(draft.citations, retrieval.hits)2. Citations get verified. When there are hits, the model is
instructed that if none of the chunks actually apply it should cite
nothing and write a clarifying question rather than stretch a bad match
into an answer. And whatever it does cite is checked back against the real
chunk text — if a citation doesn't hold up, verification fails and the
triage tier flips to verification_failed, flagging the draft for a
closer human look.
So confidence is advice, not a switch: it tells the human how strongly the retrieval layer vouches for what it found, while the zero-hit fallback and citation verification are the hard guarantees that turn a weak match into a question or a flag instead of a fabrication.
That's the whole point of measuring any of this. A RAG system that can't tell when it's guessing has no choice but to always answer — and "always answer" is just "sometimes fabricate" wearing a nicer outfit. By making "I don't have a good match" a first-class, detectable state — surfaced to the human, gated at zero hits, and checked at the citation level — the worst output the system tends to produce is an honest question rather than a confident fiction. For a support tool whose entire promise is that a human still hits send, that's the right floor to design to.
What we deliberately kept simple
A few things we didn't build, on purpose:
- No reranker model. RRF over two arms plus the consensus signal gets us a useful ordering and a useful confidence without a second model in the hot path. A cross-encoder reranker is an obvious future lever; it wasn't worth the latency and complexity yet.
- No bespoke vector store. pgvector in the same Postgres that holds everything else means one backup story, one connection pool, one thing to operate.
- Hand-seeded calibration coefficients. They're good enough to make the high/low distinction the product depends on, and the eval harness is where they get sharper over time.
None of those are permanent decisions. They're the smallest thing that made the confidence signal trustworthy enough to act on — which, for a feature whose job is to know when it doesn't know, was the only bar that mattered.
Where this is heading
The current design is deliberately the simplest thing that makes the confidence signal trustworthy. A few directions we expect to take it:
- A reranker in the slow path. A cross-encoder that reads the query and each candidate together sharpens the top-of-list ordering that RRF only approximates — most valuable on the "uncertain" tier, where the right answer is often present but not first. The cost is latency and another model to operate, which is why it's a later lever, not a launch one.
- Learned fusion instead of a fixed constant. RRF's
k = 60and the equal weighting of the two arms are conventions, not truths. With enough labeled outcomes, the fusion weights — and the calibration coefficients — can be fit to our data rather than borrowed, turning today's seed values into measured ones. - Better embeddings, and more of them. Longer-context and domain-tuned embedding models keep improving; a multilingual model would let the same pipeline serve non-English inboxes without a second design.
- Agentic, multi-hop retrieval. Some questions need two lookups — resolve an account, then answer about it. A retrieve–reason–retrieve loop sits naturally above this layer, with the same confidence gate deciding when to stop.
The broader bet is that frontier models keep getting better at not hallucinating when grounded — but "better" isn't "never," and a support tool can't tell a 2%-wrong answer from a 0%-wrong one at a glance. The grounding, the consensus signal, and the calibrated "I'm not sure" don't get less useful as the models improve; they're what let you trust the model more as it earns it.
Further reading
- Cormack, Clarke & Buettcher, Reciprocal Rank Fusion outperforms
Condorcet and individual Rank Learning Methods
(SIGIR 2009) — the original RRF paper, and where the
k = 60convention comes from. - pgvector — the Postgres extension
providing the
<=>distance operator the vector arm relies on. - PostgreSQL manual, Full Text Search
—
tsvector,websearch_to_tsquery, andts_rankfor the keyword arm. - Azure AI Search, Hybrid search scoring (RRF) — a production search engine fusing vector and keyword results with the same algorithm.
- Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020) — the paper that named RAG.