Hybrid RAG that knows when it doesn't know

Jun 26, 202611 min read

#ai
#rag
#postgres

Acorn Reply drafts support replies for you. The draft is only ever a draft — a human edits it and hits send — but a bad draft still costs something. It costs the few seconds it takes to notice it's wrong, delete it, and start over. Do that often enough and the AI is a tax, not a tool.

The failure mode we cared about most isn't a draft that's a little off. It's a draft that is confidently wrong: a fluent, well-formatted answer to a question the knowledge base can't actually answer. A human skimming a plausible paragraph is exactly the wrong reader to catch a fabricated refund window or a made-up setting name.

So the retrieval layer that feeds our drafts is built around one idea: it should be able to tell the difference between "I found the answer" and "I found the nearest thing, which isn't the same." Here's how it works.

Why a single vector search isn't enough

The default RAG recipe is: embed the question, embed your documents, and pull back the nearest neighbors by cosine similarity. It works remarkably well — until it doesn't, and the way it fails is the problem.

Vector search always returns something. Ask about a feature that doesn't exist and you'll still get the top-k closest chunks, ranked, scored, and looking for all the world like a real match. Cosine distance tells you which chunk is nearest. It does not tell you whether nearest is near enough to be an answer. Embeddings are also fuzzy in a specific way: they're great at "these mean the same thing" and weak at exact tokens — order IDs, error codes, a product name spelled a particular way — which is precisely the vocabulary support questions hinge on.

Keyword search has the opposite personality. Postgres full-text search (websearch_to_tsquery) is unforgiving about meaning but excellent at literal tokens, and when it matches nothing it's honest about it: you get zero rows, not a confidently-ranked wrong guess.

Neither arm is trustworthy alone. But their mistakes are uncorrelated — and that's the opening.

Two arms, run in parallel

Every retrieval runs both searches against the same workspace's chunks at once. They share the (LLM-rewritten) query but otherwise see the world differently:

Two retrieval arms, run in parallel (pseudocode)

rewritten  = llm_rewrite(inbound_message)      # fall back to raw text on failure
embedding  = embed(rewritten)
 
vector_hits, keyword_hits = in_parallel(
    # arm 1 — nearest by meaning: pgvector cosine distance ( <=> )
    nearest_by_cosine(embedding, workspace, published_only, limit = 30),
 
    # arm 2 — nearest by token: Postgres full-text ( websearch_to_tsquery )
    full_text_match(rewritten, workspace, published_only, limit = 30),
)

A few things worth pointing out:

It's all Postgres. The <=> operator is pgvector's cosine distance; tsv @@ websearch_to_tsquery(...) is the full-text arm. One database, two indexes, no separate vector service to operate. For a product built by a small team, "the search engine is just Postgres" is a feature, not a compromise.
Tenancy is in the WHERE clause. Every row is scoped by workspace_id. Retrieval can't leak across tenants because the query can't see across tenants.
Both arms pull a generous candidate pool (30 rows by default), not the final count — so fusion has something to work with before we trim to the handful that reaches the model.

We rewrite the raw inbound message into a cleaner search query with a small LLM call before either arm runs. If that call fails, we fall back to the original text rather than failing the retrieval — the search should degrade, not disappear.

Fusing two rankings: RRF

Now we have two ranked lists with scores that aren't comparable — a cosine similarity and a ts_rank live on completely different scales. Normalizing them against each other is fiddly and brittle. So we don't. We throw the scores away and keep only the ranks, then combine with Reciprocal Rank Fusion:

Reciprocal Rank Fusion (pseudocode)

function rrf(rankings, k = 60):
    score = {}                                   # id -> running total
    for each ranked_list in rankings:
        for rank, hit in enumerate(ranked_list): # rank starts at 0
            score[hit.id] += 1 / (k + rank + 1)
    return ids sorted by score, descending

RRF's whole appeal is that it only cares about position. A chunk ranked first in an arm contributes 1 / (k + 1); one ranked tenth contributes 1 / (k + 10). The constant k (60 is the conventional value) softens the curve so the top few ranks don't completely dominate. A chunk that shows up in both lists gets both contributions added together, which is the property we're about to lean on hard.

A subtle trap: don't let one source vote twice

Our knowledge base stores more than the verbatim FAQ text. Each source also gets "hypothetical" chunks — generated paraphrases and example questions that give the vector arm more surface area to match against. They're great for recall. They're also dangerous for fusion: if two hypotheticals derived from the same parent chunk both rank in an arm, naïve RRF treats them as two independent endorsements and stacks their scores — manufacturing confidence out of what is really a single source agreeing with itself.

So before fusing, every hit collapses to its parent and we keep only the first occurrence of each parent within an arm:

Collapse paraphrases before fusing (pseudocode)

function dedupe_within_arm(hits):
    seen = {}
    for hit in hits:
        hit.id = hit.parent_id or hit.id          # paraphrase -> its source
        if hit.id not in seen:                     # keep only the best-ranked
            seen.add(hit.id)                       # occurrence per arm, so one
            keep(hit)                              # source can't vote twice

It's a small function guarding a real bug: without it, the confidence score we're about to compute would be systematically too high for sources that happen to have many paraphrases.

Turning rank into a confidence the product can use

Fusion gives us an ordered list and a top RRF score. But "RRF score 0.0312" means nothing to the drafting layer. What it needs is a single number it can reason about: how much should I trust the top result?

Two signals feed that number. The first is the raw RRF score of the top hit — higher is better. The second is binary but powerful: did the top hit rank in both arms, or just one?

Top hit, consensus, confidence (pseudocode)

fused      = rrf([vector_hits, keyword_hits])
top        = fused.first                          # null if both arms were empty
in_both    = top != null
             and top.id in vector_hits
             and top.id in keyword_hits
confidence = calibrate(top.rrf_score, in_both) if top else 0

inBothArms is the consensus signal. When the same chunk is both the nearest neighbor by meaning and a strong match by keyword, two methods that fail in different ways agreed — that's the strongest evidence we have that we actually found the answer rather than the nearest-looking thing.

The two signals get mapped to a 0–1 confidence with a plain logistic function:

Confidence calibration (pseudocode)

# A logistic curve over two inputs: the top RRF score, and the consensus
# bit. A, B, C are seed coefficients — refit offline, not tuned in prod.
function calibrate(rrf_top, in_both):
    z = A * rrf_top  +  B * (1 if in_both else 0)  +  C
    return sigmoid(z)              # squashes z into a 0–1 confidence

The coefficients are seed values, not laws of nature. They're chosen so the curve lands roughly where intuition says it should — a top hit that's strong and present in both arms reads as high confidence; a weak, single-arm hit reads as low; nothing at all reads as near-zero — and they're meant to be refit by an offline eval harness as the knowledge base grows, not hand-tuned in production. Pulling the consensus bonus (B) out as its own term is deliberate: it keeps "two methods agreed" legible as a thing we can dial up or down on its own, instead of burying it inside a blended score.

Calibrated confidence is an honest relative signal — it ranks "probably right" above "probably guessing." It is not a calibrated probability of correctness in the statistical sense, and we don't present it as one. Treating a sigmoid output as truth is its own kind of confident wrongness.

The payoff: a draft that can say "I'm not sure"

All of this exists to make uncertainty legible to the person hitting send. The confidence score isn't thrown away after retrieval — it's stored on the draft and mapped to a triage tier the inbox can surface:

Triage tier from confidence (pseudocode)

function triage(confidence, verification):
    if verification == FAILED:  return "verification_failed"
    if confidence is null:      return "no_match"
    if confidence >= 0.75:      return "confident"
    if confidence >= 0.45:      return "uncertain"
    return "no_match"

So a draft doesn't just arrive — it arrives labeled, and the human knows up front whether the system is vouching for it or hedging.

The actual answer-or-ask fork is deliberately simpler and stricter than "trust the confidence number." Two guardrails do the real work:

1. No hits, no guessing. When retrieval comes back empty, the draft is generated from a clarify-only prompt that asks a short question instead of inventing an answer from nothing:

Draft routing (pseudocode)

if retrieval.hits is empty:
    draft = run(clarify_only_prompt)         # ask a question, don't guess
else:
    draft = run(answer_prompt, knowledge = retrieval.hits)
    # every citation the model makes is checked back against the source text
    verification = verify_citations(draft.citations, retrieval.hits)

2. Citations get verified. When there are hits, the model is instructed that if none of the chunks actually apply it should cite nothing and write a clarifying question rather than stretch a bad match into an answer. And whatever it does cite is checked back against the real chunk text — if a citation doesn't hold up, verification fails and the triage tier flips to verification_failed, flagging the draft for a closer human look.

So confidence is advice, not a switch: it tells the human how strongly the retrieval layer vouches for what it found, while the zero-hit fallback and citation verification are the hard guarantees that turn a weak match into a question or a flag instead of a fabrication.

That's the whole point of measuring any of this. A RAG system that can't tell when it's guessing has no choice but to always answer — and "always answer" is just "sometimes fabricate" wearing a nicer outfit. By making "I don't have a good match" a first-class, detectable state — surfaced to the human, gated at zero hits, and checked at the citation level — the worst output the system tends to produce is an honest question rather than a confident fiction. For a support tool whose entire promise is that a human still hits send, that's the right floor to design to.

What we deliberately kept simple

A few things we didn't build, on purpose:

No reranker model. RRF over two arms plus the consensus signal gets us a useful ordering and a useful confidence without a second model in the hot path. A cross-encoder reranker is an obvious future lever; it wasn't worth the latency and complexity yet.
No bespoke vector store. pgvector in the same Postgres that holds everything else means one backup story, one connection pool, one thing to operate.
Hand-seeded calibration coefficients. They're good enough to make the high/low distinction the product depends on, and the eval harness is where they get sharper over time.

None of those are permanent decisions. They're the smallest thing that made the confidence signal trustworthy enough to act on — which, for a feature whose job is to know when it doesn't know, was the only bar that mattered.

Where this is heading

The current design is deliberately the simplest thing that makes the confidence signal trustworthy. A few directions we expect to take it:

A reranker in the slow path. A cross-encoder that reads the query and each candidate together sharpens the top-of-list ordering that RRF only approximates — most valuable on the "uncertain" tier, where the right answer is often present but not first. The cost is latency and another model to operate, which is why it's a later lever, not a launch one.
Learned fusion instead of a fixed constant. RRF's k = 60 and the equal weighting of the two arms are conventions, not truths. With enough labeled outcomes, the fusion weights — and the calibration coefficients — can be fit to our data rather than borrowed, turning today's seed values into measured ones.
Better embeddings, and more of them. Longer-context and domain-tuned embedding models keep improving; a multilingual model would let the same pipeline serve non-English inboxes without a second design.
Agentic, multi-hop retrieval. Some questions need two lookups — resolve an account, then answer about it. A retrieve–reason–retrieve loop sits naturally above this layer, with the same confidence gate deciding when to stop.

The broader bet is that frontier models keep getting better at not hallucinating when grounded — but "better" isn't "never," and a support tool can't tell a 2%-wrong answer from a 0%-wrong one at a glance. The grounding, the consensus signal, and the calibrated "I'm not sure" don't get less useful as the models improve; they're what let you trust the model more as it earns it.