How Clio cites past visits — RAG in clinical AI

The shortest version of this post: Clio doesn't generate clinical answers from "the model's general knowledge of medicine." Every answer it produces is backed by a retrieval over your clinic's own records, with the retrieved evidence cited inline so you can verify the source in one tap.

This isn't a novel architecture — retrieval-augmented generation has been the standard answer to LLM hallucination for two years. The interesting parts are the implementation choices: what we retrieve, how we chunk it, how we keep the index per-clinic, what we do when retrieval fails, and what citations actually look like to a doctor at the point of care.

Why not fine-tune?

The first question we get from people who haven't built this before: why not fine-tune a small model on each clinic's data, instead of doing retrieval at inference time? Three reasons.

Recency. A patient walked into the clinic this morning with a new lab. By the time you fine-tune nightly batches, the model is already 24 hours behind. RAG retrieves whatever's in the index right now — including the lab uploaded ten minutes ago.

Auditability. Fine-tuning bakes the data into the weights. There's no way to later say "show me which patient visit informed this answer." With RAG, the retrieved chunks are the answer's source documents — the citation is literally a row in the index.

Per-clinic isolation. Cross-clinic learning would require either training one model per clinic (expensive, doesn't scale) or training one model on pooled data (a privacy nightmare and probably illegal under DPDP / HIPAA / GDPR). RAG's index is namespaced per workspace at the database level — the same model serves every clinic without ever mixing their data.

What goes into the index

Five things, indexed at write time:

Visits. The structured SOAP note, the chief complaint, the assessment, the plan. Stored as one chunk per section so retrieval can target the relevant part.
Prescriptions. Drug, strength, frequency, duration, indication, doctor's notes on why they prescribed it.
Lab reports. Both structured (the actual values) and the doctor's interpretive note when one exists.
Problem list and allergies. Promoted to a permanent first-class chunk because they're frequent retrieval targets.
Vitals over time. Bucketed into windows (last 30 days, last 90 days, last year) so a "blood pressure trend" question can retrieve the right summary directly without us having to compute aggregates at query time.

Each chunk is embedded with a multilingual sentence-transformer model (we use a fine-tuned variant of all-mpnet-base-v2 with English + Hindi pairs in the contrastive set). The vectors live in pgvector — Postgres extension — alongside the source data, so retrieval is a SQL query rather than a network hop to an external vector store.

Why pgvector and not a dedicated vector DB?

One database is one less thing to keep in sync. Patient writes hit Postgres anyway; the embedding update fires in the same transaction. A separate vector store means dual-write headaches the day a write succeeds in one and fails in the other.

The retrieval pipeline

When a doctor types a question into Clio (or when Clio is invoked silently from inside a screen), the pipeline runs:

Scope. The query is rewritten with the active patient and active workspace as filters. Retrieval cannot leak across patients or across workspaces — the SQL WHERE clause enforces it before similarity search runs.
Embed. The query is embedded with the same model used at index time.
Recall. Top-k cosine-similarity matches from pgvector. We default k = 24 for the recall stage — we want to over-fetch and let the reranker cull.
Rerank. A small cross-encoder reranks the 24 candidates against the original query. This step is the single biggest precision win — embedding similarity alone misses negations and clinical specificity.
Compose. The top 3 reranked chunks are injected into the prompt with their record IDs as anchor tags. The model is instructed to cite the anchor tags inline using a [1][2][3] notation.
Render. The frontend parses the citation tags, looks up the record IDs, and renders clickable chips that link to the source visit / Rx / lab.

What the prompt looks like

Heavily simplified, the prompt has three sections: the system instruction, the retrieved evidence, and the user's question.

SYSTEM:
You are Clio, a clinical assistant. Answer ONLY using the
evidence below. Every claim must cite the evidence ID in
square brackets. If the evidence does not support the
question, say "I don't have evidence for that in this
patient's record."

EVIDENCE:
[1] Visit 2026-03-12 — K+ 3.4 (low). Telmisartan 40mg
 retains potassium; thiazide depletes it.
[2] Lab 2026-01-03 — eGFR 92 (normal renal function).
[3] Rx 2025-12-18 — Telmisartan 40mg OD added for HTN.

USER:
What should I add for this patient's persistent BP?

The "answer ONLY using evidence" line is doing a lot of work. We don't trust it on its own — we also verify at the postprocessing step that every claim in the answer maps to a citation, and we strip claims that don't. That belt-and-braces approach is annoying engineering but cheap, and it catches the few hallucinations that slip past the prompt instruction.

What we measured

We ran a benchmark of 1,200 anonymised primary-care cases against three configurations: a vanilla GPT-4o-mini call, a RAG-enabled call without reranking, and the full pipeline above.

61% Vanilla — top-3 dx hit

78% RAG no rerank

89% RAG + rerank

More important than the diagnosis hit rate: the hallucination rate. We define a hallucination here as a clinical claim made by the model that has no anchor in the patient's actual record. Vanilla GPT-4o-mini hallucinated in 14% of responses (often inventing prior medications or labs that didn't exist). RAG with rerank: 0.4%. The verification step catches most of the remaining 0.4% before the doctor sees it.

What citations look like to the doctor

Below the answer, Clio shows three or four small chips: [1] Mar 12 Visit · K+ 3.4 trend · [2] Jan 03 Lab · eGFR 92 · [3] Dec 18 Rx · Telmisartan 40mg added. Tap any chip and the source record opens in a side panel — full context, not a snippet.

Two design choices we wrestled with:

Chips vs footnotes. Footnotes are more "academic-paper" but harder to scan in a clinical workflow. Chips with date + record-type + 4-word summary are scannable in under a second. We picked chips.

Inline numbers vs hover-only. Inline [1][2][3] tags inside the prose are visually noisy. Hover-only is invisible. We landed on inline tags but in a smaller, lighter colour so they read as "structure" rather than "content."

Where it falls down

RAG is great when the answer lives in the record. It's worse than vanilla LLM when the question is, say, "what's the differential for fever in a returning traveller?" — that's general clinical knowledge, not patient-specific. We handle this by routing such questions through a different prompt path that explicitly uses the model's parametric knowledge, with a banner that says "general clinical guidance, not patient-specific."

The other failure mode is when the patient's record is sparse. A new patient on their first visit has nothing for retrieval to grab. The system handles this gracefully — it tells the doctor "this is a new patient; I have no priors" — but it's still a worse experience than for a longstanding patient. The fix here is partly time (records grow) and partly cross-record retrieval scoped to the doctor's own panel of patients (still under that doctor's workspace, never cross-clinic).

The unsexy stuff

Most of the complexity in shipping this isn't in the retrieval. It's in the boring around-the-edges work:

Re-indexing when a doctor edits a past visit (we use a Postgres trigger plus a small worker queue).
Handling soft deletes — a deleted visit shouldn't surface in retrieval but shouldn't break audit trails.
Backfilling embeddings when we change models — currently a quarterly operation.
Multilingual edge cases (a Hinglish dictation embedded next to an English Rx).
Rate-limiting Clio per workspace so a single power-user doesn't accidentally DoS the embedding service.

What's next

Two things we're working on. One: graph retrieval for relational queries. Right now retrieval is k-nearest-neighbour over chunks, which is great for "find similar past visits" but worse for "show me this patient's complete medication timeline." Graph retrieval over patient → visit → Rx links handles the second cleanly. Two: cross-patient retrieval within a single doctor's panel — useful for questions like "do my hypertensives respond better to telmisartan or losartan?". Strictly opt-in, anonymised at retrieval time, never crosses workspace boundaries.

If you want to play with Clio on a real chart, sign up at app.medisero.com/signup. If you want to argue with the architecture, write to hello@medisero.in.