Building a reliable RAG system

Array of floating point numbers representing a vector database

If you arrived here from LinkedIn, this is the longer version of the story.

If you have used a RAG system, you have witnessed the "it works sometimes" problem. A query returns the right answer on Tuesday and an unrelated paragraph on Wednesday. The fix is rarely a single change. It is a sequence of decisions made carefully across the pipeline, where each one removes a specific class of failure.

This post walks through the decisions we made when building a private RAG assistant deployed inside client infrastructure for regulated industries. No exact model names, no copy-paste recipe. Just the shape of a system that behaves the same way today as it did yesterday.

Why most RAG systems are unreliable

The failures cluster into a small number of root causes:

  • Documents arrive as images. A scanned PDF with no embedded text returns nothing useful from a parser. Without OCR detection, the chunks are empty.
  • Query and corpus speak different languages. A French technician searches an English manual. Vector similarity collapses.
  • Chunking is optimistic. Default splitters cut through tables, headings, and procedure steps. Retrieval surfaces fragments.
  • Embeddings are too lightweight. A small model trained on generic text cannot tell two industrial part numbers apart.

Each of these is solvable. The work is in solving all of them at once, in the right order, without breaking the next stage downstream.

Ingestion: the part everyone underestimates

Ingestion is where most quality decisions are actually made. By the time a document is in the index, the ceiling on retrieval quality is already set.

Ingestion pipelineDocuments flow through parsing, OCR detection, chunking with token guard, embedding, and indexing into a vector and lexical store. Source documentsPDFs, manuals, scans Layout-aware parserPer-page markdown OCR detectionAvg chars per page OCR fallbackIf sparse ChunkingPage-aware splits Token guardRe-split oversized Multilingual embedCross-language vectors Vector storeSemantic index Lexical storeKeyword index

A few decisions that earned their keep:

Layout-aware parsing over naive text extraction. A parser that preserves page structure gives every chunk a real page_label. That metadata travels through the rest of the pipeline and shows up later as a citation the user can verify. If you cut this corner, you cannot show your work.

OCR detection by average characters per page. First attempt: trigger OCR if the first page has no text. That fails on cover pages with a single title. The reliable signal is the average across the document. Sparse on average means scanned. Cover pages stop being a problem.

A token guard for oversized chunks. Even a well-behaved parser produces the occasional monster chunk that exceeds the embedding model's context window. Without a guard, those chunks get silently truncated and you lose the back half of the content. The fix is to count tokens with the embedding model's own tokenizer (not a generic one, which under-counts on specialized vocabulary) and re-split anything past a safe threshold.

A multilingual embedding model from the start. Retrofitting one later is painful. Pick one upfront that handles the languages your users actually speak.

Retrieval: hybrid plus re-ranking, not one or the other

This is the part of the pipeline most teams get wrong. Either they pick semantic search and accept that exact part numbers sometimes vanish, or they pick lexical and accept that paraphrased questions return nothing. Hybrid retrieval solves the recall problem. Re-ranking solves the precision problem.

Retrieval pipelineUser query is normalized, sent to hybrid retrieval combining lexical and semantic search, fused, re-ranked, and used to ground the LLM response. User queryAny language Language normalizationDetect, translate, hint Lexical retrieverExact token match Semantic retrieverVector similarity FusionMerge candidates Cross-encoder rerankQuery-aware scoring Grounded LLM response

Three things deserve their own attention:

Language normalization happens before retrieval, not after. A French query against an English corpus is translated to English for the retrieval step, then the original query is appended as a hint so the generation step responds in the user's language. The multilingual embedding model still helps. The translation step is what lets lexical retrieval pull its weight.

Fusion is not the finish line. The merged candidate list is good. It is not great. Pre-rerank you get a list with the right answer somewhere in the top ten, surrounded by plausible-but-wrong neighbours. The LLM then has to reason through noise.

Re-ranking is what makes the system feel reliable. A cross-encoder looks at the query and each candidate together, scoring them in context rather than independently. The right answer climbs to position one or two consistently. Pre-rerank responses are noisy. Post-rerank responses are clean. Same retrieval, different ordering, very different user experience.

The full picture

Pull back and the system has a clear shape: an ingestion pipeline that feeds an index, a retrieval pipeline that serves a chat client, and an LLM that generates the final answer. Everything sits inside a private boundary. No data leaves the client's infrastructure, which is the point of building it this way.

Full system overviewSystem deployed inside a private boundary. Ingestion pipeline feeds vector and lexical stores. Retrieval pipeline serves the chat client through an API. The LLM runs inside the same boundary. Private deployment boundaryClient infrastructure, no data egress Ingestion serviceParse, chunk, embedScheduled or on-demand Index storesVector + lexicalMetadata DB Retrieval APIHybrid + rerankAuth, audit log Local LLMOn-prem inferenceNo external calls Chat clientUser interfaceCitations, feedback

A few principles hold the architecture together:

Storage is abstracted. Only one module knows where files actually live. Swapping local disk for object storage requires no changes to the routes that serve documents. Same applies to authentication. One boundary, one place to change.

Schema is version-controlled. Migrations are the single source of truth for the database. No more "it works on my machine because that column exists locally" surprises.

Errors stay controlled. Raw exception strings never reach the client. The server logs them with full context. The user sees a message that tells them what to do next.

Citations are non-negotiable. Every answer points back to the page it came from. Without that, you have a confident chatbot. With it, you have a tool people can trust.

What we are not telling you

The exact embedding model. The exact reranker. The exact LLM. Those details are choices, and the right choice depends on your hardware, languages, and tolerance for latency. The principles above hold whether you run on a single GPU droplet or a rack.

What we will tell you is that none of this came from a tutorial. Each decision was made because the previous version of the system failed in a specific, traceable way. Reliability in RAG is not a feature you turn on. It is the residue of having taken every failure mode seriously.


Mantrax Software Solutions builds private AI assistants for regulated industries where data sovereignty is not optional. If you are thinking about RAG for your own organization and want to skip the failure modes above, get in touch.

Acknowledgement

Feature photo by Mika Baumeister on Unsplash

Recommended Posts