Company-knowledge RAG: how AI answers from your documents

A language model knows “the internet in general”, but not your knowledge base: pricing, policies, documentation, customer history. Ask it directly and it will confidently make something up. RAG (retrieval-augmented generation) closes that gap: the model answers from fragments retrieved from your documents, not from memory.

Let’s look at how the RAG pipeline works, where accuracy breaks, and when it’s genuinely worth it for a business versus needless complexity.

Why RAG

A bare LLM has three business problems: it doesn’t know your data, it produces plausible but wrong answers (hallucinations), and it can’t cite a source. RAG fixes all three — the answer is assembled from specific fragments of your documents, with a link to the original.

Answers are grounded in your current documents, not “general knowledge”
Every answer is verifiable — you can see which document it came from
Updating knowledge means updating documents, not retraining a model

How the pipeline works

RAG has two loops. Indexing (once, and on updates): documents are split into chunks, each turned into an embedding — a vector of meaning — and stored in a vector database. Answering (per question): the question is also embedded, the nearest chunks are retrieved, and the model answers strictly from them.

— Diagram: documents → chunks → embeddings → vector DB → retrieval → answer

In code it’s two steps: retrieve(query) returns the top-k relevant fragments, and generate(query, context) asks the model to answer without leaving the provided context.

Where accuracy breaks

RAG quality is determined not by the model but by the retrieval loop. If retrieval brings the wrong fragments, even the best model answers off-target. The main levers:

Chunking. Chunks too large blur meaning; too small lose context. Split on semantic boundaries, not character counts.
Hybrid search. Vector search captures meaning but misses exact terms, SKUs and acronyms. Pairing it with keyword search (BM25) fixes that.
Re-ranker. A separate model reorders retrieved fragments by relevance before they reach the LLM — a clear hit-rate boost.

In RAG, retrieval owns quality, not the model. A strong LLM on bad context is a confident wrong answer.

Freshness and access

Two requirements that separate a demo from production. Freshness: the index must update when documents change — otherwise the agent confidently answers from an outdated policy. Access: a user must not see a document they lack rights to via the chat — permission filtering is built into the retrieval step itself.

Need RAG or a voice agent for your use case? Book a call

When RAG fits — and when it doesn’t

RAG is justified where knowledge is large, changing and accuracy matters: product support, answers over an internal base, helping reps with pricing and terms. It’s overkill for a dozen standard FAQ entries — a simple script wins there, no vector DB.

We deploy RAG on your stack, with data stored in Russia and explicit boundaries: what the agent answers itself and what it escalates to a human. How it pays off in client money — see our case studies.

Why RAG

How the pipeline works

Where accuracy breaks

Freshness and access

When RAG fits — and when it doesn’t

100% of calls under review: how Achilles finds missed deals

RAG on a company knowledge base: how AI answers from your documents

AI agent vs. chatbot: what the difference looks like in practice

We'll get the team working while you watch the numbers.