Building a RAG System on Your Own Business Data: What Actually Works

Almost every week now, someone asks us a version of the same question: "Can we have a ChatGPT that knows our business?" They have watched a generic chatbot answer questions confidently, and they want that — but answering from their contracts, their product manuals, and five years of support tickets, not the open internet.

The technique that makes this possible is retrieval-augmented generation, or RAG. We touched on it briefly when we wrote about how LangChain is changing business automation, but it deserves a proper treatment, because RAG is where most companies' first serious AI project lands — and where a surprising number of them get stuck.

This is a guide to what RAG actually is, how to build one that holds up in production, and the decisions that quietly determine whether the project succeeds or becomes another abandoned proof of concept. It is written from the perspective of having built these systems, not having read about them.

Problem Statement

A large language model on its own knows a great deal about the world and nothing whatsoever about your business. Ask it about your refund policy and it will either admit it does not know or, worse, invent something plausible. That second failure mode — confident fabrication — is the reason you cannot simply put a raw model in front of customers and hope for the best.

The instinctive fix is "train the model on our data." For most businesses, that is the wrong instinct. Fine-tuning teaches a model a style or a behaviour; it is poor at teaching it facts, it goes stale the moment your data changes, and it is expensive to repeat. If your prices changed yesterday, a model fine-tuned last month is now confidently wrong.

RAG solves the problem differently. Instead of baking knowledge into the model's weights, you retrieve the relevant facts at the moment of the question and hand them to the model as context. The model then answers using what you gave it. Change a document, and the next answer reflects the change immediately. That single property — knowledge that updates without retraining — is why RAG has become the default architecture for grounding AI in private data.

Industry Challenges

The concept is simple to describe and deceptively hard to do well. The challenges we see repeatedly:

Your data is messier than you think — Knowledge lives in scanned PDFs, in wiki pages three reorganisations out of date, and in the heads of two people who are always busy. Confronting the state of the source material is usually the largest hidden cost of the project, and it is rarely budgeted for.
Retrieval quality, not the model, is the bottleneck — Teams obsess over which model to use when answer quality is almost always limited by whether the right passage reached the model in the first place. A brilliant model fed irrelevant context gives a confident, irrelevant answer.
Hallucination does not disappear, it gets rarer — RAG dramatically reduces fabrication, but a poorly grounded system will still occasionally invent. For internal tools that is tolerable; for regulated or customer-facing use it needs deliberate guardrails.
Permissions are a first-class problem — If your HR documents and your board minutes go into the same searchable index, you have just built a very efficient way for the wrong person to read the wrong thing.

How RAG Actually Works

A RAG system has two phases: an ingestion pipeline that runs ahead of time, and a query pipeline that runs when a user asks a question.

The ingestion phase is the part nobody photographs but everybody depends on:

Load documents from their sources — file storage, a CMS, a database, a ticketing system.
Chunk each document into passages. Chunk too large and retrieval becomes imprecise; too small and you lose the context that makes a passage meaningful. Splitting on headings and paragraphs consistently beats splitting every few hundred characters.
Embed each chunk, converting it into a vector — a list of numbers that captures its meaning — using an embedding model.
Store those vectors in a database alongside the original text and metadata such as source, date, and access level.

The query phase is what the user actually sees. The question is embedded with the same model, the database returns the handful of passages closest in meaning, and those passages are assembled into a prompt that instructs the model to answer using only the supplied context and to say so when the answer is not present. The model then produces a grounded answer, ideally citing which source each claim came from. Picture the data flowing left to right through load, chunk, embed and store into the vector database, then the question flowing through embed, search and generate back to the user, with the vector store sitting in the middle as the shared component between both flows.

Two small details do more for reliability than any model upgrade: setting the model's temperature low so it stops improvising, and explicitly instructing it to refuse when the answer is not in the retrieved context.

Implementation Considerations

Choosing where to store the vectors

For most of our clients, the right answer is the vector extension for PostgreSQL, because they already run Postgres and adding a capability to an existing database beats operating a new one. Dedicated vector databases earn their place at larger scale or when you need advanced filtering and hybrid search. Start with what you already operate.

Hybrid search beats pure semantic search

Vector search is excellent at meaning but can miss exact terms — a product code, an error number, a clause reference. Combining vector search with traditional keyword search and re-ranking the merged results is the single most effective quality improvement we deploy. If a system is close but not quite, this is usually the fix.

Cost

Running costs split into three buckets: embeddings, which are cheap and mostly one-off per document; vector storage, which is modest; and model generation, which is the bulk. A practical lever is routing simple lookups to a small, cheap model and reserving a larger model for genuinely complex synthesis. We have cut clients' running costs by more than half this way without users noticing a quality drop.

Security

Three things matter most. First, access-scoped retrieval: store each passage's permission level as metadata and filter retrieval by the requesting user's entitlements, so the index physically cannot return what they are not allowed to see. Second, prompt-injection defence: a malicious document can contain instructions like "ignore previous rules", so treat retrieved content as untrusted data, not as commands. Third, data residency: if you are a UK business handling personal data, know where your embeddings and prompts are processed, and consider self-hosted open models for the most sensitive workloads.

Keeping the index fresh

The query path scales comfortably — search over millions of passages returns in milliseconds. The part that strains is freshness. A document changes; how quickly does the assistant know? Building incremental ingestion triggered by document updates, rather than a nightly full rebuild, is what separates a demo from a system people trust.

Real-World Use Cases

Customer support deflection — A grounded assistant trained on your help centre and past tickets answers the repetitive majority of enquiries and escalates the rest with full context. The return is easy to measure: tickets deflected multiplied by cost per ticket.
Internal knowledge assistant — New staff ask how something is done and get an answer with a link to the source policy, instead of interrupting a colleague.
Sales and bid support — Teams answering tenders query past proposals, specifications and pricing to assemble responses in a fraction of the time.
Document-heavy professional services — Legal, financial and property firms query across contracts and case files, turning hours of manual searching into a single sentence.

Common Mistakes to Avoid

Starting with the model, not the data — The first month should be spent on what is in scope, how clean it is, and who is allowed to see it.
One giant index for everything — Mixing document types and permission levels into a single undifferentiated index hurts both relevance and security.
Ignoring chunking — It is unglamorous and it is where half of all quality problems originate.
No citations — Users trust answers they can verify. An assistant that links to its sources is believed; one that does not is quietly abandoned.
Skipping evaluation — "It seemed good in the demo" is not a quality bar. Without a test set of real questions and known good answers, you will never know whether last week's tweak helped or hurt.

Future Trends

Three shifts are worth watching. Agentic RAG, where instead of a single retrieve-then-answer step the system reasons about what it needs, retrieves, checks whether it has enough, and retrieves again. Multimodal retrieval, pulling from images, diagrams and tables rather than text alone, which matters enormously for technical documentation. And standardised tool connectivity, where emerging protocols let an assistant reach into live systems such as your CRM rather than only static documents. None of these replace the fundamentals; they build on a well-constructed retrieval foundation.

Why Businesses Should Act Now

The competitive gap is opening between organisations whose staff can ask their data a question and get an instant, sourced answer, and those still searching shared drives by hand. The technology has matured to the point where a focused first project is achievable in weeks, and the cost of the underlying models has fallen sharply over the past two years. The risk is not moving too early; it is building badly. A rushed system that hallucinates or leaks does lasting damage to internal confidence in AI.

Conclusion

RAG is the most reliable way to put your own knowledge behind an AI assistant, and for most businesses it is the right first AI project. The architecture is well understood; the difficulty lives in the details — chunking, retrieval quality, access control, evaluation, and freshness. Those details are exactly where experience pays for itself, because each has a cheap wrong answer that looks fine in a demo and fails in production. We have been building software since 2013 and working with the modern AI ecosystem as these tools have matured, and we are happy to talk through your data, your constraints, and what a realistic first step looks like.

Frequently Asked Questions

What is the difference between RAG and fine-tuning?

Fine-tuning adjusts the model's internal weights to change its behaviour or style and is poor at teaching changeable facts. RAG retrieves relevant facts at query time and feeds them to the model. For knowledge that updates, such as prices and policies, RAG is almost always the right choice, and the two can be combined.

How long does it take to build a RAG system?

A focused proof of concept on a clean, well-scoped document set can take two to four weeks. A production system with access control, evaluation, monitoring and an incremental refresh pipeline takes longer, and the timeline depends heavily on how messy the source data is.

Do we need a special database?

Not necessarily. If you already run PostgreSQL, its vector extension is often enough to start. Dedicated vector databases make sense at larger scale or when you need advanced filtering.

Will it still hallucinate?

RAG sharply reduces fabrication but does not eliminate it. With a low temperature, explicit instructions to refuse when context is missing, citations and good retrieval, you get answers that are grounded and verifiable.

Is our data safe with a model provider?

It depends on the provider and configuration. Reputable providers offer agreements not to train on your data and provide regional processing for compliance. Where data is especially sensitive, the whole pipeline can run on self-hosted open-source models so nothing leaves your infrastructure.

How do we know if it is actually good?

Build a test set of real questions with known correct answers and score retrieval accuracy and answer faithfulness, re-running it on every change. Without that, you are guessing.

Problem Statement

Industry Challenges

The concept is simple to describe and deceptively hard to do well. The challenges we see repeatedly:

Your data is messier than you think — Knowledge lives in scanned PDFs, in wiki pages three reorganisations out of date, and in the heads of two people who are always busy. Confronting the state of the source material is usually the largest hidden cost of the project, and it is rarely budgeted for.
Retrieval quality, not the model, is the bottleneck — Teams obsess over which model to use when answer quality is almost always limited by whether the right passage reached the model in the first place. A brilliant model fed irrelevant context gives a confident, irrelevant answer.
Hallucination does not disappear, it gets rarer — RAG dramatically reduces fabrication, but a poorly grounded system will still occasionally invent. For internal tools that is tolerable; for regulated or customer-facing use it needs deliberate guardrails.
Permissions are a first-class problem — If your HR documents and your board minutes go into the same searchable index, you have just built a very efficient way for the wrong person to read the wrong thing.

How RAG Actually Works

A RAG system has two phases: an ingestion pipeline that runs ahead of time, and a query pipeline that runs when a user asks a question.

The ingestion phase is the part nobody photographs but everybody depends on:

Load documents from their sources — file storage, a CMS, a database, a ticketing system.
Chunk each document into passages. Chunk too large and retrieval becomes imprecise; too small and you lose the context that makes a passage meaningful. Splitting on headings and paragraphs consistently beats splitting every few hundred characters.
Embed each chunk, converting it into a vector — a list of numbers that captures its meaning — using an embedding model.
Store those vectors in a database alongside the original text and metadata such as source, date, and access level.

Implementation Considerations

Choosing where to store the vectors

Hybrid search beats pure semantic search

Cost

Security

Keeping the index fresh

Real-World Use Cases

Customer support deflection — A grounded assistant trained on your help centre and past tickets answers the repetitive majority of enquiries and escalates the rest with full context. The return is easy to measure: tickets deflected multiplied by cost per ticket.
Internal knowledge assistant — New staff ask how something is done and get an answer with a link to the source policy, instead of interrupting a colleague.
Sales and bid support — Teams answering tenders query past proposals, specifications and pricing to assemble responses in a fraction of the time.
Document-heavy professional services — Legal, financial and property firms query across contracts and case files, turning hours of manual searching into a single sentence.

Common Mistakes to Avoid

Starting with the model, not the data — The first month should be spent on what is in scope, how clean it is, and who is allowed to see it.
One giant index for everything — Mixing document types and permission levels into a single undifferentiated index hurts both relevance and security.
Ignoring chunking — It is unglamorous and it is where half of all quality problems originate.
No citations — Users trust answers they can verify. An assistant that links to its sources is believed; one that does not is quietly abandoned.
Skipping evaluation — "It seemed good in the demo" is not a quality bar. Without a test set of real questions and known good answers, you will never know whether last week's tweak helped or hurt.

Future Trends

Why Businesses Should Act Now

Conclusion

Frequently Asked Questions

What is the difference between RAG and fine-tuning?

How long does it take to build a RAG system?

Do we need a special database?

Not necessarily. If you already run PostgreSQL, its vector extension is often enough to start. Dedicated vector databases make sense at larger scale or when you need advanced filtering.

Will it still hallucinate?

Is our data safe with a model provider?

How do we know if it is actually good?

Build a test set of real questions with known correct answers and score retrieval accuracy and answer faithfulness, re-running it on every change. Without that, you are guessing.

Building a RAG System on Your Own Business Data: What Actually Works

Problem Statement

Industry Challenges

How RAG Actually Works

Implementation Considerations

Choosing where to store the vectors

Hybrid search beats pure semantic search

Cost

Security

Keeping the index fresh

Real-World Use Cases

Common Mistakes to Avoid

Future Trends

Why Businesses Should Act Now

Conclusion

Frequently Asked Questions

What is the difference between RAG and fine-tuning?

How long does it take to build a RAG system?

Do we need a special database?

Will it still hallucinate?

Is our data safe with a model provider?

How do we know if it is actually good?

More Articles

Vector Databases Explained: Choosing the Right One for Production AI

Model Context Protocol (MCP): Connecting AI to Your Real Systems

How LangChain Is Revolutionising Business Automation

Building a RAG System on Your Own Business Data: What Actually Works

Problem Statement

Industry Challenges

How RAG Actually Works

Implementation Considerations

Choosing where to store the vectors

Hybrid search beats pure semantic search

Cost

Security

Keeping the index fresh

Real-World Use Cases

Common Mistakes to Avoid

Future Trends

Why Businesses Should Act Now

Conclusion

Frequently Asked Questions

What is the difference between RAG and fine-tuning?

How long does it take to build a RAG system?

Do we need a special database?

Will it still hallucinate?

Is our data safe with a model provider?

How do we know if it is actually good?

More Articles

Vector Databases Explained: Choosing the Right One for Production AI

Model Context Protocol (MCP): Connecting AI to Your Real Systems

How LangChain Is Revolutionising Business Automation