RAG (Retrieval-Augmented Generation) is how you let your internal AI read your own documents to answer: instead of "remembering" everything in model weights, the system finds the relevant passages in your internal knowledge base, inserts them into the prompt, and lets the model answer with source citations — reducing hallucination and updating knowledge without retraining. This is Part 5/8 of the Build Internal AI series.
Quick summary
- What RAG is: retrieve relevant internal passages → inject them into context → the model answers from them, with citations.
- Why use it: answers grounded in your documents, less fabrication, and knowledge updates just by adding/editing documents — no model retraining.
- Pipeline: ingest → chunk → embed → store in a vector DB → retrieve → augment prompt → answer.
- Infrastructure: one embedding model (prefer multilingual, good for Vietnamese) + a vector DB (
pgvector,Qdrant,Chroma). - Quality comes from: sensible chunking, rich metadata, and always answering with sources.
What is RAG and why for internal documents?
A language model only "knows" what it learned during training — it doesn't know your own processes, contracts or internal policies. Ask it directly and it tends to fabricate answers that sound plausible but are wrong (hallucination). RAG fixes this by, before answering, going to find the most relevant internal passages and putting them into context; the model then just synthesizes an answer grounded in the real documents and points to which source it came from.
Compared with fine-tuning (further training the model on your data), RAG is more pragmatic for the "let the AI read internal documents" case:
| Criterion | RAG | Fine-tuning |
|---|---|---|
| Updating knowledge | Add/edit a document, effective immediately | Must retrain when data changes |
| Source citations | Yes — shows which document it used | Hard to trace answers back to sources |
| Cost & skills | Lighter, no training GPUs needed | Needs data, compute and training expertise |
| Best fit | Document Q&A, frequently-changing knowledge | Shaping tone/style, specialized skills |
The two aren't mutually exclusive — but for most enterprises wanting an "assistant that knows every company document," RAG is the right starting point, and safer too since documents aren't "baked" into model weights.
How the RAG pipeline works
A basic RAG system has two phases: indexing (done once, re-run when documents change) and querying (each time a user asks). The steps:
- Ingest: gather internal documents — PDF, Word, wiki, intranet pages — and extract the text.
- Chunk: split the text into reasonably-sized passages, with metadata (file name, section, page).
- Embed: turn each chunk into a numeric vector with an embedding model.
- Store in a vector DB: keep the vectors and metadata for similarity search.
- Retrieve: when a question comes in, embed it and find the nearest chunks in the vector DB.
- Augment prompt: insert the retrieved chunks into the prompt as context.
- Answer: the model generates an answer grounded in that context, with source citations.
The first four steps "build the knowledge base"; the last three run in real time per question. To see where RAG sits in the whole system, see the Internal AI system architecture diagram.
Vector DB & embedding model
The heart of RAG is the vector database — where embedding vectors are stored and the "nearest" vectors to a question are found fast (semantic search, not keyword matching). Three popular open-source options to run on-premise:
pgvector: an extension for PostgreSQL — most convenient if you already use Postgres, keeping knowledge and business data in one database.Qdrant: a dedicated vector DB written in Rust, strong at metadata filtering and scaling.Chroma: lightweight, easy to start with for prototypes and small projects.
Alongside it is the embedding model — the thing that turns text into vectors. For Vietnamese documents, choose a multilingual embedding model so a Vietnamese question retrieves the right Vietnamese passages; an English-only embedding will retrieve poorly. One important rule: use the same embedding model for indexing and for querying — change the model and you must re-embed the whole store.
Chunking & retrieval quality
Answer quality depends heavily on how you chunk. If chunks are too large, the vector gets "diluted" and drags in noise; too small and the passage loses context and gives clipped answers. A few pragmatic principles:
- Chunk size: pick a moderate size so each passage stands on its own (e.g. by paragraph or section), rather than cutting mid-sentence.
- Overlap: let chunks overlap slightly so ideas at boundaries aren't cut off.
- Metadata: attach source (file name, section title, page, date) to each chunk — both for filtering and for showing citations.
- Number retrieved (top-k): pull just enough relevant passages; too many dilutes context and wastes tokens.
There's no "one size fits all" number — chunk size, overlap and top-k should be tuned against your real documents and questions. How to measure retrieval and answer quality is covered in the Evaluation & tuning article.
Citations & fighting hallucination
The biggest enterprise benefit of RAG is traceability: each answer can come with the internal sources used, so the reader can open the original document and verify. A few practices to limit fabrication:
- Require grounding in context: instruct the model to use only the provided passages, and to say "not found in the documents" when there's no basis.
- Always show citations: attach the document/section name to each claim so users can check for themselves.
- Keep the knowledge base clean & fresh: stale or contradictory documents lead to wrong answers — RAG is only as good as its sources.
RAG reduces hallucination substantially but doesn't eliminate it entirely; you still need evaluation and guardrails as covered in the Evaluation article. For knowledge outside the organization (news, world trends) while still keeping data on-site, see the Trending Pool — updating world knowledge mechanism, which feeds fresh knowledge into the very same RAG store.
A minimal RAG stack that runs entirely on-premise:
- Embedding: a multilingual embedding model running locally (served via your own serving layer).
- Vector DB:
pgvectorif you already have Postgres, orQdrant/Chromafor a dedicated DB. - Retriever: a thin layer: embed the question → query top-k by similarity → insert into the prompt.
Conceptual flow (pseudocode) for one RAG query:
# indexing (once / when documents change)
chunks = chunk(documents, overlap=True)
vectors = embed(chunks) # same model used for querying
vectordb.upsert(vectors, metadata=chunks.meta)
# querying (per question)
q_vec = embed(question)
top = vectordb.search(q_vec, k=5) # relevant chunks + sources
prompt = system + context(top) + question
answer = llm(prompt) # answer with citations from top
The Namtech view
Namtech builds RAG directly on the customer's internal documents, running 100% on-site on Apple Silicon alongside an open-source serving layer: the embedding model, the vector DB and the answer-generating model all sit inside your infrastructure — documents never leave the organization. We prefer multilingual embeddings that work well for Vietnamese, chunking that follows real document structure, and always answer with internal source citations so users can verify. The next step is exposing this RAG through a chat UI and an API for staff — see the UI & integration article.
Frequently asked questions
How is RAG different from fine-tuning?
RAG retrieves relevant documents and puts them into context for the model to answer — updating knowledge just by adding/editing documents, and answering with citations. Fine-tuning further trains the model on your data, good for shaping tone/style or skills but hard to update and hard to trace to sources. For internal document Q&A, RAG is usually the right starting point.
Which vector DB should I pick: pgvector, Qdrant or Chroma?
If you already use PostgreSQL, pgvector is most convenient since it keeps everything in one database. Qdrant is a dedicated vector DB, strong at metadata filtering and scaling. Chroma is lightweight, good for getting started and small projects. All three are open-source and run on-premise.
Does RAG make data leave the organization?
No, if deployed correctly: the embedding model, vector DB and answer-generating model all run on your infrastructure, with no external API calls. That's exactly why on-premise RAG suits sensitive documents. The defense layers are detailed in the Security article.
Does RAG eliminate hallucination completely?
Not completely, but it reduces it substantially because answers are grounded in real documents and come with citations to verify. You still need to keep the knowledge base clean/fresh, instruct the model to use only the provided context, and measure quality as in the Evaluation & tuning article.
Want your internal AI to read your own documents?
Namtech builds RAG on your internal documents — embedding, vector DB and model running 100% on-site, answering with citations, data never leaving the organization.
Book a free consultationNote: This is a general guide, updated 02/07/2026; tools and models change fast — verify the latest versions when you deploy.