Evaluating quality & tuning internal AI (eval, guardrails)

Q: Is LLM-as-judge trustworthy?

Trustworthy for screening and trend tracking but with limits: the judging model can be biased and can miss subtle errors. Use it for fast broad runs, but keep a human in the loop for important decisions on a representative sample.

Q: Do I have to fine-tune?

No. For most enterprise tasks RAG + a good prompt is enough and far easier to update. Fine-tuning is only needed when you require a very specific voice or format and the data is stable, and it should be the last step.

Evaluating (eval) internal AI is the systematic process of measuring answer quality before you trust it — because you can't improve what you don't measure. The core method: build a "golden" question set matched to your business, score it with humans + LLM-as-judge, reduce hallucination with RAG + citations + good prompts, and put safety guardrails on the input and output. Fine-tuning is only needed when RAG + prompts aren't enough. This is Part 7/8 in the build-your-own internal AI series.

Quick summary

Why evaluate: AI can answer fluently yet wrongly (hallucination); without a metric you can't tell whether a new version is better or worse.
How to measure: a golden set of real questions + expected answers, scored by humans and complemented by LLM-as-judge (knowing its limits).
Reduce hallucination: RAG supplies context, force source citations, use good prompts, and let the AI "say it doesn't know when unsure".
Guardrails: filter input/output, handle PII, refuse out-of-scope or harmful requests.
Fine-tuning: usually unnecessary — only when you need a specific voice/format and the data is stable.

Why evaluate?

A language model can always produce an answer — even when it's wrong. Fluent, confident, grammatical text does not mean factually correct. When a model invents plausible-looking but false information, that's called hallucination, and it's the main risk of putting AI into real work: a wrong answer delivered convincingly is more dangerous than an honest "I don't know".

Without measurement, you have no way to know whether a model update, a prompt change, or a RAG tweak is better or worse — you're just going on feel. The engineering principle here is simple: you can't improve what you don't measure. Evaluation turns quality from a gut feeling into a number you can compare across versions.

How to measure quality — golden set + humans + LLM-as-judge

The first step is building a "golden" question set: a list of questions that actually arise in your business, each with an expected answer (or the key points a correct answer must hit). You don't need many — a few dozen to a few hundred covering common cases and the "hard" edge cases already adds real value. The crucial point: the golden set must reflect your business, not a generic benchmark.

There are two ways to score, best used together:

Human scoring: subject-matter experts read the answers and judge right/wrong, complete/incomplete, and whether sources are cited correctly. This is the gold standard for reliability, but slow and effort-heavy.
LLM-as-judge: use a model to grade an answer against the expected answer, which scales up and runs automatically, fast and repeatable.

Table — Two ways to score the golden set
Method	Strengths	Limits	Role
Human scoring	Gold standard for reliability; judges right/wrong, complete/incomplete, correct citations	Slow, effort-heavy	Human in the loop on a representative sample & important decisions
LLM-as-judge	Scales up, automatic, fast, repeatable	Can be biased (favoring longer answers, its own model family, presentation order), grade leniently, miss subtle errors	Fast screening & trend tracking

Limits of LLM-as-judge to be clear about: the judging model can be biased — favoring longer answers, answers from its own model family, or being swayed by presentation order; it can also grade leniently or miss subtle errors. So LLM-as-judge should be used for fast screening and trend tracking, while important decisions still need a human in the loop on a representative sample. Don't let a machine-generated score fully replace human judgment.

The evaluation loop: test set → run the model → score → improve (prompt/RAG/guardrails) → repeat. Diagram: Namtech.

Reducing hallucination

There's no way to eliminate hallucination 100%, but several layers reduce it sharply when combined:

RAG (document retrieval): instead of letting the model answer from "memory", give it the relevant internal-document passages as context. Answers anchor to real sources rather than inventing. See the RAG for internal documents article.
Force source citations: require the model to indicate which document/passage it drew from. Users can verify, and the citation requirement itself keeps the model more disciplined.
Good prompts: clearly specify role, scope and desired format; remind the model to use only the information in the provided context.
Allow "I don't know": explicitly instruct that when unsure or when the documents lack the information, the model must say "I don't know / not found" instead of guessing. An honest "I don't know" is safer than a confident fabrication.

Safety guardrails

Guardrails are control layers around the model to ensure safety and compliance — independent of content quality:

Input filtering: detect and block prompt-injection attempts, privilege-escalation requests, or harmful content before they reach the model.
Output filtering: check the answer before returning it to the user — block inappropriate content, secret leakage, or exposure of sensitive data.
PII handling: identify and mask/redact personal information (names, phone numbers, ID numbers…) per compliance requirements — directly relevant to PDPL.
Refuse out-of-scope: internal AI should politely decline requests outside its purpose (e.g. personal legal advice, harmful content) and steer back to the business scope.

Table — Four safety guardrail layers
Guardrail layer	Function
Input filtering	Block prompt-injection, privilege-escalation requests, harmful content before they reach the model
Output filtering	Check the answer before returning — block inappropriate content, secret leakage, exposure of sensitive data
PII handling	Identify and mask/redact personal information (names, phone numbers, ID numbers…) per compliance (PDPL)
Refuse out-of-scope	Politely decline requests outside purpose, steer back to the business scope

Guardrails are one facet of a broader security picture; see the companion article Internal AI security system for the defense layers at the infrastructure and access levels. A common risk reference for LLM applications is the OWASP Top 10 for LLM Applications.

For the IT team

Start small: build an eval set as a CSV of questions + expected answers, run it periodically, and compare scores across versions (model swap, prompt change, RAG tweak). Even a few dozen rows is enough to catch regressions.

# eval.csv — the "golden" question set for your business
question,expected
"How many annual leave days are there?","12 leave days/year"
"Approval process for spend over 50M?","Two levels: dept head + director"

# run eval, score with humans + LLM-as-judge, save scores per version
python run_eval.py --set eval.csv --model qwen2.5:7b --out scores_v1.json
python run_eval.py --set eval.csv --model qwen2.5:14b --out scores_v2.json
# compare the two versions to decide whether to upgrade

When do you need to fine-tune?

Many people assume that making AI "good at your job" requires fine-tuning (further training) the model. In reality, for most enterprise tasks, RAG + a good prompt is enough — cheaper, faster, and far easier to update (just change the documents, no retraining). Fine-tuning is only worth considering when:

You need a very specific voice / style that a prompt can't enforce reliably.
You need a fixed, strict output format repeated at large scale.
The data is stable (not constantly changing) — because fine-tuning "freezes" knowledge into the weights, so new documents won't self-update the way RAG does.

The pragmatic rule: prioritize RAG + prompt first; fine-tune last, only after you've measured and confirmed those two layers have truly hit their ceiling.

The Namtech view

For every internal-AI deployment, Namtech builds a golden set matched to the client's business from the start and runs it as part of the process — so every change (model swap, prompt tweak, RAG update) is scored objectively rather than "feels better". We set input/output guardrails and PII handling aligned with PDPL requirements, and by default prioritize RAG + citations to reduce hallucination before considering fine-tuning. This approach keeps quality transparent and improvable over time — true to the spirit of "you can't improve what you don't measure".

Frequently asked questions

How many questions does a golden set need?

Not many. A few dozen to a few hundred real questions, covering common cases plus a few "hard" edge cases, is enough to catch regressions across versions. What matters is that the questions reflect your actual business, not a generic benchmark, and that you expand the set as new errors surface.

Is LLM-as-judge trustworthy?

Trustworthy for screening and trend tracking, but with limits: the judging model can be biased (favoring longer answers, answers from the same model family, or presentation order) and can miss subtle errors. Use it for fast, broad runs, but keep a human in the loop for important decisions on a representative sample.

What's the most effective way to reduce hallucination?

Combine several layers: RAG supplies context from real documents, force source citations for verification, prompt the model to use only information in the context, and allow it to "say it doesn't know" when unsure. No single layer eliminates hallucination 100%, but combined they cut the risk substantially.

Do I have to fine-tune?

No. For most enterprise tasks, RAG + a good prompt is enough and far easier to update. Fine-tuning is only needed when you require a very specific voice/format and the data is stable — and it should be the last step, only after measurement shows the other two layers have hit their ceiling.

← Previous · Part 6/8UI & integration for internal AI Next · Part 8/8 →Operations & scaling for internal AI

Want internal AI without starting from zero?

Namtech deploys private internal AI platforms — open-source models running 100% on your own infrastructure, data never leaving the organization.

Book a free consultation

Note: This is a general guide, updated 02/07/2026; tools and models change fast — verify the latest versions when you deploy.

References