An internal AI (on-premise) system is a set of stacked layers that all sit inside your own infrastructure: bottom to top, hardware → serving → open-source model → RAG → OpenAI-compatible API → UI & integration; a security layer wraps the whole stack, and a Trending Pool feeds fresh knowledge into RAG through a controlled channel. This article explains each layer and traces the data flow of a single question through the system.
Quick summary
- Layered architecture: hardware at the bottom, UI at the top; each layer only talks to its neighbors.
- On-premise boundary: every layer sits inside your infrastructure — prompts, documents and answers never leave the organization.
- Security is a wrapping layer, not a separate box: authentication, authorization, network isolation and logging apply to every layer.
- Trending Pool is a controlled channel to periodically update the RAG layer with fresh knowledge.
- Single-question flow: user → UI → API → RAG retrieval → model generates an answer with citations.
The overall architecture
The easiest way to picture internal AI is as a stack of layers living entirely within your infrastructure boundary. External users and apps only touch the top layer (UI & API); everything below — the model, the data, retrieval — stays on-site. The diagram below is the map of the whole system.
Each layer, explained
Read bottom to top — each layer builds on the one below and serves the one above:
- Hardware (bottom): the machine that runs the model on-site — Apple Silicon (Mac Mini/Studio) or a GPU box. It's the foundation of the whole stack; memory capacity determines the model size you can run.
- Serving engine: the software that loads the model into memory and answers requests —
Ollamafor a fast start,vLLMwhen you need to serve many users at once. - Open-source model: the language "brain" — the Qwen, Gemma, Llama families. Choose by a license that allows commercial use and a size that fits the hardware.
- RAG (vector DB + internal documents): the layer that lets the AI "read" your documents. Documents are embedded and stored in a vector database; on a question, the system retrieves relevant passages so the model answers with citations, reducing hallucination.
- OpenAI-compatible API: a standardized entry point. Because it matches the OpenAI spec, existing tools and apps can call it with almost no code changes.
- UI & integration (top): the layer users touch — a chat interface for staff and connection points into internal software.
- Security (wrapping): not a separate box but a layer applied to every layer — user authentication, authorization, network isolation and access logging.
- Trending Pool: a controlled knowledge-update channel — it periodically curates fresh information and feeds it into the RAG layer, so internal AI doesn't "stand still" over time while still never opening a free outbound connection.
The data flow of a single question
The layered diagram shows how the system is stacked; when a real question arrives, data moves across the layers in sequence. Below is the journey of one question, all inside the on-premise boundary.
Mapping each layer to concrete, popular tools today:
- Hardware:
Apple Silicon(Mac Mini/Studio) or a GPU box. - Serving:
Ollama(fast start) orvLLM(high throughput). - Model:
Qwen·Gemma— pick by commercial license and a size that fits VRAM/RAM. - RAG: a vector DB —
pgvector·Qdrant+ your internal documents. - API: an
OpenAI-compatibleendpoint exposed by the serving engine. - UI:
Open WebUIor a custom app calling the API.
The on-premise boundary — why data never leaves the org
The crux of both diagrams is the dashed boundary that wraps every layer. Because hardware, serving, model, vector DB and documents all sit inside your infrastructure, when a question moves through the stack, the prompt, retrieved documents and the answer never leave the internal network. No step calls out to a public AI API.
The only channel that touches the outside is the Trending Pool — and it's a controlled, inbound-only channel: it curates fresh knowledge and loads it into RAG periodically, rather than pushing internal data out. That keeps on-site data easy to demonstrate for PDPL compliance while the system's knowledge stays refreshed.
The Namtech view
Namtech deploys exactly this layered architecture on Apple Silicon (Mac Mini/Studio clusters, low power draw) with commercially-safe open-source models. We treat the on-premise boundary and the wrapping security layer as defaults — not add-ons — and use the Trending Pool as a controlled knowledge-update channel. Understanding this diagram helps your team see what you own and what a partner carries.
Frequently asked questions
Why is security drawn as a "wrapping" layer instead of a separate box?
Because security doesn't live in one place. Authentication is at the UI, authorization at the API, network isolation at the infrastructure tier, and access logging is everywhere. Drawing it as a wrapping layer reflects that every layer must be controlled — details in the Internal AI security article.
How is RAG different from the model "just knowing" the answer?
The model only knows what's in its training data. RAG lets the model read your internal documents at question time: the system retrieves relevant passages from the vector DB and adds them to the context, so answers stay grounded in real documents and come with citations. See the RAG article.
What does "OpenAI-compatible" API mean?
It means the internal system's endpoint uses the same request/response format as OpenAI's API. So libraries, tools and apps built for OpenAI can call your internal AI with almost no code changes — you just point them at a different endpoint.
Does the Trending Pool leak data outward?
No, if implemented correctly: it's an inbound-only channel — it only curates fresh knowledge and loads it into RAG periodically, without pushing prompts or internal documents outward. Details in the Trending Pool article.
Want internal AI without starting from zero?
Namtech deploys private internal AI platforms — open-source models running 100% on your own infrastructure, data never leaving the organization.
Book a free consultationNote: This is a high-level architectural guide, updated 02/07/2026; tools and models change fast — verify the latest versions when you deploy.