Internal AI system architecture, layer by layer (with diagrams)

Q: What does OpenAI-compatible API mean?

The internal system's endpoint uses the same request/response format as OpenAI's API, so libraries, tools and apps built for OpenAI can call your internal AI with almost no code changes, just pointing them at a different endpoint.

Q: Does the Trending Pool leak data outward?

No if implemented correctly: it's an inbound-only channel, only curating fresh knowledge and loading it into RAG periodically, without pushing prompts or internal documents outward.

An internal AI (on-premise) system is a set of stacked layers that all sit inside your own infrastructure: bottom to top, hardware → serving → open-source model → RAG → OpenAI-compatible API → UI & integration; a security layer wraps the whole stack, and a Trending Pool feeds fresh knowledge into RAG through a controlled channel. This article explains each layer and traces the data flow of a single question through the system.

Quick summary

Layered architecture: hardware at the bottom, UI at the top; each layer only talks to its neighbors.
On-premise boundary: every layer sits inside your infrastructure — prompts, documents and answers never leave the organization.
Security is a wrapping layer, not a separate box: authentication, authorization, network isolation and logging apply to every layer.
Trending Pool is a controlled channel to periodically update the RAG layer with fresh knowledge.
Single-question flow: user → UI → API → RAG retrieval → model generates an answer with citations.

The overall architecture

The easiest way to picture internal AI is as a stack of layers living entirely within your infrastructure boundary. External users and apps only touch the top layer (UI & API); everything below — the model, the data, retrieval — stays on-site. The diagram below is the map of the whole system.

Layered internal AI architecture: hardware at the bottom → UI at the top, security wrapping it, Trending Pool feeding RAG — all inside the on-premise boundary. Diagram: Namtech.

Each layer, explained

Read bottom to top — each layer builds on the one below and serves the one above:

Hardware (bottom): the machine that runs the model on-site — Apple Silicon (Mac Mini/Studio) or a GPU box. It's the foundation of the whole stack; memory capacity determines the model size you can run.
Serving engine: the software that loads the model into memory and answers requests — Ollama for a fast start, vLLM when you need to serve many users at once.
Open-source model: the language "brain" — the Qwen, Gemma, Llama families. Choose by a license that allows commercial use and a size that fits the hardware.
RAG (vector DB + internal documents): the layer that lets the AI "read" your documents. Documents are embedded and stored in a vector database; on a question, the system retrieves relevant passages so the model answers with citations, reducing hallucination.
OpenAI-compatible API: a standardized entry point. Because it matches the OpenAI spec, existing tools and apps can call it with almost no code changes.
UI & integration (top): the layer users touch — a chat interface for staff and connection points into internal software.
Security (wrapping): not a separate box but a layer applied to every layer — user authentication, authorization, network isolation and access logging.
Trending Pool: a controlled knowledge-update channel — it periodically curates fresh information and feeds it into the RAG layer, so internal AI doesn't "stand still" over time while still never opening a free outbound connection.

Table — The internal AI system's layers & their roles
Layer	Role
Hardware (bottom)	The machine that runs the model on-site — Apple Silicon (Mac Mini/Studio) or a GPU box; memory capacity determines the model size you can run
Serving engine	Loads the model into memory and answers requests — Ollama for a fast start, vLLM when you need to serve many users at once
Open-source model	The language "brain" — the Qwen, Gemma, Llama families; choose by a license that allows commercial use and a size that fits the hardware
RAG (vector DB + internal documents)	The layer that lets the AI "read" your documents; retrieves relevant passages so the model answers with citations, reducing hallucination
OpenAI-compatible API	A standardized entry point; existing tools and apps can call it with almost no code changes
UI & integration (top)	The layer users touch — a chat interface for staff and connection points into internal software
Security (wrapping)	A layer applied to every layer — user authentication, authorization, network isolation and access logging
Trending Pool	A controlled knowledge-update channel — periodically curates fresh information and feeds it into the RAG layer

The data flow of a single question

The layered diagram shows how the system is stacked; when a real question arrives, data moves across the layers in sequence. Below is the journey of one question, all inside the on-premise boundary.

Single-question flow: (1) user → UI, (2) → API, (3) → RAG retrieves context from the Vector DB, (4) → model generates the answer, (5) answer with citations returns to the user. Diagram: Namtech.

For the IT team

Mapping each layer to concrete, popular tools today:

Hardware: Apple Silicon (Mac Mini/Studio) or a GPU box.
Serving: Ollama (fast start) or vLLM (high throughput).
Model: Qwen · Gemma — pick by commercial license and a size that fits VRAM/RAM.
RAG: a vector DB — pgvector · Qdrant + your internal documents.
API: an OpenAI-compatible endpoint exposed by the serving engine.
UI: Open WebUI or a custom app calling the API.

Table — Mapping each layer to concrete tools
Layer	Tools
Hardware	Apple Silicon (Mac Mini/Studio) or a GPU box
Serving	Ollama (fast start) or vLLM (high throughput)
Model	Qwen · Gemma — pick by commercial license and a size that fits VRAM/RAM
RAG	A vector DB — pgvector · Qdrant + your internal documents
API	An OpenAI-compatible endpoint exposed by the serving engine
UI	Open WebUI or a custom app calling the API

The on-premise boundary — why data never leaves the org

The crux of both diagrams is the dashed boundary that wraps every layer. Because hardware, serving, model, vector DB and documents all sit inside your infrastructure, when a question moves through the stack, the prompt, retrieved documents and the answer never leave the internal network. No step calls out to a public AI API.

The only channel that touches the outside is the Trending Pool — and it's a controlled, inbound-only channel: it curates fresh knowledge and loads it into RAG periodically, rather than pushing internal data out. That keeps on-site data easy to demonstrate for PDPL compliance while the system's knowledge stays refreshed.

The Namtech view

Namtech deploys exactly this layered architecture on Apple Silicon (Mac Mini/Studio clusters, low power draw) with commercially-safe open-source models. We treat the on-premise boundary and the wrapping security layer as defaults — not add-ons — and use the Trending Pool as a controlled knowledge-update channel. Understanding this diagram helps your team see what you own and what a partner carries.

Frequently asked questions

Why is security drawn as a "wrapping" layer instead of a separate box?

Because security doesn't live in one place. Authentication is at the UI, authorization at the API, network isolation at the infrastructure tier, and access logging is everywhere. Drawing it as a wrapping layer reflects that every layer must be controlled — details in the Internal AI security article.

How is RAG different from the model "just knowing" the answer?

The model only knows what's in its training data. RAG lets the model read your internal documents at question time: the system retrieves relevant passages from the vector DB and adds them to the context, so answers stay grounded in real documents and come with citations. See the RAG article.

What does "OpenAI-compatible" API mean?

It means the internal system's endpoint uses the same request/response format as OpenAI's API. So libraries, tools and apps built for OpenAI can call your internal AI with almost no code changes — you just point them at a different endpoint.

Does the Trending Pool leak data outward?

No, if implemented correctly: it's an inbound-only channel — it only curates fresh knowledge and loads it into RAG periodically, without pushing prompts or internal documents outward. Details in the Trending Pool article.

Want internal AI without starting from zero?

Namtech deploys private internal AI platforms — open-source models running 100% on your own infrastructure, data never leaving the organization.

Book a free consultation

Note: This is a high-level architectural guide, updated 02/07/2026; tools and models change fast — verify the latest versions when you deploy.

References