Internal AI

Serving: install & optimize model speed

Serving internal AI models on-premise: Ollama, vLLM, llama.cpp and an OpenAI-compatible API

Serving is the layer that turns a static model file into a running service — accepting requests, generating answers, serving many users. Three popular choices: Ollama (easiest, start with one command), vLLM (high throughput when many users hit it at once), and llama.cpp (lightweight, great for CPU/Apple Silicon). Combine them with quantization (GGUF Q4/Q5/Q8) to save memory and an OpenAI-compatible API so internal apps can call in. This is Part 4/8 of the build-your-own internal AI series.

Quick summary

  • What serving is: software that loads the model and exposes an endpoint to accept requests — without it, the model is just a file on disk.
  • Ollama: easiest; install, then run a model with a single command — ideal for getting started and prototyping.
  • vLLM: optimized for throughput when serving many concurrent users (continuous batching).
  • llama.cpp: lightweight, runs well on CPU and Apple Silicon; the foundation behind the GGUF format.
  • Quantization: Q4/Q5/Q8 trade memory for quality; Q4 is usually a good balance to start.
  • OpenAI-compatible API: a standardized /v1/chat/completions endpoint so any internal app calls in as if calling OpenAI.

Choosing a serving tool: Ollama, vLLM or llama.cpp?

There's no single "best" tool — each is optimized for a situation. You choose by concurrent users, hardware type, and how much you want to configure yourself.

  • Ollama — easiest, start with one command. Compact install, manages models like images, exposes a local API out of the box. Great for standing up a quick prototype on one machine for a small team. Ollama uses llama.cpp under the hood, so it runs well on both Apple Silicon and GPU machines.
  • vLLM — high throughput for many users. Built to serve many concurrent requests with continuous batching and efficient memory management, typically on GPU. A fit once internal AI is in "production" and many staff use it at once.
  • llama.cpp — lightweight, great for CPU/Apple Silicon. Written in C/C++ with minimal dependencies, the native engine for the GGUF format. A fit when you want deep control, run on a machine without a discrete GPU, or embed it in an application.
CriterionOllamavLLMllama.cpp
Ease of getting startedEasiest (one command)More configurationModerate (build/CLI)
Many concurrent usersFine for small teamsStrongest (batching)More limited
Best-fit hardwareApple Silicon / GPUGPUCPU / Apple Silicon
Model formatGGUF (via llama.cpp)Mostly native weightsGGUF
When to useGetting started, prototypeProduction, high loadEmbedded, CPU-only

The pragmatic path: start with Ollama to prove value, then as user count grows consider moving to vLLM for throughput. On an Apple Silicon machine without a discrete GPU, llama.cpp (or Ollama on top of llama.cpp) is the natural choice. See the Hardware article to pick the matching machine.

Quantization — GGUF Q4/Q5/Q8

Quantization is a technique that reduces the precision of model weights (e.g., from 16-bit down to 4-bit) so the model uses less memory and often runs faster. The GGUF format (used by llama.cpp and Ollama) ships common quantization levels ready to go: Q4, Q5, Q8.

  • Q4: the heaviest compression of the three, saving the most memory — the popular level because it balances size and quality well and is usually a reasonable starting point.
  • Q5: moderate compression, a slight quality bump over Q4, in exchange for a bit more memory.
  • Q8: light compression, keeping quality closest to the original of the three, but using the most memory.

The general rule: heavier quantization → less memory, but output quality may gradually drop. It's a trade-off, not free. The size of the effect depends on the model and task, so the right approach is to test on your own data rather than trusting a fixed percentage. For most enterprises starting out, Q4 is a safe launch point; if quality falls short on important tasks, move up to Q5/Q8 or pick a larger model (see the Model selection article).

Your infrastructure — on-premise · data never leaves the org ClientInternal app APIOpenAI-compatible Serving engineBatching ModelQuantized Response
The flow of a single inference request, entirely within your infrastructure boundary: Client → API (OpenAI-compatible) → Serving engine (batching) → Model (quantized) → response. Diagram: Namtech.

Context length & concurrency/batching

Two parameters have a big impact on memory and latency when serving many people:

  • Context length: the maximum tokens the model handles in one turn (prompt + answer). A longer context lets you feed more documents, but uses more memory and can be slower. Set it just large enough for the task rather than maxing it out "to be safe."
  • Concurrency / batching: to serve many people at once, the engine groups requests and processes them in parallel. vLLM stands out for continuous batching — raising total throughput under load. The trade-off: more concurrent requests → more memory needed, and the latency of any single request can rise as the queue lengthens.

In short: longer context + more concurrent users = more memory needed, and you have to balance throughput (total served) against latency (delay each person feels). Measure on your real load to pick the right configuration instead of guessing.

OpenAI-compatible API

A major strength of today's serving ecosystem is that most tools expose an OpenAI-compatible API — the same /v1/chat/completions endpoint shape, the same request/response structure. As a result, any internal app (chat UI, backend, automation script) can call your internal AI exactly as if calling OpenAI, just by changing the base URL to your on-premise server.

The benefits: standardization — existing libraries, examples and tools all work; and easy switching — if you later change engines (Ollama → vLLM) or models, the apps calling in barely need edits. It's also the foundation for the next step — RAG — and for the UI & integration step.

See also the companion posts: Internal AI system architecture diagram, Internal AI security system and Trending Pool — updating world knowledge.

For the IT team

Fastest start — Ollama, run a model with a single command:

# install Ollama, then run an open-source model
ollama run qwen2.5:7b # chat right in the terminal, 100% offline

When you need high throughput for many users — vLLM exposes an OpenAI-compatible server:

# serve a model via vLLM (OpenAI-compatible server)
vllm serve <model> # opens the /v1/... endpoint on GPU

Internal apps call the OpenAI-compatible endpoint — just point the base URL at your on-premise machine:

# call a chat completion on the internal server (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen2.5:7b","messages":[{"role":"user","content":"Hello"}]}'

The Namtech view

Namtech builds the serving layer for internal AI running 100% on-site on Apple Silicon (Mac Mini/Studio clusters, low power draw): starting with a simple engine to prove value, then tuning quantization, context and batching against the customer's real load. We always standardize on an OpenAI-compatible API so your internal apps aren't locked into a specific engine — swapping models or engines later stays painless. Specific speed numbers depend on hardware, model and load, so we measure directly on your system rather than promising generic figures.

Frequently asked questions

Should I start with Ollama or vLLM?

Most should start with Ollama because you install it and run a model with a single command — fast to prove value. When concurrent users grow and you need high throughput on GPU, consider moving to vLLM. Both expose an OpenAI-compatible API, so the apps calling in barely need edits.

Does Q4 quantization hurt the model much?

Quantization is a trade-off: the heavier the compression, the more memory you save but the more quality may gradually drop. Q4 is popular because it balances well and is usually a reasonable starting point. The size of the effect depends on the model and task — test on your own data; if it falls short on important tasks, move up to Q5/Q8.

Can I run internal AI without a GPU?

Yes. llama.cpp (and Ollama on top of llama.cpp) work well on CPU and especially on Apple Silicon thanks to unified memory. The speed and the model size you can run depend on the hardware — see the Hardware article to size a configuration by number of users.

What does "OpenAI-compatible API" mean?

It means the serving engine exposes an endpoint with the same shape as OpenAI's API (e.g., /v1/chat/completions). So internal apps call your on-premise AI exactly as they'd call OpenAI, just by changing the base URL to your server — no need to rewrite the integration.

Want internal AI without starting from zero?

Namtech deploys private internal AI platforms — open-source models running 100% on your own infrastructure, data never leaving the organization.

Book a free consultation

Note: This is a general guide, updated 02/07/2026; tools and models change fast — verify the latest versions when you deploy.

Get started

Start with a free assessment

To define the right package and detailed scope, Namtech offers a short, no-cost assessment.

We reply within 1 business day. No spam, we never share your info.