On-premise hardware for internal AI: how to choose

Q: How much memory does a 7B model need?

As a rule of thumb, a 7B model at Q4 quantization needs about a few GB for weights, plus a portion for context. That's a quick estimate, real numbers vary by architecture and quantization level so check the specific model page on Hugging Face.

Q: Do I need a dedicated server room?

With a low-power Apple Silicon cluster, usually not, the machines are compact, cool and quiet and can sit in a normal office with a UPS advisable. High-wattage GPUs need matching power and cooling, in which case a server room makes sense.

Hardware for internal AI is decided by four factors: model size (parameter count), number of concurrent users, context length, and the speed you expect. The pivotal factor is memory — RAM or VRAM must hold the model weights plus context. For most enterprises, Apple Silicon (Mac Mini/Studio, unified memory, low power draw) is a tidy starting point; NVIDIA GPUs fit when you need very high throughput or training. This is Part 2/8 in the build-your-own internal AI series.

Quick summary

What decides it: model size × concurrent users × context length × desired speed — all of which reduce to a memory requirement.
Apple Silicon vs GPU: Apple Silicon has unified memory, so it runs large models on shared RAM with low power, small footprint and quiet operation; NVIDIA GPUs give high throughput and suit training but cost more in power, heat and space.
Memory rule: RAM/VRAM ≈ (parameter count × bytes per quantization) + a portion for context — this is a rule of thumb, not an exact figure.
Sizing by scale: AI Box (a small team) → AI Pro (a department) → AI Cluster (whole enterprise); the numbers below are approximate guides.
Namtech: deploys low-power Apple Silicon Mac Mini/Studio clusters, scaling up as needs grow.

What decides your hardware needs?

Before debating "which machine to buy", answer four questions — they drive everything else:

Model size (parameter count): a bigger model (7B, 14B, 32B, 70B…) is smarter but consumes more memory and runs slower. This is the single biggest variable.
Concurrent users: one person asking occasionally is very different from 30 people typing at once. Many concurrent users need higher throughput (often a GPU or several machines).
Context length: letting the AI read long documents, long conversations or many RAG passages consumes extra memory for the "context" (KV cache) — the longer the context, the more memory.
Desired speed: is "fast enough to read along" acceptable, or do you need near-instant replies? Higher speed expectations demand more powerful hardware or a smaller model.

The key point: all four factors reduce to memory and throughput. Once you pick a model size and user count, the rest (machine type, RAM/VRAM capacity) follows fairly naturally. See how the pieces fit together in the internal AI system architecture diagram.

Apple Silicon or GPU?

This is the biggest hardware decision. Both run open-source models; they differ in memory architecture, power draw and the situations they suit.

Criterion	Apple Silicon (Mac Mini/Studio)	NVIDIA GPU
Memory	Unified memory — CPU & GPU share it, so a single machine can load a large model if configured with high RAM	Dedicated VRAM per card — powerful, but per-card capacity is limited; large models must span multiple cards
Power & heat	Low power, cool, quiet — fine in a normal office	High power draw & heat, usually needs a server room/cooling
Multi-user throughput	Good for small–medium teams; scale by adding machines	Very high, suited to serving many concurrent users
Heavy training / fine-tune	Light workloads only; not its strength	Its strength — the mature CUDA ecosystem for training
Size & installation	Compact, plug in and go	Bulkier, needs matching power & cooling

For most enterprise tasks (internal assistant, document Q&A, drafting, summarizing) and moderate user counts, Namtech chooses Apple Silicon Mac Mini/Studio clusters: unified memory lets a single machine load a fairly large model, low power draw means it sits in a normal office, and it scales by adding machines. When needs lean toward very high throughput or heavy training, an NVIDIA GPU is the more sensible choice.

Sizing by scale

Namtech packages hardware into three tiers for easy planning. The numbers below are approximate guides, not absolute commitments — the number of users you can serve depends on model size, context length and real load; assess against your specific needs.

Tier	For	Memory (RAM/VRAM) — guide	Storage — guide	Users — approx.
AI Box	A small team / a pilot room	Around tens of GB (enough for small–medium models)	SSD ~1 TB or more	Roughly a small shared team
AI Pro	A department	More than AI Box (larger models or longer context)	Larger SSD for multiple models + vector DB	Roughly a department, concurrent
AI Cluster	Whole enterprise	Multi-machine cluster, total memory scales with load	Centralized, redundant storage	Roughly whole enterprise, scaling up

A pragmatic principle: start at the lowest tier that solves your first clear problem, measure the real impact, then expand — rather than over-buying up front. Details of the three packages are on the pricing page.

Three internal AI hardware tiers scaling up — user counts are approximate guides. Diagram: Namtech.

The memory rule of thumb

This is the most important part of choosing a configuration. Memory (RAM on Apple Silicon, VRAM on GPU) must hold:

Model weights: roughly parameter count × bytes per parameter. Bytes per parameter depend on quantization — compressing the weights to use less memory. At the common Q4 level (about half a byte per parameter), a ~7B model needs only about a few GB; larger models (14B, 32B, 70B) need proportionally more.
Context (KV cache): the memory for context — the longer the context or the more concurrent users, the more this grows.

Add the two together, then leave a safety margin for the operating system and load variation. This is a rule of thumb for a quick estimate, not an exact figure for every model — real numbers vary by architecture and quantization level, so check the specific model page on Hugging Face. Exactly which model size and quantization to pick is covered in the Model selection article.

Storage & networking

Beyond memory and processing, three often-overlooked items directly affect the experience:

SSD: models can be many GB and must load quickly into memory at startup. An SSD large enough to hold multiple models plus the vector database for RAG (indexed internal documents).
Internal network: users call the AI server over the LAN. A stable internal network keeps responses smooth and keeps everything within the on-premise boundary — no data pushed outside.
UPS (battery backup): so models and services don't shut down abruptly during a power cut, avoiding data corruption and disruption.

How to lock the data flow inside the network and control access is covered in the Internal AI security system article.

For the IT team

How to check machine resources and the models you have, plus a quick memory-estimation rule:

List installed models: ollama list — shows the name and size of each model on the machine.
Check resources: free RAM, GPU/VRAM (on Apple Silicon it's unified memory, so total RAM is the number to watch).
Estimate memory: ≈ (parameter count × bytes per quantization) + a portion for context, then leave a safety margin. For exact numbers, check the model page on Hugging Face.

# list the models on this machine
ollama list
# rough estimate: 7B at Q4 ~ a few GB; larger models need more
# pull & run a small model to measure real memory use
ollama run qwen2.5:7b

The Namtech view

Namtech deploys private internal AI platforms on Apple Silicon Mac Mini/Studio clusters: unified memory lets a machine load a fairly large model, low power draw means it sits in a normal office, and you scale by adding machines as load grows — instead of over-investing up front. The philosophy is start with just enough, scale gradually: pick a configuration for your first clear problem, measure the real impact, then upgrade as needed. The next step is to choose the open-source model that fits the hardware you've picked.

Frequently asked questions

Is Apple Silicon or GPU better for internal AI?

There's no absolute answer. Apple Silicon (Mac Mini/Studio) has unified memory, so it runs large models on shared RAM with low power, a small footprint and quiet operation — a fit for most enterprise tasks at moderate scale. NVIDIA GPUs give high throughput and excel at training, but cost more in power, heat and space. Namtech uses Apple Silicon clusters for most deployments.

How much memory does a 7B model need?

As a rule of thumb, a ~7B model at Q4 quantization needs about a few GB for weights, plus a portion for context. That's a quick estimate — real numbers vary by architecture and quantization level, so check the specific model page on Hugging Face.

What configuration do I need for the whole company?

It depends on model size, concurrent users and context length. You typically start from AI Box for a team, then move to AI Pro (department) and AI Cluster (whole enterprise). User counts are only approximate guides; assess against your real load.

Do I need a dedicated server room?

With a low-power Apple Silicon cluster, usually not — the machines are compact, cool and quiet, and can sit in a normal office (a UPS is advisable). High-wattage GPUs need matching power and cooling, in which case a server room makes sense.

← Previous · Part 1/8Overview & roadmap Next · Part 3/8 →Model selection

Want internal AI without starting from zero?

Namtech deploys private internal AI platforms — open-source models running 100% on your own infrastructure, data never leaving the organization.

Book a free consultation

Note: The memory and user-count figures in this article are rules of thumb, updated 02/07/2026; hardware and models change fast — verify the specific configuration against your real needs when you deploy.

References