Hardware for internal AI is decided by four factors: model size (parameter count), number of concurrent users, context length, and the speed you expect. The pivotal factor is memory — RAM or VRAM must hold the model weights plus context. For most enterprises, Apple Silicon (Mac Mini/Studio, unified memory, low power draw) is a tidy starting point; NVIDIA GPUs fit when you need very high throughput or training. This is Part 2/8 in the build-your-own internal AI series.
Quick summary
- What decides it: model size × concurrent users × context length × desired speed — all of which reduce to a memory requirement.
- Apple Silicon vs GPU: Apple Silicon has unified memory, so it runs large models on shared RAM with low power, small footprint and quiet operation; NVIDIA GPUs give high throughput and suit training but cost more in power, heat and space.
- Memory rule: RAM/VRAM ≈ (parameter count × bytes per quantization) + a portion for context — this is a rule of thumb, not an exact figure.
- Sizing by scale: AI Box (a small team) → AI Pro (a department) → AI Cluster (whole enterprise); the numbers below are approximate guides.
- Namtech: deploys low-power Apple Silicon Mac Mini/Studio clusters, scaling up as needs grow.
What decides your hardware needs?
Before debating "which machine to buy", answer four questions — they drive everything else:
- Model size (parameter count): a bigger model (7B, 14B, 32B, 70B…) is smarter but consumes more memory and runs slower. This is the single biggest variable.
- Concurrent users: one person asking occasionally is very different from 30 people typing at once. Many concurrent users need higher throughput (often a GPU or several machines).
- Context length: letting the AI read long documents, long conversations or many RAG passages consumes extra memory for the "context" (KV cache) — the longer the context, the more memory.
- Desired speed: is "fast enough to read along" acceptable, or do you need near-instant replies? Higher speed expectations demand more powerful hardware or a smaller model.
The key point: all four factors reduce to memory and throughput. Once you pick a model size and user count, the rest (machine type, RAM/VRAM capacity) follows fairly naturally. See how the pieces fit together in the internal AI system architecture diagram.
Apple Silicon or GPU?
This is the biggest hardware decision. Both run open-source models; they differ in memory architecture, power draw and the situations they suit.
| Criterion | Apple Silicon (Mac Mini/Studio) | NVIDIA GPU |
|---|---|---|
| Memory | Unified memory — CPU & GPU share it, so a single machine can load a large model if configured with high RAM | Dedicated VRAM per card — powerful, but per-card capacity is limited; large models must span multiple cards |
| Power & heat | Low power, cool, quiet — fine in a normal office | High power draw & heat, usually needs a server room/cooling |
| Multi-user throughput | Good for small–medium teams; scale by adding machines | Very high, suited to serving many concurrent users |
| Heavy training / fine-tune | Light workloads only; not its strength | Its strength — the mature CUDA ecosystem for training |
| Size & installation | Compact, plug in and go | Bulkier, needs matching power & cooling |
For most enterprise tasks (internal assistant, document Q&A, drafting, summarizing) and moderate user counts, Namtech chooses Apple Silicon Mac Mini/Studio clusters: unified memory lets a single machine load a fairly large model, low power draw means it sits in a normal office, and it scales by adding machines. When needs lean toward very high throughput or heavy training, an NVIDIA GPU is the more sensible choice.
Sizing by scale
Namtech packages hardware into three tiers for easy planning. The numbers below are approximate guides, not absolute commitments — the number of users you can serve depends on model size, context length and real load; assess against your specific needs.
| Tier | For | Memory (RAM/VRAM) — guide | Storage — guide | Users — approx. |
|---|---|---|---|---|
| AI Box | A small team / a pilot room | Around tens of GB (enough for small–medium models) | SSD ~1 TB or more | Roughly a small shared team |
| AI Pro | A department | More than AI Box (larger models or longer context) | Larger SSD for multiple models + vector DB | Roughly a department, concurrent |
| AI Cluster | Whole enterprise | Multi-machine cluster, total memory scales with load | Centralized, redundant storage | Roughly whole enterprise, scaling up |
A pragmatic principle: start at the lowest tier that solves your first clear problem, measure the real impact, then expand — rather than over-buying up front. Details of the three packages are on the pricing page.
The memory rule of thumb
This is the most important part of choosing a configuration. Memory (RAM on Apple Silicon, VRAM on GPU) must hold:
- Model weights: roughly parameter count × bytes per parameter. Bytes per parameter depend on quantization — compressing the weights to use less memory. At the common Q4 level (about half a byte per parameter), a ~7B model needs only about a few GB; larger models (14B, 32B, 70B) need proportionally more.
- Context (KV cache): the memory for context — the longer the context or the more concurrent users, the more this grows.
Add the two together, then leave a safety margin for the operating system and load variation. This is a rule of thumb for a quick estimate, not an exact figure for every model — real numbers vary by architecture and quantization level, so check the specific model page on Hugging Face. Exactly which model size and quantization to pick is covered in the Model selection article.
Storage & networking
Beyond memory and processing, three often-overlooked items directly affect the experience:
- SSD: models can be many GB and must load quickly into memory at startup. An SSD large enough to hold multiple models plus the vector database for RAG (indexed internal documents).
- Internal network: users call the AI server over the LAN. A stable internal network keeps responses smooth and keeps everything within the on-premise boundary — no data pushed outside.
- UPS (battery backup): so models and services don't shut down abruptly during a power cut, avoiding data corruption and disruption.
How to lock the data flow inside the network and control access is covered in the Internal AI security system article.
How to check machine resources and the models you have, plus a quick memory-estimation rule:
- List installed models:
ollama list— shows the name and size of each model on the machine. - Check resources: free RAM, GPU/VRAM (on Apple Silicon it's unified memory, so total RAM is the number to watch).
- Estimate memory: ≈ (parameter count × bytes per quantization) + a portion for context, then leave a safety margin. For exact numbers, check the model page on Hugging Face.
# list the models on this machine
ollama list
# rough estimate: 7B at Q4 ~ a few GB; larger models need more
# pull & run a small model to measure real memory use
ollama run qwen2.5:7b
The Namtech view
Namtech deploys private internal AI platforms on Apple Silicon Mac Mini/Studio clusters: unified memory lets a machine load a fairly large model, low power draw means it sits in a normal office, and you scale by adding machines as load grows — instead of over-investing up front. The philosophy is start with just enough, scale gradually: pick a configuration for your first clear problem, measure the real impact, then upgrade as needed. The next step is to choose the open-source model that fits the hardware you've picked.
Frequently asked questions
Is Apple Silicon or GPU better for internal AI?
There's no absolute answer. Apple Silicon (Mac Mini/Studio) has unified memory, so it runs large models on shared RAM with low power, a small footprint and quiet operation — a fit for most enterprise tasks at moderate scale. NVIDIA GPUs give high throughput and excel at training, but cost more in power, heat and space. Namtech uses Apple Silicon clusters for most deployments.
How much memory does a 7B model need?
As a rule of thumb, a ~7B model at Q4 quantization needs about a few GB for weights, plus a portion for context. That's a quick estimate — real numbers vary by architecture and quantization level, so check the specific model page on Hugging Face.
What configuration do I need for the whole company?
It depends on model size, concurrent users and context length. You typically start from AI Box for a team, then move to AI Pro (department) and AI Cluster (whole enterprise). User counts are only approximate guides; assess against your real load.
Do I need a dedicated server room?
With a low-power Apple Silicon cluster, usually not — the machines are compact, cool and quiet, and can sit in a normal office (a UPS is advisable). High-wattage GPUs need matching power and cooling, in which case a server room makes sense.
Want internal AI without starting from zero?
Namtech deploys private internal AI platforms — open-source models running 100% on your own infrastructure, data never leaving the organization.
Book a free consultationNote: The memory and user-count figures in this article are rules of thumb, updated 02/07/2026; hardware and models change fast — verify the specific configuration against your real needs when you deploy.