Internal AI

On-premise hardware for internal AI: how to choose

Choosing on-premise hardware for internal AI: Apple Silicon and GPU

Hardware for internal AI is decided by four factors: model size (parameter count), number of concurrent users, context length, and the speed you expect. The pivotal factor is memory — RAM or VRAM must hold the model weights plus context. For most enterprises, Apple Silicon (Mac Mini/Studio, unified memory, low power draw) is a tidy starting point; NVIDIA GPUs fit when you need very high throughput or training. This is Part 2/8 in the build-your-own internal AI series.

Quick summary

  • What decides it: model size × concurrent users × context length × desired speed — all of which reduce to a memory requirement.
  • Apple Silicon vs GPU: Apple Silicon has unified memory, so it runs large models on shared RAM with low power, small footprint and quiet operation; NVIDIA GPUs give high throughput and suit training but cost more in power, heat and space.
  • Memory rule: RAM/VRAM ≈ (parameter count × bytes per quantization) + a portion for context — this is a rule of thumb, not an exact figure.
  • Sizing by scale: AI Box (a small team) → AI Pro (a department) → AI Cluster (whole enterprise); the numbers below are approximate guides.
  • Namtech: deploys low-power Apple Silicon Mac Mini/Studio clusters, scaling up as needs grow.

What decides your hardware needs?

Before debating "which machine to buy", answer four questions — they drive everything else:

  • Model size (parameter count): a bigger model (7B, 14B, 32B, 70B…) is smarter but consumes more memory and runs slower. This is the single biggest variable.
  • Concurrent users: one person asking occasionally is very different from 30 people typing at once. Many concurrent users need higher throughput (often a GPU or several machines).
  • Context length: letting the AI read long documents, long conversations or many RAG passages consumes extra memory for the "context" (KV cache) — the longer the context, the more memory.
  • Desired speed: is "fast enough to read along" acceptable, or do you need near-instant replies? Higher speed expectations demand more powerful hardware or a smaller model.

The key point: all four factors reduce to memory and throughput. Once you pick a model size and user count, the rest (machine type, RAM/VRAM capacity) follows fairly naturally. See how the pieces fit together in the internal AI system architecture diagram.

Apple Silicon or GPU?

This is the biggest hardware decision. Both run open-source models; they differ in memory architecture, power draw and the situations they suit.

CriterionApple Silicon (Mac Mini/Studio)NVIDIA GPU
MemoryUnified memory — CPU & GPU share it, so a single machine can load a large model if configured with high RAMDedicated VRAM per card — powerful, but per-card capacity is limited; large models must span multiple cards
Power & heatLow power, cool, quiet — fine in a normal officeHigh power draw & heat, usually needs a server room/cooling
Multi-user throughputGood for small–medium teams; scale by adding machinesVery high, suited to serving many concurrent users
Heavy training / fine-tuneLight workloads only; not its strengthIts strength — the mature CUDA ecosystem for training
Size & installationCompact, plug in and goBulkier, needs matching power & cooling

For most enterprise tasks (internal assistant, document Q&A, drafting, summarizing) and moderate user counts, Namtech chooses Apple Silicon Mac Mini/Studio clusters: unified memory lets a single machine load a fairly large model, low power draw means it sits in a normal office, and it scales by adding machines. When needs lean toward very high throughput or heavy training, an NVIDIA GPU is the more sensible choice.

Sizing by scale

Namtech packages hardware into three tiers for easy planning. The numbers below are approximate guides, not absolute commitments — the number of users you can serve depends on model size, context length and real load; assess against your specific needs.

TierForMemory (RAM/VRAM) — guideStorage — guideUsers — approx.
AI BoxA small team / a pilot roomAround tens of GB (enough for small–medium models)SSD ~1 TB or moreRoughly a small shared team
AI ProA departmentMore than AI Box (larger models or longer context)Larger SSD for multiple models + vector DBRoughly a department, concurrent
AI ClusterWhole enterpriseMulti-machine cluster, total memory scales with loadCentralized, redundant storageRoughly whole enterprise, scaling up

A pragmatic principle: start at the lowest tier that solves your first clear problem, measure the real impact, then expand — rather than over-buying up front. Details of the three packages are on the pricing page.

Scale up with your size — start small, add machines when needed 1AI BoxOne machine · small teamApprox: a shared team 2AI ProStronger machine · departmentApprox: a department 3AI ClusterMulti-machine · whole orgApprox: whole enterprise User count & load increase →
Three internal AI hardware tiers scaling up — user counts are approximate guides. Diagram: Namtech.

The memory rule of thumb

This is the most important part of choosing a configuration. Memory (RAM on Apple Silicon, VRAM on GPU) must hold:

  • Model weights: roughly parameter count × bytes per parameter. Bytes per parameter depend on quantization — compressing the weights to use less memory. At the common Q4 level (about half a byte per parameter), a ~7B model needs only about a few GB; larger models (14B, 32B, 70B) need proportionally more.
  • Context (KV cache): the memory for context — the longer the context or the more concurrent users, the more this grows.

Add the two together, then leave a safety margin for the operating system and load variation. This is a rule of thumb for a quick estimate, not an exact figure for every model — real numbers vary by architecture and quantization level, so check the specific model page on Hugging Face. Exactly which model size and quantization to pick is covered in the Model selection article.

Storage & networking

Beyond memory and processing, three often-overlooked items directly affect the experience:

  • SSD: models can be many GB and must load quickly into memory at startup. An SSD large enough to hold multiple models plus the vector database for RAG (indexed internal documents).
  • Internal network: users call the AI server over the LAN. A stable internal network keeps responses smooth and keeps everything within the on-premise boundary — no data pushed outside.
  • UPS (battery backup): so models and services don't shut down abruptly during a power cut, avoiding data corruption and disruption.

How to lock the data flow inside the network and control access is covered in the Internal AI security system article.

For the IT team

How to check machine resources and the models you have, plus a quick memory-estimation rule:

  • List installed models: ollama list — shows the name and size of each model on the machine.
  • Check resources: free RAM, GPU/VRAM (on Apple Silicon it's unified memory, so total RAM is the number to watch).
  • Estimate memory: ≈ (parameter count × bytes per quantization) + a portion for context, then leave a safety margin. For exact numbers, check the model page on Hugging Face.
# list the models on this machine
ollama list
# rough estimate: 7B at Q4 ~ a few GB; larger models need more
# pull & run a small model to measure real memory use
ollama run qwen2.5:7b

The Namtech view

Namtech deploys private internal AI platforms on Apple Silicon Mac Mini/Studio clusters: unified memory lets a machine load a fairly large model, low power draw means it sits in a normal office, and you scale by adding machines as load grows — instead of over-investing up front. The philosophy is start with just enough, scale gradually: pick a configuration for your first clear problem, measure the real impact, then upgrade as needed. The next step is to choose the open-source model that fits the hardware you've picked.

Frequently asked questions

Is Apple Silicon or GPU better for internal AI?

There's no absolute answer. Apple Silicon (Mac Mini/Studio) has unified memory, so it runs large models on shared RAM with low power, a small footprint and quiet operation — a fit for most enterprise tasks at moderate scale. NVIDIA GPUs give high throughput and excel at training, but cost more in power, heat and space. Namtech uses Apple Silicon clusters for most deployments.

How much memory does a 7B model need?

As a rule of thumb, a ~7B model at Q4 quantization needs about a few GB for weights, plus a portion for context. That's a quick estimate — real numbers vary by architecture and quantization level, so check the specific model page on Hugging Face.

What configuration do I need for the whole company?

It depends on model size, concurrent users and context length. You typically start from AI Box for a team, then move to AI Pro (department) and AI Cluster (whole enterprise). User counts are only approximate guides; assess against your real load.

Do I need a dedicated server room?

With a low-power Apple Silicon cluster, usually not — the machines are compact, cool and quiet, and can sit in a normal office (a UPS is advisable). High-wattage GPUs need matching power and cooling, in which case a server room makes sense.

Want internal AI without starting from zero?

Namtech deploys private internal AI platforms — open-source models running 100% on your own infrastructure, data never leaving the organization.

Book a free consultation

Note: The memory and user-count figures in this article are rules of thumb, updated 02/07/2026; hardware and models change fast — verify the specific configuration against your real needs when you deploy.

Get started

Start with a free assessment

To define the right package and detailed scope, Namtech offers a short, no-cost assessment.

We reply within 1 business day. No spam, we never share your info.