What is an AI token? Data units & the context window

Q: How do I count the tokens in a piece of text?

Use the model's own tokenizer: tiktoken for OpenAI models, or Hugging Face tokenizers for open-source models. Rule-of-thumb estimates are only for quick reference; for accuracy you must run the real tokenizer.

A token is the smallest data unit an AI model (LLM) reads in and produces — usually a word fragment, a short word or a punctuation mark, not a whole sentence or a whole word. The model doesn't "see" letters the way people do; it turns text into a sequence of tokens, each mapped to a numeric ID, and does all its math on those numbers. Understanding tokens unlocks two things that matter in practice: the context window (how much the AI can remember) and cost (cloud AI services bill per token).

Quick summary

What a token is: the smallest chunk of text a model processes — a word fragment, a short word or a punctuation mark — each with a numeric ID.
Rule of thumb (English): 1 token ≈ 4 characters ≈ ¾ of a word, so 100 tokens ≈ 75 words.
Vietnamese uses more tokens than English because of diacritics, multi-byte accented characters and word segmentation — affecting both cost and context capacity.
The context window is "working memory": exceed the limit and old tokens are pushed out, so the model "forgets".
Tokens = money with cloud AI; running internal AI on-premise means near-fixed cost, not per-token.

What is a token?

For people, the natural unit of reading is the word. For a language model, that unit is the token — the smallest piece the model reads in and generates. A token can be a short whole word (house), a fragment of a longer word (ternational in "international"), a punctuation mark (.) or even a space. Common words tend to be a single token; rare or long words are usually split into several tokens.

The key point: the model doesn't process letters, it processes numbers. Each token maps to an integer ID in the model's token vocabulary. Your sentence becomes a sequence of IDs, the model predicts the next token ID, and the resulting sequence is translated back into text. All the "intelligence" you see is one predicted token after another.

How does tokenization work?

Turning text into tokens is called tokenization. Most models today use subword tokenization, most commonly a BPE (Byte Pair Encoding)-style algorithm. The idea: keep frequently-seen character fragments as single tokens, and combine those fragments to build rare words — so the model can handle words it has never seen without needing an infinite dictionary.

For example, the word "darkness" may be split into two tokens: "dark" + "ness". The stem "dark" is very common so it's one token; the suffix "ness" recurs across many words (kindness, softness…) so it gets reused. Thanks to this, a novel word like "darkishness" still breaks down into familiar pieces.

For English, a handy rule of thumb: 1 token ≈ 4 characters ≈ ¾ of a word, meaning roughly 100 tokens ≈ 75 words. This is an estimate, not a hard rule.

Vietnamese typically uses more tokens than English for the same amount of content. The reasons: accented characters (tone marks, multi-byte Unicode characters), and Vietnamese word segmentation that doesn't match a tokenizer optimized for English, so many syllables get split into several small tokens. The practical consequence: for the same text, the Vietnamese version uses more tokens → higher cost (with cloud AI) and a larger share of the context window.

The flow: text → tokenizer (BPE) → tokens with IDs → held in the context window → model → answer. Diagram: Namtech.

What is the context window?

The context window is the maximum number of tokens a model can "see" at once — including both what you feed in and what it is generating. Think of it as working memory: anything inside the window, the model still "remembers"; anything beyond it effectively doesn't exist.

When a conversation or document grows past the limit, the oldest tokens get pushed out to make room for new ones — and the model genuinely "forgets" the beginning. That's why a chatbot sometimes loses track of what you said early in a long conversation.

Models today have very different context windows, ranging from tens of thousands of tokens up to around a million tokens depending on the model. These numbers change quickly over time and across versions, so when you deploy you should look up the specific limit of the model you use rather than memorizing a fixed figure.

Tokens & cost

With cloud AI services, the token is the billing unit: you pay for the tokens you send in (input) and the tokens generated (output). So the longer your prompt, the more documents you paste, and the longer the answer, the higher the bill — and pricing varies by usage as well as by model.

One thing that's easy to miss: "reasoning" / "thinking" tokens (when a model writes out a chain of reasoning before answering) use noticeably more resources than answering directly — because the model generates many intermediate tokens you don't see but are still counted.

This is where internal AI is fundamentally different: when the model runs on-premise on your own infrastructure, cost is nearly fixed (hardware + power + operations) regardless of how many tokens you use. No matter how heavily you use it, the bill doesn't jump per token. See also: Building your own internal AI — overview & roadmap.

How to optimize token usage

Whether you use cloud AI (to save money) or internal AI (to save context and speed), spending tokens wisely always helps:

Tight, clear prompts: drop the filler, write concise requests — fewer input tokens while still conveying enough.
Use RAG instead of dumping the whole knowledge base: load only the relevant passages into context rather than pasting a long document. See RAG over internal documents.
Summarize long conversations: for long-running chats, replace old turns with a short summary to save tokens while keeping continuity.

Table — Ways to optimize token usage
Technique	How to do it
Tight, clear prompts	Drop the filler, write concise requests — fewer input tokens while still conveying enough
Use RAG instead of dumping the whole knowledge base	Load only the relevant passages into context rather than pasting a long document
Summarize long conversations	Replace old turns with a short summary to save tokens while keeping continuity

For the IT team

To know how many tokens a piece of text uses, don't guess — count with the model's own tokenizer. Two common tools:

OpenAI tiktoken: a Python library that counts tokens using the exact encoding of each model.
Hugging Face tokenizers: load the tokenizer of an open-source model (Qwen, Llama…) to count for internal AI.

Table — Two common token-counting tools
Tool	Used for
OpenAI `tiktoken`	A Python library that counts tokens using the exact encoding of each OpenAI model
Hugging Face `tokenizers`	Load the tokenizer of an open-source model (Qwen, Llama…) to count for internal AI

Quick count with tiktoken:

# pip install tiktoken
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")  # encoding used by some OpenAI models
tokens = enc.encode("What is an AI token?")
print(len(tokens))       # number of tokens in the string
print(tokens)            # the numeric ID of each token

Tip: try the same sentence in English and Vietnamese to see for yourself that Vietnamese uses more tokens.

The Namtech view

For Vietnamese enterprises, tokens aren't just a technical concept — they directly affect cost and the ability to handle long documents, especially since Vietnamese content is more token-heavy. Namtech deploys private internal AI platforms running 100% on-site on Apple Silicon with open-source models: you don't pay per token, you have full control over context capacity, and data never leaves the organization. Understanding tokens is the first step to using AI efficiently — whether you choose cloud, internal, or a hybrid.

Frequently asked questions

Is a token the same as a word?

Not exactly. A token is the smallest piece a model processes — it can be a short word, a fragment of a longer word, a punctuation mark or a space. In English, a token averages about ¾ of a word; a long or rare word may be split into several tokens.

Why does Vietnamese use more tokens than English?

Because Vietnamese text has diacritics (multi-byte Unicode characters) and word segmentation that doesn't match a tokenizer optimized for English, so many syllables get split into several small tokens. For the same content, the Vietnamese version usually uses more tokens — costing more and consuming more context.

What happens when the context window fills up?

The oldest tokens are pushed out of the window to make room for new ones, and the model "forgets" that part. So very long conversations can make the model miss what you said at the start; the fix is to summarize or use RAG to reload just what's needed.

How do I count the tokens in a piece of text?

Use the model's own tokenizer: tiktoken for OpenAI models, or Hugging Face tokenizers for open-source models. Rule-of-thumb estimates (1 token ≈ 4 characters) are only for quick reference; for accuracy you must run the real tokenizer.

Want AI to handle long documents without token bills?

Namtech deploys private internal AI platforms — open-source models running 100% on your own infrastructure, at fixed cost, with data never leaving the organization.

Book a free consultation

Note: This article explains concepts, updated 02/07/2026; context-window limits and token pricing change fast across models — check the official documentation of the model you use.

References