Token/s and Concurrency:

The Two Most Misunderstood Metrics in Enterprise LLM Deployment

When evaluating Large Language Model (LLM) deployment options, many teams focus on GPU models and parameter counts—70B, 235B, 671B—while overlooking two metrics that actually determine whether a system is usable in real life:

Token/s (generation throughput)
Concurrency

These two metrics directly affect:

User experience
System scalability
Hardware sizing
Budget accuracy

This article explains what Token/s and concurrency really mean, why they are frequently misunderstood, and how enterprises should evaluate them when planning on-prem or private LLM deployments.

1. What Is Token/s?

It Is Simply the Output Speed of an AI Model

1.1 What Is a Token?

LLMs do not generate text sentence by sentence. They generate text token by token.

A token can be:

A Chinese character
An English word
A punctuation mark
A word fragment (e.g., “-ing”, “-tion”)

All reasoning and output in an LLM is performed as a sequence of token generations.

1.2 Definition of Token/s

Token/s = the number of tokens a model can generate per second

Examples:

10 Token/s → ~10 Chinese characters per second
40 Token/s → ~40 Chinese characters per second

Token/s does not affect correctness or intelligence.
It affects:

How long an answer takes to finish
How well the system handles multiple users

2. Token/s vs User Perception

When users interact with an LLM, they experience two distinct phases:

Time to First Token (TTFT)
How long until the first character appears
Generation Speed (Token/s)
How fast content streams after it starts

These two are often confused, but they have very different impacts.

Practical Comparison

Assume a 400-token response:

Scenario	TTFT	Token/s	Total Time
A	0.5 s	10	~40 s
B	3 s	40	~13 s

In enterprise workloads—analysis, reports, reasoning—Token/s usually matters more than TTFT.

3. What Is Concurrency?

And Why User Count ≠ Concurrency

3.1 The Most Common Mistake

“We have 100 employees using AI, so we need concurrency for 100 users.”

This assumption is almost always wrong.

3.2 Correct Definition of Concurrency

Concurrency = the number of requests simultaneously generating tokens

What matters is:

How many requests are active at the same time
How long each request occupies the system

3.3 Why LLM Concurrency Is Expensive

Unlike databases or web APIs, LLM requests are long-lived:

TTFT: 2–5 seconds
Generation time: 20–60 seconds (or more)

That means:

One request can occupy GPU resources for 30–60 seconds continuously

This makes concurrency a critical capacity constraint.

4. Token/s × Concurrency = Real System Load

4.1 A Practical Example

Assume:

Token/s = 10
Average response length = 300 tokens
TTFT = 4 seconds

Request duration:

4 + (300 ÷ 10) = 34 seconds

That request counts as one active concurrent load for 34 seconds.

4.2 Why Systems Degrade Quickly

If:

5 requests start at the same time

Then:

Token/s is divided
TTFT increases
Responses slow down

Users experience this as:

Lag
“System freezing”
Unreliable performance

5. How Enterprises Should Estimate Concurrency

Step 1: Start With Use Cases, Not Headcount

Use Case	Characteristics
Technical document lookup	Low frequency, short
Legal or contract analysis	Low frequency, long
R&D reasoning	Few users, deep
Customer support	High frequency, short

Step 2: Estimate Requests per Minute (Peak)

This is far more accurate than counting users.

Step 3: Use a Simple Formula

Concurrency ≈ (Requests per minute × Duration per request) ÷ 60

Example:

4 requests per minute
30 seconds per request

(4 × 30) ÷ 60 = 2

Expected concurrency ≈ 2

6. Typical Enterprise Concurrency Levels

Concurrency	Typical Scenario
1–3	IT, R&D, management
3–10	SME internal AI
10–30	Department-level support
50+	Public SaaS platforms

Most enterprises operate comfortably in the 1–10 range.

7. Why High Concurrency Costs Grow Non-Linearly

Supporting high concurrency usually requires:

More GPUs
Higher Token/s throughput
Advanced scheduling and caching
Significantly higher CapEx and OpEx

Paying 5–10× more for rare peak usage is unnecessary for most organizations.

8. Conclusion: Token/s and Concurrency Are Decision-Level Metrics

For enterprise LLM deployment:

Token/s determines productivity
Concurrency determines infrastructure scale and cost
They must be evaluated together

A practical strategy is:

Start with low concurrency and high-quality responses,
observe real usage patterns,
then scale based on actual data—not assumptions.

This approach delivers better ROI, lower risk, and sustainable AI adoption.