The Two Most Misunderstood Metrics in Enterprise LLM Deployment
When evaluating Large Language Model (LLM) deployment options, many teams focus on GPU models and parameter counts—70B, 235B, 671B—while overlooking two metrics that actually determine whether a system is usable in real life:
- Token/s (generation throughput)
- Concurrency
These two metrics directly affect:
- User experience
- System scalability
- Hardware sizing
- Budget accuracy
This article explains what Token/s and concurrency really mean, why they are frequently misunderstood, and how enterprises should evaluate them when planning on-prem or private LLM deployments.
1. What Is Token/s?
It Is Simply the Output Speed of an AI Model
1.1 What Is a Token?
LLMs do not generate text sentence by sentence. They generate text token by token.
A token can be:
- A Chinese character
- An English word
- A punctuation mark
- A word fragment (e.g., “-ing”, “-tion”)
All reasoning and output in an LLM is performed as a sequence of token generations.
1.2 Definition of Token/s
Token/s = the number of tokens a model can generate per second
Examples:
- 10 Token/s → ~10 Chinese characters per second
- 40 Token/s → ~40 Chinese characters per second
Token/s does not affect correctness or intelligence.
It affects:
- How long an answer takes to finish
- How well the system handles multiple users
2. Token/s vs User Perception
When users interact with an LLM, they experience two distinct phases:
- Time to First Token (TTFT)
How long until the first character appears - Generation Speed (Token/s)
How fast content streams after it starts
These two are often confused, but they have very different impacts.
Practical Comparison
Assume a 400-token response:
| Scenario | TTFT | Token/s | Total Time |
|---|---|---|---|
| A | 0.5 s | 10 | ~40 s |
| B | 3 s | 40 | ~13 s |
In enterprise workloads—analysis, reports, reasoning—Token/s usually matters more than TTFT.
3. What Is Concurrency?
And Why User Count ≠ Concurrency
3.1 The Most Common Mistake
“We have 100 employees using AI, so we need concurrency for 100 users.”
This assumption is almost always wrong.
3.2 Correct Definition of Concurrency
Concurrency = the number of requests simultaneously generating tokens
What matters is:
- How many requests are active at the same time
- How long each request occupies the system
3.3 Why LLM Concurrency Is Expensive
Unlike databases or web APIs, LLM requests are long-lived:
- TTFT: 2–5 seconds
- Generation time: 20–60 seconds (or more)
That means:
One request can occupy GPU resources for 30–60 seconds continuously
This makes concurrency a critical capacity constraint.
4. Token/s × Concurrency = Real System Load
4.1 A Practical Example
Assume:
- Token/s = 10
- Average response length = 300 tokens
- TTFT = 4 seconds
Request duration:
4 + (300 ÷ 10) = 34 seconds
That request counts as one active concurrent load for 34 seconds.
4.2 Why Systems Degrade Quickly
If:
- 5 requests start at the same time
Then:
- Token/s is divided
- TTFT increases
- Responses slow down
Users experience this as:
- Lag
- “System freezing”
- Unreliable performance
5. How Enterprises Should Estimate Concurrency
Step 1: Start With Use Cases, Not Headcount
| Use Case | Characteristics |
|---|---|
| Technical document lookup | Low frequency, short |
| Legal or contract analysis | Low frequency, long |
| R&D reasoning | Few users, deep |
| Customer support | High frequency, short |
Step 2: Estimate Requests per Minute (Peak)
This is far more accurate than counting users.
Step 3: Use a Simple Formula
Concurrency ≈ (Requests per minute × Duration per request) ÷ 60
Example:
- 4 requests per minute
- 30 seconds per request
(4 × 30) ÷ 60 = 2
Expected concurrency ≈ 2
6. Typical Enterprise Concurrency Levels
| Concurrency | Typical Scenario |
|---|---|
| 1–3 | IT, R&D, management |
| 3–10 | SME internal AI |
| 10–30 | Department-level support |
| 50+ | Public SaaS platforms |
Most enterprises operate comfortably in the 1–10 range.
7. Why High Concurrency Costs Grow Non-Linearly
Supporting high concurrency usually requires:
- More GPUs
- Higher Token/s throughput
- Advanced scheduling and caching
- Significantly higher CapEx and OpEx
Paying 5–10× more for rare peak usage is unnecessary for most organizations.
8. Conclusion: Token/s and Concurrency Are Decision-Level Metrics
For enterprise LLM deployment:
- Token/s determines productivity
- Concurrency determines infrastructure scale and cost
- They must be evaluated together
A practical strategy is:
Start with low concurrency and high-quality responses,
observe real usage patterns,
then scale based on actual data—not assumptions.
This approach delivers better ROI, lower risk, and sustainable AI adoption.