Skip to content

Nuface Blog

隨意隨手記 Casual Notes

Menu
  • Home
  • About
  • Services
  • Blog
  • Contact
  • Privacy Policy
  • Login
Menu

Token/s and Concurrency:

Posted on 2026-01-162026-01-16 by Rico

The Two Most Misunderstood Metrics in Enterprise LLM Deployment

When evaluating Large Language Model (LLM) deployment options, many teams focus on GPU models and parameter counts—70B, 235B, 671B—while overlooking two metrics that actually determine whether a system is usable in real life:

  • Token/s (generation throughput)
  • Concurrency

These two metrics directly affect:

  • User experience
  • System scalability
  • Hardware sizing
  • Budget accuracy

This article explains what Token/s and concurrency really mean, why they are frequently misunderstood, and how enterprises should evaluate them when planning on-prem or private LLM deployments.


1. What Is Token/s?

It Is Simply the Output Speed of an AI Model

1.1 What Is a Token?

LLMs do not generate text sentence by sentence. They generate text token by token.

A token can be:

  • A Chinese character
  • An English word
  • A punctuation mark
  • A word fragment (e.g., “-ing”, “-tion”)

All reasoning and output in an LLM is performed as a sequence of token generations.


1.2 Definition of Token/s

Token/s = the number of tokens a model can generate per second

Examples:

  • 10 Token/s → ~10 Chinese characters per second
  • 40 Token/s → ~40 Chinese characters per second

Token/s does not affect correctness or intelligence.
It affects:

  • How long an answer takes to finish
  • How well the system handles multiple users

2. Token/s vs User Perception

When users interact with an LLM, they experience two distinct phases:

  1. Time to First Token (TTFT)
    How long until the first character appears
  2. Generation Speed (Token/s)
    How fast content streams after it starts

These two are often confused, but they have very different impacts.

Practical Comparison

Assume a 400-token response:

ScenarioTTFTToken/sTotal Time
A0.5 s10~40 s
B3 s40~13 s

In enterprise workloads—analysis, reports, reasoning—Token/s usually matters more than TTFT.


3. What Is Concurrency?

And Why User Count ≠ Concurrency

3.1 The Most Common Mistake

“We have 100 employees using AI, so we need concurrency for 100 users.”

This assumption is almost always wrong.


3.2 Correct Definition of Concurrency

Concurrency = the number of requests simultaneously generating tokens

What matters is:

  • How many requests are active at the same time
  • How long each request occupies the system

3.3 Why LLM Concurrency Is Expensive

Unlike databases or web APIs, LLM requests are long-lived:

  • TTFT: 2–5 seconds
  • Generation time: 20–60 seconds (or more)

That means:

One request can occupy GPU resources for 30–60 seconds continuously

This makes concurrency a critical capacity constraint.


4. Token/s × Concurrency = Real System Load

4.1 A Practical Example

Assume:

  • Token/s = 10
  • Average response length = 300 tokens
  • TTFT = 4 seconds

Request duration:

4 + (300 ÷ 10) = 34 seconds

That request counts as one active concurrent load for 34 seconds.


4.2 Why Systems Degrade Quickly

If:

  • 5 requests start at the same time

Then:

  • Token/s is divided
  • TTFT increases
  • Responses slow down

Users experience this as:

  • Lag
  • “System freezing”
  • Unreliable performance

5. How Enterprises Should Estimate Concurrency

Step 1: Start With Use Cases, Not Headcount

Use CaseCharacteristics
Technical document lookupLow frequency, short
Legal or contract analysisLow frequency, long
R&D reasoningFew users, deep
Customer supportHigh frequency, short

Step 2: Estimate Requests per Minute (Peak)

This is far more accurate than counting users.


Step 3: Use a Simple Formula

Concurrency ≈ (Requests per minute × Duration per request) ÷ 60

Example:

  • 4 requests per minute
  • 30 seconds per request
(4 × 30) ÷ 60 = 2

Expected concurrency ≈ 2


6. Typical Enterprise Concurrency Levels

ConcurrencyTypical Scenario
1–3IT, R&D, management
3–10SME internal AI
10–30Department-level support
50+Public SaaS platforms

Most enterprises operate comfortably in the 1–10 range.


7. Why High Concurrency Costs Grow Non-Linearly

Supporting high concurrency usually requires:

  • More GPUs
  • Higher Token/s throughput
  • Advanced scheduling and caching
  • Significantly higher CapEx and OpEx

Paying 5–10× more for rare peak usage is unnecessary for most organizations.


8. Conclusion: Token/s and Concurrency Are Decision-Level Metrics

For enterprise LLM deployment:

  • Token/s determines productivity
  • Concurrency determines infrastructure scale and cost
  • They must be evaluated together

A practical strategy is:

Start with low concurrency and high-quality responses,
observe real usage patterns,
then scale based on actual data—not assumptions.

This approach delivers better ROI, lower risk, and sustainable AI adoption.

Recent Posts

  • Token/s and Concurrency:
  • Token/s 與並發:企業導入大型語言模型時,最容易被誤解的兩個指標
  • Running OpenCode AI using Docker
  • 使用 Docker 實際運行 OpenCode AI
  • Security Risks and Governance Models for AI Coding Tools

Recent Comments

  1. Building a Complete Enterprise-Grade Mail System (Overview) - Nuface Blog on High Availability Architecture, Failover, GeoDNS, Monitoring, and Email Abuse Automation (SOAR)
  2. Building a Complete Enterprise-Grade Mail System (Overview) - Nuface Blog on MariaDB + PostfixAdmin: The Core of Virtual Domain & Mailbox Management
  3. Building a Complete Enterprise-Grade Mail System (Overview) - Nuface Blog on Daily Operations, Monitoring, and Performance Tuning for an Enterprise Mail System
  4. Building a Complete Enterprise-Grade Mail System (Overview) - Nuface Blog on Final Chapter: Complete Troubleshooting Guide & Frequently Asked Questions (FAQ)
  5. Building a Complete Enterprise-Grade Mail System (Overview) - Nuface Blog on Network Architecture, DNS Configuration, TLS Design, and Postfix/Dovecot SNI Explained

Archives

  • January 2026
  • December 2025
  • November 2025
  • October 2025

Categories

  • AI
  • Apache
  • CUDA
  • Cybersecurity
  • Database
  • DNS
  • Docker
  • Fail2Ban
  • FileSystem
  • Firewall
  • Linux
  • LLM
  • Mail
  • N8N
  • OpenLdap
  • OPNsense
  • PHP
  • Python
  • QoS
  • Samba
  • Switch
  • Virtualization
  • VPN
  • WordPress
© 2026 Nuface Blog | Powered by Superbs Personal Blog theme