Why Token Maxing is a Garbage Metric for Developer Productivity
Table of Contents
There’s a new metric creeping into some engineering orgs, dressed up in fancy AI vocabulary: token count. The idea sounds seductive at first glance — if developers are using AI coding assistants, why not measure how many tokens they generate? More tokens = more output. It’s just basic numbers, right?
Wrong. It’s one of the worst productivity metrics you can imagine, and it has a long, embarrassing lineage.
The Ghost of Lines of Code Past
Let’s be clear — this isn’t new. “Lines of code” (LOC) has been the go-to misguided productivity metric for decades. And every senior engineer will tell you the same thing: it’s garbage.
You know who writes the most lines of code? The junior developer who doesn’t know about the standard library. The engineer who adds unnecessary verbosity because they’re insecure about their contributions. The team that inherits someone else’s bloat because everyone was “maxing” on output.
The developer who replaces 50 lines of nested conditionals with a clean pattern-matching table? They just destroyed 90% of the codebase and made it dramatically better. The LOC metric says they’re less productive. They’re actually the most productive person in the room.
Code is a Liability, Not an Asset
Every line of code is:
- A bug waiting to happen
- Something someone else has to read and understand
- Something that needs to be tested
- Something that ages, degrades, and eventually needs rewriting
- A maintenance cost that compounds for years
The best engineers I’ve worked with share one trait: they write less code than anyone else, and the product is better for it. They don’t reach for a framework when a config file works. They don’t build a microservice when a function call does. They delete code. They say “no.” They push back on feature creep.
They understand that the best feature is the one that never gets built. The best code is the code you don’t write.
AI Doesn’t Change the Fundamentals
AI coding assistants are incredible tools. They accelerate thought. They handle boilerplate. They turn a frustrating 20-minute search into a 10-second prompt.
But they also make it easier to generate bulk. A developer who previously spent 3 hours writing 200 lines of careful code can now “generate” 2000 tokens in minutes. The effort-to-output ratio has collapsed. And that’s precisely when bad metrics become dangerous.
Because now you can fake productivity at scale. A developer can mindlessly hit “generate” on everything — verbose implementations, redundant abstractions, overly complex solutions — and look like a superstar on the dashboard. Meanwhile, the codebase is getting fat, slow, and unmaintainable.
This isn’t hypothetical. I’ve seen it happen. The developers who produce the most AI-generated code are rarely the ones producing the best code. They’re producing the most confident-sounding code. Which is not the same thing at all.
What Actually Measures Productivity?
If you can’t measure by volume, what can you measure?
Reliability. Does the code work? Does it break things? How many bugs escape to production?
Simplicity. Is the next developer going to be glad or sorry when they inherit this? Can it be understood in 5 minutes?
Speed of delivery. Not lines shipped per hour — features delivered per sprint. Working software that users actually need.
Debt avoided. How much future work did this engineer prevent by making good design decisions, writing good tests, or just saying “we don’t need this”?
Code review impact. Do their PRs get reviewed quickly with minimal changes? That’s a sign of someone who knows what they’re doing. PRs that need 15 rounds of review are not “high output” — they’re high friction.
None of these are dashboard-metrics. They require human judgment. Which is exactly why some managers avoid them — it’s easier to count tokens than to think.
The Irony Nobody Talks About
Here’s the most uncomfortable truth: developers who are genuinely good at their jobs tend to produce less visible output.
The senior engineer who refactors a mess into something maintainable? Nobody celebrates the code that doesn’t exist anymore.
The architect who says “we don’t need a database for this” instead of provisioning a cluster? That’s not in any sprint report.
The lead who catches a flawed design before a single line is written? That’s an invisible win.
Meanwhile, the developer who churns through features with AI-assisted bulk code — messy, untested, undocumented — looks productive on the surface. And surface metrics are what get people promoted.
We’ve built an incentive system that rewards the appearance of work over the reality of results.
A Better Question
Instead of “how much code did you write?” ask: “what did you ship, and how hard would it be to maintain in six months?”
Instead of “how many tokens did you generate?” ask: “did the customer’s problem get solved, or did we just get more code?”
Instead of measuring output, measure outcome. A shipped feature that nobody uses is zero value. A deleted feature that was confusing users is positive value. A three-line function that replaces a thousand-line module is one of the highest-impact things an engineer can do.
The Bottom Line
Token maxing, LOC counting, “code velocity” — they’re all the same flawed pattern dressed in different packaging. They confuse activity with achievement. They reward busyness over brilliance. They incentivize the wrong behaviors.
The engineers who shaped the most impactful software in this industry — the ones building systems that millions rely on every day — didn’t get there by generating volume. They got there by thinking carefully, writing sparingly, and having the discipline to leave things simpler than they found them.
Any metric that can’t capture that isn’t measuring productivity. It’s measuring noise.