DE EN
Agentic Punks
Zurück zur Übersicht

Tokenmaxxing - The wrong metric for Agentic AI

25. März 2026 | Roman Zenner
Teilen:
Tokenmaxxing - The wrong metric for Agentic AI

Companies have no idea how to measure the productivity of their AI agents. So they measure token consumption. Last week, a developer at OpenAI ran 210 billion tokens through the company’s in-house models—enough text to fill Wikipedia 33 times over. At Anthropic, a single Claude Code user generated over $150,000 in monthly costs. And at Meta, AI usage is factored into performance reviews. The whole thing has a name: tokenmaxxing. And it’s the first major example of how companies are miscalibrating agentic AI.

tl;dr

  • In tech companies, internal leaderboards track who consumes the most AI tokens. The frontrunner at OpenAI: 210 billion tokens in a week.
  • Meta evaluates employees based on "AI-driven impact." Token budgets appear as a benefit in job postings.
  • Autonomous coding agents like OpenClaw work around the clock and generate token volumes that human users can no longer surpass.

What is tokenmaxxing?

As Kevin Roose reports in the New York Times, internal leaderboards have emerged in tech companies that track individual employees’ token consumption. Those at the top are considered productive. Those who consume little have an explanation to provide. VC Nikunj Kothari has coined a term for this: “Token Anxiety.” Dinner conversations in Silicon Valley no longer start with “What are you building?” but with “How many agents do you have running?”

Meta has included “AI-driven Impact” as a formal criterion in performance reviews. And according to TechCrunch, token budgets are appearing as a benefit in job postings—alongside dental insurance and free lunch.

Sounds like the future. Feels like 2015, when companies celebrated the number of Slack messages as an indicator of activity.

The problem is called Goodhart’s Law

"When a metric becomes a target, it ceases to be a good metric." That is Goodhart’s Law—named after economist Charles Goodhart, popularized by anthropologist Marilyn Strathern. Tokenmaxxing is the real-time demonstration.

Because what does token consumption actually measure? Not the quality of the code. Not the relevance of the results. Not the efficiency of the prompt. But rather: how much text has flowed through a model. A developer who needs 50 prompts to write a function consumes more tokens than one who manages it in 3 prompts. On the leaderboard, the inefficient ones win.

An example from the NYT shows just how far this goes: A startup founder used Figma to consume $70,000 worth of Claude tokens for $20 a month—and built six software projects in parallel. Figma has since closed the loophole. But the point remains: If you can get tokens cheap enough, volume becomes an end in itself.

Autonomous agents make it worse

And now come the coding agents. Systems like OpenClaw operate around the clock. They spawn sub-agents, which in turn generate tokens. Ege Erdil, co-founder of the AI startup Mechanize, estimates his personal consumption at one to ten billion tokens per week. “700 million per week from a single agent—it doesn’t take much,” he says. A Stockholm-based developer spends more on Claude than he earns—the company picks up the tab.

The promise: Agents do the work while you sleep. The reality: Agents generate token volumes that count as productivity on leaderboards—regardless of whether the output is useful.

Yes—during the learning phase, high AI usage makes sense. Employees need to experiment, iterate, and test boundaries. But leaderboards don’t measure learning; they measure consumption. And when token consumption is rewarded, it pays to let the agent run as many loops as possible. More iterations, more tokens, better performance review. Whether the code works afterward is another matter.

The real-time warning

Gergely Orosz, who is one of the most influential voices for software developers with his newsletter The Pragmatic Engineer, defends leaderboards as a “supercheap way to learn about new and interesting ways of working.” His logic: The old metrics—lines of code, number of commits—weren’t any better either. And at the most AI-enthusiastic companies, it’s now “a career risk to not use AI at an accelerated pace, regardless of output quality.”

That’s exactly where the problem lies. “Regardless of output quality”—that’s not a feature, it’s a bug. For anyone currently introducing agentic systems in companies: If you evaluate agents based on input rather than output, that’s exactly what you’ll get—a lot of input. The token leaderboards at Meta and OpenAI show what happens when volume becomes the metric. You don’t measure an author’s quality by the number of keystrokes. You don’t measure an agent’s quality by its token consumption.

It’s not the technology that’s miscalibrated. It’s the metrics.

An LLM researched and wrote. A human read, edited, and approved it. We’re still debating which of the two did more work.