Why Use Many Token When Few Token Do Trick: The Real Cost of AI Verbosity
A joke GitHub project that makes Claude Code talk like a caveman reveals a serious problem — AI coding tools burn through tokens at an alarming rate, and developers are finding creative ways to fight back.
A small open-source project called Caveman has been circulating among developers who use Anthropic's Claude Code, the AI-powered coding agent. The premise is absurd: install a custom "skill" that forces the AI to communicate in terse, caveman-like speech, and you cut token consumption by roughly 75%. The tagline on its GitHub repository says it all: "why use many token when few token do trick."
It's funny. It's also a sharp commentary on one of the most expensive and least-discussed inefficiencies in AI-assisted development. Every token an AI model generates costs money, consumes compute, and adds latency. When a coding agent wraps a one-line fix in three paragraphs of polite explanation, someone is paying for that politeness. Caveman asks a fair question: what if they didn't?
The Token Economy Nobody Talks About
To understand why a caveman-talking AI matters, you need to understand how AI pricing works. Large language models like Claude process text in "tokens," which are fragments of words. As Ars Technica explained when Anthropic expanded Claude's context window to 100,000 tokens — roughly 75,000 words — tokens are the fundamental unit of both input and output for these systems. Providers charge per token, and the costs add up fast when you're running an AI coding agent that generates thousands of lines of explanation alongside its actual code.
The problem is structural. Models like Claude are trained to be helpful, thorough, and conversational. When Anthropic first introduced Claude in 2023, the company emphasized that it was "easier to converse with" and "more steerable" than competitors. Those are selling points for a chatbot. For a coding agent running autonomously through a complex codebase, though, conversational fluency becomes expensive overhead.
Consider a typical interaction: a developer asks an AI agent to fix a bug. The agent identifies the issue, explains its reasoning, proposes a solution, explains the solution, implements the fix, then summarizes what it did. The actual code change might be ten tokens. The surrounding narration might be five hundred. In an agentic workflow where the model is calling itself repeatedly, those five hundred tokens multiply across dozens or hundreds of steps.
Caveman Logic: Compression as Cost Control
The Caveman project, created by developer Julius Brussee, works as a Claude Code "skill" — essentially a behavioral instruction that shapes how the model responds. Instead of writing "I've identified the issue in your authentication middleware. The problem is that the session token validation function doesn't account for expired tokens. Here's my proposed fix," the caveman-mode Claude might output something like "bug in auth. token expiry not checked. fix:" followed by the code.
The claimed 75% token reduction is striking, but the underlying principle is straightforward. Most of what AI coding tools say isn't code — it's commentary. Strip the commentary down to its minimum viable form, and you dramatically reduce the token footprint without losing the information that matters.
This isn't just about saving a few cents per query. For teams running AI agents at scale — where models are autonomously planning, executing, and reviewing code changes across large projects — token costs can become a significant line item. Tools like Plandex, an open-source AI coding agent designed for large projects and real-world tasks, represent the kind of agentic workflow where token efficiency directly impacts both cost and speed. When an agent is making hundreds of API calls to complete a complex task, every unnecessary token compounds.
Performance and Latency: The Hidden Benefits
Cost is the obvious win, but minimal token usage has a second benefit that's arguably more important: speed. Token generation is sequential. Each token the model produces adds latency to the response. Cut 75% of the output tokens, and you don't just save money — you get answers faster.
For interactive coding sessions, this is the difference between a tool that feels responsive and one that feels sluggish. Developers are impatient by nature and by necessity. A coding agent that takes eight seconds to deliver a fix wrapped in three paragraphs of explanation is slower than one that takes two seconds to deliver the same fix with a terse annotation. The code is identical. The experience is not.
There's a deeper performance question, too. Context windows are large but finite. Anthropic's expansion of Claude's context window to 100,000 tokens was a significant leap, but even that space fills up quickly in long agentic sessions. Every verbose response the model generates becomes part of the conversation history, consuming context that could be used for actual code, documentation, or project files. Terse outputs preserve context space for the information that matters.
The Security and Alignment Angle
There's a tension here that's worth naming. Anthropic built Claude with a specific philosophy around safety and transparency. As Ars Technica detailed in its coverage of Constitutional AI, the company trains its models using explicit behavioral principles designed to make outputs safer and more transparent. Verbose explanations aren't just a quirk — they're partly a design choice. When a model explains its reasoning, it's easier to audit, easier to catch errors, and easier to verify that it's not doing something harmful.
Stripping that verbosity away creates a tradeoff. A caveman-mode agent that says "fix auth bug" and outputs code is faster and cheaper. It's also harder to review. If the model makes a subtle error in its reasoning, the terse output gives you fewer signals to catch it. In safety-critical applications, that's a real concern.
The counterargument is that developers don't actually read most of the verbose output anyway. They scan for the code block, copy it, and move on. If the explanatory text isn't being consumed, it's not providing the safety benefit it's designed to deliver. Better, perhaps, to make the output concise by default and offer verbose mode as an opt-in for situations where auditability matters.
What This Means for the Industry
Caveman is a joke project with a serious insight at its core: the default verbosity of AI models is a design choice, not a technical necessity, and it has real costs. As AI coding agents become more autonomous and more deeply integrated into development workflows, the economics of token usage will become harder to ignore.
Model providers face a design tension. Helpful, explanatory outputs make for good demos and satisfied first-time users. They also make for expensive, slow production workloads. The market is already signaling its preference. Developers are hacking together caveman modes and terse-output prompts because the tools don't offer a built-in efficiency toggle.
The smarter path forward is probably adaptive verbosity — models that are concise by default in agentic loops and explanatory when a human is actively reading. Some of this is already happening through system prompts and custom instructions, but it's still largely a manual process. The next generation of AI development tools will likely need to treat token efficiency as a first-class feature, not an afterthought.
In the meantime, a developer in a loincloth is showing the rest of the industry what efficient AI communication looks like. It's not pretty. But it works, and it's 75% cheaper. In a market where AI compute costs are everyone's problem, that's an argument that's hard to ignore.