anthropic ·claude ·llm-eval ·test-time-compute ·reasoning-models ·arxiv-research

Fable may be smarter. Show me the token bill.

Amir Khakshour

Lukas Brandt

Sofia Ruiz

Jun 11, 2026 · 8 min read

We want the token bill next to any chart calling Fable more intelligent.

The uncomfortable part of current reasoning-model marketing is that a public score can hide two very different improvements. One is a better model. The other is a model spending more inference-time compute: more hidden reasoning, more retries, longer traces, more latency, more dollars. Both can produce better answers. Only one deserves to be sold as raw intelligence without a footnote.

That distinction matters because test-time compute is no longer a side trick. It is a whole research area.

flowchart TD
    A[Claim: more intelligent model] --> B{What bought the gain?}
    B --> C[Better weights]
    B --> D[More test-time compute — token tax]
    B --> E[Better budget allocation]
    C --> F[Publish the frontier<br/>alongside the claim]
    D --> F
    E --> F

This is the scientific version of our complaint: Fable may be more capable at Anthropic’s chosen operating point, but the public claim is incomplete unless it reports the inference budget that bought the capability.

The friendly reading

Giving a model more compute at inference time can improve reasoning, and that is the strongest version of Anthropic’s case.

The Art of Scaling Test-Time Compute for Large Language Models, a study across eight open-source LLMs and more than thirty billion generated tokens, lands on a narrower result than “more tokens good”: the best strategy depends on model type, problem difficulty, and compute budget. No single test-time strategy universally dominates.

That is already enough to change how a model card should read. If Fable gets its jump by using a more aggressive reasoning policy, that is a valid engineering achievement. But it is a compute-allocation achievement. It should be measured as one.

The same point shows up in Can 1B LLM Surpass 405B LLM?, which argues that compute-optimal test-time scaling can let much smaller models beat much larger ones on some math benchmarks. The lesson is awkward for vendor marketing: benchmark wins can come from the inference procedure, not only from a more intelligent base model.

In other words, a model endpoint is a bundle:

Layer	What the user sees	What the benchmark may hide
Base model	The answer	Parameter count, training mix, RL policy
Inference policy	The answer quality	Number of samples, verifier passes, hidden reasoning budget
Serving stack	The latency	batching, routing, speculative decoding, hardware
Product defaults	The bill	token caps, retry policy, tool-call policy

Calling the whole bundle “more intelligent” is convenient but not precise.

The token tax

The term we would use for Fable-style claims is the token tax: the extra reasoning budget paid to move an answer from acceptable to impressive.

That tax is sometimes worth paying. Economic Evaluation of LLMs makes the case cleanly: if a wrong answer is expensive, the most powerful model can be the economically correct choice even when its per-call cost is higher. For a legal review, a production migration, or an autonomous agent editing a repository, paying more for fewer mistakes may be rational.

But that does not make the tax disappear. It means the tax has to be priced against the cost of an error.

The failure mode is using the expensive setting everywhere and calling the resulting average “intelligence.” Plan and Budget names the pattern directly: reasoning models often overthink, generating verbose or tangential traces even for simple queries. The paper’s proposed fix is not “never think.” It is adaptive budgeting: decompose the problem, estimate complexity, and allocate tokens where uncertainty is high.

That is the standard Fable should be held to. Not “can it produce the best answer if allowed to spend?” But “does it know when not to spend?”

More thinking can hurt

The easiest way to overstate a reasoning model is to draw only the high-budget point on the curve.

When More Thinking Hurts, the paper we would put in the center of the critique, reports diminishing returns at higher reasoning budgets, and identifies cases where extended reasoning is associated with abandoning previously correct answers.

That result should make everyone cautious about the phrase “more intelligent.” Longer reasoning is not a monotone good. On some tasks it helps. On some tasks it wastes compute. On some tasks it gives the model enough rope to talk itself out of the right answer.

The Price of a Second Thought frames the same issue as reasoning efficiency. Thinking models can waste computation on easy problems while adding value on harder ones. That sounds obvious until you look at how leaderboards are usually consumed: a single score, detached from how much compute was spent to get it.

If Fable wins hard reasoning tasks by spending more on hard reasoning tasks, good. That is the right use of test-time compute. If it spends the same swollen budget on trivial tasks, the product is not smarter in the way users care about. It is expensive by default. That default only becomes defensible if a benchmark separates Fable’s accuracy gain from the tokens, latency, and dollars spent to buy it.

What our own logs show

We ran the obvious check first: pull 334 claude-fable-5 assistant turns from a researcher project’s session logs and bucket them by user-prompt length. If Fable were adaptive, longer prompts should buy more output. The single-turn medians came out flat. That looks damning until you remember the proxy is broken. In an agentic CLI, prompt length is a terrible signal for task difficulty. A one-word “yes” can authorize a multi-step refactor. A 200-word prompt can be context the model mostly skips.

So we re-cut the data at the task grain. A task is one user prompt through to the next, summing across every assistant turn Fable ran inside that window. Two findings survive that re-cut.

1. There is a per-task floor — even on warm sessions. The 60 tasks where Fable made zero tool calls — pure conversational replies like “what do you think?” or “is this a good idea?” — still cost a median of 2,346 output tokens and 20 seconds each. Those questions arrived in sessions where the model already had the full project context loaded and the conversation had done the heavy lifting. The literature has a name for what is missing: difficulty-aware reasoning budget allocation. AdaCtrl frames it directly — current reasoning models “frequently generate unnecessarily lengthy reasoning chains for simple problems” — and proposes self-assessed difficulty as the missing capability. Fable defaults to spending; the user does not get to opt out.

2. Short user prompts are not cheap prompts. Tasks triggered by a ≤5-word user message (n=36) had a median of 6,221 output tokens of subsequent Fable work — more than the median for 6–20-word or 21–50-word prompts. The one-word prompts are the green lights: “yes”, “do all”, “c”, “add”. What they unleash is whatever complex thing Fable just proposed. In Claude Code there is no API for “yes, but do it cheaply.”

User prompt	n	median output tokens	median wall-clock
≤ 5 words (“yes”, “do all”, “c”)	36	6,221	54 s
6–20 words	78	3,135	29 s
21–50 words	20	6,087	57 s
51–150 words	13	4,037	48 s
151+ words	187	8,705	80 s

We do not have a baseline to compare against — no parallel run with a non-reasoning model on the same tasks. So we cannot claim Fable is over-spending in some absolute sense. What we can see is that the bill is real, the floor is non-trivial, and the user has no knob to dial it down on tasks they already know are cheap.

That is the shape of an “intelligence” that has decided what it costs before the user gets a vote. Until vendors publish the cost-accuracy frontier and let the user pick a point, the token tax is whatever the default reasoner felt like spending.

The benchmark we want

OckBench says the quiet part clearly: current benchmarks over-emphasize accuracy and output quality while neglecting token efficiency. It reports that models with similar accuracy can differ by up to 5× in token length. That is not a cosmetic metric. It changes latency, serving cost, energy, and whether an agentic workflow fits inside a real budget.

So we do not want a single Fable win-rate chart. We want a frontier.

Question	Why it matters
Accuracy at fixed output-token ceilings	Separates better reasoning from longer reasoning
Accuracy at fixed dollar budget	Tells operators what they can actually buy
Accuracy at fixed latency budget	Matters for interactive products
Easy/hard task split	Reveals overthinking on easy tasks
Token distribution, not just average	Shows tail behavior and runaway traces
Prior Claude vs Fable at same budget	Tests whether the endpoint moved the frontier
Fable vs cheaper non-Anthropic endpoints at same budget	Tests whether the premium is economically justified

The key phrase is “moved the frontier.” If Fable gets higher accuracy at the same cost and latency, Anthropic has a strong claim. If it gets higher accuracy by moving to a much more expensive point on the same curve, the claim is weaker. It may still be a useful product. It is not the same scientific statement.

The line we would draw

Here is the charitable, technical version:

Fable may be a better endpoint, but Anthropic has not shown whether it is a more efficient reasoner. Without token-normalized and cost-normalized results, the claim “more intelligent” mixes model capability with test-time compute policy.

That is not anti-Anthropic. It is anti-unpriced-intelligence.

The field already has the vocabulary: test-time scaling, overthinking, reasoning efficiency, compute-accuracy Pareto frontiers, economic evaluation. Vendors should use it. If a model is better because it thinks longer, say that. If it is better because it spends tokens more selectively, show the easy/hard split. If it is better at the same budget, publish the frontier and take the win.

Until then, our working assumption is simple: every “smarter” reasoning model comes with a hidden invoice. We want the invoice printed next to the benchmark.