The Context Window Tax: How 48% of AI Tool Output Is Wasted Tokens

TL;DR

AI coding assistants burn tokens on structure, not meaning. A 200-row JSON API response sends thousands of tokens of brackets, quotes, and repeated key names that the model uses exactly zero times for reasoning. In measured runs across JSON, CSV, YAML, and debug output:

Content Type	Token Savings
Medium JSON	65.0%
API-shaped responses	50.3%
Nested objects	48.2%
Multilingual content	46.3%
Large JSON	42.6%
CSV-like data	24.5%
Average (48 cases)	48.1%

Source: Toonify benchmark run, April 30, 2026.

The mechanism: intercept tool output via a PostToolUse hook before it reaches the context window. No changes to how you use Claude Code.

The Problem: JSON Is the Wrong Format for LLM Input

JSON was designed for machine-to-machine serialization. Every value is wrapped in quotes, every key is repeated for every object in an array, every nesting level adds brackets and commas. For a program parsing JSON, that structure is the contract. For an LLM reading it, most of that structure is noise.

Consider a 10-row API response:

[
  {"id": 1, "name": "Blue Lake Trail", "distanceKm": 7.5, "elevationGain": 320},
  {"id": 2, "name": "Royal Arch", "distanceKm": 5.2, "elevationGain": 490},
  ...10 more rows
]

The keys id, name, distanceKm, elevationGain appear once per row. For 10 rows, that’s 40 key repetitions. The model’s attention doesn’t need repetition to understand the schema — it needs the schema once, then the values.

TOON (Token-Oriented Object Notation) addresses this directly. For uniform arrays of objects, it uses a CSV-style tabular layout: declare the headers once, then list values row by row:

id | name | distanceKm | elevationGain
1 | Blue Lake Trail | 7.5 | 320
2 | Royal Arch | 5.2 | 490

The data model is lossless — TOON is a drop-in representation of JSON, not a lossy summary. The difference is structural overhead: TOON strips what exists for parsers, not for models.

Why Hook-Based Compression Works

The first version of Toonify was an MCP server: you called a tool explicitly to compress content. No one used it consistently.

The insight that unlocked real usage was the PostToolUse hook. Claude Code fires this event after every tool call, before the result enters the context window. Intercepting there means:

Every Bash output gets compressed automatically
Every Read result gets compressed automatically
No extra commands, no workflow changes, no discipline required

The hook reads the tool output, detects the content type (JSON, CSV, YAML, or debug output), compresses if savings exceed a 30% threshold, and passes the result through unchanged if it doesn’t. Under 50 tokens: pass through. Compression would break the content: pass through. Any error during processing: pass through. The implementation never breaks the workflow.

This design principle — always fail silently, never interrupt the session — is what made automatic hook execution safe to ship without per-tool opt-in.

The v0.7.0 Expansion: Debug Output Has the Same Problem

Six months after launch, usage patterns revealed a second culprit: debug output.

Long test failure logs, TypeScript compiler diagnostics, and lint output share the same token-waste pattern as JSON — repetition that isn’t meaning. A TypeScript error that repeats the file path and error code on 40 lines for 40 instances of the same problem is structurally identical to a JSON array repeating 40 object keys.

The v0.7.0 compressor handles this by collapsing repeated diagnostic patterns:

Identical consecutive lines collapse to one with a (×N) count
Similar TypeScript/lint diagnostics with different file paths collapse to a representative sample with a count
Stack traces retain the top frames (where the failure is) and collapse the library internals below

In practice this makes test-failure context significantly shorter without losing the information that matters for debugging — which file, which line, which error.

When Compression Helps Most (and Least)

High compression (50–66%): Uniform arrays of objects — API responses, database query results, log entries with shared schemas. The key-repetition problem is largest here.

Medium compression (42–48%): Nested JSON, multilingual content (where TOON handles Unicode correctly), large file reads.

Low compression (24%): CSV-like data that’s already column-structured. TOON still removes quoting overhead and normalizes whitespace, but the gains are smaller because CSV is already relatively compact.

Skip entirely: Short content under 50 tokens, free-form prose, content where exact original formatting must be preserved (binary data, certain configs). The evaluator skips these rather than risking corruption.

The Caching Layer

For real-world AI coding sessions, the same content appears repeatedly — you read the same config file, you run the same test suite, you get the same API response. Rather than re-compressing identical content on every hook firing, Toonify caches compression results in a local LRU cache with a one-hour TTL.

The cache is keyed by content hash. Cache hit means near-zero overhead per call — compression work happens once per unique content per session.

Field Note

The 30–65% range in the original project description came from early benchmark averages. After broader testing across edge cases — very small payloads, already-compact CSVs, multilingual content — the current measured average of 48.1% is more representative of real-world sessions.

The design constraint that shaped every tradeoff: token savings matter only if the compression is transparent. A developer who has to think about when to compress has already paid most of the cognitive cost back.