How HypotaxBench measures syntactic depth in large language models — the protocol, the metrics, the coherence gate, and the limitations.
Hypotaxis is the grammatical term for subordination: the embedding of clauses within clauses to create hierarchical structure. In a hypotactic sentence, a main clause is modified by dependent clauses — relative clauses, adverbial clauses, noun clauses, participial phrases, appositives — each of which can contain its own subordinate clauses, producing nested layers of grammatical depth. The result is a tree, not a list.
Parataxis is its opposite: coordination. Paratactic prose joins clauses with conjunctions like "and," "but," and "or," producing flat, additive structures where each clause is grammatically independent. Both are legitimate rhetorical choices, but they make fundamentally different demands on a writer — or on a language model.
The ambassador arrived in Paris, and he met the young woman, and she impressed him deeply, and he wrote to his friends about her.
The ambassador, who had arrived in Paris expecting nothing more than a routine errand that his aging patron — herself too frail to make the crossing she had once undertaken without hesitation — had entrusted to him, found himself struck by the young woman whose presence, radiating an intelligence he could not quite define, unsettled every assumption he had carried across the Atlantic.
Hypotaxis demands sustained grammatical control over long-range dependencies. To produce a deeply nested sentence, a model must:
Coordination resets the grammatical slate with each conjunction. Subordination keeps every plate spinning. A benchmark that rewards depth, not length, separates models that can sustain complex hierarchical structure from those that merely concatenate simple clauses.
HypotaxBench does not use a naive prompt. Early testing showed that a simple instruction ("write a long complex sentence") produces mediocre results from even the most capable models. The benchmark uses a competitive prompting protocol designed to elicit each model's maximum syntactic capacity.
Every model receives a prompt that explicitly frames the task as a competition against other frontier models on a public leaderboard. The key text:
You are participating in HypotaxBench, a benchmark that ranks large language models on their ability to produce a single grammatical English sentence of maximal hierarchical syntactic depth.
Your output will be scored by an automated dependency parser and compared directly against other frontier models (Claude, GPT, Gemini, etc.) as well as human literary references from Henry James, Virginia Woolf, and William Faulkner. The results are published on a public leaderboard.
Write a single grammatical English sentence that is as long and as syntactically deep as possible while remaining coherent, readable, and stylistically controlled.
Requirements:
Output only the sentence itself, with no preamble, commentary, or explanation.
The prompt does three things that a neutral instruction does not:
Each model goes through multiple stages:
This protocol means the benchmark measures each model's coherent ceiling — not what it typically produces, but the best it can do when pushed. The competitive framing produced dramatic improvements: Grok, for example, went from a naive-prompt depth of 19 to a competitive-prompt depth of 38.
HypotaxBench computes five metrics from the spaCy dependency parse of each sentence. Each captures a different facet of syntactic complexity. The final score is a weighted sum of all five, normalized against the global maximum observed across all models and human references.
The longest chain from the root verb to any leaf word in the dependency tree. This is the headline metric because it directly measures embedding depth and cannot be inflated by coordination.
Every sentence has a root — typically the main verb. Each word depends on another word (its "head"), forming a tree. The depth of any word is the number of steps from the root to that word. Max depth is the deepest any word gets.
The man who saw the dog that chased the cat sat down.
Top-scoring benchmark sentences achieve max depths of 30–40. Henry James's best sentence from The Ambassadors reaches depth 19. A higher max depth means deeper subordination — more clauses nested inside other clauses.
The average depth across all tokens in the dependency tree. This metric catches a common failure mode: a sentence with one impressively deep tunnel surrounded by shallow filler.
Consider two sentences, both with max depth 30:
Sentence B scores much higher on this metric. Mean depth rewards sustained complexity, not isolated spikes.
The ratio of subordinating dependency labels to coordinating labels, weighted by the Shannon entropy of subordination types.
Subordinating labels include: acl (adjectival clause), relcl (relative clause), advcl (adverbial clause), ccomp (clausal complement), xcomp (open clausal complement), csubj (clausal subject), mark (subordinating marker), and appos (appositive). Coordinating labels include: cc (coordinating conjunction) and conj (conjunct).
A raw ratio of 10.0 means ten subordinating relations for every coordinating one. But this metric also incorporates entropy weighting — the diversity of subordination types matters. A sentence that uses only relative clauses ("which... that... which... that...") scores lower than one that mixes relatives, adverbials, participials, appositives, and clausal complements. The entropy bonus rewards syntactic variety, not monotonous chaining.
The average number of words between each word and its syntactic head in the linear order of the sentence. This measures how far apart grammatically related words are in the actual text.
The cat that the dog that the boy fed chased sat on the mat.
High mean dependency distance indicates that the model is producing genuine center-embedded constructions and long-range dependencies, not just right-branching chains where each word is adjacent to its head. This is the metric most correlated with the "difficulty" a human reader experiences.
The word count of the sentence, logarithmically scaled. This gives modest credit for length (a 400-word sentence has more room for structural complexity than a 100-word one) while preventing length-padding from dominating the score.
The logarithmic scaling means diminishing returns: going from 100 to 200 words helps significantly, but going from 200 to 400 words helps only modestly, and going from 400 to 800 helps barely at all. A model cannot game the benchmark by producing endless run-on prose — the four structural metrics must justify the length.
Each metric is normalized to a 0–1 range by dividing by the global maximum observed across all model outputs and all human references. The final score is the weighted sum, multiplied by 100:
score = 100 * (
0.35 * (max_depth / global_max_depth) +
0.20 * (mean_depth / global_max_mean_depth) +
0.20 * (sub_ratio / global_max_sub_ratio) +
0.15 * (mean_distance / global_max_mean_distance) +
0.10 * (log_length / global_max_log_length)
)
A score of 100 would mean a sentence that is simultaneously the deepest, the most sustainedly complex, the most subordination-heavy, the most long-range in its dependencies, and the longest ever observed. In practice, scores above 90 are exceptional.
A structurally deep sentence that is semantically incoherent is not a meaningful achievement. HypotaxBench uses a two-step coherence gate to ensure that only genuinely readable sentences receive final scores.
Claude Opus generates a detailed structural analysis of each top-scoring candidate sentence. This analysis includes:
The coherence verdict from Step 1 is manually reviewed. The gating is binary:
Common failure modes that trigger rejection:
What does not fail coherence: stylistic awkwardness, unusual vocabulary choices, or sentences that are difficult to parse on first reading but resolve on careful re-reading. The gate is for semantic coherence, not prose quality.
Seven sentences from canonical English prose are scored by the identical pipeline — the same spaCy parser, the same five metrics, the same normalization. No special treatment. These serve as calibration points: they show where the historical ceiling of human syntactic complexity falls on the benchmark's scale.
| Author | Work | Date | Score | Depth |
|---|---|---|---|---|
| Henry James | The Ambassadors | 1903 | ~57 | 19 |
| Virginia Woolf | Mrs Dalloway | 1925 | ~42 | 18 |
| Charles Dickens | Bleak House | 1853 | ~37 | 17 |
| Edward Bulwer-Lytton | Paul Clifford | 1830 | ~34 | 16 |
| Marcel Proust | Swann's Way (Moncrieff trans.) | 1913 | ~32 | 12 |
| George Eliot | Middlemarch | 1872 | ~31 | 11 |
| Joseph Conrad | Lord Jim | 1900 | ~27 | 8 |
Henry James leads the human references with a score of approximately 57 and a max dependency depth of 19. This is notable context for the leaderboard: most frontier models now exceed James on raw structural depth, producing coherent sentences with depths of 30+. The human references demonstrate that even the most hypotactic literary prose in English rarely exceeds depth 20 — the models are operating in territory that has no human precedent.
The Proust reference uses the Scott Moncrieff English translation, not the French original. Conrad's relatively low score reflects that Lord Jim, despite its famously complex narrative structure, achieves its complexity through layered narration across sentences rather than within single sentences.
HypotaxBench uses spaCy's en_core_web_sm — the small English model. This was chosen for portability: it installs cleanly on any platform without requiring C++ build tools or GPU support. A transformer-based model (en_core_web_trf) would produce more accurate parses, particularly for very long sentences, but introduces significant installation complexity and hardware requirements.
Any parse with a max dependency depth exceeding 40 is automatically rejected as a likely misparse. The small spaCy model occasionally produces wildly incorrect parse trees for very long or structurally unusual sentences, yielding phantom depths of 50, 60, or even 80. The depth-40 gate filters these out. A legitimate sentence with genuine depth 40+ would be extraordinary — well beyond anything observed in human prose — and the false-positive risk from parser error is too high to accept such scores at face value.
The subordination ratio is capped and entropy-weighted to prevent a degenerate strategy: producing an endless chain of relative clauses ("the man who saw the dog that chased the cat that caught the mouse that..."). Such a sentence would have a very high raw subordination ratio but would represent a single trick, not genuine syntactic variety. The entropy weighting ensures that a sentence using a mix of relative clauses, adverbial clauses, participial phrases, appositives, and clausal complements scores higher than one using only relatives.
Models in the Qwen QwQ / Qwen 3 "thinking" series have unusually high parser failure rates. Their outputs frequently produce dependency depths of 50–80, which are gated out as misparses. It is unclear whether these models are producing genuinely unusual syntactic structures that confuse the parser, or whether their outputs contain subtle formatting artifacts. This is a known limitation.
Several patterns have emerged from the benchmark results:
Adding a new model to HypotaxBench:
MODELS list in the benchmark configuration.