HypotaxBench — Methodology

What is Hypotaxis?
The Competitive Protocol
What We Measure
Coherence Gating
Human References
Parser & Limitations
Key Findings
How to Add a Model

1. What is Hypotaxis?

Hypotaxis is the grammatical term for subordination: the embedding of clauses within clauses to create hierarchical structure. In a hypotactic sentence, a main clause is modified by dependent clauses — relative clauses, adverbial clauses, noun clauses, participial phrases, appositives — each of which can contain its own subordinate clauses, producing nested layers of grammatical depth. The result is a tree, not a list.

Parataxis is its opposite: coordination. Paratactic prose joins clauses with conjunctions like "and," "but," and "or," producing flat, additive structures where each clause is grammatically independent. Both are legitimate rhetorical choices, but they make fundamentally different demands on a writer — or on a language model.

Paratactic (coordination)

The ambassador arrived in Paris, and he met the young woman, and she impressed him deeply, and he wrote to his friends about her.

Four independent clauses joined by "and." Each clause resets the grammatical slate. A model only needs to generate one clause at a time.

Hypotactic (subordination)

The ambassador, who had arrived in Paris expecting nothing more than a routine errand that his aging patron — herself too frail to make the crossing she had once undertaken without hesitation — had entrusted to him, found himself struck by the young woman whose presence, radiating an intelligence he could not quite define, unsettled every assumption he had carried across the Atlantic.

One main clause ("The ambassador ... found himself struck") with five levels of embedding. The model must hold the incomplete main clause in memory while generating each nested subordinate clause, then land the sentence grammatically.

Why This Matters for LLMs

Hypotaxis demands sustained grammatical control over long-range dependencies. To produce a deeply nested sentence, a model must:

Track subject-verb agreement across intervening clauses — the main verb "found" must agree with "ambassador," not with "patron," "crossing," or "woman," even though those nouns are closer in the token sequence.
Maintain referential clarity as pronouns accumulate — "herself," "she," "him," "he," and "her" must all point to the correct antecedent despite layers of nesting.
Preserve logical coherence across nested propositions — the relationships between clauses (cause, concession, temporal sequence, attribution) must make sense to a reader.
Close every bracket — each opened subordinate clause must resolve grammatically before or as the sentence terminates.

Coordination resets the grammatical slate with each conjunction. Subordination keeps every plate spinning. A benchmark that rewards depth, not length, separates models that can sustain complex hierarchical structure from those that merely concatenate simple clauses.

2. The Competitive Protocol

HypotaxBench does not use a naive prompt. Early testing showed that a simple instruction ("write a long complex sentence") produces mediocre results from even the most capable models. The benchmark uses a competitive prompting protocol designed to elicit each model's maximum syntactic capacity.

The Core Prompt

Every model receives a prompt that explicitly frames the task as a competition against other frontier models on a public leaderboard. The key text:

You are participating in HypotaxBench, a benchmark that ranks large language models on their ability to produce a single grammatical English sentence of maximal hierarchical syntactic depth.

Your output will be scored by an automated dependency parser and compared directly against other frontier models (Claude, GPT, Gemini, etc.) as well as human literary references from Henry James, Virginia Woolf, and William Faulkner. The results are published on a public leaderboard.

Write a single grammatical English sentence that is as long and as syntactically deep as possible while remaining coherent, readable, and stylistically controlled.

Requirements:

The output must be exactly one sentence, terminated by a single sentence-final punctuation mark.
Maximize hierarchical embedding — nested subordinate clauses, relative clauses, appositives, participial phrases, parentheticals — rather than coordination. Coordinating conjunctions (and, or, but, nor, so, yet) may appear but must not be the primary mechanism of length. Use a VARIETY of subordination types, not just chains of relative clauses.
The sentence must be parseable as a single well-formed structure with one main clause.
Maintain clear logical and narrative progression; a reader should be able to follow the thought from beginning to end. Referential consistency is critical — do not contradict yourself.
Aim for the register of a highly literate prose stylist (Henry James, Marcel Proust, George Eliot) — not pastiche, but controlled literary English.
Use punctuation (commas, semicolons, dashes, parentheses) to support structure, not to disguise run-ons.
AIM HIGH. The current top-scoring models produce sentences of 300–500 words with dependency depth exceeding 30. Your sentence will be judged on both structural depth AND coherence — a deep but incoherent sentence will be penalized.

Output only the sentence itself, with no preamble, commentary, or explanation.

Why Competitive Framing Works

The prompt does three things that a neutral instruction does not:

Names the competition. Models are told they are being compared to Claude, GPT, Gemini, and others. This activates whatever alignment toward high performance the model has internalized during training.
Sets explicit targets. By stating that top models reach depth 30+ and 300–500 words, the prompt anchors the model's ambition. Without this, most models produce sentences of 80–150 words with depth 12–18.
Demands variety. Requirement 2 explicitly discourages monotonous relative-clause chains ("the man who saw the dog that chased the cat that caught the mouse that...") in favor of mixed subordination types.

The Iterative Process

Each model goes through multiple stages:

5 independent runs at temperature 0.8 — capturing the variance in the model's generative capacity.
Structural scoring of all 5 candidates using the spaCy dependency parser.
Improvement rounds — the model is shown its own best sentence and asked to beat it. This iterative self-competition finds the model's ceiling: the deepest coherent sentence it can produce.
Coherence review of the top candidates (see Section 4).

This protocol means the benchmark measures each model's coherent ceiling — not what it typically produces, but the best it can do when pushed. The competitive framing produced dramatic improvements: Grok, for example, went from a naive-prompt depth of 19 to a competitive-prompt depth of 38.

3. What We Measure

HypotaxBench computes five metrics from the spaCy dependency parse of each sentence. Each captures a different facet of syntactic complexity. The final score is a weighted sum of all five, normalized against the global maximum observed across all models and human references.

35% weight

a. Max Dependency Depth

The longest chain from the root verb to any leaf word in the dependency tree. This is the headline metric because it directly measures embedding depth and cannot be inflated by coordination.

Every sentence has a root — typically the main verb. Each word depends on another word (its "head"), forming a tree. The depth of any word is the number of steps from the root to that word. Max depth is the deepest any word gets.

Example

The man who saw the dog that chased the cat sat down.

Root: "sat" (depth 0) → "man" (depth 1) → "saw" (depth 2, relative clause) → "dog" (depth 3) → "chased" (depth 4, nested relative clause) → "cat" (depth 5). Max depth = 5. The word "cat" is five dependency steps from the root verb "sat."

Top-scoring benchmark sentences achieve max depths of 30–40. Henry James's best sentence from The Ambassadors reaches depth 19. A higher max depth means deeper subordination — more clauses nested inside other clauses.

20% weight

b. Mean Dependency Depth

The average depth across all tokens in the dependency tree. This metric catches a common failure mode: a sentence with one impressively deep tunnel surrounded by shallow filler.

Consider two sentences, both with max depth 30:

Sentence A: Mean depth 5. One deep chain of relative clauses hangs off an otherwise flat main clause. Most tokens sit at depths 1–3.
Sentence B: Mean depth 15. Complexity is sustained throughout. Multiple branches of the dependency tree reach significant depth.

Sentence B scores much higher on this metric. Mean depth rewards sustained complexity, not isolated spikes.

20% weight

c. Subordination Ratio

The ratio of subordinating dependency labels to coordinating labels, weighted by the Shannon entropy of subordination types.

Subordinating labels include: acl (adjectival clause), relcl (relative clause), advcl (adverbial clause), ccomp (clausal complement), xcomp (open clausal complement), csubj (clausal subject), mark (subordinating marker), and appos (appositive). Coordinating labels include: cc (coordinating conjunction) and conj (conjunct).

A raw ratio of 10.0 means ten subordinating relations for every coordinating one. But this metric also incorporates entropy weighting — the diversity of subordination types matters. A sentence that uses only relative clauses ("which... that... which... that...") scores lower than one that mixes relatives, adverbials, participials, appositives, and clausal complements. The entropy bonus rewards syntactic variety, not monotonous chaining.

15% weight

d. Mean Dependency Distance

The average number of words between each word and its syntactic head in the linear order of the sentence. This measures how far apart grammatically related words are in the actual text.

Example: Center-embedding

The cat that the dog that the boy fed chased sat on the mat.

"cat" (position 2) depends on "sat" (position 10) — dependency distance of 8 words. "dog" (position 5) depends on "chased" (position 9) — distance of 4. "boy" (position 7) depends on "fed" (position 8) — distance of 1. The mean dependency distance is high because center-embedding forces related words apart.

High mean dependency distance indicates that the model is producing genuine center-embedded constructions and long-range dependencies, not just right-branching chains where each word is adjacent to its head. This is the metric most correlated with the "difficulty" a human reader experiences.

10% weight

e. Log Length

The word count of the sentence, logarithmically scaled. This gives modest credit for length (a 400-word sentence has more room for structural complexity than a 100-word one) while preventing length-padding from dominating the score.

The logarithmic scaling means diminishing returns: going from 100 to 200 words helps significantly, but going from 200 to 400 words helps only modestly, and going from 400 to 800 helps barely at all. A model cannot game the benchmark by producing endless run-on prose — the four structural metrics must justify the length.

Final Score

Each metric is normalized to a 0–1 range by dividing by the global maximum observed across all model outputs and all human references. The final score is the weighted sum, multiplied by 100:

score = 100 * (
    0.35 * (max_depth / global_max_depth) +
    0.20 * (mean_depth / global_max_mean_depth) +
    0.20 * (sub_ratio / global_max_sub_ratio) +
    0.15 * (mean_distance / global_max_mean_distance) +
    0.10 * (log_length / global_max_log_length)
)

A score of 100 would mean a sentence that is simultaneously the deepest, the most sustainedly complex, the most subordination-heavy, the most long-range in its dependencies, and the longest ever observed. In practice, scores above 90 are exceptional.

4. Coherence Gating

A structurally deep sentence that is semantically incoherent is not a meaningful achievement. HypotaxBench uses a two-step coherence gate to ensure that only genuinely readable sentences receive final scores.

Step 1: Structural Analysis

Claude Opus generates a detailed structural analysis of each top-scoring candidate sentence. This analysis includes:

The core skeleton of the sentence — the main clause with all subordination stripped away.
An embedding summary identifying each level of nesting and the subordination type used.
A coherence verdict: does the sentence make sense from beginning to end? Are all referential chains consistent? Does every pronoun have a clear antecedent?

Step 2: Manual Review

The coherence verdict from Step 1 is manually reviewed. The gating is binary:

"Coherent" = pass. The sentence receives its structural score as its final score.
"Not fully coherent" with specific referential contradictions = fail. The sentence is rejected and the model's next-best candidate is evaluated.

Coherence is binary, not a multiplier. A sentence either passes or it does not. There is no partial credit for "mostly coherent." The structural score IS the final score for sentences that pass. This prevents the benchmark from becoming a subjective quality judgment.

What Fails Coherence

Common failure modes that trigger rejection:

Referential contradiction — a character described as having departed in one clause is described as present in a later clause of the same sentence.
Truncation — the sentence ends mid-word or mid-clause, typically because the model hit a token limit.
Pronoun reference errors — a pronoun whose antecedent is genuinely ambiguous or points to the wrong entity.
Logical impossibility — nested propositions that flatly contradict each other in a way no charitable reading can resolve.

What does not fail coherence: stylistic awkwardness, unusual vocabulary choices, or sentences that are difficult to parse on first reading but resolve on careful re-reading. The gate is for semantic coherence, not prose quality.

5. Human References

Seven sentences from canonical English prose are scored by the identical pipeline — the same spaCy parser, the same five metrics, the same normalization. No special treatment. These serve as calibration points: they show where the historical ceiling of human syntactic complexity falls on the benchmark's scale.

Author	Work	Date	Score	Depth
Henry James	The Ambassadors	1903	~57	19
Virginia Woolf	Mrs Dalloway	1925	~42	18
Charles Dickens	Bleak House	1853	~37	17
Edward Bulwer-Lytton	Paul Clifford	1830	~34	16
Marcel Proust	Swann's Way (Moncrieff trans.)	1913	~32	12
George Eliot	Middlemarch	1872	~31	11
Joseph Conrad	Lord Jim	1900	~27	8

Henry James leads the human references with a score of approximately 57 and a max dependency depth of 19. This is notable context for the leaderboard: most frontier models now exceed James on raw structural depth, producing coherent sentences with depths of 30+. The human references demonstrate that even the most hypotactic literary prose in English rarely exceeds depth 20 — the models are operating in territory that has no human precedent.

The Proust reference uses the Scott Moncrieff English translation, not the French original. Conrad's relatively low score reflects that Lord Jim, despite its famously complex narrative structure, achieves its complexity through layered narration across sentences rather than within single sentences.

6. Parser & Limitations

The Parser

HypotaxBench uses spaCy's en_core_web_sm — the small English model. This was chosen for portability: it installs cleanly on any platform without requiring C++ build tools or GPU support. A transformer-based model (en_core_web_trf) would produce more accurate parses, particularly for very long sentences, but introduces significant installation complexity and hardware requirements.

Depth Gate

Any parse with a max dependency depth exceeding 40 is automatically rejected as a likely misparse. The small spaCy model occasionally produces wildly incorrect parse trees for very long or structurally unusual sentences, yielding phantom depths of 50, 60, or even 80. The depth-40 gate filters these out. A legitimate sentence with genuine depth 40+ would be extraordinary — well beyond anything observed in human prose — and the false-positive risk from parser error is too high to accept such scores at face value.

Subordination Capping

The subordination ratio is capped and entropy-weighted to prevent a degenerate strategy: producing an endless chain of relative clauses ("the man who saw the dog that chased the cat that caught the mouse that..."). Such a sentence would have a very high raw subordination ratio but would represent a single trick, not genuine syntactic variety. The entropy weighting ensures that a sentence using a mix of relative clauses, adverbial clauses, participial phrases, appositives, and clausal complements scores higher than one using only relatives.

Thinking Models

Models in the Qwen QwQ / Qwen 3 "thinking" series have unusually high parser failure rates. Their outputs frequently produce dependency depths of 50–80, which are gated out as misparses. It is unclear whether these models are producing genuinely unusual syntactic structures that confuse the parser, or whether their outputs contain subtle formatting artifacts. This is a known limitation.

What This Benchmark Does Not Measure

HypotaxBench measures structural capacity, not prose quality. A structurally deep sentence that is also beautiful prose scores the same as one that is merely grammatical. The benchmark cannot distinguish a sentence that reads like Henry James from one that reads like a legal contract — only the dependency structure matters. This is by design: prose quality is subjective, but parse depth is measurable.

7. Key Findings

Several patterns have emerged from the benchmark results:

Syntactic depth correlates with general capability. Flagship models average a score of approximately 85.6, while small/lightweight models average approximately 46.5. The ability to sustain deep grammatical structure appears to track overall model capability, not just training data volume.
Competitive prompting dramatically improves scores. The difference between a naive prompt and the competitive protocol is often enormous. Grok went from a naive-prompt depth of 19 to a competitive-prompt depth of 38. Most models show similar jumps. The models can produce deep structure — they just do not do so unprompted.
"That" is the universal subordinator. Across nearly every model, the word "that" is the single most common subordination marker. It appears as a relative pronoun, a complementizer, and a demonstrative — making it the workhorse of English hypotaxis.
Most frontier models beat Henry James on structural depth. James's best sentence reaches depth 19. Most flagship models produce coherent sentences at depth 30+. The models are generating syntactic structures that have no precedent in published English prose.

8. How to Add a Model

Adding a new model to HypotaxBench:

Fork the repository on GitHub.
Add the model to the MODELS list in the benchmark configuration.
Set your API key as an environment variable for the relevant provider.
Run the pipeline — this queries the model 5 times, parses each output, computes metrics, and generates results.
Submit a pull request with your results.

Contents

1. What is Hypotaxis?

Why This Matters for LLMs

2. The Competitive Protocol

The Core Prompt

Why Competitive Framing Works

The Iterative Process

3. What We Measure

a. Max Dependency Depth

b. Mean Dependency Depth

c. Subordination Ratio

d. Mean Dependency Distance

e. Log Length

Final Score

4. Coherence Gating

Step 1: Structural Analysis

Step 2: Manual Review

What Fails Coherence

5. Human References

6. Parser & Limitations

The Parser

Depth Gate

Subordination Capping

Thinking Models

What This Benchmark Does Not Measure

7. Key Findings

8. How to Add a Model