HypotaxBench

Measuring whether LLMs can sustain deep hierarchical syntax in a single sentence.

The task: Write one grammatical sentence that is as syntactically deep as possible — nested subordinate clauses, relative clauses, participial phrases — not length through coordination ("and... and... and"). What we measure: dependency tree depth, subordination density, and embedding complexity, scored against human literary references from Henry James, Proust, and Bulwer-Lytton. Full methodology →

Click any dot above to read the sentence

Leaderboard

All 34 models ranked by composite score. Click column headers to sort.

Rank Model Score Max Depth Mean Depth Subord Ratio Dep Distance Words

Methodology

35%
Max Depth
Deepest node in the dependency parse tree
20%
Mean Depth
Average depth across all tokens
20%
Subord Ratio
Subordinate clauses per main clause
15%
Dep Distance
Mean token distance between head and dependent
10%
Log Length
Log-scaled word count with diminishing returns

A sentence must pass gating -- single root, no run-on fragments or semicolon splices -- to receive a score. Five independent runs per model; the best valid sentence is kept. All parsing uses spaCy en_core_web_sm. Full methodology and examples →

Structural Patterns

Beyond the composite score, we catalog the types of subordination each model deploys and measure phrasal-level complexity. These patterns reveal how different models approach the problem of deep embedding.

Subordination Strategy

Marker Vocabulary

Key Structural Observations

Relative clauses dominate
Most models achieve depth primarily through relative clauses (who, which, that), the simplest embedding strategy. More sophisticated models like GPT-5.3 and GPT-5.2-chat diversify into participial and adjectival clauses.
'That' is the universal subordinator
It appears as the top marker for 30 of 34 models. The exceptions (Proust uses 'when', Claude 3.5 Haiku uses 'where') suggest more varied syntactic strategies.
Clause nesting tracks with score
Models with higher HypotaxBench scores tend to have deeper clause nesting (clauses embedded within clauses). Qwen 3.5-122B reaches nesting depth 24; Claude 3.5 Haiku only 2.
Branching factor is stable
Nearly all models produce trees with branching factor ~2.2, suggesting a shared structural prior in how transformers construct dependency trees.