HypotaxBench

Methodology

35%

Max Depth

Deepest node in the dependency parse tree

20%

Mean Depth

Average depth across all tokens

20%

Subord Ratio

Subordinate clauses per main clause

15%

Dep Distance

Mean token distance between head and dependent

10%

Log Length

Log-scaled word count with diminishing returns

A sentence must pass gating -- single root, no run-on fragments or semicolon splices -- to receive a score. Five independent runs per model; the best valid sentence is kept. All parsing uses spaCy en_core_web_sm. Full methodology and examples →

Structural Patterns

Beyond the composite score, we catalog the types of subordination each model deploys and measure phrasal-level complexity. These patterns reveal how different models approach the problem of deep embedding.

Subordination Strategy

Marker Vocabulary

Key Structural Observations

Relative clauses dominate

Most models achieve depth primarily through relative clauses (who, which, that), the simplest embedding strategy. More sophisticated models like GPT-5.3 and GPT-5.2-chat diversify into participial and adjectival clauses.

'That' is the universal subordinator

It appears as the top marker for 30 of 34 models. The exceptions (Proust uses 'when', Claude 3.5 Haiku uses 'where') suggest more varied syntactic strategies.

Clause nesting tracks with score

Models with higher HypotaxBench scores tend to have deeper clause nesting (clauses embedded within clauses). Qwen 3.5-122B reaches nesting depth 24; Claude 3.5 Haiku only 2.

Branching factor is stable

Nearly all models produce trees with branching factor ~2.2, suggesting a shared structural prior in how transformers construct dependency trees.

Leaderboard

Methodology

Structural Patterns

Subordination Strategy

Marker Vocabulary

Key Structural Observations