HypotaxBench

We ask each model to write a single grammatical English sentence that is as syntactically deep as possible — clauses nested within clauses, not length through coordination. Models are told they are competing against other frontier models and iteratively pushed to beat their own best output. Each sentence is scored by automated dependency parsing and gated by a coherence check from Claude Opus. The results are compared against literary sentences from Henry James, Virginia Woolf, and five other canonical prose stylists.

v1.0 — April 2026 · 27 models · 7 human references · Methodology

Click any dot above to read the sentence

Leaderboard

All 34 models ranked by composite score. Click column headers to sort.

Rank Model Score Max Depth Mean Depth Subord Ratio Dep Distance Words

Methodology

35%
Max Depth
Deepest node in the dependency parse tree
20%
Mean Depth
Average depth across all tokens
20%
Subord Ratio
Subordinate clauses per main clause
15%
Dep Distance
Mean token distance between head and dependent
10%
Log Length
Log-scaled word count with diminishing returns

A sentence must pass gating -- single root, no run-on fragments or semicolon splices -- to receive a score. Five independent runs per model; the best valid sentence is kept. All parsing uses spaCy en_core_web_sm. Full methodology and examples →

Structural Patterns

Beyond the composite score, we catalog the types of subordination each model deploys and measure phrasal-level complexity. These patterns reveal how different models approach the problem of deep embedding.

Subordination Strategy

Marker Vocabulary

Key Structural Observations

Relative clauses dominate
Most models achieve depth primarily through relative clauses (who, which, that), the simplest embedding strategy. More sophisticated models like GPT-5.3 and GPT-5.2-chat diversify into participial and adjectival clauses.
'That' is the universal subordinator
It appears as the top marker for 30 of 34 models. The exceptions (Proust uses 'when', Claude 3.5 Haiku uses 'where') suggest more varied syntactic strategies.
Clause nesting tracks with score
Models with higher HypotaxBench scores tend to have deeper clause nesting (clauses embedded within clauses). Qwen 3.5-122B reaches nesting depth 24; Claude 3.5 Haiku only 2.
Branching factor is stable
Nearly all models produce trees with branching factor ~2.2, suggesting a shared structural prior in how transformers construct dependency trees.