How HypotaxBench measures syntactic depth, and why it works the way it does.
Hypotaxis is subordination: the embedding of clauses within clauses to create hierarchical grammatical structure. A hypotactic sentence contains a main clause modified by dependent clauses — relative clauses, adverbial clauses, noun clauses, participial phrases — each of which can in turn contain its own subordinate clauses, producing nested layers of grammatical depth.
Parataxis is its opposite: coordination. Paratactic prose joins clauses with conjunctions like "and," "but," and "or," producing flat, additive structures. Both are legitimate stylistic choices, but they place very different demands on a writer — or a language model.
The rain fell and the wind blew and the lamps flickered.
The rain, which fell in torrents that were checked only by gusts of wind sweeping through streets where lamps struggled against the darkness, beat against the windows.
The paratactic version is three independent clauses strung together. The hypotactic version is a single main clause ("The rain ... beat against the windows") into which three levels of subordination have been embedded. The grammatical structure is a tree, not a list.
Why it matters for LLMs: Hypotaxis demands sustained grammatical control. The model must track subject-verb agreement across intervening clauses, maintain referential clarity as pronouns accumulate, and preserve the logical relationships between nested propositions — all while holding the incomplete main clause in working memory. Coordination, by contrast, resets the grammatical slate with each conjunction. A benchmark that rewards depth, not length, separates models that can sustain complex structure from those that merely concatenate simple ones.
HypotaxBench computes five metrics from the spaCy dependency parse of each output sentence. Each metric captures a different facet of syntactic complexity.
The longest path from the root verb to any leaf word in the dependency tree. This is the headline metric because it directly measures embedding depth and cannot be inflated by coordination.
Consider two sentences parsed into dependency trees:
Depth 3: Depth 6:
sat (ROOT) sat (ROOT)
/ \ |
cat mat cat
| | |
The the chased
|
dog
|
fed
|
boy
|
The
"The cat sat on the mat" has a max depth of 3. A deeply nested sentence like "The cat that the dog that the boy fed chased sat" pushes tokens further from the root, yielding a depth of 6 or more. Higher max depth means deeper subordination.
The average depth across all tokens in the dependency tree. This catches models that achieve one impressive spike of depth surrounded by shallow filler.
A sentence with max depth 10 but mean depth 2 has one deep tunnel embedded in otherwise flat prose. A sentence with max depth 10 and mean depth 8 sustains complexity throughout. Mean depth rewards the latter.
The count of subordinating dependency labels divided by the count of coordinating labels. Subordinating labels include: acl, relcl, advcl, ccomp, xcomp, csubj, mark, and appos. Coordinating labels include: cc and conj.
A ratio of 1.0 means equal amounts of subordination and coordination. A ratio of 10.0 means ten subordinating relations for every coordinating one. This metric penalizes models that pad length with "and ... and ... and" constructions.
The average number of words between each word and its syntactic head in the linear order of the sentence. Higher values indicate more center-embedding and long-range dependencies.
"The cat sat" has a mean dependency distance of roughly 1 — each word is adjacent to its head. "The cat, which the dog that the boy fed chased, sat" has a much higher mean distance because "cat" and "sat" are separated by an entire nested clause, and "dog" is far from "chased." This metric rewards the kind of long-range grammatical control that characterizes genuinely complex prose.
The word count of the sentence, logarithmically scaled. This prevents raw length-padding from dominating the score while still giving some credit for longer outputs (which have more opportunity for structural complexity).
Doubling length from 200 to 400 words only modestly improves this metric. A model cannot game the benchmark by producing endless run-on prose — the structural metrics must justify the length.
Before any metrics are computed, each output must pass three gating checks. If any gate fails, the output receives a score of 0.
These gates ensure that scores reflect genuine single-sentence complexity, not multi-sentence assemblages or degenerate outputs.
Each of the five metrics is normalized to a 0–1 range by dividing by the global maximum observed across all model outputs and all human reference texts. This means a score of 1.0 on any metric represents the best performance seen from any source, human or model.
The final score is the weighted sum of the five normalized metrics, multiplied by 100:
score = 100 * (
0.35 * (max_depth / global_max_depth) +
0.20 * (mean_depth / global_max_mean_depth) +
0.20 * (sub_ratio / global_max_sub_ratio) +
0.15 * (mean_distance / global_max_mean_distance) +
0.10 * (log_length / global_max_log_length)
)
Human references are scored with the identical pipeline — they receive no special treatment and are subject to the same gating criteria and normalization. This makes comparisons between human and model outputs direct and fair.
For each model, best of 5 runs is reported. All runs use temperature 0.8 to capture variance in the model's generative capacity. The best-of-5 design means the benchmark measures a model's ceiling performance: what it can produce, not just what it typically produces.
Every model receives the following prompt verbatim, with no system message or additional context:
Write a single grammatical English sentence that is as long and as syntactically deep as possible while remaining coherent, readable, and stylistically controlled.
Requirements:
Output only the sentence itself, with no preamble, commentary, or explanation.
HypotaxBench uses spaCy's en_core_web_sm (small English model) for dependency parsing. The small model was chosen because it installs cleanly on any platform without requiring C++ build tools; a transformer-based model (e.g., en_core_web_trf) would produce more accurate parses but introduces significant installation complexity.
Known limitations:
Adding a new model to HypotaxBench takes a few steps:
MODELS list in the benchmark configuration.python run_benchmark.py — this queries the model 5 times and saves the raw outputs.python score.py — this parses each output, computes metrics, and generates the results files.