1. What is Hypotaxis?

Hypotaxis is subordination: the embedding of clauses within clauses to create hierarchical grammatical structure. A hypotactic sentence contains a main clause modified by dependent clauses — relative clauses, adverbial clauses, noun clauses, participial phrases — each of which can in turn contain its own subordinate clauses, producing nested layers of grammatical depth.

Parataxis is its opposite: coordination. Paratactic prose joins clauses with conjunctions like "and," "but," and "or," producing flat, additive structures. Both are legitimate stylistic choices, but they place very different demands on a writer — or a language model.

Paratactic

The rain fell and the wind blew and the lamps flickered.

Hypotactic

The rain, which fell in torrents that were checked only by gusts of wind sweeping through streets where lamps struggled against the darkness, beat against the windows.

The paratactic version is three independent clauses strung together. The hypotactic version is a single main clause ("The rain ... beat against the windows") into which three levels of subordination have been embedded. The grammatical structure is a tree, not a list.

Why it matters for LLMs: Hypotaxis demands sustained grammatical control. The model must track subject-verb agreement across intervening clauses, maintain referential clarity as pronouns accumulate, and preserve the logical relationships between nested propositions — all while holding the incomplete main clause in working memory. Coordination, by contrast, resets the grammatical slate with each conjunction. A benchmark that rewards depth, not length, separates models that can sustain complex structure from those that merely concatenate simple ones.

2. What We Measure

HypotaxBench computes five metrics from the spaCy dependency parse of each output sentence. Each metric captures a different facet of syntactic complexity.

35% weight

a. Max Dependency Depth

The longest path from the root verb to any leaf word in the dependency tree. This is the headline metric because it directly measures embedding depth and cannot be inflated by coordination.

Consider two sentences parsed into dependency trees:

Depth 3:                      Depth 6:

      sat (ROOT)                      sat (ROOT)
      / \                              |
   cat   mat                         cat
    |      |                          |
   The    the                       chased
                                      |
                                     dog
                                      |
                                     fed
                                      |
                                     boy
                                      |
                                     The

"The cat sat on the mat" has a max depth of 3. A deeply nested sentence like "The cat that the dog that the boy fed chased sat" pushes tokens further from the root, yielding a depth of 6 or more. Higher max depth means deeper subordination.

20% weight

b. Mean Dependency Depth

The average depth across all tokens in the dependency tree. This catches models that achieve one impressive spike of depth surrounded by shallow filler.

A sentence with max depth 10 but mean depth 2 has one deep tunnel embedded in otherwise flat prose. A sentence with max depth 10 and mean depth 8 sustains complexity throughout. Mean depth rewards the latter.

20% weight

c. Subordination Ratio

The count of subordinating dependency labels divided by the count of coordinating labels. Subordinating labels include: acl, relcl, advcl, ccomp, xcomp, csubj, mark, and appos. Coordinating labels include: cc and conj.

A ratio of 1.0 means equal amounts of subordination and coordination. A ratio of 10.0 means ten subordinating relations for every coordinating one. This metric penalizes models that pad length with "and ... and ... and" constructions.

15% weight

d. Mean Dependency Distance

The average number of words between each word and its syntactic head in the linear order of the sentence. Higher values indicate more center-embedding and long-range dependencies.

"The cat sat" has a mean dependency distance of roughly 1 — each word is adjacent to its head. "The cat, which the dog that the boy fed chased, sat" has a much higher mean distance because "cat" and "sat" are separated by an entire nested clause, and "dog" is far from "chased." This metric rewards the kind of long-range grammatical control that characterizes genuinely complex prose.

10% weight

e. Log Length

The word count of the sentence, logarithmically scaled. This prevents raw length-padding from dominating the score while still giving some credit for longer outputs (which have more opportunity for structural complexity).

Doubling length from 200 to 400 words only modestly improves this metric. A model cannot game the benchmark by producing endless run-on prose — the structural metrics must justify the length.

3. Gating Criteria

Before any metrics are computed, each output must pass three gating checks. If any gate fails, the output receives a score of 0.

These gates ensure that scores reflect genuine single-sentence complexity, not multi-sentence assemblages or degenerate outputs.

4. Scoring & Normalization

Each of the five metrics is normalized to a 0–1 range by dividing by the global maximum observed across all model outputs and all human reference texts. This means a score of 1.0 on any metric represents the best performance seen from any source, human or model.

The final score is the weighted sum of the five normalized metrics, multiplied by 100:

score = 100 * (
    0.35 * (max_depth / global_max_depth) +
    0.20 * (mean_depth / global_max_mean_depth) +
    0.20 * (sub_ratio / global_max_sub_ratio) +
    0.15 * (mean_distance / global_max_mean_distance) +
    0.10 * (log_length / global_max_log_length)
)

Human references are scored with the identical pipeline — they receive no special treatment and are subject to the same gating criteria and normalization. This makes comparisons between human and model outputs direct and fair.

For each model, best of 5 runs is reported. All runs use temperature 0.8 to capture variance in the model's generative capacity. The best-of-5 design means the benchmark measures a model's ceiling performance: what it can produce, not just what it typically produces.

5. The Prompt

Every model receives the following prompt verbatim, with no system message or additional context:

Write a single grammatical English sentence that is as long and as syntactically deep as possible while remaining coherent, readable, and stylistically controlled.

Requirements:

  1. The output must be exactly one sentence, terminated by a single sentence-final punctuation mark.
  2. Maximize hierarchical embedding — nested subordinate clauses, relative clauses, appositives, participial phrases, parentheticals — rather than coordination. Coordinating conjunctions (and, or, but, nor, so, yet) may appear but must not be the primary mechanism of length.
  3. The sentence must be parseable as a single well-formed structure with one main clause.
  4. Maintain clear logical and narrative progression; a reader should be able to follow the thought from beginning to end.
  5. Aim for the register of a highly literate prose stylist (Henry James, Marcel Proust, George Eliot) — not pastiche, but controlled literary English.
  6. Use punctuation (commas, semicolons, dashes, parentheses) to support structure, not to disguise run-ons.

Output only the sentence itself, with no preamble, commentary, or explanation.

6. Parser & Limitations

HypotaxBench uses spaCy's en_core_web_sm (small English model) for dependency parsing. The small model was chosen because it installs cleanly on any platform without requiring C++ build tools; a transformer-based model (e.g., en_core_web_trf) would produce more accurate parses but introduces significant installation complexity.

Known limitations:

7. How to Add a Model

Adding a new model to HypotaxBench takes a few steps:

  1. Fork the repository on GitHub.
  2. Add the model to the MODELS list in the benchmark configuration.
  3. Set your API key as an environment variable for the relevant provider.
  4. Run the benchmark: python run_benchmark.py — this queries the model 5 times and saves the raw outputs.
  5. Run the scorer: python score.py — this parses each output, computes metrics, and generates the results files.
  6. Submit a pull request with your results.