Measuring whether LLMs can sustain deep hierarchical syntax in a single sentence.
The task: Write one grammatical sentence that is as syntactically deep as possible — nested subordinate clauses, relative clauses, participial phrases — not length through coordination ("and... and... and"). What we measure: dependency tree depth, subordination density, and embedding complexity, scored against human literary references from Henry James, Proust, and Bulwer-Lytton. Full methodology →
All 34 models ranked by composite score. Click column headers to sort.
| Rank | Model | Score ▼ | Max Depth | Mean Depth | Subord Ratio | Dep Distance | Words |
|---|
A sentence must pass gating -- single root, no run-on fragments or semicolon splices -- to receive a score. Five independent runs per model; the best valid sentence is kept. All parsing uses spaCy en_core_web_sm. Full methodology and examples →
Beyond the composite score, we catalog the types of subordination each model deploys and measure phrasal-level complexity. These patterns reveal how different models approach the problem of deep embedding.