FedLock: FOMC Hawkishness

This methodology was written by Claude (Anthropic) with editing and direction from Joe Weisenthal.

What This Measures

Each point represents a single Federal Reserve speech scored on a hawkish–dovish spectrum. A score of 50 is neutral. Higher scores indicate a preference for tighter monetary policy; lower scores indicate a preference for accommodation.

The scores are not based on keyword counts, dictionary methods, or pre-trained classifiers. They come from a pairwise tournament: Gemini 2.0 Flash reads two anonymized speeches side by side and decides which takes the more hawkish position. After 67,000 of these head-to-head comparisons across 4,479 policy speeches, a TrueSkill rating engine produces a continuous score for each speech. That works out to roughly 30 comparisons per speech on average, which is well past the point where ratings stabilize. By the end of the tournament, every speech had converged to a confidence interval (σ) below 2.0 on a 100-point scale, meaning additional comparisons would not meaningfully change the rankings.

How the Comparisons Work

Each comparison presents two speeches with speaker identities stripped out. Names, titles, and structural labels (e.g., “CHAIR POWELL.”) are replaced with generic tokens so the judge cannot identify who is speaking. This matters because the model has been trained on vast amounts of text about the Fed, and it “knows” who the hawks and doves are. Without anonymization, it could let that prior knowledge influence its judgments rather than reading the text on its merits. (We tested this directly. See Name Contamination below.)

Alongside each excerpt, the judge sees macroeconomic conditions at the time of the speech: Core PCE inflation, unemployment rate, GDP growth, and the VIX. This macro-relative calibration is the key design choice. A speaker urging caution on rate cuts when inflation is at 2.1% is meaningfully hawkish; the same language when inflation is at 8.5% is just stating the obvious.

How a pairwise comparison is presented to the judge

Each comparison presents two redacted excerpts with macroeconomic context. The judge decides which speaker takes the more hawkish stance relative to conditions at the time.

Why Gemini Flash?

The judge model is Gemini 2.0 Flash, accessed via the OpenRouter API. The entire tournament cost approximately $33. Using a fast, inexpensive model demonstrates that the LLM-as-judge technique is accessible and reproducible. You don’t need expensive frontier models to produce meaningful sentiment rankings. Parallel tournaments using Claude Sonnet and DeepSeek V3 produced closely aligned results, confirming that the signal comes from the method, not the model.

Why Pairwise?

Why not just ask the model to score each speech directly on a 1–100 scale? We tried that too. The same model scored the same speeches both ways. The difference is striking:

Pairwise tournament vs direct scoring distribution

Both distributions use Gemini 2.0 Flash on the same FOMC policy corpus.

When asked to score speeches directly, the model collapses into two camps (“dovish” or “hawkish”) with heavy anchoring on multiples of 5 and 10. There is almost no middle ground. The pairwise tournament, by contrast, produces a smooth, continuous distribution with far more discriminating power. Two speeches that a direct scorer would both call “65” can be separated by meaningful differences when compared head to head.

This is a well-documented limitation of LLM evaluation. When asked to assign absolute scores, models struggle to maintain a consistent internal scale across thousands of items. They anchor on round numbers, default to a few modal values, and lose discriminating power. A growing body of research confirms that pairwise comparison produces more reliable, human-aligned judgments:

Liu et al. (COLM 2024): “Aligning with Human Judgement” shows that formulating evaluation as pairwise ranking rather than direct scoring achieves state-of-the-art alignment with human preferences.
Gu et al. (2024): A comprehensive survey of LLM-as-a-Judge methods, documenting that pairwise assessments outperform pointwise scoring in positional consistency and human agreement across multiple benchmarks.

Forced comparison sidesteps the calibration problem entirely. The model doesn’t need to assign a number. It just needs to decide which of two speeches is more hawkish. That’s a much easier judgment call, and TrueSkill aggregates those binary decisions into a continuous rating.

The Rating System

The tournament uses Microsoft’s TrueSkill algorithm. Each speech starts at μ=50 with high uncertainty (σ=8.33). As it wins or loses comparisons, its rating updates. Pairs are selected through Swiss-style matching, uncertainty-targeted sampling, and random draws for efficient convergence. At 67,000 comparisons (averaging 30 per speech), all speeches converged to σ<2.0.

Corpus

The corpus contains 4,479 policy-relevant speeches by 69 FOMC members from 1995 to 2026. Sources include federalreserve.gov, FRASER, FOMC press conference transcripts, and regional Federal Reserve bank websites. Non-policy content (commencement addresses, fintech talks, bank supervision, community development) is filtered out via a two-stage classifier: an initial LLM pass identifies policy-relevant speeches, followed by a stricter re-review that removes speeches whose primary purpose is not communicating a monetary policy view.

Era Adjustment

Raw scores conflate dispositional hawkishness with the era a speaker served in. The Rankings and Speaker Detail tabs use era-adjusted scores: for each speech, adjusted = raw − quarterly mean + 50. This re-centers 50 to “neutral relative to contemporaries.” The Timeline tab retains raw scores to preserve the visible dynamics of hawkish and dovish eras.

The Timeline

The white trend line on the Timeline tab is a Gaussian-weighted moving average (σ=45 days, sampled weekly). At each point in time, nearby speeches contribute more to the average and distant speeches contribute less, with a smooth falloff. This captures the overall mood of Fed communication at any given moment without being dominated by individual outlier speeches.

Sanity Checks

There is no ground truth for hawkishness, but there are plenty of reasons to think these scores are well-calibrated.

Speaker rankings match conventional wisdom. Thomas Hoenig, famous for his eight consecutive dissents against zero rates in 2010, ranks as the most hawkish speaker in the dataset. Charles Plosser, Richard Fisher, and Jeffrey Lacker round out the top five. On the dovish side, Lael Brainard, Sarah Bloom Raskin, and Charles Evans all land in the bottom quartile. None of this was programmed in. These rankings emerged purely from pairwise text comparisons.

The timeline tracks macro cycles. The smoothed trend line peaks in Q3 2022, exactly when the Fed was at its most aggressive on inflation. It troughs in Q2 2020, during the COVID emergency response. It drops sharply during the financial crisis (2008–2009) and rises during the Greenspan tightening of 2004–2006. These are not coincidences.

Individual speeches are recognizable. Powell’s August 2022 Jackson Hole address, widely viewed as a defining moment in the inflation fight, scores 78 out of 100. Hoenig’s 2010 speeches all score well above average despite coming during ZIRP, when the quarterly mean was deeply dovish. Meanwhile, COVID-era emergency speeches from Williams, Powell, and Bullard score in the mid-20s.

The hawkishness line tracks the Taylor Rule. The classic Taylor Rule prescribes an interest rate based on inflation and the unemployment gap. It has nothing to do with speech text. Yet when overlaid against the smoothed hawkishness trend, the two lines move in close alignment across three decades:

White: smoothed hawkishness score (left axis). Pink: Taylor Rule rate (right axis). The Taylor Rule is computed from Core PCE, unemployment, and CBO NAIRU estimates.

The major swings match: the dot-com tightening, the post-GFC collapse, the long ZIRP era, and the sharp 2022 inflation spike. The implication is that Fed officials talk roughly the way a simple policy rule would suggest they should, and the model is picking that up from text alone.

Name Contamination

A natural concern with using an LLM as judge: does the model already “know” who these speakers are? There are two distinct risks here:

Direct identification. The model sees a name or title in the text and recognizes the speaker. “Chair Powell” or “CHAIRMAN GREENSPAN.” appears in the excerpt, and the model’s training data tells it how hawkish that person is.
Stylistic fingerprinting. Even with names stripped, the model might recognize a speaker from their vocabulary, sentence structure, or rhetorical habits. Hoenig hammers “moral hazard”; Evans talks about “optimal control.” These patterns could be enough.

What we do about it

The first risk is addressed by deep redaction. Before any speech enters the tournament, a five-layer anonymization pipeline processes the text using a roster of all 77 known FOMC members and their aliases:

Structural speaker labels are replaced: CHAIR POWELL. → SPEAKER:
Press conference reporter names are stripped: NICK TIMIRAOS. → REPORTER:
Title-name combinations in running text are removed: “Governor Waller”, “President Williams” → [OFFICIAL]
Full name mentions (first + last, aliases) are replaced: “Janet Yellen” → [OFFICIAL]
Last-name-only mentions are caught: “as Bernanke argued” → “as [OFFICIAL] argued”

An exclusion list protects false positives: “Phillips curve,” “Taylor rule,” and “Jackson Hole” are preserved. The result is that the judge model sees no names, no titles, and no identifying structural labels.

The second risk, stylistic fingerprinting, is harder to eliminate. We cannot strip a speaker’s rhetorical habits without destroying the signal we’re trying to measure. But we can test whether it matters.

How we tested it

We ran a series of controlled experiments to measure exactly how much the model’s prior knowledge affects scores. Each experiment ran a mini-tournament (8,000 pairwise comparisons) on the same corpus with the same model, but with names deliberately manipulated in different ways. The question: if the model already “knows” a speech is Lacker’s, does telling it that, or telling it the opposite, change the score?

Arm	Design	What it tests
A	Baseline: fully anonymized, no names	Reference scores (the ones shown on this site)
B	True names shown with each speech	Does providing the real name shift scores?
C	Swapped names: hawk speeches labeled with a dove’s name and vice versa	Does the model follow the name or the text?
E	All real names except Lacker, who is labeled “Whitfield” (a fictional name)	Does a single real name carry measurable signal?

What we found

The model does have priors. In Arm B (true names), hawks shift up +1–2 μ and doves shift down −2–5 μ compared to the anonymous baseline. In Arm C (swapped names), scores shift 3–6 μ in the direction of the swapped name (p<0.001 for both hawk and dove pools). The model knows who these people are, and that knowledge nudges its judgments.

But the text dominates. Even in Arm C, where Lacker’s speeches are labeled “Evans” (a well-known dove), his scores drop by about 6 μ but remain above the corpus mean. The model doesn’t just follow the name. It reads the actual text, disagrees with the label, and still rates the speech as relatively hawkish. The name is a thumb on the scale, not the whole scale.

Arm E is the cleanest test. We re-ran the tournament with every speaker keeping their real name except Lacker, whose speeches were labeled with the fictional surname “Whitfield.” Results:

Lacker: μ drops from 54.2 (real name) to 52.1 (fictional name), a shift of −2.1 (p=0.039). His anonymous baseline is 52.9, meaning the fictional name lands right where no name at all does.
All 7 other speakers: deltas range from −0.2 to +0.8, all with p>0.28. Completely stable. Changing Lacker’s label doesn’t affect anyone else.

This confirms that the “Lacker” label is worth about +2 μ of hawkish bias, and that the anonymized baseline successfully removes that signal. A fictional name and no name produce the same score.

Why this isn’t damning

The bias exists but is modest. Even for the most affected speakers, name contamination shifts scores by 2–5 points on a 100-point scale, far smaller than the 20+ point gaps between genuine hawks and doves. Hoenig doesn’t rank as the most hawkish speaker because the model knows his reputation. He ranks there because his speeches argue for tighter policy in language that consistently wins pairwise comparisons.

More importantly, the baseline scores used on this site are the anonymized ones (Arm A), and the experiments confirm that anonymization is working: stripping the name removes the name signal. The remaining variation comes from the text itself, which is exactly what we want to measure.

Limitations

These scores reflect LLM judgment applied consistently across 67,000 comparisons. They are reproducible and internally coherent, but not ground truth.

The corpus is imperfect. While it draws from multiple official sources, it does not capture most interviews, media appearances, or informal communication, all of which often carry significant policy signals. Some speeches were likely missed, and despite two rounds of filtering, a small number of borderline speeches may remain. Coverage is uneven: Chairs and Governors are well-represented, while some regional bank presidents have fewer rated speeches, particularly in earlier decades where digital records are sparse.