§ Research / Benchmarks

How well does extract<T> actually work?

Nomos's thesis stands or falls on the LLM bridge. This page is where we put real numbers. The second run covers three different models on the same 40 questions per model — 120 real extractions total. Honest wins, honest losses.

Dataset

CUAD v1 (Atticus Project)

Extractions

120 (10 × 4 × 3 models)

Models

anthropic/claude-sonnet-4.5 · openai/gpt-4o · google/gemini-2.5-pro

Run date

2026-04-21

§ 01

Overall, by model.

Model n EM Contains F1 Conf

Claude Sonnet 4.5

anthropic/claude-sonnet-4.5

40 0.47 0.72 0.64 0.98

GPT-4o

openai/gpt-4o

40 0.45 0.75 0.61 0.96

Gemini 2.5 Pro

google/gemini-2.5-pro

40 0.38 0.72 0.61 0.98

The three frontier models perform within ~10 points of each other on contains and F1. Sonnet edges ahead on exact match (0.47), GPT-4o on containment (0.75). Gemini confidence is 0.98 — matching Sonnet despite lower exact-match accuracy, which is exactly the self-rated-confidence problem that motivates the ensemble voting primitive.

§ 02

Per category, per model.

Document Name n = 10 per model

Model EM Contains F1 Conf

anthropic/claude-sonnet-4.5 1.00 1.00 1.00 0.98

openai/gpt-4o 1.00 1.00 1.00 0.99

google/gemini-2.5-pro 0.80 0.80 1.00 1.00

Parties n = 10 per model

Model EM Contains F1 Conf

anthropic/claude-sonnet-4.5 0.00 0.90 0.18 0.96

openai/gpt-4o 0.00 0.80 0.15 0.95

google/gemini-2.5-pro 0.00 1.00 0.19 0.99

Effective Date n = 10 per model

Model EM Contains F1 Conf

anthropic/claude-sonnet-4.5 0.20 0.30 0.38 0.99

openai/gpt-4o 0.20 0.50 0.40 0.97

google/gemini-2.5-pro 0.00 0.40 0.28 0.99

Governing Law n = 10 per model

Model EM Contains F1 Conf

anthropic/claude-sonnet-4.5 0.70 0.70 1.00 0.97

openai/gpt-4o 0.60 0.70 0.90 0.94

google/gemini-2.5-pro 0.70 0.70 1.00 0.93

EM: extracted string equals ground-truth (case-insensitive). Contains: extracted ⊆ gold or gold ⊆ extracted. F1: token-level F1 against best ground-truth. Conf: self-rated by the model.

§ 03

What this means.

Document Name is a 1.00/1.00 win for Sonnet and GPT-4o; Gemini lands at 0.80. Field-name extraction is essentially solved when the schema is explicit.

Parties shows a consistent gap across all three models: contains is 0.80–1.00 but F1 is 0.15–0.19. The LLM returns the full party paragraph; CUAD wants just the name. A stricter schema (add a canonical_name: String field) would close most of this gap.

Effective Date is where the shallow schema hurts us. All three models return phrases ("This Agreement is entered into on January 13, 2005") instead of just dates. The fix is a strict Date-only schema variant for the value field.

Governing Law ranges from 0.60 to 0.70 EM across models — all correct in spirit, differing only in boilerplate trimming.

Confidence sits at 0.93–1.00 across every model and every category, even when exact-match is 0.00. That's the core problem self-rated confidence has, and why Nomos ships an extractEnsemble primitive: cross-model agreement is a much stronger signal than any single model's opinion of itself.

§ 04

Reproduce.

Clone the repo, install, pull CUAD, run the harness:

git clone https://github.com/sboghossian/nomos
cd nomos && npm install && npm run build

# Download CUAD (38 MB) from Zenodo
curl -sL -o /tmp/cuad.zip \
  "https://zenodo.org/record/4595826/files/CUAD_v1.zip?download=1"
unzip -oq /tmp/cuad.zip -d /tmp/cuad
cp /tmp/cuad/CUAD_v1/CUAD_v1.json /tmp/cuad.json

# Run the cross-model benchmark
export OPENROUTER_API_KEY=sk-or-...
node bench/cuad/harness.mjs \
  --cuad /tmp/cuad.json \
  --samples 10 \
  --categories "Document Name,Effective Date,Parties,Governing Law" \
  --models "claude-sonnet-4-5,gpt-4o,gemini-2-5-pro"

Results land in bench/cuad/results/cuad-<timestamp>.json with per-item scores, per-model aggregates, and the raw extracted text. The on-disk LLM cache means a second identical run is free.

§ 05

Next runs.

— Scale: 50 samples × 10 categories = 500 per model; 1,500 extractions.

— Strict schemas: add canonical_name + Date-only variants; quantify the F1 lift.

— Ensemble consensus: instead of single-model scores, run extractEnsemble with all three and report agreement rates.

— MAUD: the same harness on 47k M&A labels across 152 contracts.

— ACORD: evaluate retrieval accuracy on 126k query-clause pairs.

All results here get posted verbatim — wins and losses. Nomos's credibility rides on the graph, not the hype.