How well does extract<T> actually work?
Nomos's thesis stands or falls on the LLM bridge. This page is where we put real numbers. The second run covers three different models on the same 40 questions per model — 120 real extractions total. Honest wins, honest losses.
CUAD v1 (Atticus Project)
120 (10 × 4 × 3 models)
anthropic/claude-sonnet-4.5 · openai/gpt-4o · google/gemini-2.5-pro
2026-04-21
Overall, by model.
Claude Sonnet 4.5
anthropic/claude-sonnet-4.5
GPT-4o
openai/gpt-4o
Gemini 2.5 Pro
google/gemini-2.5-pro
The three frontier models perform within ~10 points of each other on contains and F1. Sonnet edges ahead on exact match (0.47), GPT-4o on containment (0.75). Gemini confidence is 0.98 — matching Sonnet despite lower exact-match accuracy, which is exactly the self-rated-confidence problem that motivates the ensemble voting primitive.
Per category, per model.
EM: extracted string equals ground-truth (case-insensitive). Contains: extracted ⊆ gold or gold ⊆ extracted. F1: token-level F1 against best ground-truth. Conf: self-rated by the model.
What this means.
Document Name is a 1.00/1.00 win for Sonnet and GPT-4o; Gemini lands at 0.80. Field-name extraction is essentially solved when the schema is explicit.
Parties shows a consistent gap across all
three models: contains is 0.80–1.00 but F1 is
0.15–0.19. The LLM returns the full party paragraph; CUAD
wants just the name. A stricter schema (add a
canonical_name: String field) would close most of
this gap.
Effective Date is where the shallow schema
hurts us. All three models return phrases
("This Agreement is entered into on January 13, 2005")
instead of just dates. The fix is a strict Date-only
schema variant for the value field.
Governing Law ranges from 0.60 to 0.70 EM across models — all correct in spirit, differing only in boilerplate trimming.
Confidence sits at 0.93–1.00 across every model and
every category, even when exact-match is 0.00.
That's the core problem self-rated confidence has, and why
Nomos ships an extractEnsemble
primitive: cross-model agreement is a much stronger signal
than any single model's opinion of itself.
Reproduce.
Clone the repo, install, pull CUAD, run the harness:
git clone https://github.com/sboghossian/nomos
cd nomos && npm install && npm run build
# Download CUAD (38 MB) from Zenodo
curl -sL -o /tmp/cuad.zip \
"https://zenodo.org/record/4595826/files/CUAD_v1.zip?download=1"
unzip -oq /tmp/cuad.zip -d /tmp/cuad
cp /tmp/cuad/CUAD_v1/CUAD_v1.json /tmp/cuad.json
# Run the cross-model benchmark
export OPENROUTER_API_KEY=sk-or-...
node bench/cuad/harness.mjs \
--cuad /tmp/cuad.json \
--samples 10 \
--categories "Document Name,Effective Date,Parties,Governing Law" \
--models "claude-sonnet-4-5,gpt-4o,gemini-2-5-pro"
Results land in bench/cuad/results/cuad-<timestamp>.json
with per-item scores, per-model aggregates, and the raw extracted text.
The on-disk LLM cache means a second identical run is free.
Next runs.
— Scale: 50 samples × 10 categories = 500 per model; 1,500 extractions.
— Strict schemas: add canonical_name + Date-only variants; quantify the F1 lift.
— Ensemble consensus: instead of single-model scores, run extractEnsemble with all three and report agreement rates.
— MAUD: the same harness on 47k M&A labels across 152 contracts.
— ACORD: evaluate retrieval accuracy on 126k query-clause pairs.
All results here get posted verbatim — wins and losses. Nomos's credibility rides on the graph, not the hype.