Benchmarks

Performance and accuracy measurements on standard entity resolution datasets.

Accuracy

Tested on Leipzig benchmark datasets and synthetic data.

Structured data

Dataset	Records	Strategy	Precision	Recall	F1	Cost
DBLP-ACM (bibliographic)	4,910	Weighted fuzzy	97.2%	97.1%	97.2%	$0
DBLP-ACM	Fellegi-Sunter (opt-in)	98.8%	57.6%	72.8%	$0
DBLP-ACM	Learned blocking	97.6%	96.3%	96.9%	$0
DBLP-Scholar (2.6K vs 64K)	Multi-pass + fuzzy	–	–	74.7%	$0

For structured data (names, addresses, bibliographic records), fuzzy matching alone achieves 97%+ F1 with zero cost and zero training labels.

Product matching

Dataset	Records	Strategy	Precision	Recall	F1	Cost
Abt-Buy (electronics)	2,162	Embedding + ANN	35.5%	59.4%	44.5%	$0
Abt-Buy	Model extraction + emb	39.3%	71.0%	50.6%	$0
Abt-Buy	Domain + emb + LLM	94.8%	58.3%	72.2%	$0.04
Abt-Buy	Vertex AI + GPT-4o-mini	94.8%	71.2%	81.7%	$0.74
Amazon-Google (software)	4,589	emb + ANN + LLM	63.3%	35.2%	45.3%	$0.02
Amazon-Google	Vertex AI + reranking	–	–	44.0%	$0.10

Product matching benefits from domain extraction (electronics) and LLM scoring (borderline pairs). Adding too many candidate sources can hurt software matching – keep the candidate set clean.

Comparison with other tools

Tool	Abt-Buy F1	DBLP-ACM F1	Training Required	Zero-Config
GoldenMatch	81.7%	97.2%	No	Yes
dedupe	~75%	~96%	Yes	No
Splink	~70%	~95%	Yes	No
Zingg	~80%	~96%	Yes	No
Ditto	89.3%	99.0%	Yes (1000+ labels)	No

GoldenMatch trades ~8pts of F1 on Abt-Buy for zero training labels and no GPU requirement. On DBLP-ACM, it matches within 2pts of state-of-the-art.

Library Comparison (v1.2.7)

Head-to-head against Splink, Dedupe, and RecordLinkage. GoldenMatch uses explicit config with zero training data.

Febrl (5,000 synthetic PII records, 6,538 true pairs):

Library	Precision	Recall	F1	Time
Splink	1.000	0.995	0.998	2.0s
GoldenMatch	1.000	0.943	0.971	6.8s
Dedupe	1.000	0.865	0.928	7.2s
RecordLinkage	0.999	0.733	0.845	2.2s

DBLP-ACM (4,910 bibliographic records, 2,224 true matches):

Library	Precision	Recall	F1	Time
RecordLinkage	0.888	0.961	0.923	13.0s
GoldenMatch	0.891	0.945	0.918	6.2s
Dedupe	0.604	0.936	0.734	10.5s
Splink	0.646	0.834	0.728	3.4s

Key findings:

GoldenMatch is the most consistent performer — top-2 F1 on both datasets, zero training data required
Splink dominates structured PII (Febrl) but struggles on non-PII bibliographic data (DBLP-ACM)
RecordLinkage surprisingly wins DBLP-ACM but lags significantly on PII
Dedupe shows strong recall but lower precision on bibliographic data
GoldenMatch requires explicit config for best results — auto-config is not recommended for benchmarks

Equipment data (v1.2.6)

Dataset	Records	Strategy	Clusters	Matched	LLM Cost	Time
Bulldozer auctions	401,125	Multi-pass + ANN hybrid + LLM calibration	27,937	384,650	~$0.01	323s

Using iterative LLM calibration, the LLM learned threshold=0.947 from 200 sampled pairs instead of scoring 37,500 pairs. ANN hybrid blocking recovered 363 sub-blocks from 15 oversized blocks, matching 949 additional records that string blocking missed. 87.7% of clusters have confidence >= 0.4.

See examples/equipment_dedup.py for the full configuration.

PPRL accuracy

Benchmarked on FEBRL4 (5K vs 5K synthetic person records) and NCVR (North Carolina Voter Registration).

Dataset	Strategy	Precision	Recall	F1	Privacy
FEBRL4	Normal fuzzy (baseline)	56.5%	74.6%	64.3%	None
FEBRL4	PPRL manual tuning	98.2%	82.6%	89.8%	HMAC
FEBRL4	PPRL auto-config	99.7%	86.1%	92.4%	Per-field HMAC
FEBRL4	PPRL paranoid	98.9%	76.0%	86.0%	HMAC + balanced
NCVR	PPRL auto-config	64.0%	93.8%	76.1%	Per-field HMAC

Auto-configuration beats manual tuning on both datasets. PPRL auto-config profiles your data and picks optimal fields, bloom filter parameters, and threshold.

Performance (throughput)

Measured on a laptop (Windows 11, Python 3.12, 16GB RAM) with fuzzy + exact + golden record pipeline.

Records	Time	Throughput	Pairs Found	Memory
1,000	0.15s	6,667 rec/s	210	101 MB
10,000	1.67s	5,975 rec/s	7,000	123 MB
100,000	12.78s	7,823 rec/s	571,000	546 MB
1,000,000	7.8s	128,205 rec/s	–	–

The 1M benchmark is exact-only (Polars self-join). Fuzzy matching at 1M requires chunked processing or the DuckDB backend.

Fuzzy matching speedup

Parallel block scoring + intra-field early termination reduced 100K fuzzy matching from ~100s to ~39s (2.5x speedup):

Optimization	100K Time
Baseline (sequential, no early termination)	~100s
+ Parallel block scoring (ThreadPoolExecutor)	~55s
+ Intra-field early termination	~39s
Pipeline overhead (ingest, standardize, cluster, golden)	~12.8s total

Scaling guidance

Records	Recommended Approach
< 100K	Default (in-memory Polars)
100K – 500K	Default with blocking tuning
500K – 1M	Chunked processing (`--chunked`)
1M – 10M	DuckDB backend or database sync
10M+	Ray backend (`--backend ray`) or database sync + ANN

How to run benchmarks

Leipzig benchmarks

python tests/benchmarks/run_leipzig.py

Runs DBLP-ACM, DBLP-Scholar with multiple strategies and reports F1.

v0.3.0 quick benchmarks

python tests/benchmarks/run_v030_quick.py

Tests Fellegi-Sunter, learned blocking, and LLM budget features.

Domain extraction

python tests/benchmarks/run_domain_bench.py          # Abt-Buy
python tests/benchmarks/run_amazon_google_bench.py    # Amazon-Google

LLM + embedding

OPENAI_API_KEY=... python tests/benchmarks/run_llm_budget_bench.py

Requires an OpenAI API key.

Throughput benchmark

python tests/bench_1m.py

Generates synthetic data at multiple scales and measures throughput.

Analyze results

python tests/analyze_results.py

Key findings

Structured data does not need LLMs or embeddings. Fuzzy matching achieves 97%+ F1 on bibliographic and person records.
Product matching needs domain extraction + LLM. Domain extraction gets 393/1081 model matches for free on electronics. LLM scoring handles the borderline pairs.
More candidates can hurt. Adding candidate sources (domain extraction, token normalization, manufacturer blocking) helps electronics but hurts software matching. Keep the candidate set clean for domains without precise identifiers.
Blocking key choice dominates performance. A coarse blocking key (state) makes 100K fuzzy matching 30x slower than a fine key (zip + soundex).
PPRL auto-config beats manual tuning. 92.4% F1 vs 89.8% on FEBRL4, with zero manual configuration.
4 PPRL fields beats 6. Fewer, higher-quality fields reduce noise in bloom filter comparison.