Performance and accuracy measurements on standard entity resolution datasets.
Tested on Leipzig benchmark datasets and synthetic data.
| Dataset | Records | Strategy | Precision | Recall | F1 | Cost |
|---|---|---|---|---|---|---|
| DBLP-ACM (bibliographic) | 4,910 | Weighted fuzzy | 97.2% | 97.1% | 97.2% | $0 |
| DBLP-ACM | Fellegi-Sunter (opt-in) | 98.8% | 57.6% | 72.8% | $0 | |
| DBLP-ACM | Learned blocking | 97.6% | 96.3% | 96.9% | $0 | |
| DBLP-Scholar (2.6K vs 64K) | Multi-pass + fuzzy | – | – | 74.7% | $0 |
For structured data (names, addresses, bibliographic records), fuzzy matching alone achieves 97%+ F1 with zero cost and zero training labels.
| Dataset | Records | Strategy | Precision | Recall | F1 | Cost |
|---|---|---|---|---|---|---|
| Abt-Buy (electronics) | 2,162 | Embedding + ANN | 35.5% | 59.4% | 44.5% | $0 |
| Abt-Buy | Model extraction + emb | 39.3% | 71.0% | 50.6% | $0 | |
| Abt-Buy | Domain + emb + LLM | 94.8% | 58.3% | 72.2% | $0.04 | |
| Abt-Buy | Vertex AI + GPT-4o-mini | 94.8% | 71.2% | 81.7% | $0.74 | |
| Amazon-Google (software) | 4,589 | emb + ANN + LLM | 63.3% | 35.2% | 45.3% | $0.02 |
| Amazon-Google | Vertex AI + reranking | – | – | 44.0% | $0.10 |
Product matching benefits from domain extraction (electronics) and LLM scoring (borderline pairs). Adding too many candidate sources can hurt software matching – keep the candidate set clean.
| Tool | Abt-Buy F1 | DBLP-ACM F1 | Training Required | Zero-Config |
|---|---|---|---|---|
| GoldenMatch | 81.7% | 97.2% | No | Yes |
| dedupe | ~75% | ~96% | Yes | No |
| Splink | ~70% | ~95% | Yes | No |
| Zingg | ~80% | ~96% | Yes | No |
| Ditto | 89.3% | 99.0% | Yes (1000+ labels) | No |
GoldenMatch trades ~8pts of F1 on Abt-Buy for zero training labels and no GPU requirement. On DBLP-ACM, it matches within 2pts of state-of-the-art.
Head-to-head against Splink, Dedupe, and RecordLinkage. GoldenMatch uses explicit config with zero training data.
Febrl (5,000 synthetic PII records, 6,538 true pairs):
| Library | Precision | Recall | F1 | Time |
|---|---|---|---|---|
| Splink | 1.000 | 0.995 | 0.998 | 2.0s |
| GoldenMatch | 1.000 | 0.943 | 0.971 | 6.8s |
| Dedupe | 1.000 | 0.865 | 0.928 | 7.2s |
| RecordLinkage | 0.999 | 0.733 | 0.845 | 2.2s |
DBLP-ACM (4,910 bibliographic records, 2,224 true matches):
| Library | Precision | Recall | F1 | Time |
|---|---|---|---|---|
| RecordLinkage | 0.888 | 0.961 | 0.923 | 13.0s |
| GoldenMatch | 0.891 | 0.945 | 0.918 | 6.2s |
| Dedupe | 0.604 | 0.936 | 0.734 | 10.5s |
| Splink | 0.646 | 0.834 | 0.728 | 3.4s |
Key findings:
| Dataset | Records | Strategy | Clusters | Matched | LLM Cost | Time |
|---|---|---|---|---|---|---|
| Bulldozer auctions | 401,125 | Multi-pass + ANN hybrid + LLM calibration | 27,937 | 384,650 | ~$0.01 | 323s |
Using iterative LLM calibration, the LLM learned threshold=0.947 from 200 sampled pairs instead of scoring 37,500 pairs. ANN hybrid blocking recovered 363 sub-blocks from 15 oversized blocks, matching 949 additional records that string blocking missed. 87.7% of clusters have confidence >= 0.4.
See examples/equipment_dedup.py for the full configuration.
Benchmarked on FEBRL4 (5K vs 5K synthetic person records) and NCVR (North Carolina Voter Registration).
| Dataset | Strategy | Precision | Recall | F1 | Privacy |
|---|---|---|---|---|---|
| FEBRL4 | Normal fuzzy (baseline) | 56.5% | 74.6% | 64.3% | None |
| FEBRL4 | PPRL manual tuning | 98.2% | 82.6% | 89.8% | HMAC |
| FEBRL4 | PPRL auto-config | 99.7% | 86.1% | 92.4% | Per-field HMAC |
| FEBRL4 | PPRL paranoid | 98.9% | 76.0% | 86.0% | HMAC + balanced |
| NCVR | PPRL auto-config | 64.0% | 93.8% | 76.1% | Per-field HMAC |
Auto-configuration beats manual tuning on both datasets. PPRL auto-config profiles your data and picks optimal fields, bloom filter parameters, and threshold.
Measured on a laptop (Windows 11, Python 3.12, 16GB RAM) with fuzzy + exact + golden record pipeline.
| Records | Time | Throughput | Pairs Found | Memory |
|---|---|---|---|---|
| 1,000 | 0.15s | 6,667 rec/s | 210 | 101 MB |
| 10,000 | 1.67s | 5,975 rec/s | 7,000 | 123 MB |
| 100,000 | 12.78s | 7,823 rec/s | 571,000 | 546 MB |
| 1,000,000 | 7.8s | 128,205 rec/s | – | – |
The 1M benchmark is exact-only (Polars self-join). Fuzzy matching at 1M requires chunked processing or the DuckDB backend.
Parallel block scoring + intra-field early termination reduced 100K fuzzy matching from ~100s to ~39s (2.5x speedup):
| Optimization | 100K Time |
|---|---|
| Baseline (sequential, no early termination) | ~100s |
| + Parallel block scoring (ThreadPoolExecutor) | ~55s |
| + Intra-field early termination | ~39s |
| Pipeline overhead (ingest, standardize, cluster, golden) | ~12.8s total |
| Records | Recommended Approach |
|---|---|
| < 100K | Default (in-memory Polars) |
| 100K – 500K | Default with blocking tuning |
| 500K – 1M | Chunked processing (--chunked) |
| 1M – 10M | DuckDB backend or database sync |
| 10M+ | Ray backend (--backend ray) or database sync + ANN |
python tests/benchmarks/run_leipzig.py
Runs DBLP-ACM, DBLP-Scholar with multiple strategies and reports F1.
python tests/benchmarks/run_v030_quick.py
Tests Fellegi-Sunter, learned blocking, and LLM budget features.
python tests/benchmarks/run_domain_bench.py # Abt-Buy
python tests/benchmarks/run_amazon_google_bench.py # Amazon-Google
OPENAI_API_KEY=... python tests/benchmarks/run_llm_budget_bench.py
Requires an OpenAI API key.
python tests/bench_1m.py
Generates synthetic data at multiple scales and measures throughput.
python tests/analyze_results.py
Structured data does not need LLMs or embeddings. Fuzzy matching achieves 97%+ F1 on bibliographic and person records.
Product matching needs domain extraction + LLM. Domain extraction gets 393/1081 model matches for free on electronics. LLM scoring handles the borderline pairs.
More candidates can hurt. Adding candidate sources (domain extraction, token normalization, manufacturer blocking) helps electronics but hurts software matching. Keep the candidate set clean for domains without precise identifiers.
Blocking key choice dominates performance. A coarse blocking key (state) makes 100K fuzzy matching 30x slower than a fine key (zip + soundex).
PPRL auto-config beats manual tuning. 92.4% F1 vs 89.8% on FEBRL4, with zero manual configuration.
4 PPRL fields beats 6. Fewer, higher-quality fields reduce noise in bloom filter comparison.