Benchmarks
Performance and accuracy measurements on standard entity resolution datasets.
Accuracy
Tested on Leipzig benchmark datasets and synthetic data.
Structured data
| Dataset | Records | Strategy | Precision | Recall | F1 | Cost |
|---|---|---|---|---|---|---|
| DBLP-ACM (bibliographic) | 4,910 | Weighted fuzzy | 97.2% | 97.1% | 97.2% | $0 |
| DBLP-ACM | Fellegi-Sunter (opt-in) | 98.8% | 57.6% | 72.8% | $0 | |
| DBLP-ACM | Learned blocking | 97.6% | 96.3% | 96.9% | $0 | |
| DBLP-Scholar (2.6K vs 64K) | Multi-pass + fuzzy | – | – | 74.7% | $0 |
For structured data (names, addresses, bibliographic records), fuzzy matching alone achieves 97%+ F1 with zero cost and zero training labels.
Product matching
| Dataset | Records | Strategy | Precision | Recall | F1 | Cost |
|---|---|---|---|---|---|---|
| Abt-Buy (electronics) | 2,162 | Embedding + ANN | 35.5% | 59.4% | 44.5% | $0 |
| Abt-Buy | Model extraction + emb | 39.3% | 71.0% | 50.6% | $0 | |
| Abt-Buy | Domain + emb + LLM | 94.8% | 58.3% | 72.2% | $0.04 | |
| Abt-Buy | Vertex AI + GPT-4o-mini | 94.8% | 71.2% | 81.7% | $0.74 | |
| Amazon-Google (software) | 4,589 | emb + ANN + LLM | 63.3% | 35.2% | 45.3% | $0.02 |
| Amazon-Google | Vertex AI + reranking | – | – | 44.0% | $0.10 |
Product matching benefits from domain extraction (electronics) and LLM scoring (borderline pairs). Adding too many candidate sources can hurt software matching – keep the candidate set clean.
Comparison with other tools
| Tool | Abt-Buy F1 | DBLP-ACM F1 | Training Required | Zero-Config |
|---|---|---|---|---|
| GoldenMatch | 81.7% | 97.2% | No | Yes |
| dedupe | ~75% | ~96% | Yes | No |
| Splink | ~70% | ~95% | Yes | No |
| Zingg | ~80% | ~96% | Yes | No |
| Ditto | 89.3% | 99.0% | Yes (1000+ labels) | No |
GoldenMatch trades ~8pts of F1 on Abt-Buy for zero training labels and no GPU requirement. On DBLP-ACM, it matches within 2pts of state-of-the-art.
PPRL accuracy
Benchmarked on FEBRL4 (5K vs 5K synthetic person records) and NCVR (North Carolina Voter Registration).
| Dataset | Strategy | Precision | Recall | F1 | Privacy |
|---|---|---|---|---|---|
| FEBRL4 | Normal fuzzy (baseline) | 56.5% | 74.6% | 64.3% | None |
| FEBRL4 | PPRL manual tuning | 98.2% | 82.6% | 89.8% | HMAC |
| FEBRL4 | PPRL auto-config | 99.7% | 86.1% | 92.4% | Per-field HMAC |
| FEBRL4 | PPRL paranoid | 98.9% | 76.0% | 86.0% | HMAC + balanced |
| NCVR | PPRL auto-config | 64.0% | 93.8% | 76.1% | Per-field HMAC |
Auto-configuration beats manual tuning on both datasets. PPRL auto-config profiles your data and picks optimal fields, bloom filter parameters, and threshold.
Performance (throughput)
Measured on a laptop (Windows 11, Python 3.12, 16GB RAM) with fuzzy + exact + golden record pipeline.
| Records | Time | Throughput | Pairs Found | Memory |
|---|---|---|---|---|
| 1,000 | 0.15s | 6,667 rec/s | 210 | 101 MB |
| 10,000 | 1.67s | 5,975 rec/s | 7,000 | 123 MB |
| 100,000 | 12.78s | 7,823 rec/s | 571,000 | 546 MB |
| 1,000,000 | 7.8s | 128,205 rec/s | – | – |
The 1M benchmark is exact-only (Polars self-join). Fuzzy matching at 1M requires chunked processing or the DuckDB backend.
Fuzzy matching speedup
Parallel block scoring + intra-field early termination reduced 100K fuzzy matching from ~100s to ~39s (2.5x speedup):
| Optimization | 100K Time |
|---|---|
| Baseline (sequential, no early termination) | ~100s |
| + Parallel block scoring (ThreadPoolExecutor) | ~55s |
| + Intra-field early termination | ~39s |
| Pipeline overhead (ingest, standardize, cluster, golden) | ~12.8s total |
Scaling guidance
| Records | Recommended Approach |
|---|---|
| < 100K | Default (in-memory Polars) |
| 100K – 500K | Default with blocking tuning |
| 500K – 1M | Chunked processing (--chunked) |
| 1M – 10M | DuckDB backend or database sync |
| 10M+ | Ray backend (--backend ray) or database sync + ANN |
How to run benchmarks
Leipzig benchmarks
python tests/benchmarks/run_leipzig.py
Runs DBLP-ACM, DBLP-Scholar with multiple strategies and reports F1.
v0.3.0 quick benchmarks
python tests/benchmarks/run_v030_quick.py
Tests Fellegi-Sunter, learned blocking, and LLM budget features.
Domain extraction
python tests/benchmarks/run_domain_bench.py # Abt-Buy
python tests/benchmarks/run_amazon_google_bench.py # Amazon-Google
LLM + embedding
OPENAI_API_KEY=... python tests/benchmarks/run_llm_budget_bench.py
Requires an OpenAI API key.
Throughput benchmark
python tests/bench_1m.py
Generates synthetic data at multiple scales and measures throughput.
Analyze results
python tests/analyze_results.py
Key findings
-
Structured data does not need LLMs or embeddings. Fuzzy matching achieves 97%+ F1 on bibliographic and person records.
-
Product matching needs domain extraction + LLM. Domain extraction gets 393/1081 model matches for free on electronics. LLM scoring handles the borderline pairs.
-
More candidates can hurt. Adding candidate sources (domain extraction, token normalization, manufacturer blocking) helps electronics but hurts software matching. Keep the candidate set clean for domains without precise identifiers.
-
Blocking key choice dominates performance. A coarse blocking key (state) makes 100K fuzzy matching 30x slower than a fine key (zip + soundex).
-
PPRL auto-config beats manual tuning. 92.4% F1 vs 89.8% on FEBRL4, with zero manual configuration.
-
4 PPRL fields beats 6. Fewer, higher-quality fields reduce noise in bloom filter comparison.