GoldenMatch

Entity resolution that finds duplicates in your data so you don’t have to define the rules yourself.

PyPI Downloads Tests Python Coverage Tests


What It Does

GoldenMatch takes messy records and figures out which ones refer to the same entity — without requiring you to hand-write matching rules.

INGEST → STANDARDIZE → BLOCK → SCORE → CLUSTER → GOLDEN RECORD
Step What Happens
Ingest Load CSV, Excel, Parquet, or a DataFrame
Standardize Normalize casing, whitespace, phonetic encoding
Block Group candidates to avoid N^2 comparisons
Score Fuzzy match (jaro-winkler, levenshtein, token sort)
Cluster Union-Find with confidence scoring
Golden Merge clusters into canonical records

Quick Install

pip install goldenmatch
import goldenmatch as gm

result = gm.dedupe("customers.csv", exact=["email"], fuzzy={"name": 0.85})
print(f"{result.total_clusters} clusters, {result.match_rate:.0%} match rate")
result.golden.write_csv("golden_records.csv")

Benchmarks

Dataset Records Method F1 Time
DBLP-ACM (academic) 4,910 Fuzzy matching 97.2% 2.1s
Abt-Buy (electronics) 2,162 Domain + LLM 72.2% 4.2s
FEBRL4 (PPRL) 10,000 Auto-config bloom filters 92.4% 14s
Synthetic 100K Fuzzy (name+zip) 12.8s
Synthetic 1M Exact dedupe 7.8s

Scale: 7,823 records/sec on a laptop (fuzzy + exact + golden).


7 Ways to Use It

Interface Install Best For
Python API pip install goldenmatch Notebooks, scripts, AI agents
CLI Same package, 21 commands Terminal workflows
Interactive TUI goldenmatch tui Visual exploration
PostgreSQL Pre-built .deb/.rpm Production databases
DuckDB pip install goldenmatch-duckdb Analytics
REST API / MCP goldenmatch serve / mcp-serve Microservices, AI assistants
ER Agent (A2A) goldenmatch agent-serve AI-to-AI discovery, autonomous ER

Documentation

Guide Description
Installation pip, apt, rpm, Docker, build from source
Quick Start First dedupe in 30 seconds
Python API 101 exports: dedupe, match, score, explain, PPRL
CLI Reference 21 commands with examples
Interactive TUI 6-tab visual interface
Configuration YAML config with matchkeys, blocking, golden rules
Pipeline 10-step pipeline architecture
Blocking Strategies Static, learned, ANN blocking
Scoring Fuzzy, exact, probabilistic, LLM scoring
Domain Packs 7 built-in YAML rulebooks
PPRL Privacy-preserving record linkage
LLM Integration LLM scorer, LLM clustering, budget tracking
Streaming & Incremental Real-time matching, append-only mode
PostgreSQL Extension 18 SQL functions, pipeline schema
DuckDB Extension 12 Python UDFs
REST API HTTP endpoints, review queue
MCP Server Claude Desktop integration
Evaluation Benchmarks, CI/CD quality gates
ER Agent A2A + MCP autonomous agent, confidence gating
Architecture Module map, code patterns
Benchmarks Performance and accuracy numbers

Part of the Golden Suite

Package What It Does
GoldenMatch Entity resolution (this project)
GoldenCheck Data validation that discovers rules
goldenmatch-extensions SQL extensions for Postgres + DuckDB
goldenmatch-duckdb DuckDB UDFs for entity resolution