GoldenMatch

Entity resolution that finds duplicates in your data so you don’t have to define the rules yourself.

What It Does

GoldenMatch takes messy records and figures out which ones refer to the same entity — without requiring you to hand-write matching rules.

INGEST → STANDARDIZE → BLOCK → SCORE → CLUSTER → GOLDEN RECORD

Step	What Happens
Ingest	Load CSV, Excel, Parquet, or a DataFrame
Standardize	Normalize casing, whitespace, phonetic encoding
Block	Group candidates to avoid N^2 comparisons
Score	Fuzzy match (jaro-winkler, levenshtein, token sort)
Cluster	Union-Find with confidence scoring
Golden	Merge clusters into canonical records

Quick Install

pip install goldenmatch

import goldenmatch as gm

result = gm.dedupe("customers.csv", exact=["email"], fuzzy={"name": 0.85})
print(f"{result.total_clusters} clusters, {result.match_rate:.0%} match rate")
result.golden.write_csv("golden_records.csv")

Benchmarks

Dataset	Records	Method	F1	Time
DBLP-ACM (academic)	4,910	Fuzzy matching	97.2%	2.1s
Abt-Buy (electronics)	2,162	Domain + LLM	72.2%	4.2s
FEBRL4 (PPRL)	10,000	Auto-config bloom filters	92.4%	14s
Synthetic	100K	Fuzzy (name+zip)	–	12.8s
Synthetic	1M	Exact dedupe	–	7.8s

Scale: 7,823 records/sec on a laptop (fuzzy + exact + golden).

7 Ways to Use It

Interface	Install	Best For
Python API	`pip install goldenmatch`	Notebooks, scripts, AI agents
CLI	Same package, 21 commands	Terminal workflows
Interactive TUI	`goldenmatch tui`	Visual exploration
PostgreSQL	Pre-built .deb/.rpm	Production databases
DuckDB	`pip install goldenmatch-duckdb`	Analytics
REST API / MCP	`goldenmatch serve` / `mcp-serve`	Microservices, AI assistants
ER Agent (A2A)	`goldenmatch agent-serve`	AI-to-AI discovery, autonomous ER

Documentation

Guide	Description
Installation	pip, apt, rpm, Docker, build from source
Quick Start	First dedupe in 30 seconds
Python API	101 exports: dedupe, match, score, explain, PPRL
CLI Reference	21 commands with examples
Interactive TUI	6-tab visual interface
Configuration	YAML config with matchkeys, blocking, golden rules
Pipeline	10-step pipeline architecture
Blocking Strategies	Static, learned, ANN blocking
Scoring	Fuzzy, exact, probabilistic, LLM scoring
Domain Packs	7 built-in YAML rulebooks
PPRL	Privacy-preserving record linkage
LLM Integration	LLM scorer, LLM clustering, budget tracking
Streaming & Incremental	Real-time matching, append-only mode
PostgreSQL Extension	18 SQL functions, pipeline schema
DuckDB Extension	12 Python UDFs
REST API	HTTP endpoints, review queue
MCP Server	Claude Desktop integration
Evaluation	Benchmarks, CI/CD quality gates
ER Agent	A2A + MCP autonomous agent, confidence gating
Architecture	Module map, code patterns
Benchmarks	Performance and accuracy numbers

Part of the Golden Suite

Package	What It Does
GoldenMatch	Entity resolution (this project)
GoldenCheck	Data validation that discovers rules
goldenmatch-extensions	SQL extensions for Postgres + DuckDB
goldenmatch-duckdb	DuckDB UDFs for entity resolution