Entity resolution that finds duplicates in your data so you donβt have to define the rules yourself.
π‘ Now part of the Golden Suite monorepo. This site documents GoldenMatch (entity resolution). The same repository hosts GoldenCheck (data quality), GoldenFlow (transforms), GoldenPipe (orchestrator), InferMap (schema mapping), the Rust SQL extensions, the dbt package, and the GitHub Action β plus a master MCP server (
goldensuite-mcp), sevenghcr.iocontainer images, and 12 drop-in Airflow DAGs. See the repository README andexamples/for the suite-wide picture.
GoldenMatch takes messy records and figures out which ones refer to the same entity β without requiring you to hand-write matching rules.
INGEST β STANDARDIZE β BLOCK β SCORE β CLUSTER β GOLDEN RECORD
| Step | What Happens |
|---|---|
| Ingest | Load CSV, Excel, Parquet, or a DataFrame |
| Standardize | Normalize casing, whitespace, phonetic encoding |
| Block | Group candidates to avoid N^2 comparisons |
| Score | Fuzzy match (jaro-winkler, levenshtein, token sort) |
| Cluster | Union-Find with confidence scoring |
| Golden | Merge clusters into canonical records |
pip install goldenmatch
import goldenmatch as gm
result = gm.dedupe("customers.csv", exact=["email"], fuzzy={"name": 0.85})
print(f"{result.total_clusters} clusters, {result.match_rate:.0%} match rate")
result.golden.write_csv("golden_records.csv")
| Dataset | Records | Method | F1 | Time |
|---|---|---|---|---|
| DBLP-ACM (academic) | 4,910 | Fuzzy matching | 97.2% | 2.1s |
| Abt-Buy (electronics) | 2,162 | Domain + LLM | 72.2% | 4.2s |
| FEBRL4 (PPRL) | 10,000 | Auto-config bloom filters | 92.4% | 14s |
| Synthetic | 100K | Fuzzy (name+zip) | β | 12.8s |
| Synthetic | 1M | Exact dedupe | β | 7.8s |
Scale: 7,823 records/sec on a laptop (fuzzy + exact + golden).
| Interface | Install | Best For |
|---|---|---|
| Python API | pip install goldenmatch |
Notebooks, scripts, AI agents |
| TypeScript / Node.js | npm install goldenmatch |
Edge runtimes, web apps, Node services |
| CLI | Same package, 21 commands | Terminal workflows |
| Interactive TUI | goldenmatch tui |
Visual exploration |
| PostgreSQL | Pre-built .deb/.rpm | Production databases |
| DuckDB | pip install goldenmatch-duckdb |
Analytics |
| REST API / MCP | goldenmatch serve / mcp-serve |
Microservices, AI assistants |
| ER Agent (A2A) | goldenmatch agent-serve |
AI-to-AI discovery, autonomous ER |
| Guide | Description |
|---|---|
| Installation | pip, apt, rpm, Docker, build from source |
| Quick Start | First dedupe in 30 seconds |
| Python API | 101 exports: dedupe, match, score, explain, PPRL |
| TypeScript API | npm package with edge-safe core and Node entrypoint |
| CLI Reference | 23 commands with examples |
| Interactive TUI | 6-tab visual interface |
| Configuration | YAML config with matchkeys, blocking, golden rules |
| Pipeline | 10-step pipeline architecture |
| Blocking Strategies | Static, learned, ANN blocking |
| Scoring | Fuzzy, exact, probabilistic, LLM scoring |
| Domain Packs | 7 built-in YAML rulebooks |
| PPRL | Privacy-preserving record linkage |
| LLM Integration | LLM scorer, LLM clustering, budget tracking |
| Learning Memory | Persistent corrections + threshold learning (v1.6.0) |
| Streaming & Incremental | Real-time matching, append-only mode |
| PostgreSQL Extension | 18 SQL functions, pipeline schema |
| DuckDB Extension | 12 Python UDFs |
| REST API | HTTP endpoints, review queue |
| MCP Server | Claude Desktop integration |
| Evaluation | Benchmarks, CI/CD quality gates, cluster comparison |
| ER Agent | A2A + MCP autonomous agent, confidence gating |
| Architecture | Module map, code patterns |
| Benchmarks | Performance and accuracy numbers |
| Package | What It Does |
|---|---|
| GoldenMatch | Entity resolution (this project) |
| GoldenCheck | Data validation that discovers rules |
| goldenmatch-extensions | SQL extensions for Postgres + DuckDB |
| goldenmatch-duckdb | DuckDB UDFs for entity resolution |