Quick Start

Go from raw CSV to deduplicated golden records in under a minute.


Deduplicate a CSV (zero-config)

goldenmatch dedupe customers.csv

GoldenMatch auto-detects column types (name, email, phone, zip, address), assigns appropriate scorers, picks a blocking strategy, and launches the TUI for review.


Deduplicate with Python

import goldenmatch as gm

# Zero-config: auto-detects everything
result = gm.dedupe("customers.csv")

# Exact + fuzzy matching
result = gm.dedupe("customers.csv", exact=["email"], fuzzy={"name": 0.85, "zip": 0.95})
result.golden.write_csv("deduped.csv")
print(result)  # DedupeResult(records=5000, clusters=847, match_rate=12.0%)

Deduplicate a DataFrame

import goldenmatch as gm
import polars as pl

df = pl.read_csv("customers.csv")
result = gm.dedupe_df(df, exact=["email"], fuzzy={"name": 0.85})
result.golden  # Polars DataFrame of canonical records

Match two files

import goldenmatch as gm

result = gm.match("new_customers.csv", "master.csv", fuzzy={"name": 0.85})
result.matched.write_csv("matches.csv")
print(result)  # MatchResult(matched=412, unmatched=88)

CLI equivalent:

goldenmatch match new_customers.csv --against master.csv --config config.yaml

Score two strings

import goldenmatch as gm

score = gm.score_strings("John Smith", "Jon Smyth", "jaro_winkler")
print(score)  # 0.884

Use a YAML config

# config.yaml
matchkeys:
  - name: exact_email
    type: exact
    fields:
      - field: email
        transforms: [lowercase, strip]

  - name: fuzzy_name
    type: weighted
    threshold: 0.85
    fields:
      - field: first_name
        scorer: jaro_winkler
        weight: 0.5
        transforms: [lowercase, strip]
      - field: last_name
        scorer: jaro_winkler
        weight: 0.3
        transforms: [lowercase, strip]
      - field: zip
        scorer: exact
        weight: 0.2

blocking:
  strategy: adaptive
  keys:
    - fields: [zip]

golden_rules:
  default_strategy: most_complete
import goldenmatch as gm

result = gm.dedupe("customers.csv", config="config.yaml")

Privacy-preserving linkage (PPRL)

Match across organizations without sharing raw data:

import goldenmatch as gm

# Auto-configured: picks fields and threshold from your data
result = gm.pprl_link("hospital_a.csv", "hospital_b.csv")
print(f"Found {result['match_count']} matches")

# Manual field selection
result = gm.pprl_link(
    "party_a.csv", "party_b.csv",
    fields=["first_name", "last_name", "dob", "zip"],
    threshold=0.85,
    security_level="high",
)

CLI equivalent:

goldenmatch pprl link party_a.csv party_b.csv --security-level high

LLM scoring for hard datasets

For product matching or other domains where fuzzy matching alone falls short:

import goldenmatch as gm

result = gm.dedupe("products.csv", fuzzy={"title": 0.80}, llm_scorer=True)

The LLM scorer sends borderline pairs (score 0.75–0.95) to GPT-4o-mini and auto-accepts pairs above 0.95. Budget cap defaults to $0.05.


Evaluate accuracy

import goldenmatch as gm

metrics = gm.evaluate("data.csv", config="config.yaml", ground_truth="gt.csv")
print(f"F1: {metrics['f1']:.1%}, Precision: {metrics['precision']:.1%}")
# CI/CD quality gate: fail if F1 drops below 90%
goldenmatch evaluate data.csv --config config.yaml --gt gt.csv --min-f1 0.90

Next steps

Topic Link
Full Python API (101 exports) Python API
All 21 CLI commands CLI Reference
Interactive TUI walkthrough TUI
Complete YAML config reference Configuration