Configuration
GoldenMatch uses YAML config files with Pydantic validation. Every section is optional – GoldenMatch auto-configures what you leave out.
Full YAML reference
matchkeys:
- name: exact_email
type: exact
fields:
- field: email
transforms: [lowercase, strip]
- name: fuzzy_name_zip
type: weighted
threshold: 0.85
rerank: true
rerank_band: 0.1
fields:
- field: first_name
scorer: jaro_winkler
weight: 0.4
transforms: [lowercase, strip]
- field: last_name
scorer: jaro_winkler
weight: 0.4
transforms: [lowercase, strip]
- field: zip
scorer: exact
weight: 0.2
- name: probabilistic_fs
type: probabilistic
em_iterations: 20
convergence_threshold: 0.001
fields:
- field: first_name
scorer: jaro_winkler
levels: 3
partial_threshold: 0.8
- field: last_name
scorer: jaro_winkler
levels: 2
- field: zip
scorer: exact
levels: 2
- name: semantic
type: weighted
threshold: 0.80
fields:
- columns: [title, authors, venue]
scorer: record_embedding
weight: 1.0
column_weights: {title: 2.0, authors: 1.0, venue: 0.5}
blocking:
strategy: adaptive
auto_select: true
auto_suggest: true
max_block_size: 5000
skip_oversized: false
keys:
- fields: [zip]
- fields: [last_name]
transforms: [lowercase, soundex]
# Sorted neighborhood
window_size: 20
sort_key:
- column: last_name
transforms: [lowercase, soundex]
# Multi-pass
passes:
- fields: [zip]
- fields: [last_name]
transforms: [lowercase, soundex]
union_mode: true
# ANN blocking
ann_column: description
ann_model: all-MiniLM-L6-v2
ann_top_k: 20
# Learned blocking
learned_sample_size: 5000
learned_min_recall: 0.95
learned_min_reduction: 0.90
learned_predicate_depth: 2
learned_cache_path: .goldenmatch/learned_blocking.pkl
# Canopy
canopy:
fields: [name, address]
loose_threshold: 0.3
tight_threshold: 0.7
max_canopy_size: 500
golden_rules:
default_strategy: most_complete
max_cluster_size: 100
field_rules:
email:
strategy: majority_vote
first_name:
strategy: source_priority
source_priority: [crm, marketing]
updated_at:
strategy: most_recent
date_column: updated_at
standardization:
rules:
email: [email]
first_name: [name_proper, strip]
last_name: [name_proper, strip]
phone: [phone]
zip: [zip5]
address: [address, strip]
state: [state]
validation:
auto_fix: true
rules:
- column: email
rule_type: regex
params: {pattern: "^.+@.+\\..+$"}
action: flag
- column: zip
rule_type: min_length
params: {length: 5}
action: null
- column: name
rule_type: not_null
action: quarantine
domain:
enabled: true
pack: electronics
llm_scorer:
enabled: true
provider: openai
model: gpt-4o-mini
auto_threshold: 0.95
candidate_lo: 0.75
candidate_hi: 0.95
batch_size: 20
mode: pairwise # or "cluster" for in-context LLM clustering
cluster_max_size: 100
cluster_min_size: 5
budget:
max_cost_usd: 0.05
max_calls: 100
warn_at_pct: 80
output:
directory: ./output
format: csv
run_name: dedupe_run_001
backend: null # null (Polars), "ray", or "duckdb"
Matchkeys
Three matchkey types:
| Type | Description | Required Fields |
|---|---|---|
exact | Binary match on transformed values | field, optional transforms |
weighted | Weighted average of field scores | field, scorer, weight, threshold |
probabilistic | Fellegi-Sunter log-likelihood ratios | field, scorer, optional levels |
Transforms
Applied to field values before scoring.
| Transform | Description |
|---|---|
lowercase | Convert to lowercase |
uppercase | Convert to uppercase |
strip | Remove leading/trailing whitespace |
strip_all | Remove all whitespace |
soundex | Soundex phonetic encoding |
metaphone | Metaphone phonetic encoding |
digits_only | Keep only digits |
alpha_only | Keep only letters |
normalize_whitespace | Collapse multiple spaces |
token_sort | Sort tokens alphabetically |
first_token | First whitespace-delimited token |
last_token | Last whitespace-delimited token |
substring:start:end | Substring extraction |
qgram:n | Q-gram tokenization |
bloom_filter or bloom_filter:ngram:k:size | Bloom filter (for PPRL) |
Scorers
| Scorer | Description | Best For |
|---|---|---|
exact | Binary 0/1 match | Email, phone, ID |
jaro_winkler | Edit distance with prefix bonus | Names |
levenshtein | Normalized Levenshtein distance | General strings |
token_sort | Order-invariant token matching | Names, addresses |
soundex_match | Phonetic match | Names |
ensemble | max(jaro_winkler, token_sort, soundex) | Names with reordering |
embedding | Cosine similarity of embeddings | Semantic matching |
record_embedding | Concatenated multi-field embeddings | Cross-field semantic |
dice | Dice coefficient on bloom filters | PPRL |
jaccard | Jaccard similarity on bloom filters | PPRL |
Cross-encoder reranking
Add rerank: true to a weighted matchkey to re-score borderline pairs with a cross-encoder model:
matchkeys:
- name: fuzzy_name
type: weighted
threshold: 0.85
rerank: true
rerank_band: 0.1 # pairs within threshold +/- 0.1 get reranked
rerank_model: cross-encoder/ms-marco-MiniLM-L-6-v2
Blocking
| Strategy | Description |
|---|---|
static | Group by blocking key (default) |
adaptive | Static + recursive sub-blocking for oversized blocks |
sorted_neighborhood | Sliding window over sorted records |
multi_pass | Union of blocks from multiple passes |
ann | ANN via FAISS on embeddings |
ann_pairs | Direct-pair ANN scoring (50–100x faster than ann) |
canopy | TF-IDF canopy clustering |
learned | Data-driven predicate selection |
Set auto_select: true to auto-pick the best blocking key by histogram analysis. Set auto_suggest: true to get blocking suggestions when no keys are specified.
Golden rules
Five merge strategies for building canonical records:
| Strategy | Description |
|---|---|
most_complete | Pick value with fewest nulls |
majority_vote | Most common value across cluster members |
source_priority | Prefer values from specified sources (requires source_priority list) |
most_recent | Latest value by date (requires date_column) |
first_non_null | First non-null value encountered |
Set a default strategy and override per field:
golden_rules:
default_strategy: most_complete
field_rules:
email: { strategy: majority_vote }
name: { strategy: source_priority, source_priority: [crm, erp] }
Standardization
Map column names to standardizer functions:
standardization:
rules:
email: [email]
phone: [phone]
zip: [zip5]
first_name: [name_proper, strip]
address: [address, strip]
state: [state]
| Standardizer | Description |
|---|---|
email | Lowercase, strip, validate format |
name_proper | Title case |
name_upper | Uppercase |
name_lower | Lowercase |
phone | Strip non-digits, normalize format |
zip5 | First 5 digits |
address | Normalize abbreviations (St->Street, etc.) |
state | Normalize state abbreviations |
strip | Remove leading/trailing whitespace |
trim_whitespace | Collapse multiple spaces |
Validation
validation:
auto_fix: true
rules:
- column: email
rule_type: regex
params: { pattern: "^.+@.+\\..+$" }
action: flag
- column: name
rule_type: not_null
action: quarantine
- column: zip
rule_type: min_length
params: { length: 5 }
action: null
Rule types: regex, min_length, max_length, not_null, in_set, format. Actions: flag (mark but keep), null (set to null), quarantine (remove from matching).
Settings persistence
- Global:
~/.goldenmatch/settings.yaml– output mode, default model, API keys - Project:
.goldenmatch.yaml– column mappings, thresholds, blocking config
Settings tuned in the TUI can be saved to the project file. Next run picks them up automatically.
Programmatic config
import goldenmatch as gm
config = gm.GoldenMatchConfig(
matchkeys=[
gm.MatchkeyConfig(name="exact_email", type="exact",
fields=[gm.MatchkeyField(field="email", transforms=["lowercase"])]),
gm.MatchkeyConfig(name="fuzzy_name", type="weighted", threshold=0.85,
fields=[
gm.MatchkeyField(field="name", scorer="jaro_winkler", weight=0.7),
gm.MatchkeyField(field="zip", scorer="exact", weight=0.3),
]),
],
blocking=gm.BlockingConfig(strategy="learned"),
llm_scorer=gm.LLMScorerConfig(enabled=True, mode="cluster"),
backend="ray",
)
result = gm.dedupe("data.csv", config=config)
Or auto-generate from data:
config = gm.auto_configure([("data.csv", "source")])