Configuration

GoldenMatch uses YAML config files with Pydantic validation. Every section is optional – GoldenMatch auto-configures what you leave out.


Full YAML reference

matchkeys:
  - name: exact_email
    type: exact
    fields:
      - field: email
        transforms: [lowercase, strip]

  - name: fuzzy_name_zip
    type: weighted
    threshold: 0.85
    rerank: true
    rerank_band: 0.1
    fields:
      - field: first_name
        scorer: jaro_winkler
        weight: 0.4
        transforms: [lowercase, strip]
      - field: last_name
        scorer: jaro_winkler
        weight: 0.4
        transforms: [lowercase, strip]
      - field: zip
        scorer: exact
        weight: 0.2

  - name: probabilistic_fs
    type: probabilistic
    em_iterations: 20
    convergence_threshold: 0.001
    fields:
      - field: first_name
        scorer: jaro_winkler
        levels: 3
        partial_threshold: 0.8
      - field: last_name
        scorer: jaro_winkler
        levels: 2
      - field: zip
        scorer: exact
        levels: 2

  - name: semantic
    type: weighted
    threshold: 0.80
    fields:
      - columns: [title, authors, venue]
        scorer: record_embedding
        weight: 1.0
        column_weights: {title: 2.0, authors: 1.0, venue: 0.5}

blocking:
  strategy: adaptive
  auto_select: true
  auto_suggest: true
  max_block_size: 5000
  skip_oversized: false
  keys:
    - fields: [zip]
    - fields: [last_name]
      transforms: [lowercase, soundex]
  # Sorted neighborhood
  window_size: 20
  sort_key:
    - column: last_name
      transforms: [lowercase, soundex]
  # Multi-pass
  passes:
    - fields: [zip]
    - fields: [last_name]
      transforms: [lowercase, soundex]
  union_mode: true
  # ANN blocking
  ann_column: description
  ann_model: all-MiniLM-L6-v2
  ann_top_k: 20
  # Learned blocking
  learned_sample_size: 5000
  learned_min_recall: 0.95
  learned_min_reduction: 0.90
  learned_predicate_depth: 2
  learned_cache_path: .goldenmatch/learned_blocking.pkl
  # Canopy
  canopy:
    fields: [name, address]
    loose_threshold: 0.3
    tight_threshold: 0.7
    max_canopy_size: 500

golden_rules:
  default_strategy: most_complete
  max_cluster_size: 100
  field_rules:
    email:
      strategy: majority_vote
    first_name:
      strategy: source_priority
      source_priority: [crm, marketing]
    updated_at:
      strategy: most_recent
      date_column: updated_at

standardization:
  rules:
    email: [email]
    first_name: [name_proper, strip]
    last_name: [name_proper, strip]
    phone: [phone]
    zip: [zip5]
    address: [address, strip]
    state: [state]

validation:
  auto_fix: true
  rules:
    - column: email
      rule_type: regex
      params: {pattern: "^.+@.+\\..+$"}
      action: flag
    - column: zip
      rule_type: min_length
      params: {length: 5}
      action: null
    - column: name
      rule_type: not_null
      action: quarantine

domain:
  enabled: true
  pack: electronics

llm_scorer:
  enabled: true
  provider: openai
  model: gpt-4o-mini
  auto_threshold: 0.95
  candidate_lo: 0.75
  candidate_hi: 0.95
  batch_size: 20
  mode: pairwise           # or "cluster" for in-context LLM clustering
  cluster_max_size: 100
  cluster_min_size: 5
  budget:
    max_cost_usd: 0.05
    max_calls: 100
    warn_at_pct: 80

memory:
  enabled: true                 # default: false. Off => no memory work, zero overhead.
  backend: sqlite               # sqlite | postgres
  path: .goldenmatch/memory.db  # sqlite path or postgres DSN
  reanchor: true                # default: true. Look up corrections by record_hash when row IDs miss.
  dataset: customers            # tag corrections so one DB can hold memory for many tables
  learning:
    threshold_min_corrections: 10   # learner runs once per matchkey at this floor
    weights_min_corrections: 50     # field-weight learner floor (stub in v1.6, returns null)

output:
  directory: ./output
  format: csv
  run_name: dedupe_run_001

backend: null              # null (Polars), "ray", or "duckdb"

Matchkeys

Three matchkey types:

Type Description Required Fields
exact Binary match on transformed values field, optional transforms
weighted Weighted average of field scores field, scorer, weight, threshold
probabilistic Fellegi-Sunter log-likelihood ratios field, scorer, optional levels

Transforms

Applied to field values before scoring.

Transform Description
lowercase Convert to lowercase
uppercase Convert to uppercase
strip Remove leading/trailing whitespace
strip_all Remove all whitespace
soundex Soundex phonetic encoding
metaphone Metaphone phonetic encoding
digits_only Keep only digits
alpha_only Keep only letters
normalize_whitespace Collapse multiple spaces
token_sort Sort tokens alphabetically
first_token First whitespace-delimited token
last_token Last whitespace-delimited token
substring:start:end Substring extraction
qgram:n Q-gram tokenization
bloom_filter or bloom_filter:ngram:k:size Bloom filter (for PPRL)
legal_form_strip Strip corporate legal-form suffixes (Inc, LLC, Ltd, GmbH, S.A., …) — bundled refdata
address_normalize USPS Pub. 28 street-suffix + unit-abbrev canonicalization (Avenue→AVE, Apartment→APT) — bundled refdata
naics_normalize NAICS 2022 industry-code canonicalization (code-or-title input → canonical code) — bundled refdata

Refdata transforms are auto-prepended by the controller when a column name matches the relevant pattern AND its profiled col_type agrees. See Reference Data.

Scorers

Scorer Description Best For
exact Binary 0/1 match Email, phone, ID
jaro_winkler Edit distance with prefix bonus Names
levenshtein Normalized Levenshtein distance General strings
token_sort Order-invariant token matching Names, addresses
soundex_match Phonetic match Names
ensemble max(jaro_winkler, token_sort, soundex) Names with reordering
embedding Cosine similarity of embeddings Semantic matching
record_embedding Concatenated multi-field embeddings Cross-field semantic
dice Dice coefficient on bloom filters PPRL
jaccard Jaccard similarity on bloom filters PPRL
name_freq_weighted_jw Surname IDF-weighted Jaro-Winkler — bundled refdata last_name / surname
given_name_aliased_jw Alias-aware Jaro-Winkler — bundled refdata first_name / given_name

Cross-encoder reranking

Add rerank: true to a weighted matchkey to re-score borderline pairs with a cross-encoder model:

matchkeys:
  - name: fuzzy_name
    type: weighted
    threshold: 0.85
    rerank: true
    rerank_band: 0.1       # pairs within threshold +/- 0.1 get reranked
    rerank_model: cross-encoder/ms-marco-MiniLM-L-6-v2

Blocking

Strategy Description
static Group by blocking key (default)
adaptive Static + recursive sub-blocking for oversized blocks
sorted_neighborhood Sliding window over sorted records
multi_pass Union of blocks from multiple passes
ann ANN via FAISS on embeddings
ann_pairs Direct-pair ANN scoring (50–100x faster than ann)
canopy TF-IDF canopy clustering
learned Data-driven predicate selection

Set auto_select: true to auto-pick the best blocking key by histogram analysis. Set auto_suggest: true to get blocking suggestions when no keys are specified.


Golden rules

Five merge strategies for building canonical records:

Strategy Description
most_complete Pick value with fewest nulls
majority_vote Most common value across cluster members
source_priority Prefer values from specified sources (requires source_priority list)
most_recent Latest value by date (requires date_column)
first_non_null First non-null value encountered

Set a default strategy and override per field:

golden_rules:
  default_strategy: most_complete
  field_rules:
    email: { strategy: majority_vote }
    name: { strategy: source_priority, source_priority: [crm, erp] }

Standardization

Map column names to standardizer functions:

standardization:
  rules:
    email: [email]
    phone: [phone]
    zip: [zip5]
    first_name: [name_proper, strip]
    address: [address, strip]
    state: [state]
Standardizer Description
email Lowercase, strip, validate format
name_proper Title case
name_upper Uppercase
name_lower Lowercase
phone Strip non-digits, normalize format
zip5 First 5 digits
address Normalize abbreviations (St->Street, etc.)
state Normalize state abbreviations
strip Remove leading/trailing whitespace
trim_whitespace Collapse multiple spaces

Validation

validation:
  auto_fix: true
  rules:
    - column: email
      rule_type: regex
      params: { pattern: "^.+@.+\\..+$" }
      action: flag
    - column: name
      rule_type: not_null
      action: quarantine
    - column: zip
      rule_type: min_length
      params: { length: 5 }
      action: null

Rule types: regex, min_length, max_length, not_null, in_set, format. Actions: flag (mark but keep), null (set to null), quarantine (remove from matching).


Settings persistence

Settings tuned in the TUI can be saved to the project file. Next run picks them up automatically.


Programmatic config

import goldenmatch as gm

config = gm.GoldenMatchConfig(
    matchkeys=[
        gm.MatchkeyConfig(name="exact_email", type="exact",
            fields=[gm.MatchkeyField(field="email", transforms=["lowercase"])]),
        gm.MatchkeyConfig(name="fuzzy_name", type="weighted", threshold=0.85,
            fields=[
                gm.MatchkeyField(field="name", scorer="jaro_winkler", weight=0.7),
                gm.MatchkeyField(field="zip", scorer="exact", weight=0.3),
            ]),
    ],
    blocking=gm.BlockingConfig(strategy="learned"),
    llm_scorer=gm.LLMScorerConfig(enabled=True, mode="cluster"),
    backend="ray",
)

result = gm.dedupe("data.csv", config=config)

Or auto-generate from data:

config = gm.auto_configure([("data.csv", "source")])

Verification (v1.5.0)

auto_configure_df runs preflight at the end of config generation — 6 checks that auto-repair missing domain-extracted columns, drop useless-cardinality exact matchkeys, flag oversized blocks, demote remote-asset scorers, and cap low-confidence weights. Unrepairable issues raise ConfigValidationError; the full report is attached to the exception as err.report.

The pipeline runs postflight after scoring and before clustering — 4 signals (score histogram + bimodality, blocking recall, cluster sizes + bottleneck pairs, threshold-band overlap) that can auto-nudge the threshold on clear bimodality and attach the report to DedupeResult.postflight_report / MatchResult.postflight_report.

Two new kwargs on auto_configure_df:

import goldenmatch as gm

# Offline-safe (default): remote-asset scorers demoted, postflight may adjust threshold
cfg = gm.auto_configure_df(df)

# Opt in to cross-encoder rerank / embedding scorers
cfg = gm.auto_configure_df(df, allow_remote_assets=True)

# Strict: compute signals + advisories, but suppress auto-adjustments (DQBench, regression)
cfg = gm.auto_configure_df(df, strict=True)

The preflight report is available on the returned config (underscore is private-by-convention but stable across v1.5.x):

cfg = gm.auto_configure_df(df)
for finding in cfg._preflight_report.findings:
    print(f"[{finding.severity}] {finding.check}: {finding.message}")

See the Verification section in the Python API docs for the full preflight / postflight signatures and the PostflightSignals schema.


Learning Memory (v1.6.0)

The optional memory: section enables persistent corrections. Once a steward, agent, or LLM decides a pair, that decision is stored, re-anchored across row reorders by record_hash, and applied automatically on every subsequent dedupe_df / match_df call. After 10+ corrections accumulate against a matchkey, the learner adjusts that matchkey’s threshold for the next run. Off by default; enable via the YAML block above or config.memory.enabled = True.

Field Default Notes
enabled false Zero-config preserved. Enabling does not change pipeline output until corrections exist.
backend "sqlite" "postgres" requires pip install goldenmatch[postgres].
path ".goldenmatch/memory.db" SQLite file or full DSN for postgres.
reanchor true Re-anchor by record_hash when row IDs miss; ambiguous re-anchors report stale_ambiguous.
dataset null Tag corrections; isolates per-table memory in shared DBs.
learning.threshold_min_corrections 10 Trust-weighted grid search runs once a matchkey crosses this floor.
learning.weights_min_corrections 50 Field-weight learning is stubbed in v1.6.0 and returns null.

Full guide: Learning Memory.