GoldenMatch uses YAML config files with Pydantic validation. Every section is optional – GoldenMatch auto-configures what you leave out.

Full YAML reference

matchkeys:
  - name: exact_email
    type: exact
    fields:
      - field: email
        transforms: [lowercase, strip]

  - name: fuzzy_name_zip
    type: weighted
    threshold: 0.85
    rerank: true
    rerank_band: 0.1
    fields:
      - field: first_name
        scorer: jaro_winkler
        weight: 0.4
        transforms: [lowercase, strip]
      - field: last_name
        scorer: jaro_winkler
        weight: 0.4
        transforms: [lowercase, strip]
      - field: zip
        scorer: exact
        weight: 0.2

  - name: probabilistic_fs
    type: probabilistic
    em_iterations: 20
    convergence_threshold: 0.001
    fields:
      - field: first_name
        scorer: jaro_winkler
        levels: 3
        partial_threshold: 0.8
      - field: last_name
        scorer: jaro_winkler
        levels: 2
      - field: zip
        scorer: exact
        levels: 2

  - name: semantic
    type: weighted
    threshold: 0.80
    fields:
      - columns: [title, authors, venue]
        scorer: record_embedding
        weight: 1.0
        column_weights: {title: 2.0, authors: 1.0, venue: 0.5}

blocking:
  strategy: adaptive
  auto_select: true
  auto_suggest: true
  max_block_size: 5000
  skip_oversized: false
  keys:
    - fields: [zip]
    - fields: [last_name]
      transforms: [lowercase, soundex]
  # Sorted neighborhood
  window_size: 20
  sort_key:
    - column: last_name
      transforms: [lowercase, soundex]
  # Multi-pass
  passes:
    - fields: [zip]
    - fields: [last_name]
      transforms: [lowercase, soundex]
  union_mode: true
  # ANN blocking
  ann_column: description
  ann_model: all-MiniLM-L6-v2
  ann_top_k: 20
  # Learned blocking
  learned_sample_size: 5000
  learned_min_recall: 0.95
  learned_min_reduction: 0.90
  learned_predicate_depth: 2
  learned_cache_path: .goldenmatch/learned_blocking.pkl
  # Canopy
  canopy:
    fields: [name, address]
    loose_threshold: 0.3
    tight_threshold: 0.7
    max_canopy_size: 500

golden_rules:
  default_strategy: most_complete
  max_cluster_size: 100
  field_rules:
    email:
      strategy: majority_vote
    first_name:
      strategy: source_priority
      source_priority: [crm, marketing]
    updated_at:
      strategy: most_recent
      date_column: updated_at

standardization:
  rules:
    email: [email]
    first_name: [name_proper, strip]
    last_name: [name_proper, strip]
    phone: [phone]
    zip: [zip5]
    address: [address, strip]
    state: [state]

validation:
  auto_fix: true
  rules:
    - column: email
      rule_type: regex
      params: {pattern: "^.+@.+\\..+$"}
      action: flag
    - column: zip
      rule_type: min_length
      params: {length: 5}
      action: null
    - column: name
      rule_type: not_null
      action: quarantine

domain:
  enabled: true
  pack: electronics

llm_scorer:
  enabled: true
  provider: openai
  model: gpt-4o-mini
  auto_threshold: 0.95
  candidate_lo: 0.75
  candidate_hi: 0.95
  batch_size: 20
  mode: pairwise           # or "cluster" for in-context LLM clustering
  cluster_max_size: 100
  cluster_min_size: 5
  budget:
    max_cost_usd: 0.05
    max_calls: 100
    warn_at_pct: 80

memory:
  enabled: true                 # default: false. Off => no memory work, zero overhead.
  backend: sqlite               # sqlite | postgres
  path: .goldenmatch/memory.db  # sqlite path or postgres DSN
  reanchor: true                # default: true. Look up corrections by record_hash when row IDs miss.
  dataset: customers            # tag corrections so one DB can hold memory for many tables
  learning:
    threshold_min_corrections: 10   # learner runs once per matchkey at this floor
    weights_min_corrections: 50     # field-weight learner floor (stub in v1.6, returns null)

output:
  directory: ./output
  format: csv
  run_name: dedupe_run_001

backend: null              # null (Polars), "ray", or "duckdb"

Matchkeys

Transforms

Type	Description	Required Fields
`exact`	Binary match on transformed values	`field`, optional `transforms`
`weighted`	Weighted average of field scores	`field`, `scorer`, `weight`, `threshold`
`probabilistic`	Fellegi-Sunter log-likelihood ratios	`field`, `scorer`, optional `levels`

Transform	Description
`lowercase`	Convert to lowercase
`uppercase`	Convert to uppercase
`strip`	Remove leading/trailing whitespace
`strip_all`	Remove all whitespace
`soundex`	Soundex phonetic encoding
`metaphone`	Metaphone phonetic encoding
`digits_only`	Keep only digits
`alpha_only`	Keep only letters
`normalize_whitespace`	Collapse multiple spaces
`token_sort`	Sort tokens alphabetically
`first_token`	First whitespace-delimited token
`last_token`	Last whitespace-delimited token
`substring:start:end`	Substring extraction
`qgram:n`	Q-gram tokenization
`bloom_filter` or `bloom_filter:ngram:k:size`	Bloom filter (for PPRL)
`legal_form_strip`	Strip corporate legal-form suffixes (Inc, LLC, Ltd, GmbH, S.A., …) — bundled refdata
`address_normalize`	USPS Pub. 28 street-suffix + unit-abbrev canonicalization (Avenue→AVE, Apartment→APT) — bundled refdata
`naics_normalize`	NAICS 2022 industry-code canonicalization (code-or-title input → canonical code) — bundled refdata

Refdata transforms are auto-prepended by the controller when a column name matches the relevant pattern AND its profiled col_type agrees. See Reference Data.

Scorers

Cross-encoder reranking

Scorer	Description	Best For
`exact`	Binary 0/1 match	Email, phone, ID
`jaro_winkler`	Edit distance with prefix bonus	Names
`levenshtein`	Normalized Levenshtein distance	General strings
`token_sort`	Order-invariant token matching	Names, addresses
`soundex_match`	Phonetic match	Names
`ensemble`	max(jaro_winkler, token_sort, soundex)	Names with reordering
`embedding`	Cosine similarity of embeddings	Semantic matching
`record_embedding`	Concatenated multi-field embeddings	Cross-field semantic
`dice`	Dice coefficient on bloom filters	PPRL
`jaccard`	Jaccard similarity on bloom filters	PPRL
`name_freq_weighted_jw`	Surname IDF-weighted Jaro-Winkler — bundled refdata	`last_name` / `surname`
`given_name_aliased_jw`	Alias-aware Jaro-Winkler — bundled refdata	`first_name` / `given_name`

Add rerank: true to a weighted matchkey to re-score borderline pairs with a cross-encoder model:

matchkeys:
  - name: fuzzy_name
    type: weighted
    threshold: 0.85
    rerank: true
    rerank_band: 0.1       # pairs within threshold +/- 0.1 get reranked
    rerank_model: cross-encoder/ms-marco-MiniLM-L-6-v2

Blocking

Strategy	Description
`static`	Group by blocking key (default)
`adaptive`	Static + recursive sub-blocking for oversized blocks
`sorted_neighborhood`	Sliding window over sorted records
`multi_pass`	Union of blocks from multiple passes
`ann`	ANN via FAISS on embeddings
`ann_pairs`	Direct-pair ANN scoring (50–100x faster than `ann`)
`canopy`	TF-IDF canopy clustering
`learned`	Data-driven predicate selection

Set auto_select: true to auto-pick the best blocking key by histogram analysis. Set auto_suggest: true to get blocking suggestions when no keys are specified.

Golden rules

Strategy	Description
`most_complete`	Pick value with fewest nulls
`majority_vote`	Most common value across cluster members
`source_priority`	Prefer values from specified sources (requires `source_priority` list)
`most_recent`	Latest value by date (requires `date_column`)
`first_non_null`	First non-null value encountered

golden_rules:
  default_strategy: most_complete
  field_rules:
    email: { strategy: majority_vote }
    name: { strategy: source_priority, source_priority: [crm, erp] }

Standardization

standardization:
  rules:
    email: [email]
    phone: [phone]
    zip: [zip5]
    first_name: [name_proper, strip]
    address: [address, strip]
    state: [state]

Validation

Standardizer	Description
`email`	Lowercase, strip, validate format
`name_proper`	Title case
`name_upper`	Uppercase
`name_lower`	Lowercase
`phone`	Strip non-digits, normalize format
`zip5`	First 5 digits
`address`	Normalize abbreviations (St->Street, etc.)
`state`	Normalize state abbreviations
`strip`	Remove leading/trailing whitespace
`trim_whitespace`	Collapse multiple spaces

validation:
  auto_fix: true
  rules:
    - column: email
      rule_type: regex
      params: { pattern: "^.+@.+\\..+$" }
      action: flag
    - column: name
      rule_type: not_null
      action: quarantine
    - column: zip
      rule_type: min_length
      params: { length: 5 }
      action: null

Rule types: regex, min_length, max_length, not_null, in_set, format. Actions: flag (mark but keep), null (set to null), quarantine (remove from matching).

Settings persistence

Settings tuned in the TUI can be saved to the project file. Next run picks them up automatically.

Programmatic config

import goldenmatch as gm

config = gm.GoldenMatchConfig(
    matchkeys=[
        gm.MatchkeyConfig(name="exact_email", type="exact",
            fields=[gm.MatchkeyField(field="email", transforms=["lowercase"])]),
        gm.MatchkeyConfig(name="fuzzy_name", type="weighted", threshold=0.85,
            fields=[
                gm.MatchkeyField(field="name", scorer="jaro_winkler", weight=0.7),
                gm.MatchkeyField(field="zip", scorer="exact", weight=0.3),
            ]),
    ],
    blocking=gm.BlockingConfig(strategy="learned"),
    llm_scorer=gm.LLMScorerConfig(enabled=True, mode="cluster"),
    backend="ray",
)

result = gm.dedupe("data.csv", config=config)

config = gm.auto_configure([("data.csv", "source")])

Verification (v1.5.0)

auto_configure_df runs preflight at the end of config generation — 6 checks that auto-repair missing domain-extracted columns, drop useless-cardinality exact matchkeys, flag oversized blocks, demote remote-asset scorers, and cap low-confidence weights. Unrepairable issues raise ConfigValidationError; the full report is attached to the exception as err.report.

The pipeline runs postflight after scoring and before clustering — 4 signals (score histogram + bimodality, blocking recall, cluster sizes + bottleneck pairs, threshold-band overlap) that can auto-nudge the threshold on clear bimodality and attach the report to DedupeResult.postflight_report / MatchResult.postflight_report.

import goldenmatch as gm

# Offline-safe (default): remote-asset scorers demoted, postflight may adjust threshold
cfg = gm.auto_configure_df(df)

# Opt in to cross-encoder rerank / embedding scorers
cfg = gm.auto_configure_df(df, allow_remote_assets=True)

# Strict: compute signals + advisories, but suppress auto-adjustments (DQBench, regression)
cfg = gm.auto_configure_df(df, strict=True)

The preflight report is available on the returned config (underscore is private-by-convention but stable across v1.5.x):

cfg = gm.auto_configure_df(df)
for finding in cfg._preflight_report.findings:
    print(f"[{finding.severity}] {finding.check}: {finding.message}")

Learning Memory (v1.6.0)

The optional memory: section enables persistent corrections. Once a steward, agent, or LLM decides a pair, that decision is stored, re-anchored across row reorders by record_hash, and applied automatically on every subsequent dedupe_df / match_df call. After 10+ corrections accumulate against a matchkey, the learner adjusts that matchkey’s threshold for the next run. Off by default; enable via the YAML block above or config.memory.enabled = True.

Field	Default	Notes
`enabled`	`false`	Zero-config preserved. Enabling does not change pipeline output until corrections exist.
`backend`	`"sqlite"`	`"postgres"` requires `pip install goldenmatch[postgres]`.
`path`	`".goldenmatch/memory.db"`	SQLite file or full DSN for postgres.
`reanchor`	`true`	Re-anchor by `record_hash` when row IDs miss; ambiguous re-anchors report `stale_ambiguous`.
`dataset`	`null`	Tag corrections; isolates per-table memory in shared DBs.
`learning.threshold_min_corrections`	`10`	Trust-weighted grid search runs once a matchkey crosses this floor.
`learning.weights_min_corrections`	`50`	Field-weight learning is stubbed in v1.6.0 and returns `null`.

Configuration