GoldenMatch uses LLMs (GPT-4o-mini, Claude) to score borderline pairs that fuzzy matching alone cannot resolve. Two modes: pairwise scoring and in-context block clustering.
import goldenmatch as gm
# Enable LLM scoring via convenience API
result = gm.dedupe("products.csv", fuzzy={"title": 0.80}, llm_scorer=True)
# CLI
goldenmatch dedupe products.csv --config config.yaml --llm-scorer
Requires OPENAI_API_KEY or ANTHROPIC_API_KEY environment variable.
The default mode. Sends individual borderline pairs to the LLM for match/no-match decisions.
llm_scorer:
enabled: true
mode: pairwise
provider: openai # auto-detected from env vars if omitted
model: gpt-4o-mini # cheapest option, default
auto_threshold: 0.95 # auto-accept pairs above this (no LLM call)
candidate_lo: 0.75 # lower bound of LLM scoring range
candidate_hi: 0.95 # upper bound (same as auto_threshold)
batch_size: 75 # pairs per API call
max_workers: 3 # concurrent LLM requests
How it works:
auto_threshold (0.95) are auto-accepted – no LLM call[candidate_lo, candidate_hi] (0.75–0.95) are candidates for LLM scoringcandidate_lo (0.75) keep their original fuzzy scoreimport goldenmatch as gm
scored = gm.llm_score_pairs(borderline_pairs, df, llm_config)
New in v1.2.6. When the candidate set is large (>100 pairs), GoldenMatch uses iterative calibration instead of scoring every pair:
Typically converges in 2-3 rounds (~200 pairs, ~$0.01). On the Bulldozer dataset (401K rows, 23.7M candidate pairs), calibration learned threshold=0.947 from just 200 pairs.
llm_scorer:
enabled: true
calibration_sample_size: 100 # pairs per round
calibration_max_rounds: 5 # max iterations
calibration_convergence_delta: 0.01 # stop when threshold shift < this
Calibration activates automatically when candidates exceed calibration_sample_size. For small candidate sets (<=100 pairs), all pairs are scored directly.
Send entire blocks of borderline records to the LLM for in-context clustering. More efficient than pairwise for large candidate sets.
llm_scorer:
enabled: true
mode: cluster
cluster_max_size: 100 # max records per LLM cluster block
cluster_min_size: 5 # below this, fall back to pairwise
How it works:
import goldenmatch as gm
scored = gm.llm_cluster_pairs(borderline_pairs, df, llm_config)
Graceful degradation: cluster mode falls back to pairwise if a block is too small, then stops if the budget is exhausted.
Control LLM spending with BudgetConfig:
llm_scorer:
enabled: true
budget:
max_cost_usd: 0.05 # hard cost cap
max_calls: 100 # max API calls
warn_at_pct: 80 # warn at 80% of budget
escalation_model: gpt-4o # escalate to better model for hard pairs
escalation_band: [0.80, 0.90]
escalation_budget_pct: 20 # reserve 20% of budget for escalation
import goldenmatch as gm
tracker = gm.BudgetTracker(max_cost_usd=0.05, max_calls=100)
# tracker.record_call(input_tokens, output_tokens, model)
# tracker.remaining_budget
# tracker.total_cost
# tracker.is_exhausted
The BudgetTracker class tracks token usage, cost, and enforces limits. When the budget runs out, scoring stops gracefully – pairs are kept at their fuzzy scores.
Budget summary is available in EngineStats.llm_cost after a pipeline run.
Automatic escalation sends harder pairs to a better (more expensive) model:
The escalation budget percentage (default 20%) reserves a portion of the total budget for tier-2 calls.
A separate feature from the LLM scorer. LLM boost fine-tunes an embedding model using LLM-generated labels:
goldenmatch dedupe products.csv --llm-boost
Tiered auto-escalation:
Active sampling selects the most informative pairs for labeling, reducing cost by ~45%.
LLM boost is most valuable for product matching with local models (MiniLM). For structured data, fuzzy matching alone achieves 97%+ F1.
Extract structured fields from unstructured text using the LLM. O(N) preprocessing, not O(N^2) pair scoring.
import goldenmatch as gm
enhanced_df = gm.llm_extract_features(df, column="description", budget=tracker)
GoldenMatch auto-detects the provider from environment variables:
| Variable | Provider |
|---|---|
OPENAI_API_KEY |
OpenAI (GPT-4o-mini, GPT-4o) |
ANTHROPIC_API_KEY |
Anthropic (Claude) |
Both providers return (text, input_tokens, output_tokens) tuples for budget tracking.
| Dataset | Strategy | LLM Cost | F1 |
|---|---|---|---|
| Abt-Buy (electronics) | Domain + emb + LLM | $0.04 | 72.2% |
| Amazon-Google (software) | emb + ANN + LLM | $0.02 | 45.3% |
| Abt-Buy (Vertex AI + LLM) | Embeddings + GPT-4o-mini | $0.74 | 81.7% |
| Bulldozer 401K (equipment) | Multi-pass + ANN + calibration | ~$0.01 | 87.7% conf |
| Typical 5K dataset | LLM scorer (borderline only) | ~$0.05 | varies |
With iterative calibration (v1.2.6+), the LLM scores only ~200 pairs to learn the optimal threshold, then applies it to all candidates. This reduced the Bulldozer benchmark from ~$0.50 (37,500 pairs) to ~$0.01 (200 pairs).
| Function | Description |
|---|---|
gm.llm_score_pairs(pairs, df, config) |
Pairwise LLM scoring |
gm.llm_cluster_pairs(pairs, df, config) |
In-context block clustering |
gm.BudgetTracker(max_cost_usd, max_calls) |
Track and limit LLM spending |
gm.llm_label_pairs(pairs, df) |
Generate LLM-labeled training pairs |
gm.llm_extract_features(df, column) |
LLM-based feature extraction |