Evaluation
Measure matching accuracy against ground truth and enforce quality gates in CI/CD pipelines.
Quick start
import goldenmatch as gm
metrics = gm.evaluate("data.csv", config="config.yaml", ground_truth="gt.csv")
print(f"F1: {metrics['f1']:.1%}, Precision: {metrics['precision']:.1%}, Recall: {metrics['recall']:.1%}")
goldenmatch evaluate data.csv --config config.yaml --gt ground_truth.csv
Ground truth format
A CSV file with two columns identifying matched pairs:
id_a,id_b
1,42
1,108
5,200
5,201
5,203
Each row represents a known true match. Column names default to id_a and id_b but are configurable.
IDs correspond to GoldenMatch’s __row_id__ (int64). Ground truth CSVs may have string IDs – load_ground_truth_csv attempts int conversion automatically.
gt_pairs = gm.load_ground_truth_csv("gt.csv", col_a="id_a", col_b="id_b")
# Returns set of (int, int) tuples
CI/CD quality gates
Exit with code 1 if accuracy falls below thresholds:
goldenmatch evaluate data.csv \
--config config.yaml \
--gt ground_truth.csv \
--min-f1 0.90 \
--min-precision 0.80 \
--min-recall 0.70
Use in GitHub Actions:
# .github/workflows/quality.yml
jobs:
quality-gate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install goldenmatch
- run: |
goldenmatch evaluate data.csv \
--config config.yaml \
--gt ground_truth.csv \
--min-f1 0.90 --min-precision 0.80
EvalResult
@dataclass
class EvalResult:
precision: float # TP / (TP + FP)
recall: float # TP / (TP + FN)
f1: float # 2 * P * R / (P + R)
tp: int # True positives (correct matches)
fp: int # False positives (incorrect matches)
fn: int # False negatives (missed matches)
def summary(self) -> dict
Evaluate pairs directly
import goldenmatch as gm
predicted = {(1, 42), (1, 108), (5, 200), (7, 300)}
ground_truth = {(1, 42), (1, 108), (5, 200), (5, 201)}
result = gm.evaluate_pairs(predicted, ground_truth)
print(f"Precision: {result.precision:.1%}") # 3/4 = 75%
print(f"Recall: {result.recall:.1%}") # 3/4 = 75%
print(f"F1: {result.f1:.1%}") # 75%
Evaluate clusters
Evaluate a cluster dict (as returned by build_clusters). Expands cluster members into pairs for comparison.
import goldenmatch as gm
result = gm.evaluate_clusters(clusters, ground_truth_pairs)
print(result.f1)
Note: run_dedupe() does not return scored_pairs – use the clusters dict instead.
Build ground truth with label command
Interactively label record pairs to create a ground truth CSV:
goldenmatch label customers.csv --config config.yaml --gt ground_truth.csv
The label command shows pairs and prompts for your judgment:
| Key | Meaning |
|---|---|
y | Match (add to ground truth) |
n | No match (skip) |
s | Skip (unsure) |
Pairs are selected from actual pipeline output, focusing on borderline cases near the threshold.
Evaluation workflow
- Build ground truth: Use
goldenmatch labelor create a CSV manually - Run evaluation:
goldenmatch evaluate --gt gt.csv - Iterate: Adjust config (thresholds, scorers, blocking) and re-evaluate
- Gate CI: Add
--min-f1threshold to your CI pipeline
label pairs --> ground_truth.csv --> evaluate --> adjust config --> repeat
|
CI/CD gate (--min-f1 0.90)
Metrics explained
| Metric | Formula | Interpretation |
|---|---|---|
| Precision | TP / (TP + FP) | Of the pairs GoldenMatch found, how many are correct? |
| Recall | TP / (TP + FN) | Of the true matches, how many did GoldenMatch find? |
| F1 | 2PR / (P+R) | Harmonic mean of precision and recall |
For entity resolution:
- High precision means few false merges (records incorrectly combined)
- High recall means few missed duplicates
- Most production systems prioritize precision (false merges are harder to fix than missed dupes)
Benchmark evaluation tips
- Always use threshold-based pair generation, NOT top-1-per-record (argmax)
- Leipzig benchmark CSVs have invalid UTF-8 – use
pl.read_csv(encoding="utf8-lossy", ignore_errors=True) - Run benchmarks:
python tests/benchmarks/run_leipzig.py