Evaluation

Measure matching accuracy against ground truth and enforce quality gates in CI/CD pipelines.

Quick start

import goldenmatch as gm

metrics = gm.evaluate("data.csv", config="config.yaml", ground_truth="gt.csv")
print(f"F1: {metrics['f1']:.1%}, Precision: {metrics['precision']:.1%}, Recall: {metrics['recall']:.1%}")

goldenmatch evaluate data.csv --config config.yaml --gt ground_truth.csv

Ground truth format

A CSV file with two columns identifying matched pairs:

id_a,id_b
1,42
1,108
5,200
5,201
5,203

Each row represents a known true match. Column names default to id_a and id_b but are configurable.

IDs correspond to GoldenMatch’s __row_id__ (int64). Ground truth CSVs may have string IDs – load_ground_truth_csv attempts int conversion automatically.

gt_pairs = gm.load_ground_truth_csv("gt.csv", col_a="id_a", col_b="id_b")
# Returns set of (int, int) tuples

CI/CD quality gates

Exit with code 1 if accuracy falls below thresholds:

goldenmatch evaluate data.csv \
    --config config.yaml \
    --gt ground_truth.csv \
    --min-f1 0.90 \
    --min-precision 0.80 \
    --min-recall 0.70

Use in GitHub Actions:

# .github/workflows/quality.yml
jobs:
  quality-gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install goldenmatch
      - run: |
          goldenmatch evaluate data.csv \
            --config config.yaml \
            --gt ground_truth.csv \
            --min-f1 0.90 --min-precision 0.80

EvalResult

@dataclass
class EvalResult:
    precision: float    # TP / (TP + FP)
    recall: float       # TP / (TP + FN)
    f1: float           # 2 * P * R / (P + R)
    tp: int             # True positives (correct matches)
    fp: int             # False positives (incorrect matches)
    fn: int             # False negatives (missed matches)

    def summary(self) -> dict

Evaluate pairs directly

import goldenmatch as gm

predicted = {(1, 42), (1, 108), (5, 200), (7, 300)}
ground_truth = {(1, 42), (1, 108), (5, 200), (5, 201)}

result = gm.evaluate_pairs(predicted, ground_truth)
print(f"Precision: {result.precision:.1%}")  # 3/4 = 75%
print(f"Recall: {result.recall:.1%}")        # 3/4 = 75%
print(f"F1: {result.f1:.1%}")                # 75%

Evaluate clusters

Evaluate a cluster dict (as returned by build_clusters). Expands cluster members into pairs for comparison.

import goldenmatch as gm

result = gm.evaluate_clusters(clusters, ground_truth_pairs)
print(result.f1)

Note: run_dedupe() does not return scored_pairs – use the clusters dict instead.

Build ground truth with label command

Interactively label record pairs to create a ground truth CSV:

goldenmatch label customers.csv --config config.yaml --gt ground_truth.csv

The label command shows pairs and prompts for your judgment:

Key	Meaning
`y`	Match (add to ground truth)
`n`	No match (skip)
`s`	Skip (unsure)

Pairs are selected from actual pipeline output, focusing on borderline cases near the threshold.

Evaluation workflow

Build ground truth: Use goldenmatch label or create a CSV manually
Run evaluation: goldenmatch evaluate --gt gt.csv
Iterate: Adjust config (thresholds, scorers, blocking) and re-evaluate
Gate CI: Add --min-f1 threshold to your CI pipeline

label pairs --> ground_truth.csv --> evaluate --> adjust config --> repeat
                                         |
                                    CI/CD gate (--min-f1 0.90)

Metrics explained

Metric	Formula	Interpretation
Precision	TP / (TP + FP)	Of the pairs GoldenMatch found, how many are correct?
Recall	TP / (TP + FN)	Of the true matches, how many did GoldenMatch find?
F1	2PR / (P+R)	Harmonic mean of precision and recall

For entity resolution:

High precision means few false merges (records incorrectly combined)
High recall means few missed duplicates
Most production systems prioritize precision (false merges are harder to fix than missed dupes)

Cluster comparison (CCMS)

Compare two clustering outcomes on the same dataset without ground truth. Based on the Case Count Metric System (Talburt et al., arXiv:2601.02824v1).

import goldenmatch as gm

result = gm.compare_clusters(clusters_a, clusters_b)
print(result.summary())
# {"unchanged": 42, "merged": 3, "partitioned": 5, "overlapping": 1, "twi": 0.92, ...}

Each cluster from run A is classified into one of four cases:

Case	Meaning
Unchanged	Identical cluster in both runs
Merged	Run A cluster absorbed into a larger cluster in run B
Partitioned	Run A cluster split into smaller clusters in run B
Overlapping	Complex reorganization – members redistributed across clusters

The TWI (Talburt-Wang Index) measures overall clustering similarity, normalized to [0, 1] where 1.0 means identical outcomes.

goldenmatch compare-clusters run_a.json run_b.json --details --case-type merged

Parameter sensitivity analysis

Sweep a parameter across a range and compare each run against a baseline:

import goldenmatch as gm

results = gm.run_sensitivity(
    file_specs=[("data.csv", "src")],
    config=gm.load_config("config.yaml"),
    sweep_params=[gm.SweepParam("threshold", 0.70, 0.95, 0.05)],
    sample_size=5000,
)
for r in results:
    print(r.stability_report())

goldenmatch sensitivity data.csv -c config.yaml --sweep threshold:0.70:0.95:0.05 --sample 5000

Supported sweep fields: threshold, matchkey.<name>.threshold, blocking.max_block_size.

Benchmark evaluation tips

Always use threshold-based pair generation, NOT top-1-per-record (argmax)
Leipzig benchmark CSVs have invalid UTF-8 – use pl.read_csv(encoding="utf8-lossy", ignore_errors=True)
Run benchmarks: python tests/benchmarks/run_leipzig.py