Measure matching accuracy against ground truth and enforce quality gates in CI/CD pipelines.
import goldenmatch as gm
metrics = gm.evaluate("data.csv", config="config.yaml", ground_truth="gt.csv")
print(f"F1: {metrics['f1']:.1%}, Precision: {metrics['precision']:.1%}, Recall: {metrics['recall']:.1%}")
goldenmatch evaluate data.csv --config config.yaml --gt ground_truth.csv
A CSV file with two columns identifying matched pairs:
id_a,id_b
1,42
1,108
5,200
5,201
5,203
Each row represents a known true match. Column names default to id_a and id_b but are configurable.
IDs correspond to GoldenMatch’s __row_id__ (int64). Ground truth CSVs may have string IDs – load_ground_truth_csv attempts int conversion automatically.
gt_pairs = gm.load_ground_truth_csv("gt.csv", col_a="id_a", col_b="id_b")
# Returns set of (int, int) tuples
Exit with code 1 if accuracy falls below thresholds:
goldenmatch evaluate data.csv \
--config config.yaml \
--gt ground_truth.csv \
--min-f1 0.90 \
--min-precision 0.80 \
--min-recall 0.70
Use in GitHub Actions:
# .github/workflows/quality.yml
jobs:
quality-gate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install goldenmatch
- run: |
goldenmatch evaluate data.csv \
--config config.yaml \
--gt ground_truth.csv \
--min-f1 0.90 --min-precision 0.80
@dataclass
class EvalResult:
precision: float # TP / (TP + FP)
recall: float # TP / (TP + FN)
f1: float # 2 * P * R / (P + R)
tp: int # True positives (correct matches)
fp: int # False positives (incorrect matches)
fn: int # False negatives (missed matches)
def summary(self) -> dict
import goldenmatch as gm
predicted = {(1, 42), (1, 108), (5, 200), (7, 300)}
ground_truth = {(1, 42), (1, 108), (5, 200), (5, 201)}
result = gm.evaluate_pairs(predicted, ground_truth)
print(f"Precision: {result.precision:.1%}") # 3/4 = 75%
print(f"Recall: {result.recall:.1%}") # 3/4 = 75%
print(f"F1: {result.f1:.1%}") # 75%
Evaluate a cluster dict (as returned by build_clusters). Expands cluster members into pairs for comparison.
import goldenmatch as gm
result = gm.evaluate_clusters(clusters, ground_truth_pairs)
print(result.f1)
Note: run_dedupe() does not return scored_pairs – use the clusters dict instead.
Interactively label record pairs to create a ground truth CSV:
goldenmatch label customers.csv --config config.yaml --gt ground_truth.csv
The label command shows pairs and prompts for your judgment:
| Key | Meaning |
|---|---|
y |
Match (add to ground truth) |
n |
No match (skip) |
s |
Skip (unsure) |
Pairs are selected from actual pipeline output, focusing on borderline cases near the threshold.
goldenmatch label or create a CSV manuallygoldenmatch evaluate --gt gt.csv--min-f1 threshold to your CI pipelinelabel pairs --> ground_truth.csv --> evaluate --> adjust config --> repeat
|
CI/CD gate (--min-f1 0.90)
| Metric | Formula | Interpretation |
|---|---|---|
| Precision | TP / (TP + FP) | Of the pairs GoldenMatch found, how many are correct? |
| Recall | TP / (TP + FN) | Of the true matches, how many did GoldenMatch find? |
| F1 | 2PR / (P+R) | Harmonic mean of precision and recall |
For entity resolution:
Compare two clustering outcomes on the same dataset without ground truth. Based on the Case Count Metric System (Talburt et al., arXiv:2601.02824v1).
import goldenmatch as gm
result = gm.compare_clusters(clusters_a, clusters_b)
print(result.summary())
# {"unchanged": 42, "merged": 3, "partitioned": 5, "overlapping": 1, "twi": 0.92, ...}
Each cluster from run A is classified into one of four cases:
| Case | Meaning |
|---|---|
| Unchanged | Identical cluster in both runs |
| Merged | Run A cluster absorbed into a larger cluster in run B |
| Partitioned | Run A cluster split into smaller clusters in run B |
| Overlapping | Complex reorganization – members redistributed across clusters |
The TWI (Talburt-Wang Index) measures overall clustering similarity, normalized to [0, 1] where 1.0 means identical outcomes.
goldenmatch compare-clusters run_a.json run_b.json --details --case-type merged
Sweep a parameter across a range and compare each run against a baseline:
import goldenmatch as gm
results = gm.run_sensitivity(
file_specs=[("data.csv", "src")],
config=gm.load_config("config.yaml"),
sweep_params=[gm.SweepParam("threshold", 0.70, 0.95, 0.05)],
sample_size=5000,
)
for r in results:
print(r.stability_report())
goldenmatch sensitivity data.csv -c config.yaml --sweep threshold:0.70:0.95:0.05 --sample 5000
Supported sweep fields: threshold, matchkey.<name>.threshold, blocking.max_block_size.
pl.read_csv(encoding="utf8-lossy", ignore_errors=True)python tests/benchmarks/run_leipzig.py