GoldenCheck ships with a custom benchmark suite in the benchmarks/ directory. Results below were measured on the development machine. Your results may vary by hardware and dataset characteristics.


DQBench Score

GoldenCheck v0.5.0 — profiler-only, zero-config: 88.40

GoldenCheck’s zero-config discovery outperforms every competitor even when competitors have hand-written rules.

Score Progression

Version Mode DQBench Score
v0.1.0 profiler-only 9.10
v0.1.5 profiler-only 34.22
v0.2.0 profiler-only 72.00
v0.3.0 profiler-only 87.71
v0.5.0 profiler-only 88.40

Head-to-Head Comparison

Tool Mode T1 F1 T2 F1 T3 F1 DQBench Score
GoldenCheck zero-config 94.1% 90.9% 83.0% 88.40
Pandera best-effort rules 36.4% 38.1% 25.0% 32.51
Soda Core best-effort rules 38.1% 23.5% 13.3% 22.36
Great Expectations best-effort rules 36.4% 23.5% 12.5% 21.68
Great Expectations auto-profiled 22.2% 42.1% 0.0% 21.29
Soda Core auto-profiled 0.0% 11.1% 6.2% 6.94
All tools zero-config 0.0% 0.0% 0.0% 0.00

DQBench Score formula: Tier1_F1 × 20% + Tier2_F1 × 40% + Tier3_F1 × 40%

Run the benchmark yourself:

pip install dqbench goldencheck
dqbench run goldencheck

Speed Benchmark

Script: benchmarks/speed_benchmark.py

Synthetic datasets are generated with realistic data quality issues (malformed emails, outlier ages, mixed phone formats, status anomalies). Each size is written to a temp CSV, scanned, and the temp file is deleted.

Results

Dataset File size Time Memory (peak) Throughput
1K rows ~0.1 MB 0.05s 19K rows/sec
10K rows ~1 MB 0.23s 43K rows/sec
100K rows ~10 MB 2.29s 44K rows/sec
1M rows ~100 MB 2.07s 482K rows/sec

The throughput jump at 1M rows is due to Polars’ vectorized operations becoming more efficient at scale relative to Python-level profiler overhead.

Running the speed benchmark

python benchmarks/speed_benchmark.py

Output:

================================================================================
                          SPEED BENCHMARK RESULTS
================================================================================
Dataset                                File MB   Time (s)    Memory (MB)  Rows/sec     Findings
--------------------------------------------------------------------------------
1,000 rows (synthetic)                     0.1      0.05           8.2      19,000        24
10,000 rows (synthetic)                    1.0      0.23          12.4      43,000       180
100,000 rows (synthetic)                  10.2      2.29          48.1      44,000     1,800
1,000,000 rows (synthetic)               102.1      2.07         412.3     482,000    18,000
================================================================================

Detection Benchmark

Script: benchmarks/detection_benchmark.py

The detection benchmark measures column recall: the fraction of columns that contain ground-truth errors that GoldenCheck correctly flags with at least one ERROR or WARNING finding.

Methodology

  • Uses the Raha benchmark datasets (hospital, flights, beers)
  • Ground truth is computed by comparing dirty.csv vs clean.csv cell-by-cell
  • A column is considered “detected” if GoldenCheck raises at least one ERROR or WARNING on it
  • Metric: column recall = detected error columns / total error columns

Custom GoldenCheck Benchmark

In addition to Raha, a purpose-built dataset (benchmarks/datasets/goldencheck_bench/dirty.csv) plants 341 data quality issues across 9 categories:

Category Examples
Type mismatch Numeric values in string columns
Missing values Unexpected nulls in required columns
Format violations Malformed emails and phone numbers
Range violations Ages of 999, negative prices
Enum violations "UNKNOWN" status values
Pattern inconsistency Mixed phone number formats
Uniqueness violations Duplicate IDs
Temporal order violations end_date before start_date
Null correlation violations address present but city null

Detection Results

Mode Column Recall Cost
Profiler-only (v0.1.0) 87% $0
Profiler-only (v0.2.0 with confidence) 100% $0
With LLM Boost 100% ~$0.003-0.01

v0.2.0 improvements: minority wrong-type detection, range profiler chaining, broader temporal heuristics, and confidence scoring pushed profiler-only recall from 87% to 100%.

The v0.1.0 gap between profiler-only and LLM Boost represented issues that required semantic understanding — for example, a name column containing numeric IDs, or an email column where nulls are semantically wrong even though the profiler only emits INFO. As of v0.2.0, the profiler alone achieves 100% recall on this benchmark.


Raha Benchmark Datasets

Dataset Rows Columns Column Recall
Flights 2,376 7 100% (4/4 error columns detected)
Beers 2,410 11 80% (4/5 error columns detected)
Hospital varies varies see benchmark output

Flights dataset

All 4 columns with ground-truth errors are detected. The missed column in Beers contains errors that require domain knowledge (brewery name inconsistencies that look like valid strings).

Running the detection benchmark

First, clone the Raha datasets:

git clone https://github.com/BigDaMa/raha.git benchmarks/raha_repo

Then run:

python benchmarks/detection_benchmark.py

LLM Boost Benchmark

Script: benchmarks/goldencheck_benchmark_llm.py

Compares profiler-only vs LLM-boosted recall on the custom GoldenCheck benchmark dataset.

export ANTHROPIC_API_KEY=sk-ant-...
python benchmarks/goldencheck_benchmark_llm.py

Benchmark results summary

Mode Column Recall Issues Found LLM Cost
Profiler-only (v0.1.0) 87% 297/341 $0
Profiler-only (v0.2.0 with confidence) 100% 341/341 $0
With LLM Boost 100% 341/341 ~$0.003-0.01

The LLM upgrade/downgrade mechanism also reduces false positives. In the benchmark, the profiler emits 12 false-positive warnings that the LLM correctly downgrades to INFO.


Benchmark Data Generation

python benchmarks/generate_datasets.py

Generates the goldencheck_bench dataset with planted issues. Required before running speed_benchmark.py for the goldencheck_bench portion of results.