GoldenCheck ships with a custom benchmark suite in the benchmarks/ directory. Results below were measured on the development machine. Your results may vary by hardware and dataset characteristics.
DQBench Score
GoldenCheck v0.5.0 — profiler-only, zero-config: 88.40
GoldenCheck’s zero-config discovery outperforms every competitor even when competitors have hand-written rules.
Score Progression
| Version | Mode | DQBench Score |
|---|---|---|
| v0.1.0 | profiler-only | 9.10 |
| v0.1.5 | profiler-only | 34.22 |
| v0.2.0 | profiler-only | 72.00 |
| v0.3.0 | profiler-only | 87.71 |
| v0.5.0 | profiler-only | 88.40 |
Head-to-Head Comparison
| Tool | Mode | T1 F1 | T2 F1 | T3 F1 | DQBench Score |
|---|---|---|---|---|---|
| GoldenCheck | zero-config | 94.1% | 90.9% | 83.0% | 88.40 |
| Pandera | best-effort rules | 36.4% | 38.1% | 25.0% | 32.51 |
| Soda Core | best-effort rules | 38.1% | 23.5% | 13.3% | 22.36 |
| Great Expectations | best-effort rules | 36.4% | 23.5% | 12.5% | 21.68 |
| Great Expectations | auto-profiled | 22.2% | 42.1% | 0.0% | 21.29 |
| Soda Core | auto-profiled | 0.0% | 11.1% | 6.2% | 6.94 |
| All tools | zero-config | 0.0% | 0.0% | 0.0% | 0.00 |
DQBench Score formula: Tier1_F1 × 20% + Tier2_F1 × 40% + Tier3_F1 × 40%
Run the benchmark yourself:
pip install dqbench goldencheck
dqbench run goldencheck
Speed Benchmark
Script: benchmarks/speed_benchmark.py
Synthetic datasets are generated with realistic data quality issues (malformed emails, outlier ages, mixed phone formats, status anomalies). Each size is written to a temp CSV, scanned, and the temp file is deleted.
Results
| Dataset | File size | Time | Memory (peak) | Throughput |
|---|---|---|---|---|
| 1K rows | ~0.1 MB | 0.05s | — | 19K rows/sec |
| 10K rows | ~1 MB | 0.23s | — | 43K rows/sec |
| 100K rows | ~10 MB | 2.29s | — | 44K rows/sec |
| 1M rows | ~100 MB | 2.07s | — | 482K rows/sec |
The throughput jump at 1M rows is due to Polars’ vectorized operations becoming more efficient at scale relative to Python-level profiler overhead.
Running the speed benchmark
python benchmarks/speed_benchmark.py
Output:
================================================================================
SPEED BENCHMARK RESULTS
================================================================================
Dataset File MB Time (s) Memory (MB) Rows/sec Findings
--------------------------------------------------------------------------------
1,000 rows (synthetic) 0.1 0.05 8.2 19,000 24
10,000 rows (synthetic) 1.0 0.23 12.4 43,000 180
100,000 rows (synthetic) 10.2 2.29 48.1 44,000 1,800
1,000,000 rows (synthetic) 102.1 2.07 412.3 482,000 18,000
================================================================================
Detection Benchmark
Script: benchmarks/detection_benchmark.py
The detection benchmark measures column recall: the fraction of columns that contain ground-truth errors that GoldenCheck correctly flags with at least one ERROR or WARNING finding.
Methodology
- Uses the Raha benchmark datasets (hospital, flights, beers)
- Ground truth is computed by comparing
dirty.csvvsclean.csvcell-by-cell - A column is considered “detected” if GoldenCheck raises at least one ERROR or WARNING on it
- Metric: column recall = detected error columns / total error columns
Custom GoldenCheck Benchmark
In addition to Raha, a purpose-built dataset (benchmarks/datasets/goldencheck_bench/dirty.csv) plants 341 data quality issues across 9 categories:
| Category | Examples |
|---|---|
| Type mismatch | Numeric values in string columns |
| Missing values | Unexpected nulls in required columns |
| Format violations | Malformed emails and phone numbers |
| Range violations | Ages of 999, negative prices |
| Enum violations | "UNKNOWN" status values |
| Pattern inconsistency | Mixed phone number formats |
| Uniqueness violations | Duplicate IDs |
| Temporal order violations | end_date before start_date |
| Null correlation violations | address present but city null |
Detection Results
| Mode | Column Recall | Cost |
|---|---|---|
| Profiler-only (v0.1.0) | 87% | $0 |
| Profiler-only (v0.2.0 with confidence) | 100% | $0 |
| With LLM Boost | 100% | ~$0.003-0.01 |
v0.2.0 improvements: minority wrong-type detection, range profiler chaining, broader temporal heuristics, and confidence scoring pushed profiler-only recall from 87% to 100%.
The v0.1.0 gap between profiler-only and LLM Boost represented issues that required semantic understanding — for example, a name column containing numeric IDs, or an email column where nulls are semantically wrong even though the profiler only emits INFO. As of v0.2.0, the profiler alone achieves 100% recall on this benchmark.
Raha Benchmark Datasets
| Dataset | Rows | Columns | Column Recall |
|---|---|---|---|
| Flights | 2,376 | 7 | 100% (4/4 error columns detected) |
| Beers | 2,410 | 11 | 80% (4/5 error columns detected) |
| Hospital | varies | varies | see benchmark output |
Flights dataset
All 4 columns with ground-truth errors are detected. The missed column in Beers contains errors that require domain knowledge (brewery name inconsistencies that look like valid strings).
Running the detection benchmark
First, clone the Raha datasets:
git clone https://github.com/BigDaMa/raha.git benchmarks/raha_repo
Then run:
python benchmarks/detection_benchmark.py
LLM Boost Benchmark
Script: benchmarks/goldencheck_benchmark_llm.py
Compares profiler-only vs LLM-boosted recall on the custom GoldenCheck benchmark dataset.
export ANTHROPIC_API_KEY=sk-ant-...
python benchmarks/goldencheck_benchmark_llm.py
Benchmark results summary
| Mode | Column Recall | Issues Found | LLM Cost |
|---|---|---|---|
| Profiler-only (v0.1.0) | 87% | 297/341 | $0 |
| Profiler-only (v0.2.0 with confidence) | 100% | 341/341 | $0 |
| With LLM Boost | 100% | 341/341 | ~$0.003-0.01 |
The LLM upgrade/downgrade mechanism also reduces false positives. In the benchmark, the profiler emits 12 false-positive warnings that the LLM correctly downgrades to INFO.
Benchmark Data Generation
python benchmarks/generate_datasets.py
Generates the goldencheck_bench dataset with planted issues. Required before running speed_benchmark.py for the goldencheck_bench portion of results.