LLM Boost is an optional enhancement pass that runs after the standard profilers. It sends a compact representation of your data to an LLM and merges the LLM’s assessments back into the findings list.
How It Works
LLM Boost operates in two stages: type classification and finding review.
Stage 1 — Semantic type classification
Before the finding review call, a lightweight LLM call classifies each column’s semantic type (e.g., email, name, currency, id, category). This classification is used to:
- Improve the severity of findings that depend on column meaning (e.g., nulls in an email column are more likely errors)
- Provide context to the Stage 2 finding review prompt
This call uses the cheapest available model and typically costs under $0.001.
Stage 2 — Finding review
Step 1 — Profiler scan
The standard profiler pipeline runs first and produces a list[Finding] along with the sampled DataFrame.
Step 2 — Sample block construction
build_sample_blocks() compiles a JSON summary for each column (up to 50 columns; columns with the most existing findings are prioritized if the dataset exceeds that limit):
{
"email": {
"column": "email",
"dtype": "String",
"semantic_type": "email",
"row_count": 10000,
"null_count": 45,
"null_pct": 0.005,
"unique_count": 9821,
"top_values": [{"value": "user@example.com", "count": 3}],
"rare_values": [{"value": "bad@", "count": 1}],
"random_sample": ["alice@corp.com", "bob@example.org"],
"flagged_values": ["bad@", "notanemail"],
"existing_findings": [
{"severity": "warning", "check": "format_detection",
"message": "6 value(s) do not match expected email format"}
]
}
}
Step 3 — Single LLM call
The sample blocks are serialized to JSON and sent in a single API call with this system prompt:
You are a data quality analyst. Identify issues the profilers missed, upgrade severity of findings that are worse than assessed, downgrade false positives, and identify cross-column relationships.
The LLM returns structured JSON with per-column assessments and relation findings.
Step 4 — Merge
merge_llm_findings() integrates the LLM response:
- New issues from the LLM are appended as
Findingobjects withsource="llm" - Upgrades change the
severityon an existing finding matched bycheckname - Downgrades reduce the
severityon matched findings - Relations become new cross-column findings
The final list is sorted by severity (ERROR first) and returned.
Scores: profiler-only vs LLM boost
| Mode | DQBench Score | Cost |
|---|---|---|
| Profiler-only (v0.2.0) | 72.00 | $0 |
| With LLM Boost | ~74–76 (varies by model) | ~$0.003–0.01/scan |
The profiler-only score of 72.00 already outperforms all competitors’ hand-written rules. LLM Boost provides incremental gains on adversarial tier issues requiring semantic understanding.
What LLM Boost Catches That Profilers Miss
| Category | Example |
|---|---|
| Semantic type violations | "12345" in a first_name column |
| Business rule knowledge | Email columns should almost never be null |
| Contextual severity | A status column with "UNKNOWN" is an error, not info |
| Implicit relations | signup_date should precede last_login_date |
| False positive reduction | Mixed phone formats in a global dataset are expected |
Provider Setup
Anthropic (default)
pip install goldencheck[llm]
export ANTHROPIC_API_KEY=sk-ant-...
goldencheck data.csv --llm-boost --no-tui
Default model: claude-haiku-4-5-20251001
OpenAI
pip install goldencheck[llm]
export OPENAI_API_KEY=sk-...
goldencheck data.csv --llm-boost --llm-provider openai --no-tui
Default model: gpt-4o-mini
Cost Tracking
GoldenCheck tracks actual token usage after each LLM call and logs the cost:
INFO LLM boost cost: $0.0082 (input: 8420, output: 312, model: claude-haiku-4-5-20251001)
Cost is calculated per-model using known pricing rates:
| Model | Input (per 1K tokens) | Output (per 1K tokens) |
|---|---|---|
| claude-haiku-4-5-20251001 | $0.0008 | $0.004 |
| claude-sonnet-4-20250514 | $0.003 | $0.015 |
| gpt-4o-mini | $0.00015 | $0.0006 |
| gpt-4o | $0.0025 | $0.01 |
For unknown models, a conservative fallback rate is used.
Budget Limits
Set a maximum spend per scan with GOLDENCHECK_LLM_BUDGET:
export GOLDENCHECK_LLM_BUDGET=0.10 # max $0.10 per scan
goldencheck data.csv --llm-boost --no-tui
If the estimated cost exceeds the budget before the API call is made, the LLM pass is skipped and profiler-only results are returned. A warning is logged:
WARNING Estimated LLM cost ($0.1240) exceeds budget ($0.10). Skipping LLM boost.
Budget is pre-checked using a conservative estimate (~2,000 input + ~500 output tokens). The actual call is only made if the estimate is within budget.
Environment Variables
| Variable | Description | Example |
|---|---|---|
ANTHROPIC_API_KEY | Required when using the anthropic provider | sk-ant-... |
OPENAI_API_KEY | Required when using the openai provider | sk-... |
GOLDENCHECK_LLM_BUDGET | Maximum USD spend per scan | 0.50 |
GOLDENCHECK_LLM_MODEL | Override the default model for the selected provider | claude-sonnet-4-20250514 |
Failure Handling
If the LLM call fails (network error, invalid response, API error), GoldenCheck logs a warning and returns profiler-only results. It never crashes or exits because of an LLM failure:
WARNING LLM boost failed: Connection timeout. Showing profiler-only results.
If the response cannot be parsed as valid JSON matching the expected schema:
WARNING LLM response could not be parsed. Showing profiler-only results.
Column Limit
For datasets with more than 50 columns, LLM Boost prioritizes the columns with the most existing profiler findings. A warning is logged:
WARNING LLM boost limited to 50 columns (dataset has 120).
Columns with most findings prioritized.