LLM Boost

LLM Boost is an optional enhancement pass that runs after the standard profilers. It sends a compact representation of your data to an LLM and merges the LLM’s assessments back into the findings list.

How It Works

LLM Boost operates in two stages: type classification and finding review.

Stage 1 — Semantic type classification

Before the finding review call, a lightweight LLM call classifies each column’s semantic type (e.g., email, name, currency, id, category). This classification is used to:

Improve the severity of findings that depend on column meaning (e.g., nulls in an email column are more likely errors)
Provide context to the Stage 2 finding review prompt

This call uses the cheapest available model and typically costs under $0.001.

Stage 2 — Finding review

Step 1 — Profiler scan

The standard profiler pipeline runs first and produces a list[Finding] along with the sampled DataFrame.

Step 2 — Sample block construction

build_sample_blocks() compiles a JSON summary for each column (up to 50 columns; columns with the most existing findings are prioritized if the dataset exceeds that limit):

{
  "email": {
    "column": "email",
    "dtype": "String",
    "semantic_type": "email",
    "row_count": 10000,
    "null_count": 45,
    "null_pct": 0.005,
    "unique_count": 9821,
    "top_values": [{"value": "user@example.com", "count": 3}],
    "rare_values": [{"value": "bad@", "count": 1}],
    "random_sample": ["alice@corp.com", "bob@example.org"],
    "flagged_values": ["bad@", "notanemail"],
    "existing_findings": [
      {"severity": "warning", "check": "format_detection",
       "message": "6 value(s) do not match expected email format"}
    ]
  }
}

Step 3 — Single LLM call

The sample blocks are serialized to JSON and sent in a single API call with this system prompt:

You are a data quality analyst. Identify issues the profilers missed, upgrade severity of findings that are worse than assessed, downgrade false positives, and identify cross-column relationships.

The LLM returns structured JSON with per-column assessments and relation findings.

Step 4 — Merge

merge_llm_findings() integrates the LLM response:

New issues from the LLM are appended as Finding objects with source="llm"
Upgrades change the severity on an existing finding matched by check name
Downgrades reduce the severity on matched findings
Relations become new cross-column findings

The final list is sorted by severity (ERROR first) and returned.

Scores: profiler-only vs LLM boost

Mode	DQBench Score	Cost
Profiler-only (v0.2.0)	72.00	$0
With LLM Boost	~74–76 (varies by model)	~$0.003–0.01/scan

The profiler-only score of 72.00 already outperforms all competitors’ hand-written rules. LLM Boost provides incremental gains on adversarial tier issues requiring semantic understanding.

What LLM Boost Catches That Profilers Miss

Category	Example
Semantic type violations	`"12345"` in a `first_name` column
Business rule knowledge	Email columns should almost never be null
Contextual severity	A `status` column with `"UNKNOWN"` is an error, not info
Implicit relations	`signup_date` should precede `last_login_date`
False positive reduction	Mixed phone formats in a global dataset are expected

Provider Setup

Anthropic (default)

pip install goldencheck[llm]
export ANTHROPIC_API_KEY=sk-ant-...
goldencheck data.csv --llm-boost --no-tui

Default model: claude-haiku-4-5-20251001

OpenAI

pip install goldencheck[llm]
export OPENAI_API_KEY=sk-...
goldencheck data.csv --llm-boost --llm-provider openai --no-tui

Default model: gpt-4o-mini

Cost Tracking

GoldenCheck tracks actual token usage after each LLM call and logs the cost:

INFO  LLM boost cost: $0.0082 (input: 8420, output: 312, model: claude-haiku-4-5-20251001)

Cost is calculated per-model using known pricing rates:

Model	Input (per 1K tokens)	Output (per 1K tokens)
claude-haiku-4-5-20251001	$0.0008	$0.004
claude-sonnet-4-20250514	$0.003	$0.015
gpt-4o-mini	$0.00015	$0.0006
gpt-4o	$0.0025	$0.01

For unknown models, a conservative fallback rate is used.

Budget Limits

Set a maximum spend per scan with GOLDENCHECK_LLM_BUDGET:

export GOLDENCHECK_LLM_BUDGET=0.10  # max $0.10 per scan
goldencheck data.csv --llm-boost --no-tui

If the estimated cost exceeds the budget before the API call is made, the LLM pass is skipped and profiler-only results are returned. A warning is logged:

WARNING  Estimated LLM cost ($0.1240) exceeds budget ($0.10). Skipping LLM boost.

Budget is pre-checked using a conservative estimate (~2,000 input + ~500 output tokens). The actual call is only made if the estimate is within budget.

Environment Variables

Variable	Description	Example
`ANTHROPIC_API_KEY`	Required when using the `anthropic` provider	`sk-ant-...`
`OPENAI_API_KEY`	Required when using the `openai` provider	`sk-...`
`GOLDENCHECK_LLM_BUDGET`	Maximum USD spend per scan	`0.50`
`GOLDENCHECK_LLM_MODEL`	Override the default model for the selected provider	`claude-sonnet-4-20250514`

Failure Handling

If the LLM call fails (network error, invalid response, API error), GoldenCheck logs a warning and returns profiler-only results. It never crashes or exits because of an LLM failure:

WARNING  LLM boost failed: Connection timeout. Showing profiler-only results.

If the response cannot be parsed as valid JSON matching the expected schema:

WARNING  LLM response could not be parsed. Showing profiler-only results.

Column Limit

For datasets with more than 50 columns, LLM Boost prioritizes the columns with the most existing profiler findings. A warning is logged:

WARNING  LLM boost limited to 50 columns (dataset has 120).
         Columns with most findings prioritized.