GoldenCheck runs 10 column-level profilers and 2 cross-column profilers on every scan. Each profiler is independent — they do not share state and can be extended without touching any other profiler.
Column-Level Profilers
Column profilers implement BaseProfiler and receive a single column at a time:
class BaseProfiler(ABC):
@abstractmethod
def profile(self, df: pl.DataFrame, column: str) -> list[Finding]:
...
TypeInferenceProfiler
File: goldencheck/profilers/type_inference.py
Detects string columns where most values are actually numeric. This happens when a CSV is read without type inference or when a numeric column has been stored as text.
Triggers on: String / Utf8 dtype columns only.
Logic: Attempts to cast the column to Float64. If 80%+ of non-null values cast successfully, a finding is raised. A secondary cast to Int64 determines whether to label the type as integer or numeric.
| Severity | Condition |
|---|---|
| WARNING | >=80% of string values are numeric |
Example finding:
Column is string but 98% of values are integer (2 non-integer values)
Suggestion: Consider casting to integer
NullabilityProfiler
File: goldencheck/profilers/nullability.py
Classifies whether a column is required (no nulls), optional (some nulls), or entirely null.
Triggers on: All column types.
| Severity | Condition |
|---|---|
| ERROR | 100% of rows are null |
| INFO | 0 nulls and row count >= 10 (likely required) |
| INFO | Some nulls but not all (optional column) |
Example findings:
0 nulls across 50,000 rows — likely required
12 nulls (0.2%) — column is optional
Column is entirely null (100 rows)
UniquenessProfiler
File: goldencheck/profilers/uniqueness.py
Identifies columns that are likely primary keys (100% unique) and columns that are nearly unique but have a small number of duplicates.
Triggers on: All column types. Requires at least 10 rows.
| Severity | Condition |
|---|---|
| INFO | 100% unique across all non-null rows |
| WARNING | >95% unique but not 100% (near-unique with duplicates) |
Example findings:
100% unique across 10,000 rows — likely primary key
Near-unique column (98.3% unique) with 17 duplicates
FormatDetectionProfiler
File: goldencheck/profilers/format_detection.py
Checks string columns for known formats: email addresses, US phone numbers, and URLs. When a column is predominantly one format, any non-matching values are flagged as a separate WARNING.
Triggers on: String / Utf8 dtype columns only.
Detected formats:
| Format | Pattern used |
|---|---|
^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$ | |
| phone | ^\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}$ |
| url | ^https?:// |
Threshold: 70% match to classify the column as that format.
| Severity | Condition |
|---|---|
| INFO | >=70% of values match a known format (column classified) |
| WARNING | Non-matching values present in a classified column |
Example findings:
Column appears to contain email values (94.3% match)
6 value(s) do not match expected email format
Sample: [bad@, notanemail, user @domain.com]
RangeDistributionProfiler
File: goldencheck/profilers/range_distribution.py
Reports the numeric range and detects statistical outliers using a 3-standard-deviation threshold.
Triggers on: Numeric dtypes (Int8 through Float64). Requires at least 2 non-null values.
| Severity | Condition |
|---|---|
| INFO | Always emitted — reports min, max, mean |
| WARNING | Values beyond 3 standard deviations from the mean |
Example findings:
Range: min=1, max=120, mean=34.21
3 outlier(s) detected beyond 3 standard deviations
Sample: [999, 1050, -5]
CardinalityProfiler
File: goldencheck/profilers/cardinality.py
Flags low-cardinality columns as enum candidates. Columns with fewer than 20 unique values and at least 50 rows are surfaced as potential enums.
Triggers on: All column types.
Thresholds:
ENUM_UNIQUE_THRESHOLD = 20unique valuesENUM_MIN_ROWS = 50minimum row count
| Severity | Condition |
|---|---|
| INFO | Low cardinality — enum candidate |
| INFO | Standard cardinality report |
Example findings:
Low cardinality: 4 unique value(s) across 5,000 rows — consider using an enum type
Sample: [active, closed, inactive, pending]
Suggestion: Define an enum or categorical constraint for this column
PatternConsistencyProfiler
File: goldencheck/profilers/pattern_consistency.py
Detects mixed structural patterns within a string column. Values are generalized to a pattern signature (digits become D, letters become L, punctuation preserved), and minority patterns are flagged.
Triggers on: String / Utf8 dtype columns only.
Logic: Builds a frequency distribution of generalized patterns. Any pattern representing less than 30% of values is a minority pattern and gets its own finding.
Threshold: MINORITY_THRESHOLD = 0.30
| Severity | Condition |
|---|---|
| WARNING | A minority pattern (<30% of values) is present |
Example finding:
Inconsistent pattern detected: 'DDDDDDDDDD' appears in 47 row(s) (0.9%)
vs dominant pattern 'LLL LLL-LLLL' (5,100 row(s))
Sample: [2025551234, 8005559999]
Suggestion: Standardize values to a single format/pattern
EncodingDetectionProfiler
File: goldencheck/profilers/encoding_detection.py
Detects encoding artifacts and invisible character issues in string columns. These are common when data has been exported from Excel, copy-pasted from web pages, or converted between character sets.
Triggers on: String / Utf8 dtype columns only.
Detected issues:
| Issue | Characters | Description |
|---|---|---|
| Zero-width characters | U+200B, U+200C, U+200D, U+FEFF | Invisible characters that cause silent comparison failures |
| Smart quotes | ", ", ', ' | Typographic quotes that break exact-match lookups |
| Latin-1 mojibake | Ã, Â, †| UTF-8 bytes decoded as Latin-1 — garbled accented characters |
| Severity | Condition |
|---|---|
| WARNING | Any of the above patterns detected in one or more values |
Example finding:
3 value(s) contain zero-width Unicode characters (U+200B/U+FEFF)
Sample: ["JohnSmith", "Alice"]
Suggestion: Strip zero-width characters before storing or comparing values
SequenceGapProfiler
File: goldencheck/profilers/sequence_gap.py
Detects gaps in numeric sequences. Useful for identifying missing records in ID columns, invoice numbers, order sequences, or any column expected to be a contiguous integer range.
Triggers on: Integer dtype columns that are 100% unique and have low cardinality relative to the row count.
Logic: Computes expected_count = max - min + 1 and compares against actual_count. If the ratio is below 0.98 (more than 2% of values are missing), a finding is raised. The first few missing values are included as samples.
| Severity | Condition |
|---|---|
| WARNING | Sequence has gaps (missing integers between min and max) |
Example finding:
Sequence gaps detected: 47 missing value(s) between 1 and 10000
Missing sample: [23, 47, 102, 891, 1204]
Suggestion: Investigate whether records were deleted or IDs were never assigned
DriftDetectionProfiler
File: goldencheck/profilers/drift_detection.py
Detects statistical drift between the first and second half of a dataset. This surfaces data that changes character over time — common in logs, event streams, or pipelines that append data from different sources.
Triggers on: All columns. Requires at least 100 rows.
Categorical drift: Compares the top-value distribution between the first and second half. If a dominant value in the first half disappears or a new dominant value appears in the second half, drift is flagged.
Numeric drift: Compares the mean of the first half vs. the second half. If the means differ by more than 20% of the overall standard deviation, drift is flagged.
| Severity | Condition |
|---|---|
| WARNING | Categorical distribution shift between first and second half |
| WARNING | Numeric mean shift > 20% of standard deviation between halves |
Example findings:
Categorical drift: value 'active' drops from 72% to 31% between halves
Suggestion: Investigate whether data was loaded from different time periods or sources
Numeric drift: mean shifts from 42.3 to 61.7 between first and second half of dataset
Suggestion: Check for batch effects or pipeline changes that may have altered values over time
Cross-Column Profilers
Cross-column profilers receive the full DataFrame and look at relationships between columns. They implement a compatible profile(df) interface but are not subclasses of BaseProfiler.
TemporalOrderProfiler
File: goldencheck/relations/temporal.py
Detects column pairs where start-like columns have values later than their corresponding end-like columns.
Heuristics for pairing columns by name:
| Start keyword | End keyword |
|---|---|
start | end |
created | updated |
begin | finish |
Pairing is done by substring match on lowercased column names.
Type handling: String columns are attempted to be parsed as %Y-%m-%d dates. Non-date columns are skipped.
| Severity | Condition |
|---|---|
| ERROR | Any row where start > end |
Example finding:
Column 'start_date' has 3 row(s) where its value is later than 'end_date',
violating expected temporal order.
Sample: [2024-06-01 > 2024-05-15, ...]
Suggestion: Ensure 'start_date' <= 'end_date' for all rows.
NullCorrelationProfiler
File: goldencheck/relations/null_correlation.py
Identifies pairs of columns whose null/non-null patterns are highly correlated. This surfaces logical groups where fields should always be populated together (e.g., shipping_address and shipping_city).
Threshold: _DEFAULT_THRESHOLD = 0.90 (90% agreement on null/non-null pattern).
Pairs where neither column has any nulls are skipped — there is no interesting signal.
| Severity | Condition |
|---|---|
| INFO | Null pattern agreement >= 90% |
Example finding:
Columns 'billing_address' and 'billing_zip' have strongly correlated null patterns
(96.2% agreement). They may represent a logical group.
Suggestion: Consider treating 'billing_address' and 'billing_zip' as a unit —
validate that they are both populated or both absent together.
Severity Levels
| Level | Integer value | Meaning |
|---|---|---|
| INFO | 1 | Informational observation, no action required |
| WARNING | 2 | Potential issue worth reviewing |
| ERROR | 3 | Definite data quality problem |
Adding a Custom Profiler
- Create a new file in
goldencheck/profilers/:
# goldencheck/profilers/my_profiler.py
from __future__ import annotations
import polars as pl
from goldencheck.models.finding import Finding, Severity
from goldencheck.profilers.base import BaseProfiler
class MyProfiler(BaseProfiler):
def profile(self, df: pl.DataFrame, column: str) -> list[Finding]:
findings: list[Finding] = []
col = df[column]
# Your logic here
if some_condition:
findings.append(Finding(
severity=Severity.WARNING,
column=column,
check="my_check_name",
message="Description of the issue",
affected_rows=0,
sample_values=[],
suggestion="What the user should do",
))
return findings
- Register it in
goldencheck/engine/scanner.py:
from goldencheck.profilers.my_profiler import MyProfiler
COLUMN_PROFILERS = [
TypeInferenceProfiler(),
NullabilityProfiler(),
# ... existing profilers ...
MyProfiler(),
]
That is all. The scanner loops over COLUMN_PROFILERS for every column automatically.
For a cross-column profiler, add it to RELATION_PROFILERS instead. The only requirement is a profile(df: pl.DataFrame) -> list[Finding] method.