Blocking reduces the comparison space from O(N^2) to O(N*B) by grouping records that share a key. GoldenMatch supports 8 strategies.
| Strategy | Description | Best For |
|---|---|---|
static |
Group by blocking key | Clean data with reliable keys |
adaptive |
Static + recursive sub-blocking for oversized blocks | Default choice |
sorted_neighborhood |
Sliding window over sorted records | Typos in blocking key |
multi_pass |
Union of blocks from multiple passes | Noisy data, best recall |
ann |
FAISS nearest-neighbor on embeddings | Semantic matching |
ann_pairs |
Direct-pair ANN scoring | 50–100x faster than ann |
canopy |
TF-IDF canopy clustering | Text-heavy data |
learned |
Data-driven predicate selection | Auto-discovers rules |
Group records by exact value of the blocking key.
blocking:
strategy: static
keys:
- fields: [zip]
- fields: [last_name]
transforms: [lowercase, soundex]
Multiple keys produce independent blocks that are unioned. Transforms are applied before grouping.
Static blocking with automatic sub-splitting for oversized blocks. When a block exceeds max_block_size, it splits on the highest-cardinality column within the block.
blocking:
strategy: adaptive
max_block_size: 5000
keys:
- fields: [zip]
Sliding window over records sorted by a key. Catches near-matches that differ by one character in the blocking key.
blocking:
strategy: sorted_neighborhood
window_size: 20
sort_key:
- column: last_name
transforms: [lowercase, soundex]
Run multiple blocking passes and union the results. Best recall for noisy data.
blocking:
strategy: multi_pass
union_mode: true
passes:
- fields: [zip]
- fields: [last_name]
transforms: [lowercase, soundex]
- fields: [first_name]
transforms: [lowercase, first_token]
New in v1.2.6. Combine multi-pass string blocking with ANN fallback for oversized blocks. When a block exceeds max_block_size and would normally be skipped, GoldenMatch embeds only the unique text values in that block and uses FAISS to create smaller sub-blocks.
blocking:
strategy: multi_pass
passes:
- fields: [model_desc, state]
transforms: [lowercase, strip]
- fields: [base_model]
transforms: [lowercase, soundex]
max_block_size: 1000
skip_oversized: true
ann_column: description_text # enables ANN fallback
ann_top_k: 20
How it works:
max_block_size trigger ANN fallback instead of being skippedmax_block_size (after 10x cap) are skippedOn the Bulldozer dataset (401K rows), this recovered 363 sub-blocks from 15 oversized blocks that would otherwise be skipped, matching 949 additional records.
Requires Vertex AI (GOLDENMATCH_GPU_MODE=vertex) or local sentence-transformers for embedding.
Use FAISS approximate nearest-neighbor search on sentence-transformer embeddings. Requires pip install goldenmatch[embeddings].
blocking:
strategy: ann
ann_column: description
ann_model: all-MiniLM-L6-v2
ann_top_k: 20
ann_pairs is a faster variant (50–100x) that returns direct pairs instead of block groups:
blocking:
strategy: ann_pairs
ann_column: title
ann_top_k: 20
TF-IDF-based canopy clustering with loose and tight thresholds.
blocking:
strategy: canopy
canopy:
fields: [name, address]
loose_threshold: 0.3
tight_threshold: 0.7
max_canopy_size: 500
Data-driven predicate selection via a two-pass approach: sample pairs, train predicates, apply to full data. Achieves 96.9% F1 matching hand-tuned static blocking on DBLP-ACM.
blocking:
strategy: learned
learned_sample_size: 5000
learned_min_recall: 0.95
learned_min_reduction: 0.90
learned_predicate_depth: 2
learned_cache_path: .goldenmatch/learned_blocking.pkl
import goldenmatch as gm
rules = gm.learn_blocking_rules(df, matchkey, sample_size=5000)
blocks = gm.apply_learned_blocks(df, rules)
Cache the learned rules to skip re-training on subsequent runs.
Let GoldenMatch pick the best blocking key by histogram analysis:
blocking:
auto_select: true
keys:
- fields: [zip]
- fields: [last_name]
transforms: [lowercase, soundex]
- fields: [city]
The analyzer scores each key by block count, max block size, estimated comparisons, and recall. Use the CLI to see suggestions:
goldenmatch analyze-blocking customers.csv --config config.yaml
Blocking key choice dominates fuzzy matching performance. A coarse key (e.g., state) creates huge blocks and slow scoring. A fine key (e.g., email) misses near-duplicates.
| Key | Records | Blocks | Max Size | Comparisons | Time |
|---|---|---|---|---|---|
zip |
100K | 8,200 | 340 | 1.2M | 12s |
state |
100K | 50 | 12,000 | 45M | 320s |
last_name + soundex |
100K | 4,100 | 180 | 0.8M | 9s |
learned |
100K | 3,800 | 200 | 0.9M | 10s |
Rules of thumb:
multi_pass for best recall, adaptive for best speedlearned to auto-discover optimal predicatesann_pairs for semantic/product matching