Blocking Strategies
Blocking reduces the comparison space from O(N^2) to O(N*B) by grouping records that share a key. GoldenMatch supports 8 strategies.
Strategy overview
| Strategy | Description | Best For |
|---|---|---|
static | Group by blocking key | Clean data with reliable keys |
adaptive | Static + recursive sub-blocking for oversized blocks | Default choice |
sorted_neighborhood | Sliding window over sorted records | Typos in blocking key |
multi_pass | Union of blocks from multiple passes | Noisy data, best recall |
ann | FAISS nearest-neighbor on embeddings | Semantic matching |
ann_pairs | Direct-pair ANN scoring | 50–100x faster than ann |
canopy | TF-IDF canopy clustering | Text-heavy data |
learned | Data-driven predicate selection | Auto-discovers rules |
Static blocking
Group records by exact value of the blocking key.
blocking:
strategy: static
keys:
- fields: [zip]
- fields: [last_name]
transforms: [lowercase, soundex]
Multiple keys produce independent blocks that are unioned. Transforms are applied before grouping.
Adaptive blocking
Static blocking with automatic sub-splitting for oversized blocks. When a block exceeds max_block_size, it splits on the highest-cardinality column within the block.
blocking:
strategy: adaptive
max_block_size: 5000
keys:
- fields: [zip]
Sorted neighborhood
Sliding window over records sorted by a key. Catches near-matches that differ by one character in the blocking key.
blocking:
strategy: sorted_neighborhood
window_size: 20
sort_key:
- column: last_name
transforms: [lowercase, soundex]
Multi-pass blocking
Run multiple blocking passes and union the results. Best recall for noisy data.
blocking:
strategy: multi_pass
union_mode: true
passes:
- fields: [zip]
- fields: [last_name]
transforms: [lowercase, soundex]
- fields: [first_name]
transforms: [lowercase, first_token]
ANN blocking
Use FAISS approximate nearest-neighbor search on sentence-transformer embeddings. Requires pip install goldenmatch[embeddings].
blocking:
strategy: ann
ann_column: description
ann_model: all-MiniLM-L6-v2
ann_top_k: 20
ann_pairs is a faster variant (50–100x) that returns direct pairs instead of block groups:
blocking:
strategy: ann_pairs
ann_column: title
ann_top_k: 20
Canopy blocking
TF-IDF-based canopy clustering with loose and tight thresholds.
blocking:
strategy: canopy
canopy:
fields: [name, address]
loose_threshold: 0.3
tight_threshold: 0.7
max_canopy_size: 500
Learned blocking
Data-driven predicate selection via a two-pass approach: sample pairs, train predicates, apply to full data. Achieves 96.9% F1 matching hand-tuned static blocking on DBLP-ACM.
blocking:
strategy: learned
learned_sample_size: 5000
learned_min_recall: 0.95
learned_min_reduction: 0.90
learned_predicate_depth: 2
learned_cache_path: .goldenmatch/learned_blocking.pkl
import goldenmatch as gm
rules = gm.learn_blocking_rules(df, matchkey, sample_size=5000)
blocks = gm.apply_learned_blocks(df, rules)
Cache the learned rules to skip re-training on subsequent runs.
Auto-select
Let GoldenMatch pick the best blocking key by histogram analysis:
blocking:
auto_select: true
keys:
- fields: [zip]
- fields: [last_name]
transforms: [lowercase, soundex]
- fields: [city]
The analyzer scores each key by block count, max block size, estimated comparisons, and recall. Use the CLI to see suggestions:
goldenmatch analyze-blocking customers.csv --config config.yaml
Performance impact
Blocking key choice dominates fuzzy matching performance. A coarse key (e.g., state) creates huge blocks and slow scoring. A fine key (e.g., email) misses near-duplicates.
| Key | Records | Blocks | Max Size | Comparisons | Time |
|---|---|---|---|---|---|
zip | 100K | 8,200 | 340 | 1.2M | 12s |
state | 100K | 50 | 12,000 | 45M | 320s |
last_name + soundex | 100K | 4,100 | 180 | 0.8M | 9s |
learned | 100K | 3,800 | 200 | 0.9M | 10s |
Rules of thumb:
- Target max block size under 1,000 records
- Use
multi_passfor best recall,adaptivefor best speed - Use
learnedto auto-discover optimal predicates - Use
ann_pairsfor semantic/product matching