Blocking Strategies

Blocking reduces the comparison space from O(N^2) to O(N*B) by grouping records that share a key. GoldenMatch supports 8 strategies.


Strategy overview

Strategy Description Best For
static Group by blocking key Clean data with reliable keys
adaptive Static + recursive sub-blocking for oversized blocks Default choice
sorted_neighborhood Sliding window over sorted records Typos in blocking key
multi_pass Union of blocks from multiple passes Noisy data, best recall
ann FAISS nearest-neighbor on embeddings Semantic matching
ann_pairs Direct-pair ANN scoring 50–100x faster than ann
canopy TF-IDF canopy clustering Text-heavy data
learned Data-driven predicate selection Auto-discovers rules

Static blocking

Group records by exact value of the blocking key.

blocking:
  strategy: static
  keys:
    - fields: [zip]
    - fields: [last_name]
      transforms: [lowercase, soundex]

Multiple keys produce independent blocks that are unioned. Transforms are applied before grouping.


Adaptive blocking

Static blocking with automatic sub-splitting for oversized blocks. When a block exceeds max_block_size, it splits on the highest-cardinality column within the block.

blocking:
  strategy: adaptive
  max_block_size: 5000
  keys:
    - fields: [zip]

Sorted neighborhood

Sliding window over records sorted by a key. Catches near-matches that differ by one character in the blocking key.

blocking:
  strategy: sorted_neighborhood
  window_size: 20
  sort_key:
    - column: last_name
      transforms: [lowercase, soundex]

Multi-pass blocking

Run multiple blocking passes and union the results. Best recall for noisy data.

blocking:
  strategy: multi_pass
  union_mode: true
  passes:
    - fields: [zip]
    - fields: [last_name]
      transforms: [lowercase, soundex]
    - fields: [first_name]
      transforms: [lowercase, first_token]

ANN blocking

Use FAISS approximate nearest-neighbor search on sentence-transformer embeddings. Requires pip install goldenmatch[embeddings].

blocking:
  strategy: ann
  ann_column: description
  ann_model: all-MiniLM-L6-v2
  ann_top_k: 20

ann_pairs is a faster variant (50–100x) that returns direct pairs instead of block groups:

blocking:
  strategy: ann_pairs
  ann_column: title
  ann_top_k: 20

Canopy blocking

TF-IDF-based canopy clustering with loose and tight thresholds.

blocking:
  strategy: canopy
  canopy:
    fields: [name, address]
    loose_threshold: 0.3
    tight_threshold: 0.7
    max_canopy_size: 500

Learned blocking

Data-driven predicate selection via a two-pass approach: sample pairs, train predicates, apply to full data. Achieves 96.9% F1 matching hand-tuned static blocking on DBLP-ACM.

blocking:
  strategy: learned
  learned_sample_size: 5000
  learned_min_recall: 0.95
  learned_min_reduction: 0.90
  learned_predicate_depth: 2
  learned_cache_path: .goldenmatch/learned_blocking.pkl
import goldenmatch as gm

rules = gm.learn_blocking_rules(df, matchkey, sample_size=5000)
blocks = gm.apply_learned_blocks(df, rules)

Cache the learned rules to skip re-training on subsequent runs.


Auto-select

Let GoldenMatch pick the best blocking key by histogram analysis:

blocking:
  auto_select: true
  keys:
    - fields: [zip]
    - fields: [last_name]
      transforms: [lowercase, soundex]
    - fields: [city]

The analyzer scores each key by block count, max block size, estimated comparisons, and recall. Use the CLI to see suggestions:

goldenmatch analyze-blocking customers.csv --config config.yaml

Performance impact

Blocking key choice dominates fuzzy matching performance. A coarse key (e.g., state) creates huge blocks and slow scoring. A fine key (e.g., email) misses near-duplicates.

Key Records Blocks Max Size Comparisons Time
zip 100K 8,200 340 1.2M 12s
state 100K 50 12,000 45M 320s
last_name + soundex 100K 4,100 180 0.8M 9s
learned 100K 3,800 200 0.9M 10s

Rules of thumb:

  • Target max block size under 1,000 records
  • Use multi_pass for best recall, adaptive for best speed
  • Use learned to auto-discover optimal predicates
  • Use ann_pairs for semantic/product matching