Privacy-Preserving Record Linkage (PPRL)
Match records across organizations without sharing raw data. GoldenMatch uses bloom filter encoding with HMAC salting and supports both trusted third party (TTP) and secure multi-party computation (SMC) protocols.
Quick start
import goldenmatch as gm
# Auto-configured: profiles data, picks fields and threshold
result = gm.pprl_link("hospital_a.csv", "hospital_b.csv")
print(f"Found {result['match_count']} matches across {len(result['clusters'])} clusters")
# Manual configuration
result = gm.pprl_link(
"party_a.csv", "party_b.csv",
fields=["first_name", "last_name", "dob", "zip"],
threshold=0.85,
security_level="high",
)
CLI:
goldenmatch pprl link party_a.csv party_b.csv --security-level high
goldenmatch pprl link a.csv b.csv --fields first_name last_name dob zip --threshold 0.85
How it works
Party A data -> Bloom filter encoding -> Encoded vectors
Party B data -> Bloom filter encoding -> Encoded vectors
|
Dice/Jaccard similarity
|
Matched pairs
- Each field value is converted to character n-grams (e.g., bigrams)
- N-grams are hashed with multiple hash functions into a bloom filter bit array
- HMAC salting ensures the same value produces different encodings with different keys
- Encoded vectors are compared using Dice or Jaccard similarity
- Pairs above the threshold are matched
Bloom filter parameters
| Parameter | Description | Standard | High | Paranoid |
|---|---|---|---|---|
ngram_size | Character n-gram size | 2 | 2 | 3 |
hash_functions | Number of hash functions (k) | 20 | 30 | 40 |
bloom_filter_size | Bit array length | 512 | 1024 | 2048 |
Larger bloom filters and more hash functions increase privacy at the cost of matching precision.
Security levels
Standard
Basic bloom filter encoding. Suitable for internal use across trusted departments.
result = gm.pprl_link("a.csv", "b.csv", security_level="standard")
High (default)
HMAC salting with per-field keys. Prevents frequency analysis attacks.
result = gm.pprl_link("a.csv", "b.csv", security_level="high")
Paranoid
HMAC salting + balanced padding. Padding equalizes bloom filter density to prevent inference from bit population counts.
result = gm.pprl_link("a.csv", "b.csv", security_level="paranoid")
Protocols
Trusted Third Party (TTP)
Both parties send encoded vectors to a trusted intermediary who performs the matching.
from goldenmatch.pprl.protocol import PPRLConfig, link_trusted_third_party
config = PPRLConfig(fields=["name", "dob", "zip"], threshold=0.85)
result = link_trusted_third_party(party_a_data, party_b_data, config)
Secure Multi-Party Computation (SMC)
No trusted intermediary required. Parties exchange encrypted similarity computations.
from goldenmatch.pprl.protocol import link_smc
result = link_smc(party_a_data, party_b_data, config)
Auto-configuration
pprl_auto_config profiles your data and selects optimal fields, bloom filter parameters, and threshold.
import goldenmatch as gm
config = gm.pprl_auto_config(df)
print(config.recommended_fields) # ['first_name', 'last_name', 'zip_code', 'birth_year']
print(config.recommended_config) # PPRLConfig with optimal parameters
Auto-config heuristics:
- Penalizes near-unique fields (IDs) – they leak information
- Penalizes long fields (>15 chars) – more bits needed
- Penalizes high-null fields – reduce match quality
- Limits to 4 fields (beats 6 in benchmarks)
- Minimum threshold 0.85
CLI:
goldenmatch pprl auto-config data.csv
PPRLConfig
gm.PPRLConfig(
fields: list[str],
threshold: float = 0.85,
security_level: str = "high",
ngram_size: int = 2,
hash_functions: int = 30,
bloom_filter_size: int = 1024,
protocol: str = "trusted_third_party",
)
Low-level API
from goldenmatch.pprl.protocol import run_pprl, compute_bloom_filters, PartyData, LinkageResult
# Compute bloom filters manually
bf_a = compute_bloom_filters(df_a, fields, config)
bf_b = compute_bloom_filters(df_b, fields, config)
# Run matching
result: LinkageResult = run_pprl(df_a, df_b, config)
print(result.clusters)
print(result.match_count)
print(result.total_comparisons)
Vectorized similarity uses numpy matrix multiply (mat_a @ mat_b.T) for bloom filter Dice – 13x faster than per-pair Python loops.
PPRL in YAML config
Use the bloom_filter transform and dice/jaccard scorer:
matchkeys:
- name: pprl_match
type: weighted
threshold: 0.85
fields:
- field: first_name
transforms: [lowercase, strip, "bloom_filter:2:30:1024"]
scorer: dice
weight: 0.3
- field: last_name
transforms: [lowercase, strip, "bloom_filter:2:30:1024"]
scorer: dice
weight: 0.4
- field: zip
transforms: ["bloom_filter:2:30:1024"]
scorer: dice
weight: 0.3
Benchmarks
FEBRL4 (5K vs 5K synthetic person records)
| Strategy | Precision | Recall | F1 | Privacy |
|---|---|---|---|---|
| Normal fuzzy (baseline) | 56.5% | 74.6% | 64.3% | None |
| PPRL manual tuning | 98.2% | 82.6% | 89.8% | Per-field HMAC |
| PPRL auto-config | 99.7% | 86.1% | 92.4% | Per-field HMAC |
| PPRL paranoid | 98.9% | 76.0% | 86.0% | HMAC + balanced |
NCVR (North Carolina Voter Registration)
| Strategy | Precision | Recall | F1 |
|---|---|---|---|
| PPRL auto-config | 64.0% | 93.8% | 76.1% |
Auto-configuration beats manual tuning on both datasets. Zero-config PPRL profiles your data and picks optimal parameters automatically.
MCP tools
The MCP server exposes PPRL tools for Claude Desktop:
| Tool | Description |
|---|---|
pprl_auto_config | Analyze data and recommend PPRL parameters |
pprl_link | Run privacy-preserving linkage |