Match records across organizations without sharing raw data. GoldenMatch uses bloom filter encoding with HMAC salting and supports both trusted third party (TTP) and secure multi-party computation (SMC) protocols.
import goldenmatch as gm
# Auto-configured: profiles data, picks fields and threshold
result = gm.pprl_link("hospital_a.csv", "hospital_b.csv")
print(f"Found {result['match_count']} matches across {len(result['clusters'])} clusters")
# Manual configuration
result = gm.pprl_link(
"party_a.csv", "party_b.csv",
fields=["first_name", "last_name", "dob", "zip"],
threshold=0.85,
security_level="high",
)
CLI:
goldenmatch pprl link party_a.csv party_b.csv --security-level high
goldenmatch pprl link a.csv b.csv --fields first_name last_name dob zip --threshold 0.85
Party A data -> Bloom filter encoding -> Encoded vectors
Party B data -> Bloom filter encoding -> Encoded vectors
|
Dice/Jaccard similarity
|
Matched pairs
| Parameter | Description | Standard | High | Paranoid |
|---|---|---|---|---|
ngram_size |
Character n-gram size | 2 | 2 | 3 |
hash_functions |
Number of hash functions (k) | 20 | 30 | 40 |
bloom_filter_size |
Bit array length | 512 | 1024 | 2048 |
Larger bloom filters and more hash functions increase privacy at the cost of matching precision.
Basic bloom filter encoding. Suitable for internal use across trusted departments.
result = gm.pprl_link("a.csv", "b.csv", security_level="standard")
HMAC salting with per-field keys. Prevents frequency analysis attacks.
result = gm.pprl_link("a.csv", "b.csv", security_level="high")
HMAC salting + balanced padding. Padding equalizes bloom filter density to prevent inference from bit population counts.
result = gm.pprl_link("a.csv", "b.csv", security_level="paranoid")
Both parties send encoded vectors to a trusted intermediary who performs the matching.
from goldenmatch.pprl.protocol import PPRLConfig, link_trusted_third_party
config = PPRLConfig(fields=["name", "dob", "zip"], threshold=0.85)
result = link_trusted_third_party(party_a_data, party_b_data, config)
No trusted intermediary required. Parties exchange encrypted similarity computations.
from goldenmatch.pprl.protocol import link_smc
result = link_smc(party_a_data, party_b_data, config)
pprl_auto_config profiles your data and selects optimal fields, bloom filter parameters, and threshold.
import goldenmatch as gm
config = gm.pprl_auto_config(df)
print(config.recommended_fields) # ['first_name', 'last_name', 'zip_code', 'birth_year']
print(config.recommended_config) # PPRLConfig with optimal parameters
Auto-config heuristics:
CLI:
goldenmatch pprl auto-config data.csv
gm.PPRLConfig(
fields: list[str],
threshold: float = 0.85,
security_level: str = "high",
ngram_size: int = 2,
hash_functions: int = 30,
bloom_filter_size: int = 1024,
protocol: str = "trusted_third_party",
)
from goldenmatch.pprl.protocol import run_pprl, compute_bloom_filters, PartyData, LinkageResult
# Compute bloom filters manually
bf_a = compute_bloom_filters(df_a, fields, config)
bf_b = compute_bloom_filters(df_b, fields, config)
# Run matching
result: LinkageResult = run_pprl(df_a, df_b, config)
print(result.clusters)
print(result.match_count)
print(result.total_comparisons)
Vectorized similarity uses numpy matrix multiply (mat_a @ mat_b.T) for bloom filter Dice – 13x faster than per-pair Python loops.
Use the bloom_filter transform and dice/jaccard scorer:
matchkeys:
- name: pprl_match
type: weighted
threshold: 0.85
fields:
- field: first_name
transforms: [lowercase, strip, "bloom_filter:2:30:1024"]
scorer: dice
weight: 0.3
- field: last_name
transforms: [lowercase, strip, "bloom_filter:2:30:1024"]
scorer: dice
weight: 0.4
- field: zip
transforms: ["bloom_filter:2:30:1024"]
scorer: dice
weight: 0.3
| Strategy | Precision | Recall | F1 | Privacy |
|---|---|---|---|---|
| Normal fuzzy (baseline) | 56.5% | 74.6% | 64.3% | None |
| PPRL manual tuning | 98.2% | 82.6% | 89.8% | Per-field HMAC |
| PPRL auto-config | 99.7% | 86.1% | 92.4% | Per-field HMAC |
| PPRL paranoid | 98.9% | 76.0% | 86.0% | HMAC + balanced |
| Strategy | Precision | Recall | F1 |
|---|---|---|---|
| PPRL auto-config | 64.0% | 93.8% | 76.1% |
Auto-configuration beats manual tuning on both datasets. Zero-config PPRL profiles your data and picks optimal parameters automatically.
The MCP server exposes PPRL tools for Claude Desktop:
| Tool | Description |
|---|---|
pprl_auto_config |
Analyze data and recommend PPRL parameters |
pprl_link |
Run privacy-preserving linkage |