Privacy-Preserving Record Linkage (PPRL)

Match records across organizations without sharing raw data. GoldenMatch uses bloom filter encoding with HMAC salting and supports both trusted third party (TTP) and secure multi-party computation (SMC) protocols.

Quick start

import goldenmatch as gm

# Auto-configured: profiles data, picks fields and threshold
result = gm.pprl_link("hospital_a.csv", "hospital_b.csv")
print(f"Found {result['match_count']} matches across {len(result['clusters'])} clusters")

# Manual configuration
result = gm.pprl_link(
    "party_a.csv", "party_b.csv",
    fields=["first_name", "last_name", "dob", "zip"],
    threshold=0.85,
    security_level="high",
)

CLI:

goldenmatch pprl link party_a.csv party_b.csv --security-level high
goldenmatch pprl link a.csv b.csv --fields first_name last_name dob zip --threshold 0.85

How it works

Party A data -> Bloom filter encoding -> Encoded vectors
Party B data -> Bloom filter encoding -> Encoded vectors
                                              |
                                     Dice/Jaccard similarity
                                              |
                                        Matched pairs

Each field value is converted to character n-grams (e.g., bigrams)
N-grams are hashed with multiple hash functions into a bloom filter bit array
HMAC salting ensures the same value produces different encodings with different keys
Encoded vectors are compared using Dice or Jaccard similarity
Pairs above the threshold are matched

Bloom filter parameters

Parameter	Description	Standard	High	Paranoid
`ngram_size`	Character n-gram size	2	2	3
`hash_functions`	Number of hash functions (k)	20	30	40
`bloom_filter_size`	Bit array length	512	1024	2048

Larger bloom filters and more hash functions increase privacy at the cost of matching precision.

Security levels

Standard

Basic bloom filter encoding. Suitable for internal use across trusted departments.

result = gm.pprl_link("a.csv", "b.csv", security_level="standard")

High (default)

HMAC salting with per-field keys. Prevents frequency analysis attacks.

result = gm.pprl_link("a.csv", "b.csv", security_level="high")

Paranoid

HMAC salting + balanced padding. Padding equalizes bloom filter density to prevent inference from bit population counts.

result = gm.pprl_link("a.csv", "b.csv", security_level="paranoid")

Protocols

Trusted Third Party (TTP)

Both parties send encoded vectors to a trusted intermediary who performs the matching.

from goldenmatch.pprl.protocol import PPRLConfig, link_trusted_third_party

config = PPRLConfig(fields=["name", "dob", "zip"], threshold=0.85)
result = link_trusted_third_party(party_a_data, party_b_data, config)

Secure Multi-Party Computation (SMC)

No trusted intermediary required. Parties exchange encrypted similarity computations.

from goldenmatch.pprl.protocol import link_smc

result = link_smc(party_a_data, party_b_data, config)

Auto-configuration

pprl_auto_config profiles your data and selects optimal fields, bloom filter parameters, and threshold.

import goldenmatch as gm

config = gm.pprl_auto_config(df)
print(config.recommended_fields)     # ['first_name', 'last_name', 'zip_code', 'birth_year']
print(config.recommended_config)     # PPRLConfig with optimal parameters

Auto-config heuristics:

Penalizes near-unique fields (IDs) – they leak information
Penalizes long fields (>15 chars) – more bits needed
Penalizes high-null fields – reduce match quality
Limits to 4 fields (beats 6 in benchmarks)
Minimum threshold 0.85

CLI:

goldenmatch pprl auto-config data.csv

PPRLConfig

gm.PPRLConfig(
    fields: list[str],
    threshold: float = 0.85,
    security_level: str = "high",
    ngram_size: int = 2,
    hash_functions: int = 30,
    bloom_filter_size: int = 1024,
    protocol: str = "trusted_third_party",
)

Low-level API

from goldenmatch.pprl.protocol import run_pprl, compute_bloom_filters, PartyData, LinkageResult

# Compute bloom filters manually
bf_a = compute_bloom_filters(df_a, fields, config)
bf_b = compute_bloom_filters(df_b, fields, config)

# Run matching
result: LinkageResult = run_pprl(df_a, df_b, config)
print(result.clusters)
print(result.match_count)
print(result.total_comparisons)

Vectorized similarity uses numpy matrix multiply (mat_a @ mat_b.T) for bloom filter Dice – 13x faster than per-pair Python loops.

PPRL in YAML config

Use the bloom_filter transform and dice/jaccard scorer:

matchkeys:
  - name: pprl_match
    type: weighted
    threshold: 0.85
    fields:
      - field: first_name
        transforms: [lowercase, strip, "bloom_filter:2:30:1024"]
        scorer: dice
        weight: 0.3
      - field: last_name
        transforms: [lowercase, strip, "bloom_filter:2:30:1024"]
        scorer: dice
        weight: 0.4
      - field: zip
        transforms: ["bloom_filter:2:30:1024"]
        scorer: dice
        weight: 0.3

Benchmarks

FEBRL4 (5K vs 5K synthetic person records)

Strategy	Precision	Recall	F1	Privacy
Normal fuzzy (baseline)	56.5%	74.6%	64.3%	None
PPRL manual tuning	98.2%	82.6%	89.8%	Per-field HMAC
PPRL auto-config	99.7%	86.1%	92.4%	Per-field HMAC
PPRL paranoid	98.9%	76.0%	86.0%	HMAC + balanced

NCVR (North Carolina Voter Registration)

Strategy	Precision	Recall	F1
PPRL auto-config	64.0%	93.8%	76.1%

Auto-configuration beats manual tuning on both datasets. Zero-config PPRL profiles your data and picks optimal parameters automatically.

MCP tools

The MCP server exposes PPRL tools for Claude Desktop:

Tool	Description
`pprl_auto_config`	Analyze data and recommend PPRL parameters
`pprl_link`	Run privacy-preserving linkage