PostgreSQL Extension
Use GoldenMatch directly from SQL. The PostgreSQL extension provides 18 functions for scoring, deduplication, and pipeline management.
Installation
Pre-built packages
Download from goldenmatch-extensions releases:
# Debian/Ubuntu
sudo dpkg -i goldenmatch-pg-0.1.0-pg16-amd64.deb
sudo systemctl restart postgresql
# RHEL/Fedora
sudo rpm -i goldenmatch-pg-0.1.0-pg16.x86_64.rpm
sudo systemctl restart postgresql
Docker
docker pull ghcr.io/benzsevern/goldenmatch-extensions:latest
docker run -d \
-e POSTGRES_PASSWORD=secret \
-p 5432:5432 \
ghcr.io/benzsevern/goldenmatch-extensions:latest
From source
Requires Rust and pgrx. Build on Linux (pgrx needs libclang/LLVM):
cd goldenmatch-extensions/postgres
cargo pgrx install --release
Enable the extension
CREATE EXTENSION goldenmatch_pg;
All functions live in the goldenmatch schema.
Functions
String scoring
-- Score two strings
SELECT goldenmatch.goldenmatch_score('John Smith', 'Jon Smyth', 'jaro_winkler');
-- 0.884
-- Available scorers: jaro_winkler, levenshtein, exact, token_sort, soundex_match
SELECT goldenmatch.goldenmatch_score('hello', 'helo', 'levenshtein');
-- 0.8
Table deduplication
-- Dedupe a table with JSON config
SELECT goldenmatch.goldenmatch_dedupe_table(
'customers',
'{"exact": ["email"], "fuzzy": {"name": 0.85, "zip": 0.95}}'
);
-- With full YAML-style config
SELECT goldenmatch.goldenmatch_dedupe_table(
'customers',
'{
"matchkeys": [
{"name": "email", "type": "exact", "fields": [{"field": "email"}]},
{"name": "fuzzy", "type": "weighted", "threshold": 0.85,
"fields": [
{"field": "name", "scorer": "jaro_winkler", "weight": 0.7},
{"field": "zip", "scorer": "exact", "weight": 0.3}
]}
]
}'
);
Table matching
-- Match target table against reference
SELECT goldenmatch.goldenmatch_match_tables(
'new_customers',
'master_customers',
'{"fuzzy": {"name": 0.85}}'
);
Pair scoring and explanation
-- Score a pair of records
SELECT goldenmatch.goldenmatch_score_pair(
'{"name": "John Smith", "zip": "10001"}'::jsonb,
'{"name": "Jon Smyth", "zip": "10001"}'::jsonb,
'{"fuzzy": {"name": 0.7, "zip": 0.3}}'
);
-- Explain a match decision
SELECT goldenmatch.goldenmatch_explain_pair(
'{"name": "John Smith"}'::jsonb,
'{"name": "Jon Smyth"}'::jsonb,
'{"fuzzy": {"name": 1.0}}'
);
Pipeline management
-- List pipeline runs
SELECT * FROM goldenmatch.goldenmatch_list_jobs();
-- Get job status
SELECT goldenmatch.goldenmatch_job_status('job_id');
-- Get clusters from a run
SELECT * FROM goldenmatch.goldenmatch_get_clusters('job_id');
-- Get golden records from a run
SELECT * FROM goldenmatch.goldenmatch_get_golden('job_id');
-- Get scored pairs
SELECT * FROM goldenmatch.goldenmatch_get_pairs('job_id');
Pipeline schema
The extension creates metadata tables for pipeline tracking:
| Table | Purpose |
|---|---|
goldenmatch._jobs | Pipeline run tracking (job_id, status, config, timestamps) |
goldenmatch._pairs | Scored pairs (job_id, id_a, id_b, score, matchkey) |
goldenmatch._clusters | Cluster membership (job_id, cluster_id, record_id) |
goldenmatch._golden | Golden records (job_id, cluster_id, record_data) |
Database sync tables
When using goldenmatch sync from the CLI, additional metadata tables are created:
| Table | Purpose |
|---|---|
gm_state | Processing state, watermarks |
gm_clusters | Persistent cluster membership |
gm_golden_records | Versioned golden records (append-only, is_current flag) |
gm_embeddings | Cached embeddings for ANN blocking |
gm_match_log | Audit trail of all match decisions |
Examples
Deduplicate a customer table
-- Create and populate
CREATE TABLE customers (
id SERIAL PRIMARY KEY,
name TEXT,
email TEXT,
phone TEXT,
zip TEXT
);
INSERT INTO customers (name, email, phone, zip) VALUES
('John Smith', 'john@example.com', '555-1234', '10001'),
('Jon Smyth', 'john@example.com', '5551234', '10001'),
('Jane Doe', 'jane@example.com', '555-5678', '90210');
-- Dedupe
SELECT goldenmatch.goldenmatch_dedupe_table(
'customers',
'{"exact": ["email"]}'
);
-- View clusters
SELECT * FROM goldenmatch._clusters
WHERE job_id = (SELECT job_id FROM goldenmatch._jobs ORDER BY created_at DESC LIMIT 1);
Score columns in a query
SELECT
a.name AS name_a,
b.name AS name_b,
goldenmatch.goldenmatch_score(a.name, b.name, 'jaro_winkler') AS similarity
FROM customers a
CROSS JOIN customers b
WHERE a.id < b.id
AND goldenmatch.goldenmatch_score(a.name, b.name, 'jaro_winkler') > 0.85;
Notes
- Extension functions live in the
goldenmatchschema – usegoldenmatch.function_name()or setsearch_path - Explicit
::TEXTcasts may be needed for some argument types in psql - pgrx does not auto-generate SQL files – the extension uses handwritten SQL at
sql/goldenmatch_pg--0.1.0.sql - See goldenmatch-extensions for full documentation and CI details