DuckDB Extension
Use GoldenMatch as Python UDFs inside DuckDB. Score strings, deduplicate tables, and run pipelines from SQL.
Installation
pip install goldenmatch-duckdb
Requires goldenmatch and duckdb as dependencies. Also needs pyarrow for Polars conversion (.pl()).
Quick start
import duckdb
import goldenmatch_duckdb
con = duckdb.connect()
goldenmatch_duckdb.register(con)
# Score two strings
con.sql("SELECT goldenmatch_score('John Smith', 'Jon Smyth', 'jaro_winkler')").show()
# 0.884
Functions
goldenmatch_score
Score two strings with a named scorer.
SELECT goldenmatch_score('John Smith', 'Jon Smyth', 'jaro_winkler');
-- 0.884
SELECT goldenmatch_score('hello', 'helo', 'levenshtein');
-- 0.8
SELECT goldenmatch_score('Smith', 'Smyth', 'soundex_match');
-- 1.0
Available scorers: jaro_winkler, levenshtein, exact, token_sort, soundex_match.
goldenmatch_dedupe
Deduplicate a DuckDB table.
SELECT * FROM goldenmatch_dedupe('customers', '{"exact": ["email"]}');
con.sql("""
SELECT * FROM goldenmatch_dedupe('customers', '{
"matchkeys": [{
"name": "fuzzy",
"type": "weighted",
"threshold": 0.85,
"fields": [
{"field": "name", "scorer": "jaro_winkler", "weight": 0.7},
{"field": "zip", "scorer": "exact", "weight": 0.3}
]
}]
}')
""").show()
goldenmatch_match
Match a target table against a reference table.
SELECT * FROM goldenmatch_match('new_customers', 'master', '{"fuzzy": {"name": 0.85}}');
goldenmatch_score_pair
Score a pair of JSON records.
SELECT goldenmatch_score_pair(
'{"name": "John Smith", "zip": "10001"}',
'{"name": "Jon Smyth", "zip": "10001"}',
'{"fuzzy": {"name": 0.7, "zip": 0.3}}'
);
goldenmatch_explain
Explain a match decision in natural language.
SELECT goldenmatch_explain(
'{"name": "John Smith"}',
'{"name": "Jon Smyth"}',
'{"fuzzy": {"name": 1.0}}'
);
Pipeline management
-- List completed jobs
SELECT * FROM goldenmatch_list_jobs();
-- Get clusters from a job
SELECT * FROM goldenmatch_get_clusters('job_id');
-- Get golden records
SELECT * FROM goldenmatch_get_golden('job_id');
-- Get scored pairs
SELECT * FROM goldenmatch_get_pairs('job_id');
-- Get job status
SELECT goldenmatch_job_status('job_id');
Utility functions
-- Version
SELECT goldenmatch_version();
-- Available scorers
SELECT * FROM goldenmatch_list_scorers();
Examples
Score columns in a query
import duckdb
import goldenmatch_duckdb
con = duckdb.connect()
goldenmatch_duckdb.register(con)
con.sql("""
CREATE TABLE customers AS
SELECT * FROM read_csv_auto('customers.csv');
SELECT
a.name AS name_a,
b.name AS name_b,
goldenmatch_score(a.name, b.name, 'jaro_winkler') AS similarity
FROM customers a, customers b
WHERE a.rowid < b.rowid
AND goldenmatch_score(a.name, b.name, 'jaro_winkler') > 0.85;
""").show()
Deduplicate and export
con.sql("""
CREATE TABLE customers AS SELECT * FROM read_csv_auto('customers.csv');
""")
result = con.sql("SELECT * FROM goldenmatch_dedupe('customers', '{\"exact\": [\"email\"]}')").pl()
result.write_csv("deduped.csv")
Use with dbt
The dbt-goldenmatch package provides macros for DuckDB-based entity resolution:
pip install dbt-goldenmatch
-- In a dbt model
{{ run_goldenmatch_dedupe('customers', '{"exact": ["email"]}') }}
DuckDB backend
GoldenMatch also supports DuckDB as an out-of-core processing backend for datasets that don’t fit in memory:
import goldenmatch as gm
result = gm.dedupe("huge.csv", exact=["email"], backend="duckdb")
The DuckDB backend provides:
from goldenmatch.backends.duckdb_backend import DuckDBBackend
backend = DuckDBBackend("data.duckdb")
backend.write_table("customers", df)
df = backend.read_table("customers")
tables = backend.list_tables()
Notes
- DuckDB UDFs cannot query the same connection they are called on (deadlock) – the extension uses
con.cursor()for internal table reads .pl()(Polars conversion) requirespyarrowas a dependency- See goldenmatch-extensions for source code and CI details