ER Agent
GoldenMatch exposes itself as an autonomous entity resolution agent that other AI systems can discover and invoke.
An agent says “deduplicate this data” and GoldenMatch handles strategy selection, config generation, pipeline execution, and result explanation – all without human configuration.
Two Protocols
| Protocol | Port | Best For |
|---|---|---|
| A2A (Agent-to-Agent) | 8200 | AI agent frameworks (LangChain, CrewAI, AutoGen) |
| MCP (Model Context Protocol) | stdio | Claude Desktop, Cursor, Windsurf |
Quick Start
A2A Server
pip install goldenmatch[agent]
goldenmatch agent-serve --port 8200
Other agents discover GoldenMatch at:
GET http://localhost:8200/.well-known/agent.json
MCP (Claude Desktop)
Add to claude_desktop_config.json:
{
"mcpServers": {
"goldenmatch": {
"command": "goldenmatch",
"args": ["mcp-serve", "--file", "customers.csv"]
}
}
}
Agent Capabilities (8 Skills)
| Skill | What It Does |
|---|---|
analyze_data | Profile columns, detect domain, recommend matching strategy |
configure | Generate optimal YAML config from data analysis |
deduplicate | Full pipeline with confidence-gated output and reasoning |
match | Cross-source matching with intelligent strategy selection |
explain | Natural language explanation for any pair or cluster |
review | Present borderline matches for approval |
compare_strategies | Run multiple approaches, report metrics |
pprl | Privacy-preserving mode for sensitive data |
How It Works
When an agent calls deduplicate, GoldenMatch:
- Profiles the data (column types, cardinality, null rates)
- Detects the domain (healthcare, financial, retail, people, etc.)
- Selects the best strategy:
- Strong ID fields (email, SSN) -> exact matching
- Fuzzy-matchable fields (name, address) -> fuzzy matching
- Sensitive fields detected -> recommends PPRL
- Large datasets (>500K) -> recommends Ray backend
- Generates a config (matchkeys, blocking, scoring)
- Runs the pipeline with confidence gating
- Returns results + reasoning
Reasoning Output
Every response includes the agent’s reasoning:
{
"results": {
"clusters": 42,
"match_rate": "8.4%"
},
"reasoning": {
"domain_detected": "people",
"strategy_chosen": "exact_then_fuzzy",
"why": "Email has 92% uniqueness -- strong exact key. Name has spelling variation -- jaro_winkler at 0.85.",
"alternatives_considered": [
{"strategy": "pprl", "why_not": "No sensitive fields detected."},
{"strategy": "fellegi_sunter", "why_not": "Fuzzy gives better recall for this data."}
],
"confidence_distribution": {
"auto_merged": 38,
"review_queue": 4,
"auto_rejected": 0
}
},
"storage": "memory"
}
Confidence-Gated Review Queue
Not all matches are equal. The agent splits results by confidence:
| Confidence | Action | Count |
|---|---|---|
| > 0.95 | Auto-merged into golden records | High-confidence pairs |
| 0.75 - 0.95 | Held in review queue for approval | Borderline pairs |
| < 0.75 | Auto-rejected | Low-confidence pairs |
Storage Tiers
| Tier | Config | Persists? |
|---|---|---|
| Memory | Default (nothing to configure) | No |
| SQLite | Create a .goldenmatch/ directory | Yes (local file) |
| Postgres | Set DATABASE_URL env var | Yes (shared DB) |
The agent auto-detects which tier is available and reports it in every response.
Review Queue API
from goldenmatch import AgentSession
session = AgentSession()
result = session.deduplicate("customers.csv")
# Check what needs review
pending = session.review_queue.list_pending("customers")
for item in pending:
print(f"Pair ({item.id_a}, {item.id_b}): score={item.score}")
print(f" Explanation: {item.explanation}")
# Approve or reject
session.review_queue.approve("customers", 0, 1, decided_by="human")
session.review_queue.reject("customers", 2, 3, decided_by="human", reason="Different entities")
# Stats
print(session.review_queue.stats("customers"))
# {"pending": 2, "approved": 1, "rejected": 1}
Python API
from goldenmatch import AgentSession
session = AgentSession()
# Analyze data and get strategy recommendation
analysis = session.analyze("customers.csv")
print(analysis["strategy"]) # "exact_then_fuzzy"
print(analysis["why"])
# Deduplicate with full reasoning
result = session.deduplicate("customers.csv")
print(result["results"]["clusters"])
print(result["reasoning"]["why"])
# Compare strategies
comparison = session.compare_strategies("customers.csv")
for strategy, metrics in comparison.items():
print(f"{strategy}: {metrics['clusters']} clusters, {metrics['match_rate']:.1%} match rate")
# Match two sources
matches = session.match_sources("new_customers.csv", "master.csv")
MCP Tools (10 Agent-Level)
| Tool | Description |
|---|---|
analyze_data | Profile data, detect domain, recommend strategy |
auto_configure | Generate optimal config |
agent_deduplicate | Full pipeline with reasoning |
agent_match_sources | Cross-source matching |
agent_explain_pair | Explain a pair match |
agent_explain_cluster | Explain a cluster |
agent_review_queue | Get pending reviews |
agent_approve_reject | Process review decisions |
agent_compare_strategies | Compare ER approaches |
suggest_pprl | Check if PPRL is needed |
These are additive – existing MCP tools (suggest_config, list_domains, etc.) continue to work.
A2A Agent Card
{
"name": "goldenmatch-agent",
"description": "Autonomous entity resolution agent.",
"provider": {
"organization": "GoldenMatch",
"url": "https://github.com/benzsevern/goldenmatch"
},
"capabilities": {
"streaming": true,
"pushNotifications": false
},
"skills": [...]
}
Full card at: http://localhost:8200/.well-known/agent.json
Authentication
Set GOLDENMATCH_AGENT_TOKEN env var for bearer token auth. If not set, no auth required (suitable for local use).