Domain Packs
GoldenMatch includes 7 built-in YAML rulebooks that extract structured fields from unstructured product descriptions and other domain-specific text.
Built-in packs
| Pack | Domain | Extracted Fields |
|---|---|---|
electronics | Consumer electronics | brand, model, SKU, color, specs |
software | Software products | name, version, edition, platform |
healthcare | Medical records | NPI, CPT codes, drug names, dosages |
financial | Financial instruments | CUSIP, LEI, ticker, account numbers |
real_estate | Property listings | address, MLS number, lot size, year built |
people | Person records | name parts, phone, email, SSN pattern |
retail | General retail | brand, SKU, UPC, size, color |
Using domain packs
Auto-detection
import goldenmatch as gm
rulebooks = gm.discover_rulebooks() # Returns all 7 packs
print(list(rulebooks.keys()))
# ['electronics', 'software', 'healthcare', 'financial', 'real_estate', 'people', 'retail']
Extract fields
import goldenmatch as gm
rulebooks = gm.discover_rulebooks()
enhanced_df, low_confidence = gm.extract_with_rulebook(df, "title", rulebooks["electronics"])
# enhanced_df has new columns: __brand__, __model__, __sku__, etc.
# low_confidence contains records where extraction confidence was low
Auto-detect domain
domain = gm.match_domain(df, "description")
# Returns "electronics", "software", etc., or None
YAML config
Enable domain extraction in your config file:
domain:
enabled: true
pack: electronics
Or let GoldenMatch auto-detect:
domain:
enabled: true
Electronics pack
Extracts brand, model number, SKU, color, and technical specs from product titles.
"Samsung Galaxy S24 Ultra 256GB Titanium Black SM-S928B"
-> brand: Samsung
-> model: Galaxy S24 Ultra
-> sku: SM-S928B
-> color: Titanium Black
-> specs: 256GB
Model normalization strips hyphens, region suffixes, and color suffixes for better matching.
Software pack
Extracts name, version, edition, and platform.
"Microsoft Office 365 Professional Plus - Windows"
-> name: Microsoft Office
-> version: 365
-> edition: Professional Plus
-> platform: Windows
Healthcare pack
Extracts medical identifiers with contextual prefix requirements (e.g., NPI:, CPT:) to avoid false positives on generic numbers.
"Provider NPI:1234567890, CPT:99213 Office Visit"
-> npi: 1234567890
-> cpt_code: 99213
Financial pack
Extracts financial identifiers (CUSIP, LEI, ticker). Contextual prefixes required.
"Bond CUSIP:037833AK6, Issuer LEI:HWUPKR0MPOU8FGXBT394"
-> cusip: 037833AK6
-> lei: HWUPKR0MPOU8FGXBT394
Custom domain packs
Create your own YAML rulebook and place it in one of the search paths:
| Path | Scope |
|---|---|
.goldenmatch/domains/ | Project-local |
~/.goldenmatch/domains/ | Global (user) |
goldenmatch/domains/ | Built-in (read-only) |
Rulebook YAML format
# .goldenmatch/domains/my_domain.yaml
name: my_domain
description: Custom domain for matching widgets
signals:
- pattern: "widget"
weight: 1.0
- pattern: "part_?number"
weight: 0.8
extractors:
- name: part_number
pattern: "PN[:-]?\\s*(\\w{6,12})"
group: 1
- name: manufacturer
pattern: "(Acme|Globex|Initech)"
group: 1
normalizers:
part_number:
strip_chars: "-"
uppercase: true
Create via Python
import goldenmatch as gm
gm.save_rulebook("my_domain", rulebook)
loaded = gm.load_rulebook("my_domain")
Create via MCP
The MCP server provides tools for domain management:
| Tool | Description |
|---|---|
list_domains | List all available domain packs |
create_domain | Create a new custom domain pack |
test_domain | Test a domain pack against sample data |
Domain extraction in the pipeline
Domain extraction runs between the standardize and matchkeys steps. It adds extracted fields as new columns (prefixed with __) that can be used in matchkeys:
matchkeys:
- name: product_match
type: weighted
threshold: 0.85
fields:
- field: __brand__
scorer: exact
weight: 0.3
- field: __model__
scorer: jaro_winkler
weight: 0.5
- field: title
scorer: token_sort
weight: 0.2
Benchmarks
Domain extraction significantly improves product matching:
| Dataset | Without Domain | With Domain | Improvement |
|---|---|---|---|
| Abt-Buy (electronics) | 44.5% F1 | 72.2% F1 | +27.7pp |
| Amazon-Google (software) | 45.3% F1 | 42.1% F1 | -3.2pp |
Domain extraction helps datasets with structured identifiers (brand, model, SKU) but can hurt datasets with unstructured descriptions. For software matching, clean embedding + ANN pipelines perform better.