Domain Packs

GoldenMatch includes 7 built-in YAML rulebooks that extract structured fields from unstructured product descriptions and other domain-specific text.


Built-in packs

Pack Domain Extracted Fields
electronics Consumer electronics brand, model, SKU, color, specs
software Software products name, version, edition, platform
healthcare Medical records NPI, CPT codes, drug names, dosages
financial Financial instruments CUSIP, LEI, ticker, account numbers
real_estate Property listings address, MLS number, lot size, year built
people Person records name parts, phone, email, SSN pattern
retail General retail brand, SKU, UPC, size, color

Using domain packs

Auto-detection

import goldenmatch as gm

rulebooks = gm.discover_rulebooks()  # Returns all 7 packs
print(list(rulebooks.keys()))
# ['electronics', 'software', 'healthcare', 'financial', 'real_estate', 'people', 'retail']

Extract fields

import goldenmatch as gm

rulebooks = gm.discover_rulebooks()
enhanced_df, low_confidence = gm.extract_with_rulebook(df, "title", rulebooks["electronics"])

# enhanced_df has new columns: __brand__, __model__, __sku__, etc.
# low_confidence contains records where extraction confidence was low

Auto-detect domain

domain = gm.match_domain(df, "description")
# Returns "electronics", "software", etc., or None

YAML config

Enable domain extraction in your config file:

domain:
  enabled: true
  pack: electronics

Or let GoldenMatch auto-detect:

domain:
  enabled: true

Electronics pack

Extracts brand, model number, SKU, color, and technical specs from product titles.

"Samsung Galaxy S24 Ultra 256GB Titanium Black SM-S928B"
  -> brand: Samsung
  -> model: Galaxy S24 Ultra
  -> sku: SM-S928B
  -> color: Titanium Black
  -> specs: 256GB

Model normalization strips hyphens, region suffixes, and color suffixes for better matching.


Software pack

Extracts name, version, edition, and platform.

"Microsoft Office 365 Professional Plus - Windows"
  -> name: Microsoft Office
  -> version: 365
  -> edition: Professional Plus
  -> platform: Windows

Healthcare pack

Extracts medical identifiers with contextual prefix requirements (e.g., NPI:, CPT:) to avoid false positives on generic numbers.

"Provider NPI:1234567890, CPT:99213 Office Visit"
  -> npi: 1234567890
  -> cpt_code: 99213

Financial pack

Extracts financial identifiers (CUSIP, LEI, ticker). Contextual prefixes required.

"Bond CUSIP:037833AK6, Issuer LEI:HWUPKR0MPOU8FGXBT394"
  -> cusip: 037833AK6
  -> lei: HWUPKR0MPOU8FGXBT394

Custom domain packs

Create your own YAML rulebook and place it in one of the search paths:

Path Scope
.goldenmatch/domains/ Project-local
~/.goldenmatch/domains/ Global (user)
goldenmatch/domains/ Built-in (read-only)

Rulebook YAML format

# .goldenmatch/domains/my_domain.yaml
name: my_domain
description: Custom domain for matching widgets
signals:
  - pattern: "widget"
    weight: 1.0
  - pattern: "part_?number"
    weight: 0.8
extractors:
  - name: part_number
    pattern: "PN[:-]?\\s*(\\w{6,12})"
    group: 1
  - name: manufacturer
    pattern: "(Acme|Globex|Initech)"
    group: 1
normalizers:
  part_number:
    strip_chars: "-"
    uppercase: true

Create via Python

import goldenmatch as gm

gm.save_rulebook("my_domain", rulebook)
loaded = gm.load_rulebook("my_domain")

Create via MCP

The MCP server provides tools for domain management:

Tool Description
list_domains List all available domain packs
create_domain Create a new custom domain pack
test_domain Test a domain pack against sample data

Domain extraction in the pipeline

Domain extraction runs between the standardize and matchkeys steps. It adds extracted fields as new columns (prefixed with __) that can be used in matchkeys:

matchkeys:
  - name: product_match
    type: weighted
    threshold: 0.85
    fields:
      - field: __brand__
        scorer: exact
        weight: 0.3
      - field: __model__
        scorer: jaro_winkler
        weight: 0.5
      - field: title
        scorer: token_sort
        weight: 0.2

Benchmarks

Domain extraction significantly improves product matching:

Dataset Without Domain With Domain Improvement
Abt-Buy (electronics) 44.5% F1 72.2% F1 +27.7pp
Amazon-Google (software) 45.3% F1 42.1% F1 -3.2pp

Domain extraction helps datasets with structured identifiers (brand, model, SKU) but can hurt datasets with unstructured descriptions. For software matching, clean embedding + ANN pipelines perform better.