23 States, 12 Schemas: ETL Lessons From US Cannabis Pricing

The US cannabis market is the strangest dataset I have worked with. There are roughly 23 adult-use states and a further wave of medical-only programmes. Each state defines product categories differently, regulates labelling and advertising differently, imposes different tax structures, and recognises different licence types. Underneath that, every dispensary chain runs on one of about a dozen point-of-sale systems, each of which exposes prices, inventories and deals in its own dialect. Above it, consumers expect to type a strain name and get an answer.

I have spent the past stretch of CannabisDealsUS turning that surface into something queryable. Most of the lessons translate directly to any multi-source, regulated, consumer-facing data product. A few are specific to the market. Here is what stuck.

“In every multi-source pipeline I have built, the schema decided the ceiling. Everything else was negotiable.”

The hardest decision is the schema for products

A cannabis product has at least four things people want to query: the brand, the strain (or genetics lineage), the product form (flower, vape, edible, concentrate), and the unit. None of those four are clean.

Brand names overlap, change ownership, and operate under different DBA names in different states. Strain names are folk taxonomy — the same genetics ship under different names across producers, and the same name attaches to genetics that have drifted over generations. Product form has at least a dozen sub-categories that matter to consumers (live resin vs. distillate, infused pre-roll vs. classic) and that some POS systems collapse and others do not. Unit is sometimes grams, sometimes count, sometimes per-package, and sometimes a dosing strength.

The first attempt was a wide table that captured everything as it arrived. The second attempt was an aggressively normalised model with a strain catalogue, a brand catalogue, and a join table per source. The second model was right structurally and wrong operationally — it could not absorb new sources without three days of mapping work each.

The third model is a hybrid. Canonical product entities with stable IDs at the centre; per-source observation records as the unit of intake; an explicit resolution layer that takes raw observations and links them to canonical entities with a confidence score. It is the only one that survived contact with a fourth state.

Address normalisation is half the project

Dispensary addresses look easy and are not. The same physical store appears as 123 Main St, Suite B, 123 Main Street #B, and 123 Main St., Ste B across three sources, plus a sister location at the same suite number in a different city, plus a former location that has not been removed from a directory in eight months.

The pipeline now runs every address through a normalisation pass and a geocoding pass, then deduplicates against an internal dispensary entity store keyed by parcel rather than text. The reduction in duplicate records was roughly 12% — not glamorous, but the difference between a usable dataset and one that misreports market share.

Stage	Records	Notes
Raw scrape (one week)	1.4M	Inventory rows across all sources
After product resolution	1.1M	21% absorbed as duplicates against canonical products
After dispensary resolution	1.0M	Address-driven dedupe of locations
After price-validity gating	0.92M	Drops impossible prices, stale entries, mislabelled units

Regulatory variance is a feature of the schema, not a footnote

Cannabis regulation is per-state, and the differences are not cosmetic. New York taxes by THC content. Massachusetts taxes by category. California taxes at the cultivation level and again at retail. Some states cap the daily purchase quantity at one ounce, others at 2.5 grams of concentrate, others at a milligram-of-THC equivalence.

The first version of the schema treated tax as a single field on the price record. Every report we tried to build off it ended in a footnote that started with “in California, however…”. The current version models the legal context as a first-class entity attached to the dispensary and the date — a small, slow-moving lookup that tells the rest of the pipeline what rules apply. Reports stop needing footnotes when the schema knows the law.

This generalises. In any regulated data domain, the temptation is to push compliance to the report layer because the data layer is “just the numbers”. The numbers are not free of the law. Modelling the regulatory context as schema is what lets the system grow without compounding exceptions.

Scrapers fail in the same five ways and you should plan for all of them

Anyone who has run a long-lived scraping pipeline knows the categories. I am writing them down because I had to learn each one twice:

Selector drift. The CSS class moves or the template changes. The fix is to assert against semantic content at extraction time, not just count rows.

Rate enforcement. The source starts returning 429s or shadow-bans your IP. The fix is per-source budget, not a global rate limit.

Schema additions. A new field appears in the source response and the parser silently drops it. The fix is strict input typing with a known-extra-keys allowlist.

Partial outages. One store goes dark for a week and a stale snapshot lingers. The fix is a freshness signal per dispensary, not just per source.

Truthful failures. The source returns valid empty results because there is genuinely no inventory. The fix is a separate signal for “observed empty” vs. “failed to observe”, because those mean very different things downstream.

The system is now boring in the right way. Most weeks nothing breaks. The weeks something does break, the alerts identify which of the five categories fired and the fix takes under an hour.

What this has to do with everything else

Cannabis is the test surface, but the lessons travel. A normalised schema with an explicit resolution layer; entity stores keyed by something stable; regulatory context as a first-class model; failure modes named and budgeted for individually — these are the moves I use on every consulting engagement, whether the source is a dispensary feed, an ad platform export, or an internal CRM.

The unglamorous infrastructure is what compounds. The dashboards on top are interchangeable. The reason I run a cannabis data platform alongside an AI governance tool is that both are exercises in the same craft: the value lives in the schema and the discipline of keeping it honest as the world changes around it.

23 States, 12 Schemas: ETL Lessons From US Cannabis Pricing

The hardest decision is the schema for products

Address normalisation is half the project

Regulatory variance is a feature of the schema, not a footnote

Scrapers fail in the same five ways and you should plan for all of them

What this has to do with everything else

Theo Valmis

© Theo Valmis
*This website uses cookies only for statistical purposes

23 States, 12 Schemas: ETL Lessons From US Cannabis Pricing

The hardest decision is the schema for products

Address normalisation is half the project

Regulatory variance is a feature of the schema, not a footnote

Scrapers fail in the same five ways and you should plan for all of them

What this has to do with everything else

Why I'm Running CannabisDealsUS

Data Lakes & AI Readiness

Marketing Attribution & ROI

Theo Valmis

© Theo Valmis *This website uses cookies only for statistical purposes

© Theo Valmis
*This website uses cookies only for statistical purposes