Aisle 19
Hierarchical demand forecasting, long-tail SKUs
A 1,200-store multi-format retailer was carrying about 25% more stock than it needed because the legacy forecast couldn't tell pasta from imported truffle salt. Top-selling SKUs got accurate predictions; everything in the long tail got rounded up to safety stock, just in case. The CFO saw the inventory line; the merchandising team saw stockouts on the same SKUs the warehouse was overstocked on. Category managers had stopped trusting the system years ago and were overriding orders by gut feel, which moved the bias around without removing it. The fix wasn't a better single model, it was a model that respected the shape of the catalog: the fast-movers learn from themselves, the long tail borrows strength from its category and region, and signals that move demand (weather, promo, regional events, macro) get a seat at the table instead of being averaged out.
Hierarchical fit at SKU × store × season
Instead of a flat per-SKU model, the forecast learns at three grains at once: SKU, store, season. The fast-movers carry enough history to learn their own seasonality cleanly. The long tail (most of the catalog by SKU count, a small slice of revenue) gets pooled through its category × region group and shrinks toward the group mean when its own signal is too sparse to trust. The shrinkage strength is learned per-group, not set globally.
Causal signal layer with lag alignment
Four signal families feed in above the base model: weather (lag 0 to 14 days), macro (CPI, FX, regional unemployment), internal promo calendar, regional events. The lags are not picked by hand, they are searched per category, because a hot weekend lifts beer the same day and ice cream three days later. The signal layer is additive on top of the base; toggling a signal off at inference time is a clean delta you can show a category manager.
Reconciliation for coherence
After the base model emits per-SKU per-store forecasts, a reconciliation step (MinT-style weighted least squares) enforces the hierarchy: store totals must add up to chain totals; category totals must add up to banner totals. Without this step the forecasts disagree with themselves and the planner can't trust any single level. With it, the same number rolls up cleanly from a single jar of olives to the chain Q3 plan.
Per-SKU drift, targeted retraining
Supplier substitutions, format changes, regional remerchandising, all of it shows up as drift in a handful of SKUs while the rest of the catalog stays stable. A per-SKU drift score (population stability + residual distribution shift) flags only the SKUs that need a retrain, and a per-SKU retraining job runs without touching the hierarchy weights. Full-hierarchy retraining stays on a quiet cadence; the noisy stuff gets fixed in hours, not weeks.
Constraints at the recommendation seam
Order cutoffs, MOQs, case packs, shelf life, supplier lead times, none of these belong inside the demand model. The forecast emits a clean expected-demand number per SKU per store per day; a separate recommendation layer applies the constraints and rounds to an actually-orderable quantity. Clean separation means a constraint change (new supplier, new MOQ) doesn't poison the historical model fit.
Signals come in from the top (weather, macro, promo, regional events) and feed a causal signal layer with proper lag alignment. The hierarchy (SKU × store × season) enters from the left and feeds the base hierarchical model, where Prophet-ensemble fits run alongside a BSTS trend decomposition and long-tail shrinkage. Reconciliation enforces coherence across the hierarchy before anything reaches a human. From there the forecast splits: the category-manager dashboard exposes what-if toggles for trust; the order-recommendation engine applies operational constraints (MOQs, cutoffs, shelf life) to convert demand into actually-orderable quantities. A drift loop along the bottom taps realized demand, scores per-SKU drift, and triggers targeted retraining without touching the rest of the hierarchy.
Long-tail SKUs had so little per-SKU signal that the legacy model defaulted to safety stock and stayed there year after year.
Hierarchical shrinkage: long-tail SKUs borrow strength from their category × region group. The shrinkage strength is learned per group, so a niche category with stable buyers shrinks differently than a churn-heavy one. The long tail stops getting rounded up just because it's quiet.
Promo and weather are noisy at the daily level; raw signals dragged the forecast around in ways the team couldn't justify.
A causal signal layer with searched lag features (per category) and a sparsity prior on signal coefficients. Signals that don't carry real predictive weight in a given category get zeroed out; the ones that do show their contribution as an attributable delta the category manager can see.
Drift was constant, supplier substitutions, regional remerchandising, format changes, and a full-hierarchy retrain was expensive enough that it lagged the drift by weeks.
Online per-SKU drift detector (population stability + residual distribution shift) flags only the affected SKUs and triggers a per-SKU retraining job. Hierarchy weights stay stable; the noisy SKUs converge in hours; full retrain stays on a quiet cadence.
Category managers didn't trust the model. The previous system was a black box that had been wrong often enough that they routinely overrode its orders.
A what-if dashboard exposes the signal toggles directly. Managers can see, per SKU, how much weather is adding, how much promo is contributing, what the forecast looks like without macro factored in. Trust came from visibility, not from a leaderboard number. Adoption hit 90% inside six months.
Operational realities (order cutoffs, MOQs, case packs, shelf life) kept leaking into the forecast and corrupting the historical fit.
Hard separation. The demand model emits a clean expected-demand number; a downstream recommendation layer applies the constraints and rounds to orderable quantities. Constraints change all the time; the forecast stays clean across the change.
Forecast
- ·Prophet ensemble, per-SKU and per-group
- ·Custom hierarchical reconciliation (MinT-style WLS)
- ·BSTS for trend decomposition on top categories
Drift + retraining
- ·Evidently + custom per-SKU drift score
- ·Online retraining triggers, per-SKU isolation
- ·Quiet-cadence full-hierarchy refit (monthly)
Signals
- ·Weather API, provider-agnostic adapter
- ·Macro feeds (CPI, FX, regional unemployment)
- ·Internal promo store + regional event calendar
Serving
- ·FastAPI for on-demand forecast + recommendation
- ·Redis cache for hot forecasts (planner UI)
- ·Batch overnight + on-demand re-fit modes
Dashboards
- ·Next.js category-manager portal
- ·Recharts for what-if visualization
- ·Per-SKU signal-contribution view at order time
Forecast Playground
Three SKU profiles, fast-mover, mid, long-tail, each with 12 months of real-shape demand and two overlaid forecasts: the legacy line (dashed, muted, often over-shoots) and the new hierarchical line (solid, brand). Toggle the four causal signals on and off; the new forecast visibly re-fits, the legacy line never moves. The three stat tiles recompute live so the signal contribution lands in concrete inventory and MAPE numbers, not abstraction.
- → Inventory holding down 18% across the catalog within two quarters
- → Forecast variance on long-tail SKUs halved against the legacy baseline
- → Category-manager adoption hit 90% inside 6 months, mostly via the what-if dashboard
- → Full rollout across 1,200 stores on a single live banner with two more in pilot
Rolled out across [REDACTED] stores across multiple banners. Per-SKU drift detection runs continuously; full-hierarchy refit on a monthly cadence. Reviewed quarterly against the merchandising plan and the inventory P&L.
Live across 1,200 stores on one banner with two more in pilot. Per-SKU drift retraining continuous; full-hierarchy refit monthly. Anonymized case study available Q1 2026.