How PlateIQ works

This page documents the data pipeline, projection engine, stabilization rules, and calibration metrics behind every number on the live tabs. If a cell on Slate or Matchups seems off, you should be able to trace it back from here.

The model, in one paragraph

Every team-run and prop projection on PlateIQ comes out of bbsim, a Monte-Carlo play-by-play simulator that runs 5,000 simulated games for each scheduled matchup. For each plate appearance, it draws event outcomes from a joint model built on pitcher rates, hitter rates, park factors, weather, and umpire tendencies — then aggregates across the 5,000 sims to produce run means, win probabilities, p10/p50/p90 bands, and per-prop projections. The bbsim cache covers 2020–2024 and feeds the accuracy table below. The bet-sizing layer on top (Platt calibration + Kelly) is trained on 2022–2024 and held-out validated against the full 2025 season — see the Best Bets section for the actual ROI numbers.

Where the data comes from

FeedSourceRefresh
Schedule, lineups, probable pitchersMLB Stats APIevery 5 min on game day
Hitter + pitcher season stats, Statcast xStatsBaseball Savantnightly
Park factors (per-side HR, runs, K)FanGraphs Guts, 3-yr rollingnightly ingest
Game weather (temp, wind, humidity)Open-Meteo forecast, 3-hr avg from first pitchevery 5 min on game day
Moneyline / game total / run lineTheOddsAPI (BookMaker.eu primary) — h2h+totals+spreads bundlehourly
Team totals (book-direct prices)TheOddsAPI per-event team_totals markethourly
Umpire zone tendenciesUmpScorecards season aggregates (dampened ×0.4)nightly
Player prop lines (HR, K, TB, etc.)TheOddsAPI — paused since 2026-04-26 lookahead auditmanual

The projection engine (bbsim)

Each plate-appearance draw is a multinomial over {K, BB, HBP, 1B, 2B, 3B, HR, other BIP} with probabilities shaped by three layers:

  1. Talent layer. Start from the hitter's and pitcher's season rates (K%, BB%, xwOBA, etc.), then apply a Marcel-style regression toward league mean using the stabilization thresholds below. At low sample this pulls extreme rates toward league average; at large sample the raw rate dominates.
  2. Matchup layer. Platoon split (LHB vs RHP etc.), pitch-type mix, batter-vs-pitcher history, recent form (14-day LII).
  3. Environment layer. Park factor (per-side HR, runs, K), weather impact (temp, wind-projection-to-CF, humidity clipped at extremes), umpire dampened K/BB tendencies.

The resulting per-PA probability is drawn 5,000 times per game across both lineups, with base-state progression from a Retrosheet 1974–2024 RE24 matrix. Aggregate stats come from averaging the 5,000 sim outcomes.

Why 5,000 sims and not more?
At n=5,000 the Monte-Carlo standard error on a p=0.15 HR-prop probability is √(p(1−p)/n) ≈ 0.5%, down from ~1.5% at n=500. Higher n has diminishing returns and costs proportional latency on cold-start calls. The cache is regenerated nightly at n=5,000.

Matchup Score

Matchup Score is a 0–100 percentile of expected per-PA wOBA for tonight's specific batter / pitcher / park / weather combination, ranked against the empirical distribution of every batter-pitcher-park- weather combo over 2023–2025 (n ≈ 597K PAs). It answers "how does tonight's spot rank versus everything we've seen the last three years?"

Composition

Additive in wOBA units — every column on the row corresponds to one term:

E_skill = lg_xwOBA + (bat_xwOBA_vs_hand − lg_xwOBA) + (pit_xwOBA_vs_hand − lg_xwOBA)
E_env  = E_skill + park_delta_woba(venue, stand) + weather_delta_woba(temp, wind_to_CF, humidity)
Matchup Score = percentile_rank(E_env, 2023-2025 distribution)
  • Batter and pitcher xwOBA-vs-hand: point-in-time, Marcel-regressed (200 PA prior toward league mean), using all PAs strictly before the game date — no lookahead.
  • Park delta: hand-split, wOBA-direct factor from a sparse OLS with batter + pitcher fixed effects on 2023–2025 events. Empirical-Bayes shrunk toward 0 with a 1,000-PA prior. Coors RHB comes out at +.039 wOBA (the largest hitter-park delta in MLB); T-Mobile Park RHB at −.031 (the largest pitcher-park delta).
  • Weather delta: empirically fit on 2023–2025 actual wOBA, holding batter / pitcher / venue×stand fixed: +0.30 wOBA pts/°F above 70°F, +0.72 wOBA pts/mph wind-to-CF, −0.07 wOBA pts/% humidity above 60%. Zero for closed-roof venues.

Two flavors of wOBA

Skill estimates use Statcast xwOBA for batted balls (physics-based, park-and-weather-neutral) — that's the noise-resistant signal we want for batter / pitcher quality. Park and weather deltas use actual wOBA — those layers exist to capture the things xwOBA filters out (warning-track FBs becoming HRs, ball carry on hot days). Mixing the two on a single additive scale is intentional.

Scale calibration

The 2023–2025 E_env distribution: mean .320, σ = .032, 1%ile .251, 50%ile .318, 99%ile .409. A few worked anchors against that distribution:

MatchupE_envScore
Aaron Judge vs LHP @ Yankee Stadium, 78°F 8 mph wind out.45599
Top of Reds lineup vs Kyle Freeland @ Coors, summer night.39 – .4096 – 98
Median PA across all 2023–2025 matchups.31850
League-average bat vs Skubal @ T-Mobile, 58°F 12 mph wind in.200 – 1

Freeze rule

The skill and park components freeze when the game's lineup flips to confirmed in the daily-lineups feed. After lock-in only the weather delta keeps refreshing (Open-Meteo forecasts update through the day), so users see a stable Matchup Score that drifts at most a few points as the wind / temperature forecast tightens.

Small-sample regression (the ★ marker)

Rate stats need a minimum sample before their observed value is signal. PlateIQ follows Russell Carleton's r=0.5 thresholds: below these, cells render without heatmap color and carry a marker. The displayed value is always Marcel-regressed toward league mean:

X_shown = (PA × X_raw + r × X_league) / (PA + r)

So both display and projection engines see the regressed number; they never diverge.

Hitter (PA)
k_pct60
bb_pct120
iso160
xwoba100
woba100
xba100
avg100
slg120
ops120
barrel_pct150
hard_pct150
chase_pct100
whiff_pct100
Pitcher (TBF)
k_pct70
k_per_970
bb_pct170
bb_per_9170
xwoba_against120
fip200
xfip200
siera200
whip200
era250
csw_pct150
swstr_pct150
o_swing_pct150
gb_pct200
barrel_pct_against150
hard_pct_against150
avg_fb_velo20

Calibration (backtest vs. actual)

Below are the bbsim-engine accuracy metrics on every precomputed projection in the cache — 11,662 games across 5seasons. Lower is better on every column. This measures the simulator's raw run-level + win-prob accuracy. The bet-sizing layer on top of bbsim (Platt scaling, Kelly sizing, edge gates) is validated separately against the 2025 hold-out season — see the Best Bets section below for those numbers.

Team-run MAE + win-prob Brier
SeasonGamesHome R MAEAway R MAETotal R MAEWin Brier
20209512.5142.5353.7430.2479
20212,6782.4922.4363.5940.2506
20222,7382.4122.5543.5950.2476
20232,6662.4782.5703.6910.2525
20242,6292.4452.5863.6120.2497
All11,6622.4612.5363.6330.2499
  • MAE = mean absolute error in runs. Lower is better.
  • Brier = mean squared error on home-win probability (0 perfect, 0.25 coin-flip, 1 worst).
  • Stats computed across every precomputed backtest in bbsim_cache. Bias is held out because the cache is frozen training years only.
  • Generated 4/19/2026, 7:40:31 PM ET. Refresh with python pipeline/compute_calibration.py.

How Best Bets are calculated

The Best Bets tab takes every game on the slate where both lineups are confirmed, asks bbsim for its win probability + run distribution, then compares each market the books are offering against our number. A bet surfaces when the gap is large enough to clear vig and a Kelly check.

1. Edge

For each market (moneyline, total over/under, team total over/under) we convert the offered American price to its raw implied probability — this is the book's breakeven rate including vig — and subtract it from our model's probability:

edge_pp = (our_prob − book_implied_prob) × 100

We compare to the viggedbook number, not the de-vigged fair price. That keeps the threshold honest: a 2pp edge means 2pp above the actual price you'd be paying, after the house takes its cut.

2. Calibration

Raw bbsim probabilities are slightly over-confident at the tails — the Monte Carlo will say 70% when reality is closer to 64%. Two layers tame it before sizing:

  • Platt scaling. For game totals and team totals we fit a line-aware logistic on the held-out backtest cache and store the coefficients in lib/models/platt_calibration.json. Predicted probability is passed through that sigmoid before edge math.
  • Linear shrinkage toward 0.5.Anything Platt doesn't cover (moneylines today) gets pulled toward coin-flip with k=0.6: p_shrunk = p_raw × 0.6 + 0.5 × 0.4. The raw probability is preserved on the row as our_prob_raw for audit, but the calibrated value drives edge + Kelly.

3. Quarter-Kelly sizing

Full-Kelly is the math-optimal stake but variance is brutal — a 50% drawdown is normal even when the model is right. We stake at one-quarter Kelly, scaled to a 100-unit bankroll where 1 unit = 1% of roll:

b = decimal_payout − 1
kelly_full = (b · p − (1−p)) / b
units = kelly_full × 0.25 × 100, capped at 3.0 units

A bet only surfaces when units > 0 AND edge ≥ 2pp. The cap at 3 units exists because a single mis-calibrated 8% edge at -110 would otherwise call for 4–5 units, more concentration than the model has earned.

Worked example: a typical 5pp-edge bet at -110

Suppose bbsim says BOS ML wins 57.4% of the time and the book has BOS at -110 (decimal 1.91, implied 52.4% with vig). On a $10,000 bankroll where 1 unit = $100:

edge_pp = (57.4 − 52.4) = 5.0pp
EV per $1 = 0.574 · 0.91 − 0.426 = +$0.096
kelly_full = (0.91 · 0.574 − 0.426) / 0.91 = 10.6%
units = 10.6% · 0.25 · 100 = 2.65u ($265)

Expected return on this single bet: +$25.50 (2.65u · 9.6% EV). But variance is real — the standard deviation on a single -110 bet is ~0.95× stake, so the 95% range on a $265 wager is roughly −$495 to +$760. One bet tells you almost nothing; the +22% backtest ROI only emerges over thousands of bets where the edges average out.

This is the math behind every row on the Best Bets table — the Units, EV%, and Edgecolumns are direct outputs of these formulas applied to that game's current line + price.

4. Lock-on-add

The first time the generator sees a bet that clears the gates — confirmed lineups, edge ≥ 2pp, units > 0 — the row is written and frozen. Re-runs throughout the day update nothing on that row except the grader, which sets status and pnl_units after the game finishes.

Why: if we kept re-pricing pending rows, the displayed edge would drift toward the closing line and the realized ROI column would measure "how well does our number track the market" instead of "how well do we beat the price we actually got." The Added column on the Best Bets table shows the moment of capture.

5. Backtesting (2025 hold-out)

Every Platt fit + sizing rule shipped to production was first validated against the 2025 season as a strict hold-out — 2025 games are not in the training set for any calibration coefficient. The harness:

  1. Replays each 2025 game through bbsim.betting.score_bet() with the historical closing line + price, using Platt coefficients fit only on 2022–2024.
  2. Applies the production Kelly + edge-gate pipeline.
  3. Grades each bet against the actual final score and tallies unit-weighted ROI by bet type, edge bucket, and Platt variant.

Production Platt fits and their 2025 hold-out numbers:

MarketTrain cohort2025 hold-out n2025 ROI2025 win rateLog-loss
Game total (loosened)2022–20247,779 bets+22.21%65.07%0.6613
Game total (log-loss optimal)2022–20242,422 bets+20.84%64.47%0.6464
Team total2022–20241,575 games0.6807
Pitcher Ks2022–2024 (5,635 SP games × 7 lines)39,445 obs0.3997

Why ship the "loosened" fit over the log-loss-optimal one?Log-loss is a global metric — it scores every probability prediction equally, including the ~80% of games where we'd never bet. ROI only counts the predictions that actually fire under our edge gate + Kelly filter, which by construction are the high-confidence tail. A calibrator tuned to minimize global log-loss will compress the tails harder than a calibrator tuned for the bet-decision boundary. The two objectives disagree, and the operationally-correct one is ROI.

The 50/50 blend with identity isn't arbitrary: it's a uniform prior over "trust the historical correction" (full Platt) vs "trust the simulator" (no Platt). Halfway between is the maximum-entropy choice when both endpoints have credible support — Platt because the bbsim cache shows real systematic compression, identity because the unfiltered ROI on raw probabilities is genuinely +EV (63–65% win rate vs 52.4% breakeven on 2025 game-totals). Fully discounting either end is overconfident given the data we have.

Practical consequence: 3.2× the bet volume (7,779 vs 2,422), +22.21% ROI vs +20.84%, at a cost of +0.015 log-loss. We'd revisit the blend if 2026 forward results showed ROI degrading below the strict-Platt baseline.

The harness lives at scratch/backtest_platt_roi.py (game totals) and scratch/refit_platts_on_2025.py (Platt refit). Per-config result JSONs are in scratch/ and the production coefficients with full audit trails sit in lib/models/platt_calibration.json — including the previous 2024-cohort fits, kept for diff-the-fit forensics.

6. Live production track record

The most credible number on this page is what the deployed system has actually returned since launch. These are bets the production generator locked in (entry price, units, edge), the grader settled with the final score, and stored in the best_bets table — no backtesting, no replay, no re-pricing.

Settled bets2069992W · 1050L · 27P
Win rate48.6%vs ~52.4% breakeven at -110
Net units+218.65u4337.93u staked
ROI+5.04%2025-05-01 → 2026-05-08

Season-by-season

SeasonSettledW–L–PNet unitsROIRange
202610547–56–2-7.05u-7.55%2026-05-01 → 2026-05-08
20251,964945–994–25+225.70u+5.32%2025-05-01 → 2025-10-01

Best Bets generation pauses every March 1 – April 30 (the early-season cohort runs −10 to −15% ROI on every backtest year — cold weather, unstable rotations, less training signal on fresh starters). The generator un-pauses May 1 each year, so an empty current-season row in March or April is by design, not a feed failure.

What backtest results don't prove
2025 closing-line replays can't tell us how often we'd have actually been able to bet a given line at the displayed price (line shopping, max-bet limits, quick line moves on sharp action). And once 2025 informs a ship/no-ship decision, it's no longer pristine hold-out data for that specific change. Refits are scheduled quarterly or whenever per-line calibration drift exceeds 2pp; the next true hold-out cohort is the live 2026 season as it accumulates.

Pitcher strikeout prop selectors (shipped 2026-05-06)

Pitcher strikeout props are gated by a two-leg cohort filter that runs on top of the standard ingest. Both legs are restricted to the same low-K/9 starter pool and fire on mechanically disjointpitcher-games — the model can't put +edge on both sides of the same line, so adding the second leg roughly doubles slate coverage without overlap.

Shared Q1 cohort gate

  • Pitcher current-season K/9 < 7.7. Per-year quartile cuts averaged 7.62–7.92 across 2023–2025; 7.7 is a tight Q1 cut. Source: scratch/pitcher_k_year_quartile_v4.py.
  • Prior-year IP ≥ 40 — drops first-start debutants whose noisy current-year K/9 would mis-classify them.
  • Edge in [2pp, 15pp) — drops the high-edge tail (15pp+ is structurally negative-ROI on both legs per v4 backtest: Q1 Under at −8.67% on n=56; fade at −5.68% on n=35).

Leg 1 — Q1 Under (model agrees)

When the model says +Under with edge ≥ 2pp on a Q1 cohort pitcher, we bet Under. Stake is half-Kelly on the model's shrunk probability (k=0.75 production routing), capped at 6 units. Bypasses the legacy apply_edge_tail_policy zero-stake band that was set for the older R5 selector cohort.

Leg 2 — Q1 Over fade (model disagrees, we trust the empirical pattern)

When the model says +Over with edge ≥ 2pp on a Q1 cohort pitcher, we fade — take the matching Under at the under price. Mechanism: Q1 Overs are −17 to −22% ROI in everybacktest year on the pre-fade pool, so when the model picks an Over in this cohort, the empirical evidence says it's wrong. Stake is half-Kelly on a synthetic probability market_imp + over_edge/100(the model's own under-side prob would clamp Kelly to 0 — we're betting against the model on this cell), capped at 6 units.

Why the two legs don't overlap
Per (game, pitcher, line), exactly one side has positive model edge — the other is mechanically negative. Leg 1 fires on Q1 cohort + model picks Under. Leg 2 fires on Q1 cohort + model picks Over. Empirically verified zero overlap on n=2,142 Q1-cohort pitcher-games in the v4 backtest (Leg 1 fires on 662; Leg 2 fires on 465; combined 1,127 = 52.6% coverage of Q1 vs 30.9% Leg-1-only).

Backtest by month (combined 2023-05 → 2026-04)

v4 backtest = production engine (per-pitcher estimateIP, opponent K% × pitcher handedness, park K factor, L5 short-term blend on K/9) at the shipped half-Kelly + 6u-cap sizing for both legs. Closing-line basis from the historical Odds API per-event snapshots in data/backtest/raw_props_historical/.

MonthL1 nL1 ROIL2 nL2 ROITotal nTotal stakeTotal pnlTotal ROI
2023-0542+26.86%20+27.37%62323.36u+87.36u+27.02%
2023-0633−4.81%14+38.84%47226.65u+21.79u+9.62%
2023-0729+53.46%11−6.57%40218.76u+79.15u+36.18%
2023-0840+2.01%14+12.90%54276.11u+12.77u+4.63%
2023-0938+11.77%15−24.84%53268.69u+3.04u+1.13%
2023-102+0.00%0212.00u+0.00u+0.00%
2024-031+71.43%7+20.82%844.92u+12.39u+27.58%
2024-0429+6.64%21−6.38%50261.62u+3.47u+1.33%
2024-0533+24.10%20+51.70%53266.80u+91.01u+34.11%
2024-0634−3.21%22+27.14%56287.25u+24.85u+8.65%
2024-0725−15.71%19−35.82%44207.52u−50.43u−24.30%
2024-0829+17.77%27+6.24%56282.58u+34.53u+12.22%
2024-0937−5.92%17+36.89%54287.19u+22.53u+7.85%
2025-037+57.53%6−44.15%1364.99u+3.98u+6.12%
2025-0450+13.91%33+12.57%83409.24u+54.98u+13.43%
2025-0556−8.97%36+12.90%92466.05u−3.53u−0.76%
2025-0649+21.54%47−16.73%96487.59u+15.19u+3.11%
2025-0738+3.42%37+15.51%75387.31u+37.31u+9.63%
2025-0854+15.77%42+15.93%96471.18u+74.64u+15.84%
2025-0920+15.12%38−11.59%58298.78u−5.68u−1.90%
2026-033+98.81%4−76.77%731.66u−2.13u−6.74%
2026-0413+0.97%15+44.83%28129.13u+31.83u+24.65%
TOTAL662+10.93%465+7.71%1,1275,709.37u+549.07u+9.62%

22 months, 1,127 bets at ~5u average stake. 17 of 22 months positive ROI on combined. Worst single month: 2024-07 (−24.30% — only month where both legs lost meaningfully; the legs are normally somewhat anti-correlated month-to-month, mid-summer 2024 was the exception). Best single month: 2023-07 (+36.18%).

Known limitations + caveats

  • Pre-game odds only — no live / in-play markets. Best Bets are generated from the closing-ish line a few hours before first pitch and locked at that price (see §4 Lock-on-add). If the line moves significantly post-lock, the displayed edge stays anchored to the entry price, not the current market.
  • Bet sizing assumes independence between bets. Quarter-Kelly is calculated per-bet without a correlation correction. Two bets on the same game (e.g. Home ML + Total Over, or Total Over + Home TT Over) are NOT independent — the same outcomes drive both. The total exposure on a single game can therefore exceed what a correlation-adjusted Kelly would size to. Treat sizing as an upper bound; consider scaling down when multiple recommendations land on one game.
  • Calibration drifts between quarterly refits. The Platt coefficients are fit on a closed cohort (currently 2022–2024) and held until the next scheduled refit or until per-line calibration drift exceeds 2pp. If the league environment shifts mid-season (offense surge, juiced ball, rule change), the coefficients may underfit until the next refit catches up.
  • Openers require manual overrides.bbsim ignores pitching changes — it uses the listed probable starter for the entire game. When a team uses an opener (1-2 IP) in front of a bulk pitcher (4-5 IP), the MLB-listed probable is the opener, so without intervention the projection uses the opener's platoon profile + K%/BB%/HR rate for all 6 IP. We maintain a manual override file (pipeline/opener_overrides.json) that swaps the listed opener for the bulk pitcher before the projection runs; affected games show an "Opener: A → B" tag on Slate. Games where the opener isn't in the override file get the wrong projection silently — proper opener+bulk modeling is a planned upgrade.
  • Projections rely on the daily forward-projection refresh. The bbsim cache (2020–2024) is precomputed and shipped with the build. Current-season games are projected by a daily Python job that writes fresh JSON into the same cache directory; Vercel serves those files at request time. If the daily refresh fails or hasn't run yet, today's games show a "Projection refreshing" placeholder rather than serving stale numbers — and the Best Bets generator skips that day entirely rather than betting against stale projections.

Changelog

v1.5.3Matchup Score back on the Matchups tab; SSR + response cache for sub-100ms tab clicks2026-04-29
  • Matchup Score column restored on the Matchups tab (HitterTableV3 MS pill + breakdown tooltip showing batter / pitcher / park / weather contributions in wOBA points). Backed by the existing bbsim/matchup compute stack and the matchup_scores Supabase table
  • Migration 021 recreates the matchup_scores table dropped in 019
  • Migration 020 drops the unused LII v2 columns from lii_scores added in 014 (xwoba_vs_hand, weighted_z, season_lii, etc.) — slim the schema
  • In-process response cache on /api/leaderboard/hitters, /api/leaderboard/pitchers, /api/matchup/[gamePk] with 30s TTL. Repeat tab clicks on a warm instance: 12-20ms vs 6-7s cold
v1.5.1LII reverted to v1 (3-component model + click-to-explain popover)2026-04-28
  • LII tab is back to the v1 layout — 3 components (contact quality, plate vision, expected production), EV / HH% / Brl% / Chase% / Whiff% / xwOBA / xBA columns, click any LII number for the calculation breakdown popover
  • Reads from the daily-refreshed daily_statcast Supabase table — fresh through yesterday's games, not the stale events parquet that was capping LII PA counts
  • v2 model (5-input weighted z, hand-filtered, Marcel-regressed) parked: the heavy shrinkage was hiding hot streaks behind season baselines and the tab capped at LII ~58 in early April when most batters had < 30 PA in 14 days vs a single hand
  • Matchup Score backend (Supabase matchup_scores table + API) preserved — no UI surfaces it currently but the data layer is intact for future use
v1.5.0LII v2 + Matchup Score — formal definitions, hand-filtered, percentile-anchored2026-04-28
  • Locked-In Index reformulated: 5-input weighted z (xwOBA · K% · BB% · ISO · HardHit%), hand-filtered to tonight's SP, Marcel-regressed toward season baseline
  • New "Δ vs season" column on the LII tab — LII percentile minus the player's season-baseline percentile (hotter or colder than usual)
  • Matchup Score: 0-100 percentile of expected per-PA wOBA, ranked against the 2023-2025 distribution of all batter × pitcher × park × weather combos
  • Park factors recomputed wOBA-direct + hand-split from 2023-2025 events (no more HR-multiplier-to-wOBA translation fudge)

Earlier versions + full detail: click the v1.5.3 chip next to the PlateIQ title.