This page documents the data pipeline, projection engine, stabilization rules, and calibration metrics behind every number on the live tabs. If a cell on Slate or Matchups seems off, you should be able to trace it back from here.
Every team-run and prop projection on PlateIQ comes out of bbsim, a Monte-Carlo play-by-play simulator that runs 5,000 simulated games for each scheduled matchup. For each plate appearance, it draws event outcomes from a joint model built on pitcher rates, hitter rates, park factors, weather, and umpire tendencies — then aggregates across the 5,000 sims to produce run means, win probabilities, p10/p50/p90 bands, and per-prop projections. The bbsim cache covers 2020–2024 and feeds the accuracy table below. The bet-sizing layer on top (Platt calibration + Kelly) is trained on 2022–2024 and held-out validated against the full 2025 season — see the Best Bets section for the actual ROI numbers.
| Feed | Source | Refresh |
|---|---|---|
| Schedule, lineups, probable pitchers | MLB Stats API | every 5 min on game day |
| Hitter + pitcher season stats, Statcast xStats | Baseball Savant | nightly |
| Park factors (per-side HR, runs, K) | FanGraphs Guts, 3-yr rolling | nightly ingest |
| Game weather (temp, wind, humidity) | Open-Meteo forecast, 3-hr avg from first pitch | every 5 min on game day |
| Moneyline / game total / run line | TheOddsAPI (BookMaker.eu primary) — h2h+totals+spreads bundle | hourly |
| Team totals (book-direct prices) | TheOddsAPI per-event team_totals market | hourly |
| Umpire zone tendencies | UmpScorecards season aggregates (dampened ×0.4) | nightly |
| Player prop lines (HR, K, TB, etc.) | TheOddsAPI — paused since 2026-04-26 lookahead audit | manual |
Each plate-appearance draw is a multinomial over {K, BB, HBP, 1B, 2B, 3B, HR, other BIP} with probabilities shaped by three layers:
The resulting per-PA probability is drawn 5,000 times per game across both lineups, with base-state progression from a Retrosheet 1974–2024 RE24 matrix. Aggregate stats come from averaging the 5,000 sim outcomes.
Matchup Score is a 0–100 percentile of expected per-PA wOBA for tonight's specific batter / pitcher / park / weather combination, ranked against the empirical distribution of every batter-pitcher-park- weather combo over 2023–2025 (n ≈ 597K PAs). It answers "how does tonight's spot rank versus everything we've seen the last three years?"
Additive in wOBA units — every column on the row corresponds to one term:
Skill estimates use Statcast xwOBA for batted balls (physics-based, park-and-weather-neutral) — that's the noise-resistant signal we want for batter / pitcher quality. Park and weather deltas use actual wOBA — those layers exist to capture the things xwOBA filters out (warning-track FBs becoming HRs, ball carry on hot days). Mixing the two on a single additive scale is intentional.
The 2023–2025 E_env distribution: mean .320, σ = .032, 1%ile .251, 50%ile .318, 99%ile .409. A few worked anchors against that distribution:
| Matchup | E_env | Score |
|---|---|---|
| Aaron Judge vs LHP @ Yankee Stadium, 78°F 8 mph wind out | .455 | 99 |
| Top of Reds lineup vs Kyle Freeland @ Coors, summer night | .39 – .40 | 96 – 98 |
| Median PA across all 2023–2025 matchups | .318 | 50 |
| League-average bat vs Skubal @ T-Mobile, 58°F 12 mph wind in | .20 | 0 – 1 |
The skill and park components freeze when the game's lineup flips to confirmed in the daily-lineups feed. After lock-in only the weather delta keeps refreshing (Open-Meteo forecasts update through the day), so users see a stable Matchup Score that drifts at most a few points as the wind / temperature forecast tightens.
Rate stats need a minimum sample before their observed value is signal. PlateIQ follows Russell Carleton's r=0.5 thresholds: below these, cells render without heatmap color and carry a ★marker. The displayed value is always Marcel-regressed toward league mean:
So both display and projection engines see the regressed number; they never diverge.
| k_pct | 60 |
| bb_pct | 120 |
| iso | 160 |
| xwoba | 100 |
| woba | 100 |
| xba | 100 |
| avg | 100 |
| slg | 120 |
| ops | 120 |
| barrel_pct | 150 |
| hard_pct | 150 |
| chase_pct | 100 |
| whiff_pct | 100 |
| k_pct | 70 |
| k_per_9 | 70 |
| bb_pct | 170 |
| bb_per_9 | 170 |
| xwoba_against | 120 |
| fip | 200 |
| xfip | 200 |
| siera | 200 |
| whip | 200 |
| era | 250 |
| csw_pct | 150 |
| swstr_pct | 150 |
| o_swing_pct | 150 |
| gb_pct | 200 |
| barrel_pct_against | 150 |
| hard_pct_against | 150 |
| avg_fb_velo | 20 |
Below are the bbsim-engine accuracy metrics on every precomputed projection in the cache — 11,662 games across 5seasons. Lower is better on every column. This measures the simulator's raw run-level + win-prob accuracy. The bet-sizing layer on top of bbsim (Platt scaling, Kelly sizing, edge gates) is validated separately against the 2025 hold-out season — see the Best Bets section below for those numbers.
The Best Bets tab takes every game on the slate where both lineups are confirmed, asks bbsim for its win probability + run distribution, then compares each market the books are offering against our number. A bet surfaces when the gap is large enough to clear vig and a Kelly check.
For each market (moneyline, total over/under, team total over/under) we convert the offered American price to its raw implied probability — this is the book's breakeven rate including vig — and subtract it from our model's probability:
We compare to the viggedbook number, not the de-vigged fair price. That keeps the threshold honest: a 2pp edge means 2pp above the actual price you'd be paying, after the house takes its cut.
Raw bbsim probabilities are slightly over-confident at the tails — the Monte Carlo will say 70% when reality is closer to 64%. Two layers tame it before sizing:
Full-Kelly is the math-optimal stake but variance is brutal — a 50% drawdown is normal even when the model is right. We stake at one-quarter Kelly, scaled to a 100-unit bankroll where 1 unit = 1% of roll:
A bet only surfaces when units > 0 AND edge ≥ 2pp. The cap at 3 units exists because a single mis-calibrated 8% edge at -110 would otherwise call for 4–5 units, more concentration than the model has earned.
Suppose bbsim says BOS ML wins 57.4% of the time and the book has BOS at -110 (decimal 1.91, implied 52.4% with vig). On a $10,000 bankroll where 1 unit = $100:
Expected return on this single bet: +$25.50 (2.65u · 9.6% EV). But variance is real — the standard deviation on a single -110 bet is ~0.95× stake, so the 95% range on a $265 wager is roughly −$495 to +$760. One bet tells you almost nothing; the +22% backtest ROI only emerges over thousands of bets where the edges average out.
This is the math behind every row on the Best Bets table — the Units, EV%, and Edgecolumns are direct outputs of these formulas applied to that game's current line + price.
The first time the generator sees a bet that clears the gates — confirmed lineups, edge ≥ 2pp, units > 0 — the row is written and frozen. Re-runs throughout the day update nothing on that row except the grader, which sets status and pnl_units after the game finishes.
Why: if we kept re-pricing pending rows, the displayed edge would drift toward the closing line and the realized ROI column would measure "how well does our number track the market" instead of "how well do we beat the price we actually got." The Added column on the Best Bets table shows the moment of capture.
Every Platt fit + sizing rule shipped to production was first validated against the 2025 season as a strict hold-out — 2025 games are not in the training set for any calibration coefficient. The harness:
Production Platt fits and their 2025 hold-out numbers:
Why ship the "loosened" fit over the log-loss-optimal one?Log-loss is a global metric — it scores every probability prediction equally, including the ~80% of games where we'd never bet. ROI only counts the predictions that actually fire under our edge gate + Kelly filter, which by construction are the high-confidence tail. A calibrator tuned to minimize global log-loss will compress the tails harder than a calibrator tuned for the bet-decision boundary. The two objectives disagree, and the operationally-correct one is ROI.
The 50/50 blend with identity isn't arbitrary: it's a uniform prior over "trust the historical correction" (full Platt) vs "trust the simulator" (no Platt). Halfway between is the maximum-entropy choice when both endpoints have credible support — Platt because the bbsim cache shows real systematic compression, identity because the unfiltered ROI on raw probabilities is genuinely +EV (63–65% win rate vs 52.4% breakeven on 2025 game-totals). Fully discounting either end is overconfident given the data we have.
Practical consequence: 3.2× the bet volume (7,779 vs 2,422), +22.21% ROI vs +20.84%, at a cost of +0.015 log-loss. We'd revisit the blend if 2026 forward results showed ROI degrading below the strict-Platt baseline.
The harness lives at scratch/backtest_platt_roi.py (game totals) and scratch/refit_platts_on_2025.py (Platt refit). Per-config result JSONs are in scratch/ and the production coefficients with full audit trails sit in lib/models/platt_calibration.json — including the previous 2024-cohort fits, kept for diff-the-fit forensics.
The most credible number on this page is what the deployed system has actually returned since launch. These are bets the production generator locked in (entry price, units, edge), the grader settled with the final score, and stored in the best_bets table — no backtesting, no replay, no re-pricing.
Season-by-season
Best Bets generation pauses every March 1 – April 30 (the early-season cohort runs −10 to −15% ROI on every backtest year — cold weather, unstable rotations, less training signal on fresh starters). The generator un-pauses May 1 each year, so an empty current-season row in March or April is by design, not a feed failure.
Pitcher strikeout props are gated by a two-leg cohort filter that runs on top of the standard ingest. Both legs are restricted to the same low-K/9 starter pool and fire on mechanically disjointpitcher-games — the model can't put +edge on both sides of the same line, so adding the second leg roughly doubles slate coverage without overlap.
When the model says +Under with edge ≥ 2pp on a Q1 cohort pitcher, we bet Under. Stake is half-Kelly on the model's shrunk probability (k=0.75 production routing), capped at 6 units. Bypasses the legacy apply_edge_tail_policy zero-stake band that was set for the older R5 selector cohort.
When the model says +Over with edge ≥ 2pp on a Q1 cohort pitcher, we fade — take the matching Under at the under price. Mechanism: Q1 Overs are −17 to −22% ROI in everybacktest year on the pre-fade pool, so when the model picks an Over in this cohort, the empirical evidence says it's wrong. Stake is half-Kelly on a synthetic probability market_imp + over_edge/100(the model's own under-side prob would clamp Kelly to 0 — we're betting against the model on this cell), capped at 6 units.
v4 backtest = production engine (per-pitcher estimateIP, opponent K% × pitcher handedness, park K factor, L5 short-term blend on K/9) at the shipped half-Kelly + 6u-cap sizing for both legs. Closing-line basis from the historical Odds API per-event snapshots in data/backtest/raw_props_historical/.
22 months, 1,127 bets at ~5u average stake. 17 of 22 months positive ROI on combined. Worst single month: 2024-07 (−24.30% — only month where both legs lost meaningfully; the legs are normally somewhat anti-correlated month-to-month, mid-summer 2024 was the exception). Best single month: 2023-07 (+36.18%).
Earlier versions + full detail: click the v1.5.3 chip next to the PlateIQ title.