Pitch, Hit, Profit

Andrew Cerqui; Jace Higa

Introduction

Sports betting has grown rapidly in popularity, fueled by increased legalization, media coverage, and widespread access to online platforms. For many, it remains a form of entertainment, but could it be something more? With the right data and strategy, is it possible to consistently identify favorable opportunities and gain an edge over the market? Specifically, can a data-driven approach to betting, particularly in Major League Baseball (MLB), serve as a viable income stream?

Baseball is a team sport played between two sides of nine players, where the goal is to score runs by hitting a pitched ball and advancing around four bases. Teams alternate between offense (batting) and defense (pitching and fielding), with each half-inning ending after three outs. A standard game consists of nine innings. While the rules are relatively simple, modern baseball stands out as one of the most analytically advanced professional sports. Its one-on-one matchups and slower pace of play make it easier to isolate and track a wide range of distinct metrics in real time. Teams routinely analyze batter-pitcher matchups, pitching tendencies, and situational probabilities to gain a competitive edge—making baseball particularly appealing for data-driven sports betting.

With the rise of advanced statistics, particularly expected metrics, and real-time data, bettors now have a wide range of wagering options. These include moneylines (who wins the game), over/unders (total runs scored), and player proposition bets (prop bets), which are specific bettable outcomes related to an individual player’s performance in a game. These include things like hits, will a player get a hit in this game (yes or no), and total bases, which track how many bases a player earns through hits in a given game. Despite the abundance of vailablity of data, many sports bettors rely on their intuition and personal biases– an approach that seldom results in sustained success. Compounding this challenge is the fact that sportsbooks maintain a built-in edge through the way odds are structured. For example, in an evenly matched game, both teams are typically listed at -110 odds rather than +100. This means a $100 winning bet would return just $90.90 in profit, rather than $100, creating a margin that favors the sportsbook. Thus, in this context users would have to win more than 52.4% of their bets to just break even (Naomi 2023).

In this context, leveraging the vast amount of granular baseball data becomes crucial. With its rich statistical history and highly measurable in-game actions, baseball stands out as one of the most data-intensive and analytically advanced sports. It offers bettors an opportunity to apply data-driven strategies and models to gain an edge over the market.

We aim to investigate whether recent player performance, historical trends, advanced metrics, and variations in sportsbook odds can be leveraged to identify profitable opportunities in Major League Baseball betting markets.

Background

This background section examines the theoretical and practical foundations for data-driven sports betting research. We first explore how modern sports betting markets operate and where pricing inefficiencies might exist, particularly for individual player propositions. We then review the evolution of predictive modeling in baseball analytics and how machine learning approaches have been applied to sports betting. Finally, we identify key research gaps in integrating statistical modeling with market analysis, setting the stage for our hybrid methodology that combines bottom-up player performance prediction with top-down market opportunity identification.

Sports Betting Market Structure and Efficiency

The legalization of online sports betting across multiple jurisdictions has created increasingly sophisticated and competitive markets. However, research on sports betting market efficiency shows mixed results—different sports and bet types demonstrate varying levels of pricing accuracy (Hubáček, Šourek, and Železný 2019).

Main game betting markets (like who wins the game) tend to be efficiently priced due to high betting volumes and extensive professional analysis. In contrast, individual player proposition bets often show greater pricing disparities. These player prop markets receive less analytical attention and have lower betting volumes, potentially creating opportunities for systematic data-driven approaches.

The fundamental challenge for any betting strategy remains the house edge built into sportsbook pricing. Standard odds structures ensure that sportsbooks maintain mathematical advantages even in competitive markets, requiring bettors to identify opportunities where their expected value exceeds the built-in profit margins. This edge varies across bet types and operators, with player propositions often carrying higher margins than main game markets.

Predictive Modeling in Baseball Analytics

Baseball analytics has evolved dramatically since the introduction of sabermetrics, with modern machine learning applications achieving increasingly sophisticated prediction capabilities. (Huang and Li 2021) demonstrated that statistical models could achieve over 94% accuracy in MLB game outcome prediction using advanced feature selection techniques, while (Li, Huang, and Li 2022) developed comprehensive frameworks for identifying the most predictive baseball statistics.

However, a critical insight from (Walsh and Joshi 2024) revealed that prediction accuracy and betting profitability represent fundamentally different goals. Their research showed that models optimized for probability calibration achieved +34.69% ROI compared to -35.17% for accuracy-focused approaches. Calibration refers to how well a model’s predicted probabilities match actual outcome frequencies—for example, events predicted with 30% probability should actually occur about 30% of the time. This is crucial for betting because accurate probability estimates are needed to identify when market odds offer value, regardless of whether the model correctly predicts the outcome. This finding suggests that traditional machine learning evaluation metrics such as accuracy, precision, and recall may actually be counterproductive for betting applications.

Recent research has increasingly focused on ensemble methods for sports prediction, such as (Galekwa and colleagues 2024) findings that combining multiple algorithms consistently outperforms single-model approaches. This supports developing comprehensive modeling frameworks that leverage multiple analytical techniques rather than relying on individual algorithms.

Strategic Approaches to Sports Betting

Sports betting research traditionally distinguishes between two methodological approaches. Bottom-up strategies involve building statistical models to predict outcomes more accurately than sportsbooks do. These approaches use player performance data, matchup analysis, and advanced metrics to generate probability estimates that might reveal when market odds are mispriced.

Top-down strategies focus on comparing odds across different sportsbooks to identify pricing inefficiencies. Rather than predicting game outcomes, these approaches look for discrepancies in how different operators price the same events, capitalizing on situations where one book offers significantly better odds than others.

While both approaches have shown promise independently, existing literature has not rigorously examined whether combining bottom-up predictive modeling with top-down market analysis provides superior results compared to using either methodology alone.

Research Gaps and Methodological Innovation

A significant gap exists in formal examination of hybrid approaches that integrate predictive modeling with market analysis. While practitioners may informally blend these strategies, academic research has not evaluated whether such integration provides superior risk-adjusted returns.

This gap is particularly notable given the complementary nature of these approaches, statistical modeling excels at identifying undervalued performance patterns, while market analysis can identify optimal timing and execution strategies. The combination potentially addresses the primary weaknesses of each individual approach.

Furthermore, much existing research focuses on prediction accuracy rather than practical implementation, often overlooking critical factors such as bankroll management or market access limitations. These practical considerations are essential for translating theoretical success into real-world profitability.

MLB as an Analytical Environment

Baseball presents unique advantages for data-driven betting analysis due to its extensive statistical history and granular performance data. The sport’s discrete, measurable events enable precise statistical modeling, while Statcast technology has created opportunities for advanced analytical approaches incorporating biomechanical measurements and contact quality metrics.

The individual nature of many baseball events—particularly batter versus pitcher matchups—creates natural opportunities for player proposition betting that may be less efficiently priced than team outcomes. Two of the most common player propositions involve hits and total bases. Hit markets typically offer over/under wagers on whether a player will record more than 0.5 hits (at least one hit) or 1.5 hits (multiple hits) in a game. Total bases represent the cumulative value of a player’s hits: singles count as one base, doubles as two, triples as three, and home runs as four. Common total bases markets include over/under 1.5 or 2.5 bases.

These individual performance markets offer analytical advantages over team-based betting. Player propositions depend primarily on individual skill and recent form rather than complex team dynamics, creating more predictable statistical patterns. The large number of games throughout the baseball season provides substantial sample sizes for analysis while creating numerous daily opportunities for profitable betting.

Recent research has identified significant variation in sportsbook accuracy across different types of baseball bets, with player propositions showing greater pricing disparities between operators compared to primary game markets (Vandenbruaene, Annaert, and Ceuster 2022).

Research Objective and Contribution

Our investigation addresses the identified research gap by developing and evaluating complementary approaches that demonstrate both top-down and bottom-up methodologies for MLB player proposition betting. The first approach develops interactive tools for situational matchup analysis, enabling detailed examination of pitcher-batter historical performance and contextual factors that may reveal betting opportunities on a case-by-case basis.

The second approach employs calibrated machine learning models trained on comprehensive player performance data to generate probabilistic assessments, which are then systematically compared against real sportsbook odds from multiple operators to identify value opportunities. This hybrid methodology focuses on analysis of individual player propositions across large volumes of daily betting markets.

This dual-methodology investigation provides novel contributions to sports betting literature by rigorously implementing both approaches and examining their respective strengths for different aspects of the sports betting challenge. The research advances understanding of line pricing variations in proposition betting and provides practical insights into where different data-driven approaches may be most effective.

Ethical Considerations

We next wanted to address potential ethical issues before we get further into our work. These ethical considerations are important because our analytical findings can influence real-world behavior and financial decisions; thus acknowledging limits and risks helps prevent harm and misuse. Stating our standards up front also strengthens the integrity and reproducibility of the research and ensures alignment with responsible gambling and legal frameworks.

Data Privacy and Sources

The data utilized in this research consists entirely of publicly available information from professional baseball games, including Statcast measurements, player performance statistics, and sportsbook pricing data from both regulated and offshore operators. No personally identifiable information beyond public player names and performance metrics was collected. All data represents factual, objective measures that are routinely published within the sports analytics community.

Responsible Gambling Considerations

The primary ethical consideration in this research concerns its potential application to sports betting activities. Several critical limitations and risks must be emphasized:

No Guaranteed Profits: Even when positive expected value opportunities are identified, variance inherent in probabilistic outcomes means that substantial losses remain possible over both short and extended periods. Historical performance does not guarantee future results.

Variance and Risk Management: The high variance of sports betting means that even systematically profitable approaches can experience extended losing streaks that individuals may not be able to endure. Anyone considering application of these findings must thoroughly understand proper bankroll management principles and position sizing (deciding how much money to bet on each wager) strategies.

Problem Gambling Risks: Sports betting can lead to significant financial harm and may negatively affect mental health. We strongly advise anyone considering wagering to seek education on responsible gambling practices and never risk money they cannot afford to lose. Betting should be approached cautiously and within personal limits.

Market Evolution: Betting markets continuously evolve as operators adjust to new information. Any identified inefficiencies may diminish as markets adapt, potentially rendering historical findings obsolete.

Disclaimer: This research is conducted for academic purposes only. Any application of these findings to actual betting activities is undertaken entirely at the user’s own risk. The authors accept no responsibility for financial losses that may result from application of this research. Readers are strongly encouraged to thoroughly research proper bankroll management techniques and to seek professional guidance on responsible gambling practices before placing any wagers.

Data

Our dataset was built by combining information from multiple sources, each selected for its relevance to MLB performance and wagering analysis. Together, these sources contributed complementary perspectives and enabled us to perform the kind of matchup-level analysis needed to explore profitability and uncover potential edges in the betting market.

Data Sources

We collected data from five sources, which are outlined in greater detail in the following sections. StatsAPI, pybaseball, and oddsapi were accessed using a custom Python scraper that called their respective APIs directly, allowing us to ingest structured data on player performance, advanced metrics, and betting markets. This approach differed from our Selenium-based web scraper used later for MLB.com, which required automated browser interaction to extract dynamically loaded content not available through the APIs. Finally, we sourced supplemental tracking data from Baseball Savant by downloading static CSV files. Together, these pipelines enabled a comprehensive, multi-source dataset suitable for statistical analysis, machine learning, and real-world validation.

statsapi

The first source was Major League Baseball’s public API, accessed using the statsapi Python package. This provided structured JSON data containing player metadata, team identifiers, game results, and detailed per-game statistics for batters and pitchers. This source served as the foundation for our dataset. To automate collection, we deployed the scraper on Railway, a cloud platform that lets us run and schedule scripts automatically without needing to manage any servers. We scheduled our scripts to runs daily at 7:00 AM PST, retrieving the previous day’s game data. The raw JSON was parsed using SQL queries in PostgreSQL and loaded into a set of normalized relational tables.

Our tables included players, teams, games, sides, batter_stats, and pitcher_stats. The players table contains a list of all individuals who appeared in at least one Major League Baseball game from June 21st to the present, each assigned a unique player_id for consistent identification. The teams table includes all 30 MLB teams, each mapped to a unique team_id, enabling team-level joins and analysis. The sides table identifies which team played as the home or away team in each game, an important distinction in baseball since the home team bats second and may have strategic advantages. The batter_stats and pitcher_stats tables store detailed per-game box score data, separated by hitting statistics and pitching statistics. These stats include traditional metrics such as hits, home runs, and strikeouts for batters, and innings pitched, earned runs, and walks for pitchers. The use of primary keys such as player_id, game_id, and team_id allowed for seamless joins across tables. This structure enables efficient querying across player appearances, game contexts, and performance metrics, while remaining flexible enough to incorporate additional data sources such as advanced Statcast metrics and other tracking metrics.

mlb.com

Our second data source was MLB’s official website, mlb.com, which we scraped using Selenium, a Python package for automated browser interaction. This approach allowed us to extract pre-game matchup information from MLB.com’s preview pages, data not available through the public Stats API. Specifically, we captured scheduled starting pitchers and projected batter matchups, along with historical hitter performance against the opposing pitcher. It enabled us to factor in pre-game context and generate predictions ahead of each day’s games. The scraper was designed to click into each game’s “Preview” button, capture the relevant information, then return to the main page and repeat this process for all games scheduled on a given day. This scraper was deployed on Railway and scheduled to run slightly earlier than the StatsAPI job, at 6:55 AM PST. Unlike the StatsAPI scraper, it focused on collecting information for games scheduled later that same day.

A major challenge with this source was the absence of standardized identifiers such as player or game IDs. To address this, we created a composite key in the preview table using a combination of game_date, batter_name, and pitcher_name to uniquely identify each record. The JSON data gathered was unstructured so it required an extensive use of regular expressions (REGEX), which made the parsing process significantly more complex.

In the end, we produced a table called previews, which included the game date, projected starting pitcher, and historical performance data of each batter against that pitcher– such as home runs, at-bats, and batting average. This table allowed us to enrich our dataset by linking valuable pre-game context with actual in-game performance, enabling more nuanced analyses of matchups and outcomes.

Baseball Savant

To supplement our scraped data, we incorporated a third source: Baseball Savant, a publicly available website that provides detailed, season-long statistics that go beyond traditional box scores. Unlike our other sources, which offered game-by-game data, Baseball Savant’s data is cumulative and tracks advanced metrics throughout the season. For example, it includes expected batting average which incorporates exit velocity (how hard the ball is hit) and launch angle (angle at which the player swings their bat) and pitch usage rates, which displays how frequently pitchers throw specific pitch types. The site offers the ability to download the data as CSV files, which we did during the All-Star break to capture a midseason snapshot of player performance. For the purposes of our analysis, we chose to download the data once, as the season is already more than halfway complete and player performance is unlikely to vary significantly over the remaining games. After downloading the CSV files, we cleaned the data by removing unnecessary columns and renaming fields to improve clarity and ensure consistency across our dataset.

The processed data was stored in four normalized tables within our PostgreSQL database. The batter_pitches table captured how individual batters performed against specific pitch types—such as fastballs, sliders, and curveballs. It included advanced metrics like Run Value (a measure of how much a specific pitch type increases or decreases run expectancy when thrown to that specific batter) and expected batting average to estimate likely outcomes. Similarly, the pitcher_pitches table reflected how effective each pitcher was with their various pitch types, using the same set of metrics but from the pitcher’s perspective. The pitcher_statcast table provided aggregated performance statistics that were not pitch-specific, offering a broader view of each pitchers season-level tendencies. This Statcast data was particularly valuable as it combined pitch-level insights with overall performance metrics, enabling a more complete evaluation of both hitters and pitchers.

pybaseball

Our next source was the pybaseball Python package, which provided extended access to MLB’s Statcast system and allowed us to retrieve season-level statistics. This data enriched our dataset with advanced metrics, including batted ball tracking (which measures factors like exit velocity and hit direction), pitch characteristics, and sabermetric indicators (which provide expected values for more precise performance evaluation). Using pybaseball.statcast(), we collected pitch-level data spanning over a full season, which was then aggregated into player-game summaries. We were also able to extract seasonal statistics for batters, adding broader context around batter performance.

To unify these sources, we used the players table to bridge the gap between the detailed Statcast tracking data and the rest of our dataset. By matching on player_id, as in earlier integrations, we successfully connected granular tracking metrics with season-level statistics for 537 unique players. This join allowed us to capture both game-level performance details and broader, long-term trends across the season.

Only historical data was needed, so our scraper ran once, and the collected data was stored in the batter_statcast table. This table followed a similar structure to pitcher_statcast, but included season-level summaries instead of data limited to the current year. All data was pushed into PostgreSQL, where this structure supported querying across pitch-level, game-level, and seasonal data, laying the groundwork for downstream modeling and analysis.

oddsapi

Our betting market data was sourced using the OddsAPI feed, which provided structured JSON data on sportsbook offerings for player props such as hits, total bases, and home runs. This information was retrieved using the same custom scraper architecture used in other parts of the project. For consistency and integration with our game-level data, we joined odds records to player performance using a combination of game date and team abbreviations, ensuring alignment across sources despite the absence of shared unique identifiers.

The betting data was parsed and stored in PostgreSQL, using fields such as market_type (type of prop bet), price (decimal odds), and point (over/under threshold). We captured a historical snapshot of the betting odds, preserving how sportsbooks were pricing specific player outcomes at a given moment in time. This static reference point allowed us to compare market expectations to actual player performance, which was especially valuable during model training and evaluation. While the dataset does not yet include full odds histories or line movement over time, the current structure still enables a direct link between sportsbook predictions and individual player-game results.

This integration supports downstream use cases such as model backtesting and value identification. This integration supports downstream use cases such as model backtesting and value identification. Backtesting refers to the process of applying a predictive model to historical data to evaluate how well it would have performed in real-world scenarios– for example, simulating bets on past games based on model predictions and comparing the results to actual outcomes. Value identification involves comparing the model’s predicted probabilities for specific player outcomes (e.g., getting a hit or hitting a home run). By comparing predicted player outcomes to market-implied probabilities from sportsbooks, we can evaluate model performance not only in terms of accuracy but also potential profitability. This alignment creates a clear path for analysis centered on identifying inefficiencies in publicly posted betting lines.

Data Organization

In the end the data was organized and combined using third normal form (3NF) to minimize redundancy and maintain reliable links across related data. Our entity-relationship diagram (ERD) (Figure 1) provides a clear visual representation, clearly mapping the relationships among all entities derived from our data acquisition process.

ERD diagram — **Figure 1**: This diagram illustrates how our database schema is organized to support complex querying and analysis. At the center, the players, games, and teams tables form the relational backbone, linking to performance-specific tables like batter_stats, pitcher_stats, and their Statcast counterparts. The previews and player_props tables bring in pre-game context and betting market data, while sides defines home/away team roles per game. This structure, normalized to third normal form (3NF), allows us to seamlessly join historical performance, matchup context, and betting odds—enabling detailed statistical modeling, matchup evaluation, and value detection at both the player and game levels.

See Data Dictionary

Results

Using the full dataset we compiled, we now turn to a focused analysis aimed at uncovering actionable edges for MLB betting. We start by modeling team run scoring to identify the strongest drivers of offense, then evaluate a classifier for over/under outcomes at the 4.5-run threshold. We follow that up with hitter-level prop work that combines matchup histories, pitch-type run values, and calibrated ML models which link predictions to historical sportsbook odds to test realized profitability.

Over/Under Run Scoring Analysis

Our initial analysis focused on identifying which variables were most predictive of how many runs a team scores in a given game. We used a wide range of features grouped into three main categories:

Opposing starting pitcher (sp.): Including metrics such as earned run average (ERA), opponent batting average (BA), innings pitched (IP), and expected stats like xERA and xBA.
Opposing relief pitcher statistics (rp.): Represented aggregated bullpen performance using the same types of measures as for starters, such as ERA and slugging percentage (SLG) allowed.

– Team batter statistics (b.): Covered both per-game aggregates (e.g., total hits, home runs, strikeouts) and season-long averages for the lineup, such as batting average (BA) and slugging percentage (SLG).

By integrating data from opposing starting pitchers, opposing relievers, and the team’s offensive performance, we aimed to construct a well-rounded model for predicting team run performance.

Exploratory Data Analysis

To gain insight into team scoring behavior, we computed summary statistics for runs scored per game. This helped us examine the distribution of offensive output and understand typical scoring ranges across all team-game observations.

Run Distribution — **Figure 2**: The distribution of runs scored per game across all teams from June 21 to the present is right-skewed, with most teams scoring between 3 and 7 runs. Extreme high-scoring games (10+ runs) are relatively rare, while shutouts and 1-run games are pretty common.

The distribution of runs scored per game, as seen in Figure 2, was right-skewed, with a few high-scoring outliers pulling the mean upward. The median was 4 runs, while the mean was slightly higher at 4.6, reflecting the impact of those outliers. This mean value serves as a useful benchmark when modeling run production in machine learning, especially when deciding how to frame our prediction target.

We then focused on identifying which features were most indicative of how many runs a team scores in a game. To do this, we built a linear regression model using a wide range of pitching and hitting statistics. However, many baseball statistics are derived from overlapping components, making them inherently correlated. For example, batting average is one of several inputs used to calculate weighted on-base average (wOBA), meaning these two variables are mathematically linked. As a result, including both in a predictive model can introduce multicollinearity, where highly correlated inputs distort the model’s ability to accurately estimate the unique contribution of each feature. To address this, we systematically filtered out redundant variables, prioritizing those that captured unique predictive value. After this reduction process, we finalized a simplified linear model using the following variables: starting pitcher earned run average (sp_era), starting pitcher batting average against (sp_ba), starting pitcher slugging percentage against (sp_slg), relief pitcher earned run average (rp_era), relief pitcher slugging percentage against (rp_slg), relief pitcher innings pitched (rp_ip), team batting average (b_ba), total team strikeouts (b_sum_so), and team slugging percentage (b_slg).

Predictor	Estimate	Std. Error	t value	Pr(>
(Intercept)	6.538	2.377	2.751	0.0061 **
sp_era	-0.135	0.130	-1.042	0.2980
sp_ba	4.744	6.150	0.771	0.4408
sp_slg	-4.021	3.252	-1.236	0.2168
rp_era	-0.220	0.095	-2.323	0.0205 *
rp_slg	3.594	2.329	1.543	0.1232
rp_ip	0.224	0.089	2.527	0.0117 *
b_ba	26.823	11.862	2.261	0.0241 *
b_sum_so	0.050	0.045	1.130	0.2589
b_slg	9.894	5.692	1.738	0.0826 .

Table 1: Coefficient estimates from a linear regression model predicting team runs scored. The model includes variables related to starting and relief pitcher performance (e.g., ERA, innings pitched, opponent batting metrics), as well as team-level offensive statistics such as batting average, slugging percentage, and total strikeouts. Each coefficient reflects the estimated change in runs scored associated with a one-unit increase in the corresponding predictor, holding all other variables constant. Statistically significant predictors (p < 0.05) are marked with an asterisk (*), indicating a meaningful relationship with run production.

From our linear model, we identified three statistically significant predictors of runs scored (p < 0.05):

Relief pitcher innings pitched (rp_ip), with a p-value of 0.0117
Relief pitcher ERA (rp_era), with a p-value of 0.0205
Team batting average (b_ba), with a p-value of 0.0241

Batting average stood out as something worth exploring further, especially as the season progresses, since we also have data on expected batting average (xBA) available.

Actual vs. Expected Batting Average

The law of averages suggest that after a large number of trials, outcomes tend to stabilize and move closer to their long-run tendencies. In the context of baseball, this implies that a team’s observed batting average may begin to reflect their underlying hitting ability over time. While not a mathematical guarantee, this principle provides context for using modeled statistics. Model-derived metrics are not a true statistical expectation, but it does serve as a useful benchmark for identifying under- or over-performance across the season.

Expected batting average is a Statcast metric designed to estimate the likelihood that a batted ball will result in a hit, which refers to any batted ball that results in the batter safely reaching at least first base without the benefit of an error. The metric is based on two key factors: exit velocity (how hard the ball is hit) and the launch angle (the vertical angle, in degrees, at which the ball leaves the bat relative to the ground). Baseball results can be unpredictable and are influenced by baseball alignments. As a result, traditional stats do not always measure how well a player is actually performing. That is where expected statistics come in– through data they provide a more accurate picture of a player’s performance. For example, a hitter might crush a ball at 115 MPH, but if it’s hit directly at a defender, it results in an out. On the other hand, a softly hit ball at just 40 MPH could drop into the perfect spot for a hit. xBA accounts for these inconsistencies by using historical data to estimate how often similar batted balls have gone for hits. While not perfect, it offers a more context-aware view of hitting performance. The following graph, (Figure 3), compares each team’s actual batting average to their xBA, helping us see who has been overperforming or underperforming based on quality of contact.

Since we are trying to understand team run production it makes sense to focus on batting average. A higher team batting average means the team is collecting more hits on average, which is essential for generating offense. In baseball, scoring runs typically requires advancing base runners, and hits are one of the primary ways to move players around the bases. The more often players get on base via hits, the more opportunities a team has to bring runners home. While not every hit results in a run, consistent hitting increases the chances of building rallies and ultimately scoring. When a team’s actual batting average is lower than its expected batting average (xBA), it may suggest underperformance, assuming xBA reasonably reflects the quality of contact. While xBA is not a perfect estimate, it offers a standardized way to evaluate whether a team is hitting into bad luck or failing to convert quality contact into results. Over time, if xBA is a reliable reflection of a team’s underlying hitting quality, we would expect actual outcomes to begin aligning with those expectations. Teams underperforming their xBA may see an uptick in batting average, and consequently, run production. Conversely, teams overperforming may experience a decline in their runs as fewer balls in play fewer batted balls result in hits over time. To explore whether these trends actually play out over time, we examined how average run production shifted after MLB’s All-Star Break (ASB) as shown in (Figure 4).

Run Differences — **Figure 4**: Change in average runs scored per game for MLB teams after the 2024 All-Star Break (ASB) compared to before the break. Positive values (green bars) indicate teams that increased their run production, with the Chicago White Sox (+5.15) showing the largest improvement. Negative values (red bars) indicate teams that scored fewer runs on average after the break, with the Boston Red Sox (-4.33) experiencing the steepest decline. The figure highlights the five teams with the largest increases and decreases, illustrating which offenses surged or struggled to start the second half of the season.

A pattern emerges when comparing team-level batting average (BA) and expected batting average (xBA) to changes in run production after the All-Star Break. Among the largest movers, several teams that had been underperforming their xBA—such as CLE, CWS, KC, and HOU—saw significant increases in runs scored (Figure 4), consistent with the idea of positive regression. Conversely, some overperforming teams like BOS and SEA experienced declines in run production, aligning with expectations of regression toward the mean. However, the relationship was far from universal– teams like DET and BAL deviated from the expected pattern. This indicates that while xBA may highlight potential for regression, its predictive power appears strongest for certain teams at the extremes and less reliable across the league as a whole. (Figure 5).

The Colorado Rockies (COL) appear to be overperforming their expected batting average (xBA), but this is likely a product of Coors Field– MLB’s most hitter-friendly park. The high altitude and thin air in Denver inflate offensive stats by reducing air resistance and allowing balls to travel farther. This “Coors Field Effect” regularly skews batting metrics, making Colorado’s apparent overperformance more a reflection of park conditions than unsustainable hitting. This highlights the importance of considering contextual factors– such as ballpark effects, strength of schedule, and injuries when interpreting performance metrics. In future work, it would be valuable to explore other factors to add to the depth.

Batting Average Overlayed — **Figure 5**: Relationship between changes in average runs scored after the All-Star Break and each team’s degree of overperformance or underperformance relative to their expected batting average (xBA). Green indicates teams whose run production has increased, while red indicates teams whose run production has declined.

Machine Learning for Over/Unders

Following the xBA and run production analysis, we were curious whether setting a clear scoring threshold could reveal additional predictive signals. We framed over/under outcomes in MLB games as a binary classification task– predicting whether a team would score more or fewer than 4.5 runs. This threshold closely mirrors the historical league average of 4.6 runs, providing a relevant and interpretable baseline for evaluation.

We selected a Random Forest algorithm for our classification model due to its strong performance and interpretability. Random Forests are ensemble models that reduce overfitting by aggregating the results of multiple decision trees, leading to more accurate predictions. Additionally, they offer clear insights into feature importance, allowing us to understand which variables most influence model outcomes.

Feature Engineering

We engineered two key features: the average number of runs a team has scored over its last five games (rolling_runs_5) and the team’s batting average over the same span team_rolling_ba_5. These metrics serve as short-term performance indicators, capturing both a team’s ability to generate runs and its overall hitting effectiveness in recent matchups. By focusing on a five-game window, they provide a timely snapshot of offensive form that can reflect hot streaks, slumps, or the impact of recent roster changes.

Additionally, we incorporated several existing metrics into our model, including the opponent starting pitcher’s expected batting average (sp_x_ba), opponent starting pitcher’s earned run average (sp_era), the starting lineup’s average expected batting average (b_x_ba), and team slugging percentage (b_slg).

Model Performance

To evaluate the effectiveness of our Random Forest classifier in predicting whether a team would score over or under 4.5 runs, we examined several performance metrics that offer complementary insights into model quality and reliability.

The Area Under the Receiver Operating Characteristic Curve (AUC) was 0.756, reflecting strong discriminative performance. This indicates that, when comparing a randomly selected game in which a team scored over 4.5 runs to one in which it did not, the model assigns a higher probability to the over outcome approximately 76% of the time. Our targeted result can be influenced by a wide range of dynamic and interrelated variables, thus an AUC above 0.75 indicates the model is reliably distinguishing between high and low scoring outcomes. This result suggests that the model is effectively capturing underlying patterns that differentiate high-scoring from low-scoring team performances. This performance is further illustrated by the ROC curve in Figure 6, which visualizes the model’s trade-off between sensitivity and specificity across all classification thresholds. The steep initial rise and bowed shape of the curve indicate effective separation between the two classes—teams that scored over 4.5 runs and those that did not—culminating in an AUC of 0.756. This reinforces the model’s reliability in identifying scoring patterns relevant to over/under predictions.

ROC Graph — **Figure 6**: ROC curve for the Random Forest model predicting whether a team scores over 4.5 runs. The curve illustrates the trade-off between sensitivity (true positive rate) and specificity (true negative rate). The model shows strong performance, with the curve rising well above the diagonal line

In addition to AUC, overall accuracy provides a more intuitive sense of model correctness. The model achieved an overall accuracy of 72.5%, correctly predicting the outcome in nearly three-quarters of all games. This far exceeds the No Information Rate of 55.8%, which reflects the accuracy one would achieve by always predicting the majority class (in this case teams scoring under 4.5 runs). The substantial lift over this baseline demonstrates that the model is capturing meaningful patterns in the data rather than simply echoing the most frequent outcome. This improvement is particularly noteworthy given the variability of baseball scoring, where small changes in game context, player performance, or even weather conditions can swing results.

We also evaluated sensitivity, which measures how well the model identifies games where a team scored over 4.5 runs. It achieved a rate of 76.6%, meaning that when a team did go over, the model predicted it correctly more than 75% of the time. This level of performance is especially useful in betting contexts, where one would need to predict the over correctly more than 52.4% of the time just to break even. The model’s specificity was 67.2%, indicating that it also performed well at identifying games that went under.

Finally, we considered Cohen’s Kappa, which was 0.44. Unlike raw accuracy, Kappa adjusts for the possibility of correct predictions occurring by chance, offering a more rigorous measure of model agreement. A value of 0.44 indicates moderate agreement beyond chance, reinforcing that the model captures real predictive signals. While not exceptionally high, this level of Kappa still demonstrates that the model performs meaningfully better than random guessing—an important outcome given the inherent variability when predicting team run totals.

Together, these results show that our Random Forest model performs well across both interpretability and predictive accuracy dimensions, making it a valuable tool for forecasting team run production in MLB games.

Feature Importance

Important Variables for O/U ML — **Figure 7**: This figure illustrates the relative importance of each feature in predicting whether a team will score over 4.5 runs. Higher values indicate greater contribution to the model’s predictive accuracy.

The variable importance plot highlights the relative contribution of each feature to the model’s predictive accuracy. Among all inputs, the engineered feature rolling_runs_5 overwhelmingly emerged as the most influential. This feature, which captures a team’s recent scoring momentum, proved to be the strongest indicator of whether a team would surpass the 4.5 run threshold. Its high importance reinforces the value of short-term performance trends in forecasting offensive output.

Following this, team_rolling_ba_5, which represents the team’s batting average over the last five games, also ranked highly. This measure captures the overall quality of a team’s hitting during recent matchups, serving as a gauge of a team’s ability to consistently get hits. Traditional pitching and matchup-based metrics also contributed meaningfully. For instance, sp_x_ba and sp_era, which reflect the expected batting average and earned run average of the opposing starting pitcher, respectively, helped capture the quality of pitching faced by the offense. These features provide critical context for how difficult it might be for a team to score in a given game. Similarly, b_x_ba (expected batting average of the lineup) and b_slg (slugging percentage) contributed additional insight into the underlying hitting power and contact quality of the offensive side.

Overall, the combination of short-term offensive momentum, recent team batting metrics, and opposing pitcher quality formed the backbone of the model’s predictive framework. These findings suggest that blending engineered features with traditional matchup statistics enhances the model’s ability to predict over/under outcomes accurately.

Prop Betting Analysis for Batters

We then explored hitter-focused prop bets, specifically whether a batter would record at least one hit or go over/under 1.5 total bases in a given game. Total bases are calculated as one for a single, two for a double, three for a triple, and four for a home run. To hit the over, a player must total at least two bases– for example, by hitting a double, triple, home run, or combining multiple hits such as a single and a double.

Pitcher-Batter Preview

To gain an edge in predicting favorable prop outcomes, we began by analyzing game preview data for each day, including scheduled starting pitchers and historical batter-vs-pitcher matchup statistics.

The first areas we explored was how specific batters had performed against certain pitchers in the past. In baseball, it is often said that some hitters “have a pitcher’s number,” meaning they see a pitcher very well— a phenomenon that may not always be captured by traditional stats. This can stem from a batter’s ability to pick up the ball exceptionally well out of a pitcher’s hand or from a strong sense of that pitcher’s tendencies. Thus, when a batter is able to consistently succeed against a particular pitcher, it can signal a meaningful matchup advantage. With that in mind, we prioritized these historical trends as a way to identify potentially favorable prop bet opportunities.

We analyzed discrepancies in run value per 100 pitches (RV/100)– a metric measuring how each pitch changes a team’s run expectancy, with positive values favoring hitters and negative values favoring pitcher. This would aid us in identifying potential pitcher-hitter mismatches. This allowed us to identify potential pitcher-hitter mismatches as we calculated the difference in RV/100 by comparing a batter’s performance against specific pitch types to the different pitches thrown by the scheduled opposing pitcher. This allowed us to estimate how well a batter might match up based on the types of pitches they were likely to face in a given game. In other words, we cross-referenced each batter’s strengths with a pitcher’s weaknesses to uncover matchups where a hitter may be especially well-suited to succeed. To ensure the matchup was meaningful, we also applied a minimum usage threshold, filtering out pitch types that a pitcher rarely throws.

Interactive Hitter Matchup App

We wanted a way to translate our findings into a practical, game-day resource, we developed an interactive dashboard designed to point out the most advantageous hitter macthups for each day’s slate of games. The tool identifies situations where a hitter has a statistical edge and allows users to examine supporting statistics, such as matchup histories or run-value differentials.

The top table displays advantageous batter–pitcher matchup histories, filtered to highlight hitters with the strongest track records. We included only hitters with a batting average of .300 or higher against the opposing pitcher– a mark generally considered very good in baseball, indicating the hitter records a hit in at least 30% of their past at-bats. To ensure meaningful results, we also required a minimum of four at-bats in the matchup to provide a more reliable sample size.

The table below highlights batters with a significant advantage based on run value against specific pitch types. It is sorted from highest to lowest run value difference, showcasing hitters who are most likely to succeed against the opposing pitcher’s arsenal. To qualify for this table, a batter must have a run value differential of at least 3 in their favor. If a batter appears in both this table and the matchup history table above, they are highlighted in yellow– indicating that both analyses suggest a strong likelihood of success in that day’s game.

Users can sort the dashboard by date, with the default view automatically displaying matchupss with supporting stats from selected day’s slate of games. If a previous date is selected, historical matchups and their corresponding results will appear. A color-coded key is provided below the date selector to help interpret the meaning of each highlighted row.

At the bottom of the dashboard is a simulated betting analysis. For players to get a hit,it is typically priced around -200 odds, meaning a $200 wager would win $100 in profit. For the purpose of our analysis, we will be assuming a $10 wager per player. A successful bet yields a $5 profit, while a loss forfeits the $10 stake. For total bases (2+), where odds are usually closer to +100, we again simulate $10 wagers. In this case, a win returns $10 in profit, while a loss results in a $10 loss. The dashboard aggregates daily results based on these assumptions and displays the cumulative outcomes for easy tracking.

Figure 8: Interactive dashboard showing advantageous MLB batter props, with hypothetical results displayed as well.

Machine Learning for Individual Player Props

While the interactive matchup analysis provided valuable insights into daily opportunities through pitcher-batter historical data and pitch-type analysis, we recognized the need for a scalable, data-driven approach that could systematically evaluate large volumes of prop opportunities. Rather than integrating these approaches directly—which would have been ideal but was not feasible due to the higher cost of real-time odds data required for such integration—we developed a complementary machine learning pipeline that could operate independently while addressing similar prop betting markets. This ML system focused on predicting specific binary outcomes corresponding to common betting markets: whether a player would record at least one hit, at least two hits, and at least two total bases in a game.

Model Development and Methodology

We framed individual player prop prediction as a binary classification problem, then applied probability calibration to ensure outputs would be suitable for betting applications. While classification accuracy optimizes prediction correctness, calibration adjusts probability estimates to align with observed outcome frequencies—essential for comparing model predictions against betting market prices. This approach follows Walsh & Joshi (2024) research showing calibration-optimized models achieve better ROI in sports betting applications.

Our feature engineering used only information available before game time to avoid data leakage and artificially inflated model performance. We incorporated two types of player metrics: seasonal performance rates and volume-based historical patterns. Seasonal rates included batting average (hits per at-bat), on-base percentage, and slugging percentage, which measure hitting efficiency and quality. Volume-based features like hits per game and home runs per game captured typical production levels. Experience indicators such as games played and total plate appearances (individual batting opportunities) captured player usage patterns rates.

The dataset comprised 19,587 player-game records with strict temporal validation: training on April-May 2025 data (12,000 records), validation on June 2025 (4,000 records), and testing on July 2025 (3,587 records). This temporal structure prevents the model from seeing future information during training, simulating realistic betting scenarios where only past performance can be used to inform future betting decisions. We implemented Random Forest, XGBoost, and ensemble methods with isotonic regression calibration applied to all base models to optimize probability estimates for betting applications.

Model Performance and Validation

Before testing market profitability, we validated model performance on historical MLB outcomes using standard machine learning metrics. This sequential approach follows academic best practices: establish predictive capability on known outcomes before attempting market-based validation where multiple factors beyond model accuracy affect results.

The Random Forest model achieved 0.549 test AUC, closely aligning with an established academic baseline of 0.552 AUC (Elfrink, 2018) for MLB prediction tasks. This alignment demonstrates realistic performance expectations rather than artificially inflated metrics which are common in sports betting literature. Feature importance analysis revealed seasonal performance rates provided the primary predictive signal, with batting average, OPS, and wOBA ranking as top contributors. The absence of same-game features from importance rankings confirms successful data leakage prevention, as shown in Figure 9.

Model Feature Importance — **Figure 9**: Feature importance analysis aggregated across all models and prediction targets shows that seasonal performance metrics dominate predictive capability. Hit/Game and Batting Average emerge as the strongest predictors, with traditional sabermetric statistics (Slg, OPS, wOBA) providing additional signal. The absence of same-game features confirms successful data leakage prevention.

Integration with Real Sportsbook Odds

Having validated model performance on historical outcomes, we integrated predictions with historical odds data from nine major sportsbooks including DraftKings, FanDuel, BetMGM, and Caesars. This integration successfully matched 86.5% of our model predictions with corresponding market odds across 1.2 million player prop records. 100% integration was not possible due to some players not having odds posted for that day or market.

Our value detection algorithm identified betting opportunities where our generated model probabilities exceeded best available odds probabilities by at least 5%. This edge threshold ensures meaningful statistical advantage while accounting for the built-in profit margins that sportsbooks incorporate into their pricing.

Position sizing employed Kelly criterion methodology, which calculates optimal bet size based on the magnitude of the statistical edge and odds offered to maximize long-term growth while controlling risk. We capped all positions at 5% of a $10,000 simulated bankroll to prevent excessive risk from model estimation errors.

Financial Performance and Market Testing

Our backtesting analysis covered the period through July 12, 2025, stopping before the All Star Game. Through this period, our model identified 342 value betting opportunities, generating $2,579.21 profit on $132,518.08 wagered for 1.95% ROI. The system achieved a 41.5% win rate with mean return per bet of $7.54 and standard deviation of $460.80, reflecting the high variance inherent in individual bet outcomes. Figure 10 illustrates the volatile profit trajectory characteristic of sports betting applications.

Test Period Performance — **Figure 10**: Cumulative profit performance across 342 value betting opportunities through July 12, 2025. The high volatility pattern illustrates the inherent variance in sports betting outcomes, with significant drawdown periods followed by profitable streaks. Final profit of $2,579.21 represents 1.95% ROI despite substantial intermediate fluctuations.

Given this high variance environment, we conducted comprehensive statistical validation to assess whether our results represent genuine predictive capability and profitable edge or could have occurred by chance. Bootstrap analysis—repeatedly sampling our betting results with replacement—with 10,000 resamples generated a 95% confidence interval of [-$40.92, $56.78] for average profit per bet. This wide interval, spanning both positive and negative values, reflects the substantial uncertainty inherent in our relatively small sample size.

Our effect size analysis revealed a standardized effect of 0.016 (Cohen’s d), indicating a very small magnitude that falls well below conventional thresholds for meaningful effects in other fields. However, even a small edge in sports betting can be very profitable over time, assuming this edge is true and did not occur by chance. Permutation testing, which randomly shuffles our profit/loss results to simulate the null hypothesis of no predictive edge, yielded a p-value of 0.765. This result fails to reach traditional significance levels (p < 0.05), indicating we cannot confidently rule out that our profits occurred by random chance rather than systematic edge.

Power analysis suggests our 342-bet sample achieved approximately 62% statistical power, substantially below standards recommended for reliable conclusions. Based on our observed effect size and variance, we would aim for over 1000 bets to achieve adequate statistical power for definitive validation. These findings highlight a fundamental challenge in sports betting research: the high variance environment demands much larger sample sizes than typical analytical applications to distinguish consistent edges from random variation.

Market Efficiency and Sportsbook Performance

Our analysis revealed significant profit variations across both prop types and individual sportsbooks. Different betting markets showed substantially different results:

hits_over_1.5 (at least 2 hits): 18.2% ROI (112 bets)
hits_over_0.5 (at least 1 hit): -0.5% ROI (178 bets)
total_bases_over_1.5 (at least 2 total bases): -6.9% ROI (52 bets)

This segmented performance pattern across prop types is noteworthy given that our underlying ML model achieved similar predictive accuracy across all three target markets when validated against actual game outcomes. The dramatic differences in betting profitability—despite comparable model performance—hints at varying levels of pricing efficiency across these markets. This suggests certain prop markets may have more opportunity for profit than others, potentially due to differences in betting volume, public attention, or line pricing errors.

Performance also varied substantially across individual sportsbooks. The most profitable books included Fanatics ($7,252 profit, 72 bets), DraftKings ($2,121 profit, 14 bets), and FanDuel ($1,032 profit, 82 bets).

Practical Implementation and Future Directions

The machine learning approach offers scalable EV prop identification capabilities that complement situational matchup analysis found in the dashboard. While manual approaches excel at identifying specific advantages through contextual factors, systematic ML evaluation provides consistent statistical criteria across hundreds of daily opportunities, suggesting promising directions for future research integrating algorithmic and human expertise in sports betting applications.

Conclusions

The following findings summarize the most important insights from our analyses. They are organized into two categories– Over/Under Runs Bets and Prop Bets to distinguish between team-level and player-level perspectives.

Over/Under Runs Bets

The first key finding is that bullpen performance is a major driver of scoring, with relief pitcher metrics, particularly ERA and innings pitched, proving to be strong predictors of team run totals, often surpassing the predictive value of starter statistics. This is intuitive: when a starting pitcher struggles, they are removed early, increasing reliance on the bullpen and creating opportunities to exploit fatigue or depth mismatches. In future work, incorporating contextual bullpen data—such as relievers’ recent usage or rest days—could further enhance predictive power.

Next, we found that expected statistics can signal potential shifts in scoring. Differences between a team’s actual batting average and its expected batting average (xBA) often serve as early indicators of offensive changes. Teams outperforming their xBA may be benefiting from favorable luck or other unsustainable factors, suggesting a potential decline in future production. Conversely, teams underperforming their xBA may be making quality contact without seeing results, indicating a likely increase in their amount of hits in the future and thus run production as performance normalizes. However, these signals should always be considered in context, accounting for factors such as ballpark effects, strength of schedule, and injuries. Tracking these gaps offers a valuable tool for anticipating both upward and downward scoring trends.

Lastly, recent team performance emerged as a significant factor when evaluating the over/under metric. In particular, trends in run production and hit totals over the last five games provide valuable context for predicting future scoring outcomes. A team averaging a high number of runs and hits in this short span may be in the midst of an offensive surge, benefiting from hot hitters, favorable matchups, or advantageous ballpark conditions. Conversely, a recent slump in both categories can signal an offense struggling to generate and capitalize on consistent scoring opportunities. Incorporating these short-term trends alongside season-long metrics allows for a more context-driven assessment of a team’s current scoring potential.

Prop Bets

In our prop bet analysis, we examined two key outcomes: whether a player would record at least one hit and whether they would surpass 1.5 total bases. To approach this from multiple angles, we combined domain knowledge with statistical modeling– evaluating matchup history as a potential predictor while also incorporating run value, a more advanced metric of offensive contribution. The findings from these analyses are outlined below.

Hit Prop Bets: A $10 wager on every player featured in the dashboard to record at least one hit would have yielded a total profit of $70.
Total Bases Prop Bets: A $10 wager on every player featured in the dashboard to exceed 1.5 total bases would have resulted in a total loss of $120.

The dashboard appeared to predict the hits prop bets reasonably well but was less effective in forecasting the over 1.5 total bases outcome. Thus, incorporating slugging percentage (SLG) into our over 1.5 total bases prop could be highly beneficial. SLG is calculated by dividing a player’s total bases by their at-bats, with singles counting as one base, doubles as two, triples as three, and home runs as four. Unlike batting average, which values all hits equally, SLG places greater weight on extra-base hits, making it a stronger indicator of a player’s potential to exceed 1.5 total bases in a single game.

Incorporating additional context could improve overall accuracy for both props as well. For example, a player’s position in the batting order can be important– batting first versus ninth significantly impacts the number of plate appearances, with higher spots in the lineup typically guaranteeing more opportunities. Home and away status also matters: an away team is assured nine full innings of offense, while a home team leading after the top of the ninth will not bat in the bottom half, losing a potential at-bat. In betting, every opportunity matters, and maximizing them can improve results.

Ballpark-specific performance is another valuable consideration. Some hitters excel in certain parks due to their dimensions; for instance, Fenway Park’s short distance and tall wall in left field can favor hitters who frequently pull the ball to that side. Conversely, some players may struggle in certain stadiums if background elements make it harder to track pitches. Considering these situational factors, along with player statistics at specific venues, the number of variables we could account for is virtually endless. It ranges from the home plate umpire’s strike zone tendencies to weather conditions such as wind speed, temperature, and humidity– all of which can influence offensive performance. Factoring in any combination of these elements could further refine and enhance the dashboard’s predictive power in future versions.

Achieved positive ROI through a systematic approach, generating a 1.95% ROI ($2,579.21 profit on $132,518.08 wagered) across 342 betting opportunities. This demonstrates that data-driven approaches can identify profitable opportunities in MLB player prop markets.

The analysis revealed dramatic profitability differences across prop types, with hits_over_1.5 bets generating an 18.2% ROI, while hits_over_0.5 and total_bases_over_1.5 markets were near breakeven. These results suggest that certain player prop markets are less efficiently priced.

Despite those positive returns, comprehensive statistical testing, including a bootstrap analysis with a p-value of 0.765, could not definitively rule out that the profits occurred by chance. This underscores the substantial sample sizes required for reliable validation in high-variance sports betting environments.

Implications

Our research demonstrates that analytical approaches to MLB betting market analysis can identify potential value opportunities across both team-level and individual player markets. The combination of team scoring models, individual prop prediction, and interactive matchup analysis creates a comprehensive framework for market evaluation that addresses multiple betting contexts within professional baseball.

The hybrid methodology successfully bridges traditional statistical analysis with modern machine learning techniques while maintaining academic rigor through proper validation and data leakage prevention. This approach provides a replicable framework for sports betting research that prioritizes methodological soundness over inflated performance claims, contributing to more reliable research standards in this emerging field.

The solid performance achieved by our models—clustering around established academic baselines—validates market efficiency research suggesting that sustainable edges in professional sports betting are necessarily modest. Rather than claiming exceptional results, our work establishes realistic expectations for legitimate MLB prediction models and provides tools for comprehensive market analysis.

References

Galekwa, René Manassé, and colleagues. 2024. “A Systematic Review of Machine Learning in Sports Betting: Techniques, Challenges, and Future Directions.” arXiv Preprint arXiv:2410.21484. https://arxiv.org/html/2410.21484v1.

Huang, Mei-Ling, and Yi-Zhen Li. 2021. “Use of Machine Learning and Deep Learning to Predict the Outcomes of Major League Baseball Matches.” Applied Sciences 11 (10): 4499. https://doi.org/10.3390/app11104499.

Hubáček, Ondřej, Gustav Šourek, and Filip Železný. 2019. “Exploiting Sports-Betting Market Using Machine Learning.” Applied Soft Computing 83: 105654. https://doi.org/10.1016/j.asoc.2019.105654.

Li, Shuo-Fang, Mei-Ling Huang, and Yi-Zhen Li. 2022. “Exploring and Selecting Features to Predict the Next Outcomes of MLB Games.” Entropy 24 (2): 264. https://doi.org/10.3390/e24020264.

Naomi. 2023. “The Sportsbook Edge: Understanding How the House Always Wins.” https://allthepicks.com/betting-edge/.

Vandenbruaene, Jonas, Jan Annaert, and Marc De Ceuster. 2022. “Efficient Spread Betting Markets: A Literature Review.” Journal of Sports Economics 23 (4): 442–79. https://doi.org/10.1177/15270025211071042.

Walsh, Sean, and Mayukh Joshi. 2024. “Machine Learning for Sports Betting: Should Model Selection Be Based on Accuracy or Calibration?” Finance Research Letters 60: 104814. https://doi.org/10.1016/j.frl.2024.104814.

Appendix

The following appendix provides supporting materials that complement the main analysis.

Data Dictionaries

batter_pitches

This table contains underlying aggregated statisical data, which will not show up in the typical box score.(Baseball Savant)

Variable	Name	Description
`batter_id`	Batter ID	Unique identifier for the batter.
`pitch_name`	Pitch Type	Name of the pitch type (e.g., fastball, slider).
`rv_100`	Run Value / 100 Pitches	Average run value per 100 pitches seen. Measures how well batters hit each pitch.
`rv`	Run Value	Total run value assigned to differnt pitches the batter has seen.
`pitches`	Pitches	Total number of pitches thrown or seen.
`pitch_usage`	Pitch Usage	Percentage of time a specific pitch type is used.
`pa`	Plate Appearances	Number of completed batting appearances, including walks and hit-by-pitches.
`ba`	Batting Average	Hits divided by at-bats.
`x_ba`	Expected Batting Average	Likelihood a batted ball becomes a hit, based on exit velocity and launch angle.
`slg`	Slugging Percentage	Total bases divided by at-bats; measures power.
`x_slg`	Expected Slugging	Predicted slugging based on batted ball quality.
`woba`	Weighted On-Base Average	Gives more weight to extra-base hits than OBP (On Base Percentage).
`x_woba`	Expected wOBA	Predicted wOBA using quality-of-contact data.
`whiff_percent`	Whiff Percentage	Percentage of swings that result in misses.
`k_percent`	Strikeout Percentage	Strikeouts divided by plate appearances.
`put_away`	Put Away Percentage	2-strike pitches resulting in strikeouts divided by total 2-strike pitches.
`hard_hit`	Hard Hit Percentage	Percentage of batted balls hit at 95+ mph.

batter_statcast

This table contains aggregated Statcast metrics for batters across different pitch types, including expected and actual performance differentials across batting average, slugging, and wOBA. (Baseball Savant)

Variable	Name	Description
`batter_id`	Batter ID	Unique identifier for the batter.
`season`	Season	Year in which the statistics are from.
`pa`	Plate Appearances	Number of completed batting appearances, including walks and hit-by-pitches.
`bip`	Balls In Play	Number of batted balls put into play (excluding strikeouts, walks, etc.).
`ba`	Batting Average	Hits divided by at-bats.
`x_ba`	Expected Batting Average	Likelihood a batted ball becomes a hit, based on exit velocity and launch angle.
`diff_ba`	BA - xBA	Difference between actual and expected batting average.
`slg`	Slugging Percentage	Total bases divided by at-bats; measures power.
`x_slg`	Expected Slugging	Predicted slugging based on batted ball quality.
`diff_slg`	SLG - xSLG	Difference between actual and expected slugging percentage.
`woba`	Weighted On-Base Average	Gives more weight to extra-base hits than OBP.
`x_woba`	Expected wOBA	Predicted wOBA using quality-of-contact data.
`diff_woba`	wOBA - xwOBA	Difference between actual and expected wOBA.
`batter_id`	Batter ID	Unique identifier for the batter.
`ops`	OPS	On-base Plus Slugging — combines on-base % and slugging % to summarize hitter performance.
`wrc_plus`	wRC+	Weighted Runs Created Plus — measures a player’s total offensive value, adjusted for park and league. 100 is league average.
`home_runs`	Home Runs	Number of home runs hit by the player.

batter_stats

This table captures traditional game-level statistics for each batter, derived from box score data. (statsapi)

Variable	Name	Description
`game_id`	Game ID	Unique identifier for the game.
`batter_id`	Batter ID	Unique identifier for the batter.
`team_id`	Team ID	Identifier for the team the batter played for in the game.
`position`	Position	Fielding position the batter played during the game.
`ab`	At-Bats	Number of official at-bats (excludes walks, HBP, sacrifices).
`h`	Hits	Number of hits.
`bb`	Walks (BB)	Number of base on balls (walks).
`r`	Runs	Number of runs scored.
`rbi`	Runs Batted In	Number of runs batted in.
`so`	Strikeouts	Number of times the batter struck out.
`double`	Doubles	Number of doubles hit.
`triple`	Triples	Number of triples hit.
`hr`	Home Runs	Number of home runs hit.
`sb`	Stolen Bases	Number of bases stolen during the game.

previews

This table stores pre-game matchup context, such as historical batter-vs-pitcher performance, scraped from MLB.com preview pages. (Selenium)

Variable	Name	Description
`game_date`	Game Date	Date of the game.
`batter_name`	Batter Name	Name of the batter (no ID available in this table).
`pitcher_name`	Pitcher Name	Name of the opposing pitcher.
`hr`	Home Runs	Total number of home runs prior to the game against the projected pitcher.
`rbi`	Runs Batted In	Total number of RBIs prior to the game against the projected pitcher.
`ab`	At-Bats	Number of official at-bats prior to the game against the projected pitcher.
`avg`	Batting Average	Batting average prior to the game against the projected pitcher.
`ops`	On-Base Plus Slugging	OPS (OBP + SLG) prior to the game against the projected pitcher.

players

This table contains player metadata, including full names and unique identifiers used to join across tables. (statsapi)

Variable	Name	Description
`player_id`	Player ID	Unique identifier for the player.
`player_name`	Player Name	Full name of the player.
`bats`	Bats	Which side they hit from (Left or Right).
`throws`	Throws	Which side they pitch from (Left or Right).
`birth_date`	Birth Date	Day player was born.

sides

This table maps each team to its role in a given game (home or away), used to differentiate sides during analysis. (statsapi)

Variable	Name	Description
`game_id`	Game ID	Identifier for the game.
`team_id`	Team ID	Identifier for the team.
`side`	Side	Side of the team in the game (`home` or `away`).

teams

This table contains metadata about each team, including their unique identifier and full name. (statsapi)

Variable	Name	Description
`team_id`	Team ID	Unique identifier for the team.
`team_name`	Team Name	Full name of the team.

games

This table holds game-level information such as date and final scores for both home and away teams. (statsapi)

Variable	Name	Description
`game_id`	Game ID	Unique identifier for the game.
`date`	Date	Date the game was played.
`home_team_id`	Home Team ID	Identifier for the home team.
`home_score`	Home Team Score	Final score of the home team.
`away_team_id`	Away Team ID	Identifier for the away team.
`away_score`	Away Team Score	Final score of the away team.

pitches

This table contains pitch-level Statcast data used for machine learning feature engineering, derived from individual pitch outcomes. (pybaseball)

Variable	Name	Description
`game_id`	Game ID	Unique identifier for games table.
`batter_id`	Batter ID	Unique identifier for batters.
`pitcher_id`	Pitcher ID	Unique identifier for pitchers.
`events`	Pitch Outcome	Result of the plate appearance (single, home_run, strikeout, etc.).
`launch_speed`	Exit Velocity	Speed of batted ball off the bat in mph.
`launch_angle`	Launch Angle	Vertical angle of batted ball trajectory in degrees.
`release_speed`	Pitch Velocity	Speed of pitch when released in mph.
`pitch_type`	Pitch Type Code	Abbreviated pitch type (FF=four-seam fastball, SL=slider, etc.).
`woba_value`	Weighted On-Base Value	Run value assigned to the specific pitch outcome.

pitcher_stats

This table contains game-level performance stats for each pitcher, including innings pitched, runs allowed, and ERA. (statsapi)

Variable	Name	Description
`game_id`	Game ID	Unique identifier for the game.
`pitcher_id`	Pitcher ID	Unique identifier for the pitcher.
`team_id`	Team ID	Identifier for the team the pitcher played for.
`type`	Pitcher Type	Type of pitcher (e.g., starter, reliever).
`ip`	Innings Pitched	Total number of innings pitched in the game.
`h`	Hits Allowed	Total hits allowed.
`r`	Runs Allowed	Total runs allowed.
`er`	Earned Runs	Number of earned runs allowed.
`bb`	Walks	Number of walks issued.
`so`	Strikeouts	Number of batters struck out.
`hr`	Home Runs Allowed	Number of home runs allowed.
`pitches`	Pitches Thrown	Total number of pitches thrown.
`strikes`	Strikes Thrown	Number of strikes thrown.
`era`	Earned Run Average	ERA for the game.

pitcher_statcast

This table includes Statcast-derived advanced metrics for pitchers, highlighting differences between actual and expected outcomes like ERA, BA, SLG, and wOBA. (Baseball Savant)

Variable	Name	Description
`pitcher_id`	Pitcher ID	Unique identifier for the pitcher.
`pa`	Plate Appearances	Number of completed batting appearances against the pitcher.
`bip`	Balls In Play	Number of batted balls put into play against pitcher.
`ba`	Batting Average	Batting average against the pitcher.
`x_ba`	Expected BA	Expected batting average based on batted ball data.
`diff_ba`	BA - xBA	Difference between actual and expected batting average.
`slg`	Slugging Percentage	Slugging percentage against the pitcher.
`x_slg`	Expected SLG	Expected slugging based on contact quality.
`diff_slg`	SLG - xSLG	Difference between actual and expected slugging.
`woba`	Weighted On-Base Avg	Weighted on-base average against the pitcher.
`x_woba`	Expected wOBA	Predicted wOBA using Statcast data.
`diff_woba`	wOBA - xwOBA	Difference between actual and expected wOBA.
`era`	ERA	Earned run average.
`x_era`	Expected ERA	Expected ERA based on contact and strikeout/walk profile.
`diff_era`	ERA - xERA	Difference between actual and expected ERA.

pitcher_pitches

This table contains pitch-type-level metrics for pitchers, including run value, whiff rate, usage rate, and contact quality outcomes. (Baseball Savant)

Variable	Name	Description
`pitcher_id`	Pitcher ID	Unique identifier for the pitcher.
`pitch_name`	Pitch Type	Name of the pitch (e.g., fastball, slider).
`rv_100`	Run Value / 100 Pitches	Average run value per 100 pitches thrown.
`rv`	Run Value	Total run value assigned to the pitch.
`pitches`	Pitches Thrown	Number of times this pitch type was thrown.
`pitch_usage`	Pitch Usage	Percentage of time this pitch type is used.
`pa`	Plate Appearances	Total plate appearances ending with this pitch type.
`ba`	Batting Average	Batting average against this pitch.
`x_ba`	Expected BA	Expected batting average against this pitch.
`slg`	Slugging Percentage	Slugging percentage against this pitch.
`x_slg`	Expected SLG	Expected slugging against this pitch.
`woba`	Weighted On-Base Average	wOBA against this pitch.
`x_woba`	Expected wOBA	Expected wOBA against this pitch.
`whiff_percent`	Whiff Percentage	Percent of swings against this pitch that missed.
`k_percent`	Strikeout Percentage	Percent of plate appearances ending in strikeouts on this pitch.
`put_away`	Put Away Percentage	Percent of 2-strike counts resulting in strikeouts with this pitch.
`hard_hit`	Hard Hit Percentage	Percent of batted balls hit 95+ mph off this pitch.

betting_events

This table links betting markets to scheduled or completed games. (oddsapi)

Variable	Name	Description
`event_id`	Event ID	Unique identifier for each betting event.
`commence_date`	Start Time	Scheduled start date of the game.
`home_team`	Home Team	Name of the home team.
`away_team`	Away Team	Name of the away team.
`completed`	Completed Flag	Boolean indicating whether the event has finished.

player_props

This table represents the planned betting market integration, designed to support model validation and value identification once populated with historical odds data. (oddsapi)

Variable	Name	Description
`prop_id`	Prop ID	Primary key for individual prop bet offerings.
`event_id`	Event ID	Links to specific games through events table.
`bookmaker_key`	Sportsbook	Identifier for which bookmaker offered the odds.
`market_type`	Bet Type	Type of prop bet (hits, total_bases, home_runs, etc.).
`player_name`	Player Name	Name of player for the prop bet.
`price`	Betting Odds	Decimal odds offered by bookmaker.
`point`	Over/Under Line	Threshold value for over/under bets (0.5, 1.5, etc.).
`last_update`	Last Update	Time at which the prop was last updated.

bookmakers

This lookup table provides sportsbook identifiers used in the player_props table.

Variable	Name	Description
`bookmaker_key`	Sportsbook Key	Unique identifier for the bookmaker
`title`	Sportsbook Name	Display name of the sportsbook (e.g., “DraftKings”, “FanDuel”, etc.).