Big Green Analytics - (Almost) Winning a March Madness Pool with Analytics

Background

Every year, I enter a large (100+ entrants) player selection pool with my dad and brother. Normally, we pick players hapharzardly–I am the only one of the three of us who watches any regular-season college basketball, and I only watch a few games per year before the NCAA tournament starts in March. Since none of us want to invest more time into watching teams full of transfers try to figure out how to pass to each other during the regular season, I thought we could take a different approach this year and use analytics to make our picks. We also enlisted the help of Evan, a USC economics major who played basketball in high school and knows basketball better than the three Gottesmans combined. Before we get too far, here are the rules for the competition:

Each entrant picks 15 players. Each player should be from a team playing in the NCAA tournament.
An entrant’s score is a function of the sum of the points their selected players score over the course of the NCAA tournament. Players from 1-5 seeds get 1x multiplier on the number of points they score, 6-12 seeds get 2x, and 13+ seeds get 3x.
The three entrants with the highest scores will win prizes at the end of the tournament.

More rigorous description

To be more mathematically rigorous about the above rules:

For a player \(i\) on a team with seed seed \(\text{seed}(i)\), they get a multiplier \(\lambda(\text{seed}(i))\) on their score (i.e. if player \(i\) scores \(p_i\) points during the tournament, they will contribute \(λ(\text{seed}(i))*p_i\) to the score of an entrant who selects player \(i\)).

For an entrant \(e\) who selects a set of players \(P_e\): \[\begin{equation} \text{score}_e = \sum_{i \in P_e} \lambda(\text{seed}(i)) * p_i \end{equation}\]

For a player from a team with seed s, λ is defined as follows: \[\begin{equation} \lambda(s) = \begin{cases} 1 & s\leq 5 \\ 2 & 6 \leq s\leq 12 \\ 3 & 13 \leq s \end{cases} \end{equation}\]

And, for the set of entrants to the pool \(E\) the set of winners \(W \subset E\) will be:

\[\begin{equation} W = \text{argmax}_{E' \subset E,\, |E'|=3} \sum_{e \in E'} \text{score}_e \end{equation}\]

https://math.stackexchange.com/questions/463502/mathematical-expression-for-n-largest-values-in-a-set

Initial Approach

Our first (naive) thought was to maximize \(\mathbb{E}[\text{score}_e]\) for our entry. In other words, we wanted to just find the collection of 15 players to form our team that had the highest number of expected points. To do this, we estimated \(\mathbb{E}[\text{score}_e]\) using stats from the 2022-2023 regular season and the DraftKings odds for how far various teams would make it in the tournament. We were on a 24-hour time constraint and the best estimate we could make was to muliply each player’s ppg by the number of games we expected them to play (let’s denote this \(\mathbb{E}[\text{g}_i]\) for player \(i\)).

To estimate \(\mathbb{E}[\text{g}_i]\), our preferred approach was to estimate each game using a win probability model. Evan painstakingly transcribed the win probability estimates from ESPN for the first round games, but we ran into an issue: we couldn’t get WP estimates after the first round. To solve this, we decided to trust the experts in Vegas and scrape published odds. DraftKings offers futures for each team in the tournament making it to each round (up to the final). Then, we can get a naive estimate for the number of points each player will score during the course tournament by multiplying their ppg by the \(\mathbb{E}[\text{g}_i]\). Here is the list we generated from this strategy¹

¹ Satisfyingly, our algorithm picked a number of Creighton player and Creighton was a team we felt (based on intuition) was underseeded and had a good change of making a run

Kendrick Davis, Memphis

Zach Edey, Purdue

Ryan Kalkbrenner, Creighton

Sincere Carry, Kent State

Oscar Tshiebwe, Kentucky

Brandon Miller, Alabama

Wade Taylor, Texas A&M

Kris Murray, Iowa

Deandre Williams, Memphis

Trey Alexander, Creighton

Walter Clayton, Iona

Raequan Battle, Montana St.

Antonio Reeves, Kentucky

Baylor Scheierman, Creighton

Marcus Sasser, Houston

Could we have done better?

This approach got us fourth out of 24 entrants to the player pool ², but how could we improve the method?

² Pretty good and almost enough to earn some prize money!

Optimizing for Pool Constructs

Our method of maximizing the expected value of the ppg * expected games played has one immediately apparent flaw: maximizing the expected value is not the same as maximizing your chance to win the tournament. Take a look at the plot below–let the red line represent the score needed to win the tournament (get in the top three). Now look at the distributions. The orange distribution a higher expected value (green line) than the EV of the blue distribution (purple line). However, blue distribution has more area to the right of the “winning” line than the orange distribution, so that would be a better choice for our player picks.

```{python}
import numpy as np
import matplotlib.pyplot as plt
```

```{python}
x_axis = np.arange(-20, 20, 0.01)

plt.hist(np.random.normal(0, 1, 100000), bins=1000)
plt.hist(np.random.normal(0.5, 0.5, 100000), bins=1000)
plt.hist(np.random.uniform(0.5, 0.5, 400))
plt.hist(np.random.uniform(1.6, 1.6, 400))
plt.hist(np.random.uniform(0, 0, 400))
plt.show()
```

Accounting for Conditional Factors

We implicitly assumed that all the agents (players) could be picked independently, but that ignores the nature of the tournament format. For example, consider the extreme case where the best 15 players are all on two teams that play each other in the first round, but the next five best players are on a team on the other side of the bracket. To maximize your potential points, it is probably best to pick some players from teams on opposite sides of the bracket.

Background

Initial Approach

Could we have done better?

Optimizing for Pool Constructs

Accounting for Conditional Factors

Better Win Probability Estimates