Creating a New WP Model¶
While NFLWin ships with a fairly robust default model, there is always room for improvement. Maybe there’s a new dataset you want to use to train the model, a new feature you want to add, or a new machine learning model you want to evaluate.
Good news! NFLWin makes it easy to train a new model, whether you just want to refresh the data or to do an entire refit from scratch. We’ll start with the simplest case:
Default Model, New Data¶
Refreshing the data with NFLWin is a snap. If you want to change the
data used by the default model but keep the source as nfldb, all you
have to do is override the default keyword arguments when calling the
train_model()
and validate_model()
methods. For instance, if for some insane reason you wanted to train on the 2009 and 2010 regular
seasons and validate on the 2011 and 2012 playoffs, you would do the following:
>>> from nflwin.model import WPModel
>>> new_data_model = WPModel()
>>> new_data_model.train_model(training_seasons=[2009, 2010], training_season_types=["Regular"])
>>> new_data_model.validate_model(validation_seasons=[2011, 2012], validation_season_types=["Postseason"])
(21.355462918011327, 565.56909036318007)
If you want to supply your own data, that’s easy too - simply set the
source_data kwarg of train_model()
and
validate_model()
to be a Pandas DataFrame of your training and validation data (respectively):
>>> from nflwin.model import WPModel
>>> new_data_model = WPModel()
>>> training_data.head()
gsis_id drive_id play_id offense_team yardline down yards_to_go \
0 2012090500 1 35 DAL -15.0 0 0
1 2012090500 1 57 NYG -34.0 1 10
2 2012090500 1 79 NYG -34.0 2 10
3 2012090500 1 103 NYG -29.0 3 5
4 2012090500 1 125 NYG -29.0 4 5
home_team away_team offense_won quarter seconds_elapsed curr_home_score \
0 NYG DAL True Q1 0.0 0
1 NYG DAL False Q1 4.0 0
2 NYG DAL False Q1 11.0 0
3 NYG DAL False Q1 55.0 0
4 NYG DAL False Q1 62.0 0
curr_away_score
0 0
1 0
2 0
3 0
4 0
>>> new_data_model.train_model(source_data=training_data)
>>> validation_data.head()
gsis_id drive_id play_id offense_team yardline down yards_to_go \
0 2014090400 1 36 SEA -15.0 0 0
1 2014090400 1 58 GB -37.0 1 10
2 2014090400 1 79 GB -31.0 2 4
3 2014090400 1 111 GB -26.0 1 10
4 2014090400 1 132 GB -11.0 1 10
home_team away_team offense_won quarter seconds_elapsed curr_home_score \
0 SEA GB True Q1 0.0 0
1 SEA GB False Q1 4.0 0
2 SEA GB False Q1 30.0 0
3 SEA GB False Q1 49.0 0
4 SEA GB False Q1 88.0 0
curr_away_score
0 0
1 0
2 0
3 0
4 0
>>> new_data_model.validate_model(source_data=validation_data)
(8.9344062502671591, 265.7971863696315)
Building a New Model¶
If you want to construct a totally new model, that’s possible
too. Just instantiate
WPModel
, then replace the
model
attribute with either a
scikit-learn classifier
or Pipeline. From
that point train_model()
and
validate_model()
should work as normal.
Note
If you create your own model, the
column_descriptions
attribute will no longer be
accurate unless you update it manually.
Note
If your model uses a data structure other than a Pandas DataFrame,
you will not be able to use the source_data="nfldb"
default
kwarg of train_model()
and
validate_model()
. If you want to use nfldb
data, query it through nflwin.utilities.get_nfldb_play_data()
first and convert it from a DataFrame to the format required by your model.
Using NFLWin’s Preprocessors¶
While you can completely roll your own WP model from scratch, NFLWin
comes with several classes designed to aid in preprocessing your
data. These can be found in the appropriately named
preprocessing
module. Each of these preprocessors inherits
from scikit-learn’s BaseEstimator class, and therefore is fully
compatible with scikit-learn Pipelines. Available preprocessors
include:
ComputeElapsedTime
: Convert the time elapsed in a quarter into the total seconds elapsed in the game.ComputeIfOffenseIsHome
: Create an indicator variable for whether or not the offense is the home team.CreateScoreDifferential
: Create a column indicating the difference between the offense and defense point totals (offense-defense). Uses home team and away team plus an indicator giving if the offense is the home team to compute.MapToInt
: Map a column of values to integers. Useful for string columns (e.g. a quarter column with “Q1”, “Q2”, etc).CheckColumnNames
: Ensure that only the desired data gets passed to the model in the right order. Useful to guarantee that the underlying numpy arrays in a Pandas DataFrame used for model validation are in the same order as they were when the model was trained.
To see examples of these preprocessors in use to build a model, look
at nflwin.model.WPModel.create_default_pipeline()
.
Model I/O¶
To save a model to disk, use the
nflwin.model.WPModel.save_model()
method.
Note
If you do not provide a filename, the default model will be overwritten and in order to recover it you will need to reinstall NFLWin (which will then overwrite any non-default models you have saved).
To load a model from disk, use the
nflwin.model.WPModel.load_model()
class method. By default this
will load the standard model that comes bundled with pip installs of
NFLWin. Simply specify the filename
kwarg to load a non-standard
model.
Note
By default, models are saved to and loaded from the path given by
nflwin.model.WPModel.model_directory
, which by default is
located inside your NFLWin install.
Estimating Quality of Fit¶
When you care about measuring the probability of a classification model rather than getting a yes/no prediction it is challenging to estimate its quality. This is an area I’m actively looking to improve upon, but for now NFLWin does the following.
First, it takes the probabilities given by the model for each play in the
validation set, then produces a kernel density estimate (KDE) of all
the plays as well as just the ones that were predicted
correctly. The ratio of these two KDEs is the actual WP measured
from the test data set at a given predicted WP. While all of this is
measured in validate_model()
, you can plot
it for yourself by calling the
plot_validation()
method, which will
generate a plot like that shown on the home page.
From there NFLWin computes both the maximum deviation at any given
percentage and the total area between the estimated WP from the model
and what would be expected if the model was perfect - that’s what is
actually returned by
validate_model()
. This is obviously not
ideal given that it’s not directly estimating uncertainties in
the model, but it’s the best I’ve been able to come up with so far. If anyone
has an idea for how to do this better I would welcome it enthusiastically.