nflwin package

Submodules

nflwin.model module

Tools for creating and running the model.

class nflwin.model.WPModel(copy_data=True)[source]

Bases: object

The object that computes win probabilities.

In addition to holding the model itself, it defines some columns names likely to be used in the model as parameters to allow other users to more easily figure out which columns go into the model.

Parameters:

copy_data : boolean (default=``True``)

Whether or not to copy data when fitting and applying the model. Running the model in-place (copy_data=False) will be faster and have a smaller memory footprint, but if not done carefully can lead to data integrity issues.

Attributes

model (A Scikit-learn pipeline (or equivalent)) The actual model used to compute WP. Upon initialization it will be set to a default model, but can be overridden by the user.
column_descriptions (dictionary) A dictionary whose keys are the names of the columns used in the model, and the values are string descriptions of what the columns mean. Set at initialization to be the default model, if you create your own model you’ll need to update this attribute manually.
training_seasons (A list of ints, or None (default=``None``)) If the model was trained using data downloaded from nfldb, a list of the seasons used to train the model. If nfldb was not used, an empty list. If no model has been trained yet, None.
training_season_types (A list of strings or None (default=``None``)) Same as training_seasons, except for the portions of the seasons used in training the model (“Preseason”, “Regular”, and/or “Postseason”).
validation_seasons (same as training_seasons, but for validation data.)
validation_season_types (same as training_season_types, but for validation data.)
sample_probabilities (A numpy array of floats or None (default=``None``)) After the model has been validated, contains the sampled predicted probabilities used to compute the validation statistic.
predicted_win_percents (A numpy array of floats or None (default=``None``)) After the model has been validated, contains the actual probabilities in the test set at each probability in sample_probabilities.
num_plays_used (A numpy array of floats or None (default=``None``)) After the model has been validated, contains the number of plays used to compute each element of predicted_win_percents.
model_directory (string) The directory where all models will be saved to or loaded from.
create_default_pipeline()[source]

Create the default win probability estimation pipeline.

Returns:

Scikit-learn pipeline

The default pipeline, suitable for computing win probabilities but by no means the best possible model.

This can be run any time a new default pipeline is required,

and either set to the model attribute or used independently.

classmethod load_model(filename=None)[source]

Load a saved WPModel.

Parameters:Same as ``save_model``.
Returns:nflwin.WPModel instance.
model_directory = '/home/docs/checkouts/readthedocs.org/user_builds/nflwin/checkouts/stable/nflwin/models'
num_plays_used
plot_validation(axis=None, **kwargs)[source]

Plot the validation data.

Parameters:

axis : matplotlib.pyplot.axis object or None (default=``None``)

If provided, the validation line will be overlaid on axis. Otherwise, a new figure and axis will be generated and plotted on.

**kwargs

Arguments to axis.plot.

Returns:

matplotlib.pylot.axis

The axis the plot was made on.

Raises:

NotFittedError

If the model hasn’t been fit and validated.

predict_wp(plays)[source]

Estimate the win probability for a set of plays.

Basically a simple wrapper around WPModel.model.predict_proba, takes in a DataFrame and then spits out an array of predicted win probabilities.

Parameters:

plays : Pandas DataFrame

The input data to use to make the predictions.

Returns:

Numpy array, of length len(plays)

Predicted probability that the offensive team in each play will go on to win the game.

Raises:

NotFittedError

If the model hasn’t been fit.

predicted_win_percents
sample_probabilities
save_model(filename=None)[source]

Save the WPModel instance to disk.

All models are saved to the same place, with the installed NFLWin library (given by WPModel.model_directory).

Parameters:

filename : string (default=None):

The filename to use for the saved model. If this parameter is not specified, save to the default filename. Note that if a model already lists with this filename, it will be overwritten. Note also that this is a filename only, not a full path. If a full path is specified it is likely (albeit not guaranteed) to cause errors.

Returns:

None

train_model(source_data='nfldb', training_seasons=[2009, 2010, 2011, 2012, 2013, 2014], training_season_types=['Regular', 'Postseason'], target_colname='offense_won')[source]

Train the model.

Once a modeling pipeline is set up (either the default or something custom-generated), historical data needs to be fed into it in order to “fit” the model so that it can then be used to predict future results. This method implements a simple wrapper around the core Scikit-learn functionality which does this.

The default is to use data from the nfldb database, however that can be changed to a simple Pandas DataFrame if desired (for instance if you wish to use data from another source).

There is no particular output from this function, rather the parameters governing the fit of the model are saved inside the model object itself. If you want to get an estimate of the quality of the fit, use the validate_model method after running this method.

Parameters:

source_data : the string "nfldb" or a Pandas DataFrame (default=``”nfldb”``)

The data to be used to train the model. If "nfldb", will query the nfldb database for the training data (note that this requires a correctly configured installation of nfldb’s database).

training_seasons : list of ints (default=``[2009, 2010, 2011, 2012, 2013, 2014]``)

What seasons to use to train the model if getting data from the nfldb database. If source_data is not "nfldb", this argument will be ignored. NOTE: it is critical not to use all possible data in order to train the model - some will need to be reserved for a final validation (see the validate_model method). A good dataset to reserve for validation is the most recent one or two NFL seasons.

training_season_types : list of strings (default=``[“Regular”, “Postseason”]``)

If querying from the nfldb database, what parts of the seasons to use. Options are “Preseason”, “Regular”, and “Postseason”. If source_data is not "nfldb", this argument will be ignored.

target_colname : string or integer (default=``”offense_won”``)

The name of the target variable column.

Returns:

None

Notes

If you are loading in the default model, there is no need to re-run this method. In fact, doing so will likely result in weird errors and could corrupt the model if you were to try to save it back to disk.

training_seasons
training_seasons_types
validate_model(source_data='nfldb', validation_seasons=[2015], validation_season_types=['Regular', 'Postseason'], target_colname='offense_won')[source]

Validate the model.

Once a modeling pipeline is trained, a different dataset must be fed into the trained model to validate the quality of the fit. This method implements a simple wrapper around the core Scikit-learn functionality which does this.

The default is to use data from the nfldb database, however that can be changed to a simple Pandas DataFrame if desired (for instance if you wish to use data from another source).

The output of this method is a p value which represents the confidence at which we can reject the null hypothesis that the model predicts the appropriate win probabilities. This number is computed by first smoothing the predicted win probabilities of both all test data and just the data where the offense won with a gaussian kernel density estimate with standard deviation = 0.01. Once the data is smooth, ratios at each percentage point from 1% to 99% are computed (i.e. what fraction of the time did the offense win when the model says they have a 1% chance of winning, 2% chance, etc.). Each of these ratios should be well approximated by the binomial distribution, since they are essentially independent (not perfectly but hopefully close enough) weighted coin flips, giving a p value. From there Fisher’s method is used to combine the p values into a global p value. A p value close to zero means that the model is unlikely to be properly predicting the correct win probabilities. A p value close to one, while not proof that the model is correct, means that the model is at least not inconsistent with the hypothesis that it predicts good win probabilities.

Parameters:

source_data : the string "nfldb" or a Pandas DataFrame (default=``”nfldb”``)

The data to be used to train the model. If "nfldb", will query the nfldb database for the training data (note that this requires a correctly configured installation of nfldb’s database).

training_seasons : list of ints (default=``[2015]``)

What seasons to use to validate the model if getting data from the nfldb database. If source_data is not "nfldb", this argument will be ignored. NOTE: it is critical not to use the same data to validate the model as was used in the fit. Generally a good data set to use for validation is one from a time period more recent than was used to train the model. For instance, if the model was trained on data from 2009-2014, data from the 2015 season would be a sensible choice to validate the model.

training_season_types : list of strings (default=``[“Regular”, “Postseason”]``)

If querying from the nfldb database, what parts of the seasons to use. Options are “Preseason”, “Regular”, and “Postseason”. If source_data is not "nfldb", this argument will be ignored.

target_colname : string or integer (default=``”offense_won”``)

The name of the target variable column.

Returns:

float, between 0 and 1

The combined p value, where smaller values indicate that the model is not accurately predicting win probabilities.

Raises:

NotFittedError

If the model hasn’t been fit.

Notes

Probabilities are computed between 1 and 99 percent because a single incorrect prediction at 100% or 0% automatically drives the global p value to zero. Since the model is being smoothed this situation can occur even when there are no model predictions at those extreme values, and therefore leads to erroneous p values.

While it seems reasonable (to me at least), I am not totally certain that this approach is entirely correct. It’s certainly sub-optimal in that you would ideally reject the null hypothesis that the model predictions aren’t appropriate, but that seems to be a much harder problem (and one that would need much more test data to beat down the uncertainties involved). I’m also not sure if using Fisher’s method is appropriate here, and I wonder if it might be necessary to Monte Carlo this. I would welcome input from others on better ways to do this.

validation_seasons
validation_seasons_types

nflwin.preprocessing module

Tools to get raw data ready for modeling.

class nflwin.preprocessing.CheckColumnNames(column_names=None, copy=True)[source]

Bases: sklearn.base.BaseEstimator

Make sure user has the right column names, in the right order.

This is a useful first step to make sure that nothing is going to break downstream, but can also be used effectively to drop columns that are no longer necessary.

Parameters:

column_names : None, or list of strings

A list of column names that need to be present in the scoring data. All other columns will be stripped out. The order of the columns will be applied to any scoring data as well, in order to handle the fact that pandas lets you play fast and loose with column order. If None, will obtain every column in the DataFrame passed to the fit method.

copy : boolean (default=``True``)

If False, add the score differential in place.

fit(X, y=None)[source]

Grab the column names from a Pandas DataFrame.

Parameters:

X : Pandas DataFrame, of shape(number of plays, number of features)

NFL play data.

y : Numpy array, with length = number of plays, or None

1 if the home team won, 0 if not. (Used as part of Scikit-learn’s Pipeline)

Returns:

self : For compatibility with Scikit-learn’s Pipeline.

transform(X, y=None)[source]

Apply the column ordering to the data.

Parameters:

X : Pandas DataFrame, of shape(number of plays, number of features)

NFL play data.

y : Numpy array, with length = number of plays, or None

1 if the home team won, 0 if not. (Used as part of Scikit-learn’s Pipeline)

Returns:

X : Pandas DataFrame, of shape(number of plays, len(column_names))

The input DataFrame, properly ordered and with extraneous columns dropped

Raises:

KeyError

If the input data frame doesn’t have all the columns specified by column_names.

NotFittedError

If transform is called before fit.

class nflwin.preprocessing.ComputeElapsedTime(quarter_colname, quarter_time_colname, quarter_to_second_mapping={'Q1': 0, 'Q3': 1800, 'Q2': 900, 'Q4': 2700, 'OT3': 5400, 'OT2': 4500, 'OT': 3600}, total_time_colname='total_elapsed_time', copy=True)[source]

Bases: sklearn.base.BaseEstimator

Compute the total elapsed time from the start of the game.

Parameters:

quarter_colname : string

Which column indicates what quarter it is.

quarter_time_colname : string

Which column indicates how much time has elapsed in the current quarter.

quarter_to_second_mapping : dict (default=``{“Q1”: 0, “Q2”: 900, “Q3”: 1800, “Q4”: 2700,

“OT”: 3600, “OT2”: 4500, “OT3”: 5400}``)

What mapping to use between the string values in the quarter column and the seconds they correspond to. Mostly useful if your data had quarters listed as something like “Quarter 1” or “q1” instead of the values from nfldb.

total_time_colname : string (default=”total_elapsed_time”)

What column name to store the total elapsed time under.

copy : boolean (default=True)

Whether to add the new column in place.

fit(X, y=None)[source]
transform(X, y=None)[source]

Create the new column.

Parameters:

X : Pandas DataFrame, of shape(number of plays, number of features)

NFL play data.

y : Numpy array, with length = number of plays, or None

1 if the home team won, 0 if not. (Used as part of Scikit-learn’s Pipeline)

Returns:

X : Pandas DataFrame, of shape(number of plays, number of features + 1)

The input DataFrame, with the new column added.

Raises:

KeyError

If quarter_colname or quarter_time_colname don’t exist, or if total_time_colname does exist.

TypeError

If the total time elapsed is not a numeric column, which typically indicates that the mapping did not apply to every row.

class nflwin.preprocessing.ComputeIfOffenseIsHome(offense_team_colname, home_team_colname, offense_home_team_colname='is_offense_home', copy=True)[source]

Bases: sklearn.base.BaseEstimator

Determine if the team currently with possession is the home team.

Parameters:

offense_team_colname : string

Which column indicates what team was on offense.

home_team_colname : string

Which column indicates what team was the home team.

offense_home_team_colname : string (default=”is_offense_home”)

What column to store whether or not the offense was the home team.

copy : boolean (default=True)

Whether to add the new column in place.

fit(X, y=None)[source]
transform(X, y=None)[source]

Create the new column.

Parameters:

X : Pandas DataFrame, of shape(number of plays, number of features)

NFL play data.

y : Numpy array, with length = number of plays, or None

1 if the home team won, 0 if not. (Used as part of Scikit-learn’s Pipeline)

Returns:

X : Pandas DataFrame, of shape(number of plays, number of features + 1)

The input DataFrame, with the new column added.

Raises:

KeyError

If offense_team_colname or home_team_colname don’t exist, or if offense_home_team_colname does exist.

class nflwin.preprocessing.CreateScoreDifferential(home_score_colname, away_score_colname, offense_home_colname, score_differential_colname='score_differential', copy=True)[source]

Bases: sklearn.base.BaseEstimator

Convert offense and defense scores into a differential (offense - defense).

Parameters:

home_score_colname : string

The name of the column containing the score of the home team.

away_score_colname : string

The name of the column containing the score of the away team.

offense_home_colname : string

The name of the column indicating if the offense is home.

score_differential_colname : string (default=``”score_differential”``)

The name of column containing the score differential. Must not already exist in the DataFrame.

copy : boolean (default = True)

If False, add the score differential in place.

fit(X, y=None)[source]
transform(X, y=None)[source]

Create the score differential column.

Parameters:

X : Pandas DataFrame, of shape(number of plays, number of features)

NFL play data.

y : Numpy array, with length = number of plays, or None

1 if the home team won, 0 if not. (Used as part of Scikit-learn’s Pipeline)

Returns:

X : Pandas DataFrame, of shape(number of plays, number of features + 1)

The input DataFrame, with the score differential column added.

class nflwin.preprocessing.MapToInt(colname, copy=True)[source]

Bases: sklearn.base.BaseEstimator

Map a column of values to integers.

Mapping to integer is nice if you know a column only has a few specific values in it, but you need to convert it to integers before one-hot encoding it.

Parameters:

colname : string

The name of the column to perform the mapping on.

copy : boolean (default=True)

If False, apply the mapping in-place.

Attributes

mapping (dict) Keys are the unique values of the column, values are the integers those values will be mapped to.
fit(X, y=None)[source]

Find all unique strings and construct the mapping.

Parameters:

X : Pandas DataFrame, of shape(number of plays, number of features)

NFL play data.

y : Numpy array, with length = number of plays, or None

1 if the home team won, 0 if not. (Used as part of Scikit-learn’s Pipeline)

Returns:

self : For compatibility with Scikit-learn’s Pipeline.

Raises:

KeyError

If colname is not in X.

transform(X, y=None)[source]

Apply the mapping to the data.

Parameters:

X : Pandas DataFrame, of shape(number of plays, number of features)

NFL play data.

y : Numpy array, with length = number of plays, or None

1 if the home team won, 0 if not. (Used as part of Scikit-learn’s Pipeline)

Returns:

X : Pandas DataFrame, of shape(number of plays, number of features)

The input DataFrame, with the mapping applied.

Raises:

NotFittedError

If transform is called before fit.

KeyError

If colname is not in X.

class nflwin.preprocessing.OneHotEncoderFromDataFrame(categorical_feature_names='all', dtype=<type 'float'>, handle_unknown='error', copy=True)[source]

Bases: sklearn.base.BaseEstimator

One-hot encode a DataFrame.

This cleaner wraps the standard scikit-learn OneHotEncoder, handling the transfer between column name and column index.

Parameters:

categorical_feature_names : “all” or array of column names.

Specify what features are treated as categorical. * “all” (default): All features are treated as categorical. * array of column names: Array of categorical feature names.

dtype : number type, default=np.float.

Desired dtype of output.

handle_unknown : str, “error” (default) or “ignore”.

Whether to raise an error or ignore if an unknown categorical feature is present during transform.

copy : boolean (default=True)

If False, apply the encoding in-place.

dtype
fit(X, y=None)[source]

Convert the column names to indices, then compute the one hot encoding.

Parameters:

X : Pandas DataFrame, of shape(number of plays, number of features)

NFL play data.

y : Numpy array, with length = number of plays, or None

1 if the home team won, 0 if not. (Used as part of Scikit-learn’s Pipeline)

Returns:

self : For compatibility with Scikit-learn’s Pipeline.

handle_unknown
transform(X, y=None)[source]

Apply the encoding to the data.

Parameters:

X : Pandas DataFrame, of shape(number of plays, number of features)

NFL play data.

y : Numpy array, with length = number of plays, or None

1 if the home team won, 0 if not. (Used as part of Scikit-learn’s Pipeline)

Returns:

X : Pandas DataFrame, of shape(number of plays, number of new features)

The input DataFrame, with the encoding applied.

nflwin.utilities module

Utility functions that don’t fit in the main modules

nflwin.utilities.connect_nfldb()[source]

Connect to the nfldb database.

Rather than using the builtin method we make our own, since we’re going to use SQLAlchemy as the engine. However, we can still make use of the information in the nfldb config file to get information like username and password, which means this function doesn’t need any arguments.

Parameters:

None

Returns:

SQLAlchemy engine object

A connected engine, ready to be used to query the DB.

Raises:

IOError

If it can’t find the config file.

nflwin.utilities.get_nfldb_play_data(season_years=None, season_types=['Regular', 'Postseason'])[source]

Get play-by-play data from the nfldb database.

We use a specialized query and then postprocessing because, while possible to do using the objects created by nfldb, it is orders of magnitude slower. This is due to the more general nature of nfldb, which is not really designed for this kind of data mining. Since we need to get a lot of data in a single way, it’s much simpler to interact at a lower level with the underlying postgres database.

Parameters:

season_years : list (default=None)

A list of all years to get data for (earliest year in nfldb is 2009). If None, get data from all available seasons.

season_types : list (default=[“Regular”, “Postseason”])

A list of all parts of seasons to get data for (acceptable values are “Preseason”, “Regular”, and “Postseason”). If None, get data from all three season types.

Returns:

Pandas DataFrame

The play by play data, with the following columns:

  • gsis_id: The official NFL GSIS_ID for the game.
  • drive_id: The id of the drive, starts at 1 and increases by 1 for each new drive.
  • play_id: The id of the play in nfldb. Note that sequential plays have increasing but not necessarily sequential values. With drive_id and gsis_id, works as a unique identifier for a given play.
  • quarter: The quarter, prepended with “Q” (e.g. Q1 means the first quarter). Overtime periods are denoted as OT, OT2, and theoretically OT3 if one were to ever be played.
  • seconds_elapsed: seconds elapsed since the start of the quarter.
  • offense_team: The abbreviation of the team currently with possession of the ball.
  • yardline: The current field position. Goes from -49 to 49, where negative numbers indicate that the team with possession is on its own side of the field.
  • down: The down. kickoffs, extra points, and similar have a down of 0.
  • yards_to_go: How many yards needed in order to get a first down (or touchdown).
  • home_team: The abbreviation of the home team.
  • away_team: The abbreviation of the away team.
  • curr_home_score: The home team’s score at the start of the play.
  • curr_away_score: The away team’s score at the start of the play.
  • offense_won: A boolean - True if the offense won the game, False otherwise. (The database query skips tied games.)

Notes

gsis_id, drive_id, and play_id are not necessary to make the model, but are included because they can be useful for computing things like WPA.

Module contents