Using Data From nfldb

NFLWin comes with robust support for querying data from nfldb, a package designed to facilitate downloading and accessing play-by-play data. There are functions to query the nfldb database in nflwin.utilities, and nflwin.model.WPModel has keyword arguments that allow you to directly use nfldb data to fit and validate a WP model. Using nfldb is totally optional: a default model is already fit and ready to use, and NFLWin is fully compatible with any source for play-by-play data. However, nfldb is one of the few free sources of up-to-date NFL data and so it may be a useful resource to have.

Installing nfldb

nfldb is pip-installable, and can be installed as an extra dependency (pip install nflwin[nfldb]). Without setting up the nfldb Postgres database first, however, the pip install will succeed but nfldb will be unuseable. What’s more, trying to set up the database after installing nfldb may fail as well.

The nfldb wiki has fairly decent installation instructions, but I know that when I went through the installation process I had to interpret and adjust several steps. I’d at least recommend reading through the wiki first, but in case it’s useful I’ve listed the steps I followed below (for reference I was on Mac OS 10.10).

Installing Postgres

I had an old install kicking around, so I first had to clean that up. Since I was using Homebrew:

$ brew uninstall -force postgresql
$ rm -rf /usr/local/var/postgres/ # where I'd installed the prior DB

Then install a fresh version:

$ brew update
$ brew install postgresql

Start Postgres and Create a Default DB

You can choose to run Postgres at startup, but I don’t use it that often so I choose not to do those steps - I just run it in the foreground with this command:

$ postgres -D /usr/local/var/postgres

Or in the background with this command:

$ pg_ctl -D /usr/local/var/postgres -l logfile start

If you don’t create a default database based on your username, launching Postgres will fail with a psql: FATAL:  database "USERNAME" does not exist error:

$ createdb `whoami`

Check that the install and configuration went well by launching Postgres as your default user:

$ psql
psql (9.5.2)
Type "help" for help.

USERNAME=#

Next, add a password:

USERNAME=# ALTER ROLE "USERNAME" WITH ENCRYPTED PASSWORD 'choose a
superuser password';
USERNAME=# \q;

Edit the pg_hba.conf``file found in your database (in my case the file was ``/usr/local/var/postgres/pg_hba.conf), and change all instances of trust to md5.

Create nfldb Postgres User and Database

Start by making a user:

$ createuser -U USERNAME -E -P nfldb

where you replace USERNAME with your actual username. Make up a new password. Then make the nfldb database:

$ createdb -U USERNAME -O nfldb nfldb

You’ll need to enter the password for the USERNAME account. Next, add the fuzzy string matching extension:

$ psql -U USERNAME -c 'CREATE EXTENSION fuzzystrmatch;' nfldb

You should now be able to connect the nfldb user to the nfldb database:

$ psql -U nfldb nfldb

From this point you should be able to follow along with the instructions from nfldb.

Using nfldb

Once nfldb is properly installed, you can use it with NFLwin in a couple of different ways.

Querying Data

nfldb comes with a robust set of options to query its database, but they tend to be designed more for ad hoc querying of small amounts of data or computing aggregate statistics. It’s possible to use built-in nfldb queries to get the data NFLWin needs, but it’s slow. So NFLWin has built in support for bulk queries of nfldb in the nflwin.utilities module:

>>> from nflwin import utilities
>>> data = utilities.get_nfldb_play_data(season_years=[2010],
... season_types=["Regular", "Postseason"])
>>> data.head()
      gsis_id  drive_id  play_id offense_team  yardline  down  yards_to_go  \
0  2010090900         1       35          MIN     -20.0     0            0
1  2010090900         1       57           NO     -27.0     1           10
2  2010090900         1       81           NO       1.0     1           10
3  2010090900         1      109           NO      13.0     1           10
4  2010090900         1      135           NO      13.0     2           10

  home_team away_team offense_won quarter  seconds_elapsed  curr_home_score  \
0        NO       MIN       False      Q1              0.0                0
1        NO       MIN        True      Q1              4.0                0
2        NO       MIN        True      Q1             39.0                0
3        NO       MIN        True      Q1             79.0                0
4        NO       MIN        True      Q1             84.0                0

   curr_away_score
0                0
1                0
2                0
3                0
4                0

You can see the docstring for more details, but basically get_nfldb_play_data queries the nfldb database directly for columns relevant to estimating WP, does some simple parsing/preprocessing to get them in the right format, then returns them as a dataframe. Keyword arguments control what parts of seasons are queried.

Integration with WPModel

While you can train NFLWin’s win probability model (nflwin.model.WPModel) with whatever data you want, it comes with keyword arguments that allow you to query nfldb directly. For instance, to train the default model on the 2009 and 2010 regular seasons from nfldb, you’d enter the following:

>>> from nflwin.model import WPModel
>>> model = WPModel()
>>> model.create_default_pipeline() 
Pipeline(...)
>>> model.train_model(source_data="nfldb",
... training_seasons=[2009, 2010],
... training_season_types=["Regular"])