Using Data From nfldb¶
NFLWin comes with robust support for querying data from nfldb, a package designed to
facilitate downloading and accessing play-by-play data. There are
functions to query the nfldb database in nflwin.utilities
,
and nflwin.model.WPModel
has keyword arguments that allow
you to directly use nfldb data to fit and validate a WP model. Using
nfldb is totally optional: a default model is already fit and ready to
use, and NFLWin is fully compatible with any source for play-by-play
data. However, nfldb is one of the few free sources of up-to-date NFL
data and so it may be a useful resource to have.
Installing nfldb¶
nfldb is pip-installable, and can be installed as an extra dependency
(pip install nflwin[nfldb]
). Without setting up the nfldb
Postgres database first, however, the pip install will succeed but
nfldb will be unuseable. What’s more, trying to set up the database
after installing nfldb may fail as well.
The nfldb wiki has fairly decent installation instructions, but I know that when I went through the installation process I had to interpret and adjust several steps. I’d at least recommend reading through the wiki first, but in case it’s useful I’ve listed the steps I followed below (for reference I was on Mac OS 10.10).
Installing Postgres¶
I had an old install kicking around, so I first had to clean that up. Since I was using Homebrew:
$ brew uninstall -force postgresql
$ rm -rf /usr/local/var/postgres/ # where I'd installed the prior DB
Then install a fresh version:
$ brew update
$ brew install postgresql
Start Postgres and Create a Default DB¶
You can choose to run Postgres at startup, but I don’t use it that often so I choose not to do those steps - I just run it in the foreground with this command:
$ postgres -D /usr/local/var/postgres
Or in the background with this command:
$ pg_ctl -D /usr/local/var/postgres -l logfile start
If you don’t create a default database based on your username,
launching Postgres will fail with a psql: FATAL: database
"USERNAME" does not exist
error:
$ createdb `whoami`
Check that the install and configuration went well by launching Postgres as your default user:
$ psql
psql (9.5.2)
Type "help" for help.
USERNAME=#
Next, add a password:
USERNAME=# ALTER ROLE "USERNAME" WITH ENCRYPTED PASSWORD 'choose a
superuser password';
USERNAME=# \q;
Edit the pg_hba.conf``file found in your database (in my case the
file was
``/usr/local/var/postgres/pg_hba.conf
), and change all instances of
trust
to md5
.
Create nfldb Postgres User and Database¶
Start by making a user:
$ createuser -U USERNAME -E -P nfldb
where you replace USERNAME
with your actual username. Make up a
new password. Then make the nfldb database:
$ createdb -U USERNAME -O nfldb nfldb
You’ll need to enter the password for the USERNAME account. Next, add the fuzzy string matching extension:
$ psql -U USERNAME -c 'CREATE EXTENSION fuzzystrmatch;' nfldb
You should now be able to connect the nfldb user to the nfldb database:
$ psql -U nfldb nfldb
From this point you should be able to follow along with the instructions from nfldb.
Using nfldb¶
Once nfldb is properly installed, you can use it with NFLwin in a couple of different ways.
Querying Data¶
nfldb comes with a robust set of options to query its database, but
they tend to be designed more for ad hoc querying of small amounts of
data or computing aggregate statistics. It’s possible to use built-in
nfldb queries to get the data NFLWin needs, but it’s slow. So NFLWin
has built in support for bulk queries of nfldb in the
nflwin.utilities
module:
>>> from nflwin import utilities
>>> data = utilities.get_nfldb_play_data(season_years=[2010],
... season_types=["Regular", "Postseason"])
>>> data.head()
gsis_id drive_id play_id offense_team yardline down yards_to_go \
0 2010090900 1 35 MIN -20.0 0 0
1 2010090900 1 57 NO -27.0 1 10
2 2010090900 1 81 NO 1.0 1 10
3 2010090900 1 109 NO 13.0 1 10
4 2010090900 1 135 NO 13.0 2 10
home_team away_team offense_won quarter seconds_elapsed curr_home_score \
0 NO MIN False Q1 0.0 0
1 NO MIN True Q1 4.0 0
2 NO MIN True Q1 39.0 0
3 NO MIN True Q1 79.0 0
4 NO MIN True Q1 84.0 0
curr_away_score
0 0
1 0
2 0
3 0
4 0
You can see the docstring for more details, but basically get_nfldb_play_data
queries
the nfldb database directly for columns relevant to estimating WP,
does some simple parsing/preprocessing to get them in the right format,
then returns them as a dataframe. Keyword arguments control what parts
of seasons are queried.
Integration with WPModel¶
While you can train NFLWin’s win probability model
(nflwin.model.WPModel
) with whatever data you want, it
comes with keyword arguments that allow you to query nfldb
directly. For instance, to train the default model on the 2009 and 2010
regular seasons from nfldb, you’d enter the following:
>>> from nflwin.model import WPModel
>>> model = WPModel()
>>> model.create_default_pipeline()
Pipeline(...)
>>> model.train_model(source_data="nfldb",
... training_seasons=[2009, 2010],
... training_season_types=["Regular"])