February 5, 2013

Building a Stock Options Historical Database

I am interested in doing econometric analysis on financial derivatives. The main hangup I have faced is that there are no good free resources (at least that I know of) for historical options data. For that reason I want to create my own personal database of historical options prices. I have broken this project down into three main hurdles:

  1. Figure out how to get options data from within python
  2. Pick a data storage format
  3. Automate the collection of daily data

Getting options data in python

Over the summer I had some free time and teamed up with my dad to create an investment model. While it is a very simple model, this post is about building a database so I won't go into details here. It suffices to say that I needed to find a way to get options data from yahoo! Finance. This was a unique challenge because unlike equity data or data from other sources like FRED, options data doesn't have a convenient "download to csv" button anywhere on the website. 

At the time I was reading the excellent book "Python for Data Analysis" by Wes McKinney and got an idea for how to implement a basic web crawler to parse the html on yahoo and return the data in a python friendly format. Long story short, I wrote some code to do just that and it made its way into version 0.9 of the pandas library (if you aren't familiar with pandas and you work with data in python you should definitely check it out).

Now only these few commands are needed to get options data from yahoo Finance!:
from pandas.io.data import Options

aapl = Options('AAPL')
puts, calls = appl.get_options_data()
The calls and puts objects are pandas DataFrames that contain the same information you would find on the yahoo! Finance page for Apple Inc. options.

Picking the file format

In picking a file format I had two main considerations: size of the file and speed at which it can be written/read. To test this out I wrote a simple script that generated a random 4000 by 4000 numpy array and defined functions for writing and reading that data in different file formats. The formats I chose to work with were csv, hdf5 (.h5), and MatLab (.mat). Below is the script I used to run the test:
import numpy as np
from scipy.io import savemat, loadmat
import pandas as pd
import h5py

data = np.random.randn(4000, 4000)

file_path = '/Users/spencerlyon2/Desktop/test_data.'

def store_csv():
    pd_data = pd.DataFrame(data)
    pd_data.to_csv(file_path + 'csv')

def store_h5():
    f = h5py.File(file_path + 'h5')
    dset = f.create_dataset('data', (4000, 4000), 'f')
    dset[...] = data

def store_mat():
    sp_dat = {'data': data}
    savemat(file_path + 'mat', sp_dat)

def read_csv():
    pd_data_in = pd.DataFrame().from_csv(file_path + 'csv')
    return pd_data_in

def read_h5():
    f_in = h5py.File(file_path + 'h5', 'r')
    d_in = f_in['data']
    return d_in

def read_mat():
    mat_dict = loadmat(file_path + 'mat')
    return mat_dict['data']
After I had this code I simply fired up iPython and ran the file (file_test.py) and used the %timeit magic to see how long it took each of the three methods to read and write the data. The timing results, along with the final file sizes are summarized in the table below:

csv hdf5 mat
Write time 59.9 s 101 ms 836 ms
Read Time 6.07 s 409 us 82.5 ms
File size 322.6 MB 64 MB 128 MB

It is easy to see that the hdf5 file type is the best one to choose for my purposes. I would like to note here that the reason the hdf5 file format is 1/2 the size of the .mat file, is because the dtype in the .h5 file is a 32 bit float whereas the .mat dtype is a 64 bit float. However, for stock options we only generally have/care about data out two decimal places so the 32 bit precision is plenty. 

Automating the data retrieval

The final step in getting this database started was to automate the data retrieval process. To do this I used the popular UNIX scheduling tool cron. I run OSX 10.8 Mountain Lion, and by default in 10.8 the cron tool is disabled. To fix this I simply ran the following command in the terminal:
sudo touch /etc/crontab
This command creates the /etc/crontab file (if it doesn't already exist) and gets it ready for use by cron. I am not going to give a detailed explanation for how to use cron here (as I am still fairly new at it myself), but googling for it will give you lots of examples and tutorials. I will, however give the line in my crontab file that executes the script:
50 23 * * 1,2,3,4,5 python ~/Documents/Python/pytools/Finance/db_cron.py
The next step was to write the script I would have cron call. This appears below.

Created Feb 6, 2013

Author: Spencer Lyon

File to be run automatically by cron each day to gather options data
import datetime as dt
import pandas as pd
import numpy as np
from options import Options
import h5py as h5

# ticker_list.csv contains all tickers on nyse and nasdaq stock exchanges
tickers = pd.Series.from_csv('ticker_list.csv')
num_ticks = tickers.size  # used to print status

months = {1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr', 5: 'May', 6: 'Jun',
          7: 'Jul', 8: 'Aug', 9: 'Se[', 10: 'Oct', 11: 'Nov', 12: 'Dec'}

now = dt.datetime.now()  # Get current time
c_month = months[now.month]  # Get current month
c_day = str(now.day)  # Get current day
c_year = str(now.year)  # Get current year

f = h5.File('/Volumes/Secondary HD/options_db.h5')  # open database file
year = f.require_group(c_year)  # Make hdf5 group for year
month = year.require_group(c_month)  # Make hdf5 group for month
day = month.require_group(c_day)  # Make hdf5 group for day

num = 0
for i in tickers:
    option = Options(i)

    # NOTE: this functionality is forthcoming in pandas 0.11
    raw_calls = option.get_forward_data(months=3, call=1, put=0,
                                        near=1, above_below=6)
    raw_puts = option.get_forward_data(months=3, call=0, put=1,
                                        near=1, above_below=6)

    if raw_calls.values.any():  # Check if any calls were returned
        try:  # Try to add item to file.
            # This block (and below for puts) does the following:
            #   - Get unique expiry dates
            #   - make hdf5 group for ticker
            #   - Get options data for each expiry
            #   - Put each set of expiry data in unique hdf5 dataset

            expiries = raw_calls.Expiry.unique().astype(str)
            tick = day.require_group(i)

            for ex in expiries:
                data = raw_calls[raw_calls.Expiry == ex]
                i_calls = data[['Strike', 'Last', 'Vol']]
                i_calls.Vol = i_calls.Vol.str.replace(',', '')

                ex_m_y = ex[:2] + ex[-3:]
                call_ds = tick.require_dataset('C' + i + ex_m_y,
                                               i_calls.shape, float)
                call_ds[...] = i_calls.astype(np.float32)
        except:  # If it doesn't work just pass

    if raw_puts.values.any():  # Check if any puts were returned
            expiries = raw_puts.Expiry.unique().astype(str)
            tick = day.require_group(i)

            for ex in expiries:
                data = raw_puts[raw_puts.Expiry == ex]
                i_puts = data[['Strike', 'Last', 'Vol']]
                i_puts.Vol = i_puts.Vol.str.replace(',', '')
                ex_m_y = ex[:2] + ex[-3:]
                put_ds = tick.require_dataset('P' + i + ex_m_y,
                                              i_puts.shape, float)
                put_ds[...] = i_puts.astype(np.float32)

    # status update
    num += 1
    if num % 500 == 0:
        print "just finished %s of %s" % (str(num), str(num_ticks))

f.close()  # Close file
I have cron run this script at a specified time each week-day and populate the hdf5 file. The resultant file will have a nested structure like this:
/Year {
    / Month{
        /Day {
            /Ticker {
The notation CTICKmm-yy stands for a call option (C), a given ticker (TICK), and the expiry of the option (mm-yy). Inside each of the datasets there are three columns: strike price, last price on option contract, and volume in last trading day.

After running this script for one night, the resultant hdf5 data file was 7.648648 MB. If I were to allow this file to run each business day for a year, the final file size would be under 2 GB. Not bad!

If you would like more information on how I collect ticker names or what Options functionality is in pandas 0.10 or earlier leave a comment and I'll do my best to respond.