October 22, 2013

Rouwenhorst (1995) Markov Transition Matrix

If you're going to approximate an AR(1) process using a finite state Markov-chain, many people will use the method given in either Tauchen (1986) or Tauchen and Hussey (1991). For a highly persistent process, however, Rouwenhorst (1995) outperforms both of these methods considerably. A rigorous comparison is given by Kopecky and Suen (2010).

Python code for Tauchen and Hussey is in a previous post. An implementation of Rowenhorst is given here (and shown below).


import numpy as np

def rouwen(rho, mu, step, num):
    '''
    Adapted from Lu Zhang and Karen Kopecky. Python by Ben Tengelsen.
    Construct transition probability matrix for discretizing an AR(1)
    process. This procedure is from Rouwenhorst (1995), which works
    well for very persistent processes.

    INPUTS:
    rho  - persistence (close to one)
    mu   - mean and the middle point of the discrete state space
    step - step size of the even-spaced grid
    num  - number of grid points on the discretized process

    OUTPUT:
    dscSp  - discrete state space (num by 1 vector)
    transP - transition probability matrix over the grid
    '''

    # discrete state space
    dscSp = np.linspace(mu -(num-1)/2*step, mu +(num-1)/2*step, num).T

    # transition probability matrix
    q = p = (rho + 1)/2.
    transP = np.array([[p**2, p*(1-q), (1-q)**2], \
                    [2*p*(1-p), p*q+(1-p)*(1-q), 2*q*(1-q)], \
                    [(1-p)**2, (1-p)*q, q**2]]).T

    while transP.shape[0] <= num - 1:

        # see Rouwenhorst 1995
        len_P = transP.shape[0]
        transP = p*np.vstack((np.hstack((transP,np.zeros((len_P,1)))), np.zeros((1, len_P+1)))) \
        + (1-p)*np.vstack((np.hstack((np.zeros((len_P, 1)), transP)), np.zeros((1, len_P+1)))) \
        + (1-q)*np.vstack((np.zeros((1, len_P+1)), np.hstack((transP, np.zeros((len_P, 1)))))) \
        + q*np.vstack((np.zeros((1, len_P+1)), np.hstack((np.zeros((len_P, 1)), transP))))

        transP[1:-1] /= 2.

    # ensure columns sum to 1
    if np.max(np.abs(np.sum(transP, axis=1) - np.ones(transP.shape))) >= 1e-12:
        print('Problem in rouwen routine!')
        return None
    else:
        return transP.T, dscSp

May 23, 2013

Supergametools

Many people are familiar with the introductory game-theory scenario of the prisoners dilemma. When the prisoners make their decision only once, the Nash equilibrium strategy is to testify against their partner. Things change, however, when the prisoners make their decision infinitely many times. A very neat folk theorem says that infinitely repeated games can give rise to many more Nash equilibria.

Analytically finding the set of Nash equilibria in an infinitely repeated game is usually not possible. A paper by Judd, Yeltekin, and Conklin (2003) gives some nice numerical algorithms to approximate the set.

I coded these algorithms up as functions in a library called supergametools. It is available for download here, and includes instructions for installation and usage. I give an example of how you would use the library for a simple prisoners dilemma below:

import numpy as np
import supergametools as sgt

# determine payoff matrices p1 = np.array([[4, 0], [6, 2]])
p2 = p1.T

# center and radious of initial guess
cen = np.array([3, 3], ndmin=2)
rad = 5

# find outer hull of set of supergame equilibria
outer_hull = sgt.outerbound(p1, p2, cen, rad)

# display results
print outer_hull


The output is the following image. The red lines give the convex hull of the equilibria set. The routine also gives the convex hull in matrix form.



March 12, 2013

Matlab's Cylinder Command in Python

Matlab has a "cylinder" command that will give you n coordinates of a cylinder of radius R. Here R is a vector of radii, not necessarily all the same.  You can think of R as a line or curve, and the resulting cylinder comes from rotating the line around the z-axis.

Matplotlib has commands to plot certain shapes, such as circles, but not any that give you n evenly spaced coordinates approximating the shape. This admittedly has little to do with economics. I use it for an initial guess for the set of supergame equilibria in a repeated game (see the supergametools library). This is more for the random matlab to python translation project.

It's not hard to program up a method that does this, but to save you all some time I'm posting a version of the command in Python, seen below or downloadable on GitHub.

import numpy as np

def cylinder(r,n):
    '''
    Returns the unit cylinder that corresponds to the curve r.
    INPUTS:  r - a vector of radii
             n - number of coordinates to return for each element in r

    OUTPUTS: x,y,z - coordinates of points
    '''

    # ensure that r is a column vector
    r = np.atleast_2d(r)
    r_rows,r_cols = r.shape
    
    if r_cols > r_rows:
        r = r.T

    # find points along x and y axes
    points  = np.linspace(0,2*np.pi,n+1)
    x = np.cos(points)*r
    y = np.sin(points)*r

    # find points along z axis
    rpoints = np.atleast_2d(np.linspace(0,1,len(r)))
    z = np.ones((1,n+1))*rpoints.T
    
    return x,y,z

See below for an example.

import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d.axes3d as p3

# set curve vector r and set n to 20
r = np.linspace(2,.5,100)
n = 20

# get points from cylinder and plot
[x,y,z] = cylinder(r,n)

# make plots
fig=plt.figure()
ax = p3.Axes3D(fig)
ax.plot_wireframe(x,y,z)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
plt.show()

February 5, 2013

Building a Stock Options Historical Database

I am interested in doing econometric analysis on financial derivatives. The main hangup I have faced is that there are no good free resources (at least that I know of) for historical options data. For that reason I want to create my own personal database of historical options prices. I have broken this project down into three main hurdles:

  1. Figure out how to get options data from within python
  2. Pick a data storage format
  3. Automate the collection of daily data

Getting options data in python


Over the summer I had some free time and teamed up with my dad to create an investment model. While it is a very simple model, this post is about building a database so I won't go into details here. It suffices to say that I needed to find a way to get options data from yahoo! Finance. This was a unique challenge because unlike equity data or data from other sources like FRED, options data doesn't have a convenient "download to csv" button anywhere on the website. 

At the time I was reading the excellent book "Python for Data Analysis" by Wes McKinney and got an idea for how to implement a basic web crawler to parse the html on yahoo and return the data in a python friendly format. Long story short, I wrote some code to do just that and it made its way into version 0.9 of the pandas library (if you aren't familiar with pandas and you work with data in python you should definitely check it out).

Now only these few commands are needed to get options data from yahoo Finance!:
from pandas.io.data import Options

aapl = Options('AAPL')
puts, calls = appl.get_options_data()
The calls and puts objects are pandas DataFrames that contain the same information you would find on the yahoo! Finance page for Apple Inc. options.

Picking the file format


In picking a file format I had two main considerations: size of the file and speed at which it can be written/read. To test this out I wrote a simple script that generated a random 4000 by 4000 numpy array and defined functions for writing and reading that data in different file formats. The formats I chose to work with were csv, hdf5 (.h5), and MatLab (.mat). Below is the script I used to run the test:
import numpy as np
from scipy.io import savemat, loadmat
import pandas as pd
import h5py

data = np.random.randn(4000, 4000)

file_path = '/Users/spencerlyon2/Desktop/test_data.'


def store_csv():
    pd_data = pd.DataFrame(data)
    pd_data.to_csv(file_path + 'csv')


def store_h5():
    f = h5py.File(file_path + 'h5')
    dset = f.create_dataset('data', (4000, 4000), 'f')
    dset[...] = data
    f.close()


def store_mat():
    sp_dat = {'data': data}
    savemat(file_path + 'mat', sp_dat)


def read_csv():
    pd_data_in = pd.DataFrame().from_csv(file_path + 'csv')
    return pd_data_in


def read_h5():
    f_in = h5py.File(file_path + 'h5', 'r')
    d_in = f_in['data']
    f_in.close()
    return d_in


def read_mat():
    mat_dict = loadmat(file_path + 'mat')
    return mat_dict['data']
After I had this code I simply fired up iPython and ran the file (file_test.py) and used the %timeit magic to see how long it took each of the three methods to read and write the data. The timing results, along with the final file sizes are summarized in the table below:

csv hdf5 mat
Write time 59.9 s 101 ms 836 ms
Read Time 6.07 s 409 us 82.5 ms
File size 322.6 MB 64 MB 128 MB

It is easy to see that the hdf5 file type is the best one to choose for my purposes. I would like to note here that the reason the hdf5 file format is 1/2 the size of the .mat file, is because the dtype in the .h5 file is a 32 bit float whereas the .mat dtype is a 64 bit float. However, for stock options we only generally have/care about data out two decimal places so the 32 bit precision is plenty. 

Automating the data retrieval


The final step in getting this database started was to automate the data retrieval process. To do this I used the popular UNIX scheduling tool cron. I run OSX 10.8 Mountain Lion, and by default in 10.8 the cron tool is disabled. To fix this I simply ran the following command in the terminal:
sudo touch /etc/crontab
This command creates the /etc/crontab file (if it doesn't already exist) and gets it ready for use by cron. I am not going to give a detailed explanation for how to use cron here (as I am still fairly new at it myself), but googling for it will give you lots of examples and tutorials. I will, however give the line in my crontab file that executes the script:
50 23 * * 1,2,3,4,5 python ~/Documents/Python/pytools/Finance/db_cron.py
The next step was to write the script I would have cron call. This appears below.
"""
db_cron.py

Created Feb 6, 2013

Author: Spencer Lyon

File to be run automatically by cron each day to gather options data
"""
import datetime as dt
import pandas as pd
import numpy as np
from options import Options
import h5py as h5

# ticker_list.csv contains all tickers on nyse and nasdaq stock exchanges
tickers = pd.Series.from_csv('ticker_list.csv')
num_ticks = tickers.size  # used to print status

months = {1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr', 5: 'May', 6: 'Jun',
          7: 'Jul', 8: 'Aug', 9: 'Se[', 10: 'Oct', 11: 'Nov', 12: 'Dec'}

now = dt.datetime.now()  # Get current time
c_month = months[now.month]  # Get current month
c_day = str(now.day)  # Get current day
c_year = str(now.year)  # Get current year

f = h5.File('/Volumes/Secondary HD/options_db.h5')  # open database file
year = f.require_group(c_year)  # Make hdf5 group for year
month = year.require_group(c_month)  # Make hdf5 group for month
day = month.require_group(c_day)  # Make hdf5 group for day

num = 0
for i in tickers:
    option = Options(i)

    # NOTE: this functionality is forthcoming in pandas 0.11
    raw_calls = option.get_forward_data(months=3, call=1, put=0,
                                        near=1, above_below=6)
    raw_puts = option.get_forward_data(months=3, call=0, put=1,
                                        near=1, above_below=6)

    if raw_calls.values.any():  # Check if any calls were returned
        try:  # Try to add item to file.
            # This block (and below for puts) does the following:
            #   - Get unique expiry dates
            #   - make hdf5 group for ticker
            #   - Get options data for each expiry
            #   - Put each set of expiry data in unique hdf5 dataset

            expiries = raw_calls.Expiry.unique().astype(str)
            tick = day.require_group(i)

            for ex in expiries:
                data = raw_calls[raw_calls.Expiry == ex]
                i_calls = data[['Strike', 'Last', 'Vol']]
                i_calls.Vol = i_calls.Vol.str.replace(',', '')

                ex_m_y = ex[:2] + ex[-3:]
                call_ds = tick.require_dataset('C' + i + ex_m_y,
                                               i_calls.shape, float)
                call_ds[...] = i_calls.astype(np.float32)
        except:  # If it doesn't work just pass
            pass

    if raw_puts.values.any():  # Check if any puts were returned
        try:
            expiries = raw_puts.Expiry.unique().astype(str)
            tick = day.require_group(i)

            for ex in expiries:
                data = raw_puts[raw_puts.Expiry == ex]
                i_puts = data[['Strike', 'Last', 'Vol']]
                i_puts.Vol = i_puts.Vol.str.replace(',', '')
                ex_m_y = ex[:2] + ex[-3:]
                put_ds = tick.require_dataset('P' + i + ex_m_y,
                                              i_puts.shape, float)
                put_ds[...] = i_puts.astype(np.float32)
        except:
            pass

    # status update
    num += 1
    if num % 500 == 0:
        print "just finished %s of %s" % (str(num), str(num_ticks))

f.close()  # Close file
I have cron run this script at a specified time each week-day and populate the hdf5 file. The resultant file will have a nested structure like this:
/Year {
    / Month{
        /Day {
            /Ticker {
                CTICKmm-yy
                PTICKmm-yy
            }
        }
    }
}
The notation CTICKmm-yy stands for a call option (C), a given ticker (TICK), and the expiry of the option (mm-yy). Inside each of the datasets there are three columns: strike price, last price on option contract, and volume in last trading day.

After running this script for one night, the resultant hdf5 data file was 7.648648 MB. If I were to allow this file to run each business day for a year, the final file size would be under 2 GB. Not bad!

If you would like more information on how I collect ticker names or what Options functionality is in pandas 0.10 or earlier leave a comment and I'll do my best to respond.

January 21, 2013

Interpolation

One of the key tools in an economist's toolbox is interpolation.  There are many types of interpolation: Linear, Polynomial, Chebyshev, and Splines.  One of the ways that interpolation is employed in macroeconomics is to estimate policy rules.  Given the value of certain state variables and a shock; the policy rule will tell you how much of each state variable will be employed in the next period.  This is then used in simulations.  The following is an implementation of a cubic spline described by Christian Habermann and Fabian Kindermann in " Multidimensional Spline Interpolation: Theory and Applications" 2007.

Here are links to the implemented code.  The 2 dimensional code calls the 1 dimensional code.  We use la.solve in the method now.

1 dimensional interpolation code
2 dimensional interpolation code

These files below are for testing the code.  It gives you an example of how you could implement it.  (Obviously you won't know the functions beforehand, but if you have x and y, or x y and z in this form it should work.)  The current test for 2 dimensional spline is calculating it over a 100x100 grid.  This means it is solving for 10,000 points and so it takes a little bit, but solving for any one point that is needed is much faster.

The 1d takes a vector of x values and takes a vector of y values such that y = sin(x) for all of the x's we are given.  It then interpolates to solve for estimation coefficients.  Then feval calculates the value for any x such that x in range(x)

The 2d takes 2 vectors x and y and a matrix of zs such that f(x_i, y_j) = z_i,j.  In a similar fashion it implement interpolation on the grid created by the Cartesian product.  Then for any pair of x,y in the domain of x and domain of y we can calculate its value, z, for f(x,y).  feval2 implements this solution

testspline1d
testspline2d

It can be made even more accurate using a rescaling method presented in the paper.  Below are the surfaces obtained by the true values of the function f(x,y) = sin(x) - cos(y^2) and the surface obtained by evaluating our spline function on a 100x100 grid.




If you have questions let us know.  We recently updated the spline code so that it is a class.  The documentation should be sufficient.  Interp_Class.