January 23, 2012

Python function for "collapsing" panel data (the hard way)

[Note: If you're using Python for data work, you can do this super easily via Pandas. I wrote this back in the stone age before knowing about Pandas. I leave this post up as a demonstration of using matrix algebra to keep loops out of your code. But really, Pandas is awesome so go learn it!]

Suppose we have data in the following form:

PersonID    Bread     Meat          Amount
0             1         1             10
0             1         1             14
0             2         1             30
0             3         0             6
1             1         0             20
1             2         0             15

We call this panel data (or longitudinal data). Many statistical packages (e.g. STATA) have a "collapse" function that will reduce the panel data to a unique entry for each member of the panel with the corresponding column sum/average/min/max for the entries of the panel member. Using such a command to sum over panel members would reduce the table given above to the following:

PersonID    Bread     Meat        Amount
0            7        3            30
1            3        0            35

Looping over panel members works, but could become very cumbersome when the dataset
is large. We can speed things up considerably by thinking about this operation in terms of matrix multiplication (you will want to assign numeric values to any strings). Consider the following matrix:

This gives us exactly the matrix we want for all columns except our PersonID column, but replacing the PersonID with the correct values is straightforward. Call the left-most matrix in the figure above C and the data matrix A. The column with the panel member identifying info is the "index" column. To generate C in python, we use:

C = A[:,index] == Unique values of A[:,index]

It is important to note that this only makes C if "A[:,index]" is a row vector and "Unique values of A[:,index]" is a column vector.

We can now collapse our matrix with the following (PS, sort your data first!):

def sumcollapse(A,index):
    A1 = sp.atleast_2d(sp.unique(A[:,index]))
    C = A[:,index] ==A1.T
    B  = sp.dot(C,A)
    #replace the index row with the true panel member values
    B[:,index] = A1 
    return B       

Similarly to average over panel members:

def meancollapse(A,index):
    A1 = sp.atleast_2d(sp.unique(A[:,index]))
    C = A[:,index] ==A1.T
    # make a "dividing vector" based on the number of entries for each panel member
    C1 = sp.atleast_2d(sp.sum(C,axis=1))*1.0
    B  = sp.dot(C,A)
    B  = sp.divide(B,C1.T)
    return B

To take the first row for each panel member:

def firstcollapse(A,index):
    A1 = sp.atleast_2d(sp.unique(A[:,index]))
    C = A[:,index] == A1.T
    B  = sp.argmax(C, axis=1)
    return A[B]

Hopefully you see the pattern.  Manipulating this to find minima/maxima is straightforward. Don't forget to sort your data first!

A Basic Object Oriented Programming Exercise

A primary advantages of using Python over other computational programs (i.e. Matlab, Maple, etc.) is the ease of  object oriented programming.  Thus Python combines the computational muscle of your favorite number crunching program with the flexibility of languages like C or Java. This post is not meant to be a step-by-step guide to object oriented programming, but rather a quick overview.

Classes: I think of classes as the machines that make objects. It specifies the methods and attributes that you want all of your objects to have. The object that we make below is a "player" in a game of rock-paper-scissors. We create new player objects with the first of our methods that begins with the line:

def __init__(self,name):

Every "self.xxxx" line included in the method specifies an "attribute", which can be thought of as a "rule for being an object of this sort." In the example below we require that each player has a name, strategy, and a tally of their wins and losses. If I make and object with the name Josh

Josh = player('Josh')

and I want to see how many wins he has, I input:


Because this is an attribute instead of a method, I do not need parenthesis [i.e. Josh.wincount( )].

The example below includes three methods that all player objects. Players can add to their win-count, they can train themselves so that their strategy is more competitive, and they can (based on the probabilities given in their strategy attribute) randomly generate rock, paper, or scissors. A few comments on these:
  • The players "rock,paper,scissors" strategy is represented by an array of three numbers between zero and one that sum to one. Players are initially given random strategy, but if you really think that rock will win every time you can give your player that strategy after he/she has been created. Each entry in the array represents the relative frequency for rock, paper, or scissors respectively. The train method improves the strategy by bringing it closer to ([1/3, 1/3, 1/3]) (this represents a Nash equilibrium for rock, paper, scissors).
  • To determine what "weapon" a player will choose to use based on a random number, we use the following line of code:

    choice         = sp.argmax(sp.cumsum(self.strategy)>k)

    Because our three strategy numbers sum to one, the sp.cumsum() command gives us an array that divides the interval (0,1) into three "buckets" of sizes equal to the three strategy numbers. The >k command turns this array into "true" and "false" entries, depending on whether k exceeded the upper limit of the bucket. The sp.argmax command gives the index for the first true entry.
    Inheritance: One advantage of creating classes to make our objects is that we can very quickly create special cases. In our program, we want to make a player that will always win. To do this, we include a "chuckNorris" class that inherits the methods of the "playerClass" class. To make the "chuckNorris" class unique, we override the "fight" method. Thus a player of the "chuckNorris" class will still have a random strategy and will keep track of his/her wins and losses. The neat part about inheritance is that we don't have to specify any of those common elements.

    The entire code for making our players is below:

    class playerClass:
        # This is called whenever a new player is created
        def __init__(self):
            # Create and initiallized attributes.
            # Strategy is mostly random at first.
            # Strategy can improve their via the "train" method.
            # Strategy is a length three array that sums to one
            # Each entry in the array corresponds to the  
            # probability of choosing rock, paper, or scissors. 
            r = sp.random.rand()
            self.strategy   = sp.array([r,(1-r)/2.,(1-r)/2.])
            self.wincount   = 0
            self.losscount  = 0
        def addwin(self,w):
            #This method keeps tracks of wins and losses
            if w == 1:
                self.wincount    +=1
                if self.wincount ==5:
                    print "The winner is a Jedi Master!"
            if w == 0:
                self.losscount   += 1
                if self.losscount > 5:
                    print "Maybe the loser should get some training"
        def train(self):
            #This method improves the players strategy
            print('Hitting the gym!')
            correction = (self.strategy[sp.argmax(self.strategy)] -self.strategy[sp.argmin(self.strategy)] )/3.
            self.strategy[sp.argmin(self.strategy)] += correction
            self.strategy[sp.argmax(self.strategy)] -= correction
        def fight(self):
            #This method chooses rock, paper, or scissors 
            #It is determined probabilistically based on the players strategy
            k     = sp.random.rand()
            choice         = sp.argmax(sp.cumsum(self.strategy)>k)
            weapon         = "scissors"
            if choice     ==0:
                weapon     = "rock"
            if choice     ==1:
                weapon     = "paper"
            elif choice ==2:
                weapon     = "scissors"
            return weapon
    class chuckNorris(playerClass):
        def train(self):
            print self.strategy
        def fight(self):
            weapon = "Roundhouse kick to the face"
            return weapon

    To actually make players and pit them against each other, use the following code as a separate file ("rpsFile.py" if you want to run this code as is). One neat thing we do here is use recursion within the rockpaperscissors( ) method in the event of a draw. That reduces the number of rock/paper/scissors combinations we need to hard code.

    # This file creates player objects and pits them
    # against one another in fierce games of rock-paper-scissors
    # Those who win can become jedi masters
    import scipy as sp
    from rpsFile import playerClass as player
    from rpsFile import chuckNorris as killer
    ben      = player()
    jeff     = player()
    jared    = player()
    ryan     = player()
    jonathan = player()
    matt     = player()
    jess     = killer()
    # Now we hold the rockpaperscissors contest
    def rockpaperscissors(player1, player2):
        p1 = player1.fight(); print p1
        p2 = player2.fight(); print p2
        if p1 == "Roundhouse kick to the face":
            print "player1 wins"
        if p2 == "Roundhouse kick to the face":
            print "player2 wins"
        if p1 == "rock" and p2 =="scissors":
            print "player1 wins"
        if p1 == "rock" and p2 =="paper":
            print "player2 wins"
        if p1 == "paper" and p2 =="scissors":
            print "player2 wins"
        if p1 == "paper" and p2 =="rock":
            print "player1 wins"
        if p1 == "scissors" and p2 =="paper":
            print "player1 wins"
        if p1 == "scissors" and p2 =="rock":
            print "player2 wins"
        if p1 == p2:
            print "draw"

    It really is basic, and could be cleaner I guess, but here it is.

    Estimating and Sampling from KDE's

    Some popular statistical packages have functionality to plot Kernel Density Estimates, but lack a way to randomly sample from the resulting distribution. While visualizing the distribution of your variable is important, any worthwhile statistical analysis requires an ability to sample from the distribution.

    To estimate and then sample from your KDE, import the following:

    import scipy as sp
    from scipy import stats as st
    import matplotlib.pyplot as plt

    Then load your data as a Python array. A one line way to do this is the Scipy "loadtxt" command, which works if the data file is a .txt file and consists entirely of numeric values.

    maindat = sp.loadtxt('datafile.txt')

    The resulting array "maindat" can be any dimension (m x n), but I assume for this example that is a single column vector (m x 1). Estimating and sampling from the KDE is simple. Here I estimate a KDE and resample 10000 points from the distribution (stored as a two-dimensional array).

    my_kde = st.gaussian_kde(maindat)
    sample = my_kde.resample(10000)

    To plot the resulting KDE, one might use something akin to this:

    x    = sp.linspace(sample[0,0], sample[0,-1],400)
    fig  = plt.figure()
    fig1 = fig.add_subplot(111)
    plt.plot(x, my_kde(x),'b--')
    plt.xlabel('Title Here ')

    Log-normal Distributions: reconciling MATLAB and Python

    MATLAB users are familiar with the log-normal functions for calculating a pdf/cdf at a given point. The MATLAB syntax is lognpdf(x,mean,sigma) and logncdf(x,mean,sigma) for the pdf and cdf values respectively. Python users will find that the lognorm.pdf and lognorm.cdf functions in the scipy.stats library follow a different set of conventions, and consequently will give different outputs for the same pair of inputs (Python's can be transformed to get the same outputs as MATLAB). Here I have some simple values that generate pdf/cdf values for scalar inputs following the MATLAB conventions.

    Information relating how to transform the MATLAB inputs so that Python and MATLAB give the same output are given in the lognorm.pdf/cdf doc-strings.

    First for the pdf function...

    import scipy as sp
    def lognpdf(x,mean,sig):
        if x<0 or not sp.isfinite(x):
            pdf = 0
            a   = 1./(x*sig*sp.sqrt(2*sp.pi))
            pdf =  a*sp.exp(-(sp.log(x)-mean)**2/(2.*sig**2)) 
        return pdf

    and then the cdf function.

    import scipy as sp
    import math
    def logncdf(x,mean,sig):
        if x<0:
            cdf  = 0.
        elif sp.isposinf(x):
            cdf  = 1.
            z    = (sp.log(x)-mean)/float(sig)
            cdf  = .5*(math.erfc(-z/sp.sqrt(2)))
        return cdf

    You'll find that these give the same results as MATLAB.