January 23, 2012

Python function for "collapsing" panel data (the hard way)

[Note: If you're using Python for data work, you can do this super easily via Pandas. I wrote this back in the stone age before knowing about Pandas. I leave this post up as a demonstration of using matrix algebra to keep loops out of your code. But really, Pandas is awesome so go learn it!]

Suppose we have data in the following form:

PersonID    Bread     Meat          Amount
0             1         1             10
0             1         1             14
0             2         1             30
0             3         0             6
1             1         0             20
1             2         0             15

We call this panel data (or longitudinal data). Many statistical packages (e.g. STATA) have a "collapse" function that will reduce the panel data to a unique entry for each member of the panel with the corresponding column sum/average/min/max for the entries of the panel member. Using such a command to sum over panel members would reduce the table given above to the following:

PersonID    Bread     Meat        Amount
0            7        3            30
1            3        0            35

Looping over panel members works, but could become very cumbersome when the dataset
is large. We can speed things up considerably by thinking about this operation in terms of matrix multiplication (you will want to assign numeric values to any strings). Consider the following matrix:

This gives us exactly the matrix we want for all columns except our PersonID column, but replacing the PersonID with the correct values is straightforward. Call the left-most matrix in the figure above C and the data matrix A. The column with the panel member identifying info is the "index" column. To generate C in python, we use:

C = A[:,index] == Unique values of A[:,index]

It is important to note that this only makes C if "A[:,index]" is a row vector and "Unique values of A[:,index]" is a column vector.

We can now collapse our matrix with the following (PS, sort your data first!):

def sumcollapse(A,index):
    A1 = sp.atleast_2d(sp.unique(A[:,index]))
    C = A[:,index] ==A1.T
    B  = sp.dot(C,A)
    #replace the index row with the true panel member values
    B[:,index] = A1 
    return B       

Similarly to average over panel members:

def meancollapse(A,index):
    A1 = sp.atleast_2d(sp.unique(A[:,index]))
    C = A[:,index] ==A1.T
    # make a "dividing vector" based on the number of entries for each panel member
    C1 = sp.atleast_2d(sp.sum(C,axis=1))*1.0
    B  = sp.dot(C,A)
    B  = sp.divide(B,C1.T)
    return B

To take the first row for each panel member:

def firstcollapse(A,index):
    A1 = sp.atleast_2d(sp.unique(A[:,index]))
    C = A[:,index] == A1.T
    B  = sp.argmax(C, axis=1)
    return A[B]

Hopefully you see the pattern.  Manipulating this to find minima/maxima is straightforward. Don't forget to sort your data first!


  1. This can be done with pandas!

    These commands will give you powerful/flexible panel data features like the ones described.

    First, import pandas and create the data set (sorry I don't know how to format code in a comment):

    In [1]: import pandas as pd

    In[2]: data = pd.DataFrame([[0,1,1,10],[0,1,1,14],[0,2,1,30],[0,3,0,6],[1,1,0,20],[1,2,0,15]], columns = ['PersonID', 'Bread', 'Meat', 'Amount'])

    In[3]: data
    PersonID Bread Meat Amount
    0 0 1 1 10
    1 0 1 1 14
    2 0 2 1 30
    3 0 3 0 6
    4 1 1 0 20
    5 1 2 0 15

    In[4]: grouped = data.groupby('PersonID')

    In[5]: grouped.sum()
    Bread Meat Amount
    0 7 3 60
    1 3 0 25

    In[6]: grouped.mean()
    Bread Meat Amount
    0 1.75 0.75 15.0
    1 1.50 0.00 12.5

    In[7]: grouped.std()
    Bread Meat Amount
    0 0.957427 0.5 10.519823
    1 0.707107 0.0 3.535534

    In[8]: grouped.first()
    Bread Meat Amount
    0 1 1 10
    1 1 0 20

    The grouped object has over 25 standard functions implemented. It also has the instance method 'apply' that allows the user to pass in any arbitrary function for more complex comparisons.

    If you are doing any kind of data analysis in python and you aren't using pandas you are working too hard!

    1. One more thing, the objects all look much prettier when printed at the command line or in an ipython notebook. blogspot removed all the formatting/spacing so it mushed it all together.