January 23, 2012

Estimating and Sampling from KDE's

Some popular statistical packages have functionality to plot Kernel Density Estimates, but lack a way to randomly sample from the resulting distribution. While visualizing the distribution of your variable is important, any worthwhile statistical analysis requires an ability to sample from the distribution.

To estimate and then sample from your KDE, import the following:

import scipy as sp
from scipy import stats as st
import matplotlib.pyplot as plt

Then load your data as a Python array. A one line way to do this is the Scipy "loadtxt" command, which works if the data file is a .txt file and consists entirely of numeric values.

maindat = sp.loadtxt('datafile.txt')

The resulting array "maindat" can be any dimension (m x n), but I assume for this example that is a single column vector (m x 1). Estimating and sampling from the KDE is simple. Here I estimate a KDE and resample 10000 points from the distribution (stored as a two-dimensional array).

my_kde = st.gaussian_kde(maindat)
sample = my_kde.resample(10000)

To plot the resulting KDE, one might use something akin to this:

x    = sp.linspace(sample[0,0], sample[0,-1],400)
fig  = plt.figure()
fig1 = fig.add_subplot(111)
plt.plot(x, my_kde(x),'b--')
plt.xlabel('Title Here ')
plt.show()

No comments:

Post a Comment