Kernel Density Estimation - Vivek's Digital Garden

### Kernel Density Estimation As mentioned above, histograms are very sensitive to * Your choice of bin size * Bin placement Another alternative to a histogram and using binning is to place a kernel at each data point and sum up the values of the kernel at each point in the data set. This results in a much smoother graph of the distribution and is not dependent on bin size or placement. The width of the kernel, however, does have an impact on the shape of the distribution. One of the commonly used kernels is the Gaussian kernel and we will see how it can used to get a visual estimation of the data distribution as shown below. ```python from scipy.stats import norm x_range = np.linspace(0,5,100) density_estimate = sum(norm(elem, scale=0.2).pdf(x_range) for elem in x) plt.plot(x_range, density_estimate) plt.yscale('log') # try linear and log scale plt.title('KDE in log scale vs. x') ``` ## Scikit-learn to Estimate the Density You can also use the Scikit-learn implementation of the Kernel Density Estimator Valid kernels are ‘gaussian’|’tophat’|’epanechnikov’|’exponential’|’linear’|’cosine’ ```python from sklearn.neighbors import KernelDensity kde = KernelDensity(bandwidth=0.2, kernel='gaussian') kde.fit(x) logprob = kde.score_samples(x_range[:,None]) plt.plot(x_range, logprob) plt.title('Log probability of KDE vs. x') ```