Descriptive Statistics - Vivek's Digital Garden

# Descriptive Statistics This is used to summarize the data so that we have a quantitative way to understand data. This allows to understand and visualize data qualitatively. We can draw conclusions about the nature of the data. Descriptive statistics is applied to a population and hence can provide measures such as the mean and variance of the data. They do not allow us to make predictions about data that we have not analyzed. ## Summary statistics 1. Frequency table (`value_counts` in python `table` in R) 1. count/percentage 2. Barchart/Pie Chart 3. Histograms/ Kernel density plots # What to look for 1. Outliers 2. Correlations ## relationship between variables CAT vs Numerical - side by side box plot Cat vs Cat 1. bar charts 2. mosaic plot # Central Tendency aka Mean $\mu$ - population mean $\hat{\mu}$ or $\bar{X}$ - sample mean ## Mean 1. Mean (arithmetic) 1. sensitive to outliers 1. trimmed mean 1. `mean(col, trim=0.1)` in R 2. `scipy.stats.trim_mean(a, trim=0.1)` in python 2. Median 1. 1st Quartile (25%) 2. 2nd Quartile (75%) 3. trimmed mean (drop top 5% and bottom 5%) 4. geometric mean ## Variance - $VAR(aX)=a^2VAR(X)$ - $VAR(a+X)=VAR(X)$ - $VAR(X_1+X_2)=VAR(X_1)+VAR(X_2)$ if $X_1$ and $X_2$ are independent - $VAR(X_1-X_2)=VAR(X_1)+VAR(X_2)$ if $X_1$ and $X_2$ are independent (not a typo) # Standard Deviation $\sigma = \sqrt{\frac{(\sum{(x_i-\bar{X})^2})}{n-1}}$ Why $(n-1)$ ? - because of degree of freedom - if you have n data points you calculated the mean of them already. So the last data point will be mean x count - sum of the rest of the data point. So you have only n-1 degrees of freedom # Correlation Outlier messes up Pearson's correlation but Spearman's is a little less sensitive ## Pearson's Correlation Measures the strength and direction of a linear association between two variables Uses distance between points ## Spearman's Correlation Measures the strength and direction of a *monotonic* association between two *ranked* variables Orders from highest to lowest and calculates correlation factor # Kendall's correlation # Covariance Less interesting in applied statistics