# Descriptive Statistics
This is used to summarize the data so that we have a quantitative way to understand data. This allows to understand and visualize data qualitatively. We can draw conclusions about the nature of the data. Descriptive statistics is applied to a population and hence can provide measures such as the mean and variance of the data. They do not allow us to make predictions about data that we have not analyzed.
## Summary statistics
1. Frequency table (`value_counts` in python `table` in R)
1. count/percentage
2. Barchart/Pie Chart
3. Histograms/ Kernel density plots
# What to look for
1. Outliers
2. Correlations
## relationship between variables
CAT vs Numerical
- side by side box plot
Cat vs Cat
1. bar charts
2. mosaic plot
# Central Tendency aka Mean
$\mu$ - population mean
$\hat{\mu}$ or $\bar{X}$ - sample mean
## Mean
1. Mean (arithmetic)
1. sensitive to outliers
1. trimmed mean
1. `mean(col, trim=0.1)` in R
2. `scipy.stats.trim_mean(a, trim=0.1)` in python
2. Median
1. 1st Quartile (25%)
2. 2nd Quartile (75%)
3. trimmed mean (drop top 5% and bottom 5%)
4. geometric mean
## Variance
- $VAR(aX)=a^2VAR(X)$
- $VAR(a+X)=VAR(X)$
- $VAR(X_1+X_2)=VAR(X_1)+VAR(X_2)$ if $X_1$ and $X_2$ are independent
- $VAR(X_1-X_2)=VAR(X_1)+VAR(X_2)$ if $X_1$ and $X_2$ are independent (not a typo)
# Standard Deviation
$\sigma = \sqrt{\frac{(\sum{(x_i-\bar{X})^2})}{n-1}}$
Why $(n-1)$ ?
- because of degree of freedom
- if you have n data points you calculated the mean of them already. So the last data point will be mean x count - sum of the rest of the data point. So you have only n-1 degrees of freedom
# Correlation
Outlier messes up Pearson's correlation but Spearman's is a little less sensitive
## Pearson's Correlation
Measures the strength and direction of a linear association between two variables
Uses distance between points
## Spearman's Correlation
Measures the strength and direction of a *monotonic* association between two *ranked* variables
Orders from highest to lowest and calculates correlation factor
# Kendall's correlation
# Covariance
Less interesting in applied statistics