Probability Theory - Vivek's Digital Garden

Stochastic Process - a process that produces results that vary by chance Terminology 1. outcomes - mutually exclusive and exhaustive list of possible results in a model 2. events - sets containing zero or more outcomes, we define events we are interested in 3. sample space - set of all possible outcomes (must be collectively exhaustive and mutually exclusive). 4. event space - All sample space and outside sample space, All events and outside events 5. Probability space A probability space is a combination of a sample space, event space, and probability function. The probability function is a real-valued function mapping events to the interval [0,1] ## Axioms of Probability (Kolmogorov Axioms) 1. the probability of an event is a real number on the interval [0,1] $0 \le P(E) \le 1$ 2. the probability of at least one event occurring is 1 $P(S) = 1$ where S is the sample space 3. countable mutually exclusive sets of events satisfy the following $P(\bigcup_{i=1}^\infty E_i) = \sum_{i=1}^\infty P(E_i)$ # Addition Rule https://www.varsitytutors.com/hotmath/hotmath_help/topics/addition-rule-of-probability $P(A\ or\ B) = P(A\cup B)= P(A) + P(B) - P(A\cap B)$ ## Multiplicative Rule https://www.varsitytutors.com/hotmath/hotmath_help/topics/multiplication-rule-of-probability $P(A\ and\ B)= P(A,B) = P(A\cap B) = P(A)\cdot P(B|A) = P(A)\cdot P(B)(if independent)$ # Conditional Rule $P(A|B) = \frac {P(A,B)} {P(B)}$ If you knew the population distribution and if you sample X items from it what is the probability that the mean would be something [[Probability Distributions]] ## Discrete Random Variables discrete random variables are characterized by countable outcome spaces. Discrete random variables are associated with a **probability mass function (pmf)** whose range is a countable subset of $\mathbb{R}$ with probability values in the range [0,1]. Properties of the pmf include: 1. $f_x(x) \ge 0$, for all $x \in \mathbb{R}_X$. i.e probabiilty of an event is >= 0 2. $\sum_{x \in X} f_x(x) = 1$ sum of probabilities is one 3. $F_X(b) - F_X(a) = \sum_{x=a}^b f(x), a < b, a, b \in \mathbb{R}$ This is CDF CDF: $F_X(b) = \sum_{x \in [-\infty,b]} f_X(x)$ # Continuous random variables outcome space is continuous. You have a **Probability density function (pdf)** # Moment, Expectation, etc We ## Expectation The expectation of a function is the average value of the function under a probability distribution. In the case of discrete distributions, this is computed as the weighted average where the weights are dictated by the probability at the value of x (where p(x) is the pmf/pdf): $E[f] = \sum_x f(x)^r p(x)$ For continuous distributions, this looks like $E[f] = \int f(x)^r p(x) dx.$ ## Mean if $(f(x) = x$) and$(r=1$) in both cases, this is called the mean of the distribution. In the figure of the normal distribution above, computing the expectation will give \$E[x]=0\$ which matches our intuition of where the bulk of the mass is centered based on the figure 1st central moment - Mean 2nd central moment - Variance 3rd central moment - Skewness 4th central moment - Kurtosis ## Independence It can be shown that if X and Y are independent, there exists some functions g(x) and h(y) such that: $f(x,y) = g(x)h(y)\text{ for all (x,y)}$ How do we use this? In the discrete case, if we can find a pair (x,y) that violate the product rule, the random variables are dependent. ## Covariance Independent variables have covariance of 0 but the corollary is not true ## Correlation # Point Estimation Point estimation is a type of statistical inference which consists in producing a guess or approximation of an unknown parameter. Given a representative sample of data from some population how dow we estimate the parameters of the distribution? 1. Method of Moments 2. [[Maximum likelihood estimation]] 3. Maximum a posteriori probability estimate Properties fo an estimator - Consistency - as N gets large the estimator converged - Bias - estimator is unbiased if centered - Efficiency - lowest possible variance ## Method of Moment Its like taking a mean of the samples and saying the estimate for population mean is sample mean Method of moments can be shown to be consistent but not necessarily efficient and can give estimates outside the parameter space. ## Maximum Likelihood estimation follows from an assumption that our data results from independent and identically distributed observations from a population. Our goal is to find a 𝜃 that maximizes the likelihood of us observing our data. It is joint probability of observing the data given a specific value for parameter. sometimes it is necessary to convert likelihood to log likelihood for computational simplicity MLE can be shown to be a consistent estimator, but may be biased.(eg. variances of normal distribution) Operationally, it can be computationally expensive to calculate, but offers a useful fact that any function of the parameters is also a function of the MLE, ie invariant to transformations. For a normal distribution MLE and MoM give same resutls https://www.statlect.com/fundamentals-of-statistics/normal-distribution-maximum-likelihood ![[Pasted image 20220130194405.png]] ## Maximum a posteriori estimate An augmented MLE using prior. similar procedure to MLE ## Z-score $Z=\frac{observation-\mu}{\sigma}$ ## Central Limit Theorem - If - we sample individuals from a population - and we take a large sample size (>20) - the more symmetric the population distribution the smaller the sample size can be. the more skewed the population more samples are neeeded - or distribution of the individuals is approximately normal - then - the sampling distribution will be approximately normal ## Sampling Distribution - It is the probability distribution of mean obtained from several sampling from a population - Theoretical set all possible $\bar{X}$ we could get ### Standard error of mean Also called standard deviation of mean $SE\bar{X}=\frac{\sigma}{\sqrt{n}}$ where n is the number of samples Sample mean varies according to normal distribution If you know max-min possible values and if you know that the population is normally distributed then you can estimate standard deviation sd = (max - min)/ 6 ## Confidence Interval True mean of population will be +/- 2$\sigma$ from the sample mean 95% of the time where $\sigma$ is population standard deviation. Because we only have the sample's std devitiaon $\mu=\bar{X}\pm\\t\frac{s}{\sqrt{n^2}}$ $t$ - t-value ## t-distribution Use t-distribution instead of Z-distribution (standard normal) because - we dont know true population std dev. but we know sample std. dev. - invented by william gosset when he was working in Guinness breweries - As sample size becomes larger and larger t-distribution converges to standard distribution - as the sample size becomes smaller the tail of the distribution becomes fatter compared to standard distribution ## Bootstrap Why? - what if you do not have a large population and cannot assume sampling distribution is normal It is difficult to work out the Standard error of the estimate ![[Pasted image 20211210223535.png]] Bootstrap is resampling with replacement from a sample Distribution of $\bar{X^*}$ is bootstramp sample distribution SD of all $\bar{X^*}$ is bootstrap standard error Bootstrap (atleast 1000 or 10,000) Increasing B cannot increase the amount of information analysis is still based on n observation - gives a more reliable estimate of standard error Results from a bootstrap approach is almost identical to a large sample theory. If assumptions arent met for large sample theory sample mean might not work but bootstrap might work