Probability Distributions - Vivek's Digital Garden

# Common Data Distributions ## Parametric vs Non parametric methods Parametric is when data can be represented by a finite number of parameters. Non-parametric you just let the data be and dont fit a model. Non parametric does not mean you dont have parameters. It just keeps growing as data grows # Discrete Distributions ## Bernoulli distribution x can be 1 or 0 Model event that either occurs or not $(X \sim Bern(\theta)$, where $f(x;\theta) = P(X=x;\theta) = \begin{cases} \theta, \text{for x = 1} \\ 1 - \theta, \text{for x = 0} \\ \end{cases} $ or $f(x;\theta) = \theta^x(1-\theta)^{(1-x)}$ ## Binomial distribution Used to model the number of successes 'p' in 'n' total events. Each draw is a Bernoulli event Each trial is independent $\theta$ is probability of success of an event (Bernoulli) $P(y | \theta) = nCp \cdot \theta^p \cdot (1 - \theta)^{n-p}$ where $nCp = C(n,p) = \binom{n}{p} = \frac{n!}{( p! (n - p)! )}$ $Mean = n\theta$ $Variance = n \theta (1 - \theta) $ Binomial does not have a "spread" variable. n is total options. Theta shifts the "mean" ## Negative Binomial Distribution Number of successes "r-1" and "x" failures in x+r-1 trials until you have a success on the "x+r"th trial.. i.e number of failures(x) you would have to see before you se a certain number of successes (r) "What is the probability of seeing 'x' failures when a number of events = total_events has occurred, when I know the probability of success is $(\theta$)?" Example: Number of days your car would work (x) before it breaks down the third time (r) $P(x| \theta) = {x + r - 1 \choose r - 1} \: \theta^r (1- \theta)^x$ where $\theta$ is the probability of success $mean = r ( 1- \theta) / \theta $ $variance = r ( 1- \theta) / \theta^2 $ Can be used to model something Poisson can but mean=variance does not hold true. Hence its called the *overdispersed Poisson* ### Similarity to a Poisson Distribution Can be used to model phenomenon that a Poisson distribution can but where the mean = variance criterion no longer holds. Hence this distribution is sometimes called the overdispersed Poisson. There is an alternate parameterization where the inverse dispersion parameter $(\kappa$) is used. If the Poisson distribution is parameterized by the rate parameter $(\lambda$), the negative binomial distribution can be shown to have a variance given by $\lambda + \dfrac{\lambda^2}{\kappa}$ ## Poisson Distribution Discrete distribution that indicate the probability of a number of events (coutn of events) $P(y|\theta) = \theta^y e^{-\theta} / y!$ The mean and variance of a Poisson distribution are both given by the rate parameter $(\theta$), resulting in the distribution growing wider as the mean moves away from zero. # Continuous Distributions ## Exponential Distribution It is a special case of a [[Probability Distributions#Gamma Distribution|Gamma Distribution]] An exponential distribution (continuous distribution) has the PDF given by $P(x) = \lambda e^{-\lambda x}$ $\lambda$ is the rate parameter and can be used to model the duration of events. eg. length of a phone call. There is a relation between the $\lambda$ and the rate parameter in the Poisson distribution.If the number of epidemics per year is modeled by a Poisson distribution with parameter $(\lambda_{1}$), the time between the epidemics can be modeled by an exponential distribution with parameter $(\lambda_{2} = \dfrac{1}{\lambda_{1}}$) $ mean = \dfrac{1}{\lambda} $ $ variance = \dfrac{1}{\lambda^2} $ More lambda squishes close to zero less lambda makes it spread more ## Gamma Distribution The Gamma distribution is used to model the time taken for 'n' independent events to occur. It can be derived from the CDF of a Poisson distribution by computing the probability of a number of events given by 'a' in time 'x'. The Gamma distribution models the wait time 'x' (positive value) and is given by the PDF $P(x) = \dfrac{b^a}{\gamma(a)} \cdot x^{a - 1} e^{-b x} $ Here 'a' is the shape parameter and 'b' is the rate parameter, or the inverse scale parameter, of the distribution (same as the rate parameter of the Poisson distribution from which this distribution is derived) such that $ a> 0$ and $b > 0$ The $\gamma$ is the gamma function, which is a generalization of the factorial for positive integers #### Interpretation The shape parameter 'a' can be interpreted as the number of events that we are waiting on to happen. The rate parameter 'b', as the name indicates, is the rate at which events happen. In the case where a = 1, or we are waiting for the first event to happen, we get the exponential distribution $ mean = a/b $ $ variance = a/b^2 $ An exponential distribution models the time to the first event, the Gamma distribution models the time to the 'n'th event. ## Normal Distribution - parameters - mean & std. dev $\mu\ and \ \sigma^2$ $ mean = \mu$ $standard\;deviation = \sigma$ and its PDF is given by $P(x) = \dfrac{1}{\sqrt{2 \pi \sigma}} e^{- (x - \mu)^2 / 2 \sigma^2}$ This is also called a Gaussian distribution. A normal distribution is found in many physical phenomena and has significance as a result of the **Central Limit Theorem**, which states that the mean of different sets of samples $(X_1,X_2,..X_n$) (each with k samples) converges to a normal distribution. The conditions for the Central Limit Theorem to hold are: 1. The samples must come from random variables that have finite mean and variance 2. The sample size must be large enough, i.e. $(k \rightarrow \infty)$ The implication of this is that a physical process represented by a random variable $A$ is the result of the sum of random variables $(B_1, B_2,..B_n$) will have a normal distribution irrespective of the type of distribution to which $(B_1, B_2...B_n$) belongs. The square of the standard deviation is called the variance of the normal distribution and the inverse of the variance is called the precision. Three standard deviations from the mean on either direction gives you 99.73% of the mass in a PDF. This may not be a good distribution to use for data that may have a lot of outliers, one of the heavy-tailed distributions such as a Student's t-distribution is a better fit for such data. It is also the distribution that has the maximum amount of uncertainty given a certain mean and variance. - 1/2/3 std dev - 68/95/99% - say mean height of individuals is 175cm and std dev is 10cm and it is normally distributed. What is the probability that someone is more than 190cm (6.7%) Example - number of accidents at an intersection - number of salmonella outbreaks in an year ## Log-Normal distribution If Y is a normal distribution parameterized by the mean $(\mu$) and standard deviation $(\sigma$), then X given by $X = e^Y$ is a Lognormal distribution. In other words, X can be considered a lognormal distribution if the log of X is normally distributed. If a mean $(\mu_X)$ and standard deviation $(\sigma_X$) are desired for the lognormal distribution, we can set the paramters $(\mu$) and $(\sigma$) to be $\mu = ln ( \dfrac{\mu_X^2}{\sqrt{\mu_X^2 + \sigma_X^2}})$ $\sigma = ln( 1 + \dfrac{\sigma_X^2}{\mu_X^2})$ The mode of the lognormal distribution is given by $mode = e^{\mu - \sigma^2}$ If you would like to design a lognormal distribution to have a certain mode, you can set the mean $(\mu$) for a selected standard deviation as described below $\mu = ln(mode) + \sigma^2$ The PDF for a lognormal distribution is given by $P(x) = \dfrac{1}{x \sigma \sqrt{2 \pi }} e^{- (ln(x) - \mu)^2 / 2 \sigma^2}$ **Example** This can be used to model the disease parameters such as the reproduction number for epidemics. **Conditions** 1. Continuous non-negative values 2. Asymmetric unlike the Normal distribution ## Student's t-distribution Parameterized using mean $\mu$ and inverse of precision $\tau$ which is inverse of variance We can compute the marginal distribution for x by using the prior of the precision and integrating out the dependence of the Normal distribution on its precision over all values of precision from 0 to $(\infty$). $ p(x | \mu, a, b) = \int_0^{\infty} N(x | \mu, \tau^{-1}) \; G(\tau | a,b) \; d \tau $ $ p(x | \mu, a, b) = \dfrac{b^a}{\Gamma(a)} (\dfrac{1}{2 \pi})^{1/2} [ b + \dfrac{(x - \mu)^2}{2} ]^{-a - 1/2} \; \Gamma(a + 1/2) $ The integral above has the following interpretation: a Student's t-distribution is made up of an infinite number of Gaussians with the same mean and differing variances, with the variances varying from 0 to infinity. The following new parameters are defined for the t-distribution in terms of the Gamma distribution parameters a and b 1. $ν = 2a$ 2. $\lambda = a/b$ The PDF for the Generalized Student's t-distribution can hence be written as $ p(x | \mu, a, b) = \dfrac{\Gamma \dfrac{v+1}{2}}{\Gamma (v/2)} \sqrt{\dfrac{\lambda}{v \pi}} (1 + \dfrac{{\lambda(x - \mu)}^2}{v})^{-(v+1)/2}$ where * μ corresponds to the mean of the distribution * $\lambda$ corresponds to the precision of the distribution though it is generally not the same as the inverse of the variance of the distribution * ν is the degrees of freedom and takes between 0 and $(\infty$). The degrees of freedom corresponds to the number of independent observations minus 1. When the sample size is 8, the distribution used to model this would have degrees of freedom set to 7. A value of 1 corresponds to the Cauchy distribution and indicates heavy tails, while infinity corresponds to a Normal distribution. $ mean = \mu $ $ variance = ν/(ν - 2) \lambda$ The simplest form of the probability density function for a zero-centered t-distribution is given by: $P(x) = \dfrac{\Gamma \dfrac{v+1}{2}}{\sqrt{v \pi} \Gamma (v/2)} (1 + \dfrac{x^2}{v})^{-(v+1)/2}$ For the form of the distribution as shown above $mean = 0$ $variance = ν/(ν - 2)$ ##### Alternative Interpretation If we draw samples from a population that has a normal distribution, but whose standard deviation $(\sigma$) is not known to us then we can use a Student's t-distribution to describe the standardized distances of the sample mean (M) to the population mean ($(\mu$)). A Student's t-distribution is used where one would be inclined to use a Normal distribution, but a Normal distribution is susceptible to outliers whereas a t-distribution is more robust. If we sample from a population, there are two ways in which the sample mean (M) could be different from the true mean $(\mu$) * M is too high * M is too low M can be close to $(\mu$) in which case we were lucky The estimate of the standard deviation $\sigma$ can be obtained from the samples as $s_m$. * $(s_m$) has to be smaller than $\sigma$ unless the sample is extensive and covers the full range of values of the distribution (support) The Student's t-distribution can be used to quantify how much smaller this is. **Example** * A distribution of test scores from an exam which has a significant number of outliers and would not be appropriate for a Normal distribution * Use a Cauchy distribution with ν = 1 to model extreme observations for rare events (black swan, Nassim Taleb) **Conditions** * Continuous data * Unbounded distribution * Considered an overdispersed Normal distribution, mixture of individual normal distributions with different variances ## Beta distribution The Beta distribution is a distribution over a probability, therefore the range of values it takes as input is from 0 to 1. Beta distribution is the conjugate prior of a binomial distribution. A class of conjugate priors for a sampling model $(P(y | \theta)$) is one that makes the posterior $(P(\theta |y)$) have the same form as the prior. The PDF for a Beta distributions is shown below. $P(\theta | a, b) = \dfrac{\gamma(a + b)}{\gamma(a) \gamma(b)} \cdot \theta^{a - 1} \cdot (1 - \theta)^{b -1}$ This means that if we have a beta prior and a likelihood that has a binomial form, the posterior will have a beta distribution. Not only do we know the form, we can compute the posterior exactly in closed form $ Prior\;for\;\theta = P(\theta) = Beta(a,b) $ $ Sampling\; distribution = P(y|\theta) = Binomial(num_p, total) $ $ Posterior\;for\;\theta = P(\theta|y) = Beta(num_p + a, total - num_p +b) $ where the observed data 'y' is represented by $(num_p$) and $(total$). The parameter $(\theta$) is associated with the success event represented by $(num_p$). $(num_p$) is the number of positive events and $(total$) is the total number of events. Here 'a' and 'b' can be considered pseudo counts. Setting a = 1 and b = 1 gives you a uniform distribution. The mean and variance for Beta(a,b) is given by $ Mean = \dfrac{a}{a +b} $ $ Variance = \dfrac{ab}{(a + b)^2(a + b + 1)} $ **Example** * [Baseball batting averages](http://varianceexplained.org/statistics/beta_distribution_and_baseball/) The Beta distribution can be used as a prior distribution, as shown above, for modeling the batting average $(\theta$) in baseball. For a new player, since we have no information we can rely on historical batting averages to form a prior. Since batting averages are known to be in a certain range, we use that information to define a Beta prior for $(\theta$). The number of hits and misses can be represented by a binomial distribution. After every game, the posterior for $(\theta$) can be computed from these values. The posterior value for theta can then be used as the prior for the next game. **Conditions** * Takes positive values between 0 and 1 as input * Setting a and b to 1 gives you a uniform distribution Weibull Cauchy Exponential