Probability & Statistics

Reference notes.

Probability theory provides the mathematical foundation for quantifying uncertainty. Statistics uses data to make inferences about the world. Together, they form the backbone of data science, machine learning, and scientific research.

Probability Foundations

Sample Spaces and Events

A sample space Ω is the set of all possible outcomes of an experiment.

Events are subsets of the sample space
Complement: Aᶜ = {ω ∈ Ω : ω ∉ A}
Union: A ∪ B = {ω ∈ Ω : ω ∈ A or ω ∈ B}
Intersection: A ∩ B = {ω ∈ Ω : ω ∈ A and ω ∈ B}

Probability Axioms (Kolmogorov)

Non-negativity: P(A) ≥ 0 for any event A
Normalisation: P(Ω) = 1
Countable additivity: For any countable sequence of pairwise disjoint events A₁, A₂, …, P(∪ Aᵢ) = Σ P(Aᵢ)

From these axioms, we derive:

P(Aᶜ) = 1 - P(A)
P(∅) = 0
P(A ∪ B) = P(A) + P(B) - P(A ∩ B)

Conditional Probability

The probability of A given that B has occurred:

$P (A ∣ B) = \frac{P ( A \cap B )}{P ( B )}, P (B) > 0$

Independence: Two events are independent if and only if:
$P (A \cap B) = P (A) P (B)$

Bayes’ Theorem

$P (A ∣ B) = \frac{P ( B ∣ A ) P ( A )}{P ( B )}$

This allows us to update our beliefs about A after observing evidence B.

Law of Total Probability

For a partition {B₁, B₂, …, Bₙ} of Ω:

$P (A) = \sum_{i = 1}^{n} P (A ∣ B_{i}) P (B_{i})$

Random Variables

A random variable is a function X: Ω → ℝ that assigns a numerical value to each outcome.

Discrete Random Variables

Takes countably many values.

Probability mass function (PMF): p(x) = P(X = x)
Cumulative distribution function (CDF): F(x) = P(X ≤ x) = Σ p(k) for k ≤ x

Properties of PMF:

p(x) ≥ 0 for all x
Σ p(x) = 1

Continuous Random Variables

Takes uncountably many values in an interval.

Probability density function (PDF): f(x) where P(a ≤ X ≤ b) = ∫ₐᵇ f(x)dx
CDF: F(x) = P(X ≤ x) = ∫₋∞ˣ f(t)dt

Properties of PDF:

f(x) ≥ 0 for all x
∫₋∞^∞ f(x)dx = 1
P(X = x) = 0 for any specific value

Expected Value

The expected value (mean) is the weighted average of all possible values.

Discrete: E[X] = Σ x · P(X = x)

Continuous: E[X] = ∫₋∞^∞ x · f(x)dx

Properties:

Linearity: E[aX + bY] = aE[X] + bE[Y]
E[c] = c for any constant
E[g(X)] = Σ g(x)P(X = x) or ∫ g(x)f(x)dx

Variance and Standard Deviation

Variance measures the spread of a distribution around its mean:

$Var (X) = E [(X - μ)^{2}] = E [X^{2}] - (E [X])^{2}$

Standard deviation: σ = √Var(X)

Properties:

Var(aX + b) = a²Var(X)
Var(X + Y) = Var(X) + Var(Y) + 2Cov(X,Y)
For independent X, Y: Var(X + Y) = Var(X) + Var(Y)

Covariance and Correlation

Covariance measures how two variables vary together:

$Cov (X, Y) = E [(X - μ_{X}) (Y - μ_{Y})] = E [X Y] - E [X] E [Y]$

Correlation is the normalised covariance:

$ρ_{X, Y} = \frac{Cov ( X , Y )}{σ _{X} σ _{Y}}, - 1 \leq ρ \leq 1$

Common Distributions

Discrete Distributions

Distribution	PMF	E[X]	Var(X)	Use Case
Bernoulli(p)	pˣ(1-p)¹⁻ˣ	p	p(1-p)	Single trial
Binomial(n,p)	C(n,k)pᵏ(1-p)ⁿ⁻ᵏ	np	np(1-p)	Number of successes in n trials
Poisson(λ)	e⁻λλᵏ/k!	λ	λ	Count of rare events
Geometric(p)	p(1-p)ᵏ⁻¹	1/p	(1-p)/p²	Trials until first success
Negative Binomial(r,p)	C(k-1,r-1)pʳ(1-p)ᵏ⁻ʳ	r/p	r(1-p)/p²	Trials until r successes

Continuous Distributions

Distribution	PDF	E[X]	Var(X)	Use Case
Uniform(a,b)	1/(b-a)	(a+b)/2	(b-a)²/12	Equal likelihood
Exponential(λ)	λe⁻λˣ	1/λ	1/λ²	Wait times, memoryless
Normal(μ,σ²)	(1/σ√2π)e^(-(x-μ)²/2σ²)	μ	σ²	Natural phenomena
Gamma(α,β)	βᵅxᵅ⁻¹e⁻βˣ/Γ(α)	α/β	α/β²	Sum of exponentials
Beta(α,β)	xᵅ⁻¹(1-x)ᵝ⁻¹/B(α,β)	α/(α+β)	αβ/((α+β)²(α+β+1))	Probabilities

The Normal Distribution

The normal (Gaussian) distribution is ubiquitous due to the Central Limit Theorem.

Standard normal: Z ~ N(0,1)

Standardisation: If X ~ N(μ,σ²), then Z = (X - μ)/σ ~ N(0,1)

68-95-99.7 Rule:

68% of values fall within 1σ of μ
95% of values fall within 2σ of μ
99.7% of values fall within 3σ of μ

Limit Theorems

Law of Large Numbers (LLN)

As sample size n → ∞, the sample mean X̄ converges to the population mean μ:

$\overset{ˉ}{X}_{n} = \frac{1}{n} \sum_{i = 1}^{n} X_{i} p μ$

Central Limit Theorem (CLT)

For independent, identically distributed random variables with mean μ and variance σ²:

$\frac{X ˉ _{n} - μ}{σ / n} d N (0, 1)$

The sum/average of many independent random variables is approximately normal, regardless of the underlying distribution.

Statistical Inference

Point Estimation

Estimating a single value for a population parameter.

Sample mean: X̄ = (1/n)Σxᵢ estimates μ
Sample variance: s² = (1/(n-1))Σ(xᵢ - X̄)² estimates σ²
Maximum Likelihood Estimation (MLE): Find θ that maximises P(data|θ)
Method of Moments: Equate sample moments to population moments

Desirable properties: Unbiasedness, consistency, efficiency

Interval Estimation

A confidence interval provides a range of plausible values.

95% CI for μ (known σ): X̄ ± 1.96(σ/√n)

95% CI for μ (unknown σ): X̄ ± t₀.₀₂₅,ₙ₋₁(s/√n)

Interpretation: If we repeated the experiment many times, 95% of the constructed intervals would contain the true parameter.

Hypothesis Testing

State null hypothesis H₀ and alternative H₁
Choose significance level α (typically 0.05)
Calculate test statistic from data
Find p-value or critical region
Reject H₀ if p-value < α

Errors:

Type I error (α): Rejecting H₀ when it’s true (false positive)
Type II error (β): Failing to reject H₀ when it’s false (false negative)
Power = 1 - β: Probability of correctly rejecting false H₀

Common Statistical Tests

Test	Purpose	Assumptions
One-sample t-test	Compare mean to value	Normality
Two-sample t-test	Compare two means	Normality, equal variance
Paired t-test	Compare paired observations	Normality of differences
χ² test	Test categorical associations	Expected counts ≥ 5
F-test	Compare variances	Normality
ANOVA	Compare multiple means	Normality, equal variance

Bayesian Statistics

The Bayesian Approach

$P (θ ∣ d a t a) = \frac{P ( d a t a ∣ θ ) P ( θ )}{P ( d a t a )}$

Prior P(θ): Initial belief about parameter
Likelihood P(data|θ): Probability of data given parameter
Posterior P(θ|data): Updated belief after observing data
Evidence P(data): Normalising constant

Bayesian vs Frequentist

Aspect	Frequentist	Bayesian
Probability	Long-run frequency	Degree of belief
Parameters	Fixed unknown constants	Random variables
Prior information	Not formally used	Explicitly incorporated
Inference	Point estimates, CIs	Posterior distribution
Interpretation	Objective	Subjective/epistemic

Regression

Simple Linear Regression

Models the relationship between two variables:

$y = β_{0} + β_{1} x + ε$

Least squares estimates:

β̂₁ = Cov(x,y)/Var(x)
β̂₀ = ȳ - β̂₁x̄

Coefficient of determination R² = 1 - SS_res/SS_tot measures the proportion of variance explained.

Multiple Linear Regression

$y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + ... + β_{p} x_{p} + ε$

In matrix form: y = Xβ + ε

Least squares solution: β̂ = (XᵀX)⁻¹Xᵀy

Assumptions: Linearity, independence, homoscedasticity, normality of errors

Multicollinearity: When predictors are highly correlated, estimates become unstable.

Rai Notes

Explorer

Probability & Statistics

Probability Foundations

Sample Spaces and Events

Probability Axioms (Kolmogorov)

Conditional Probability

Bayes’ Theorem

Law of Total Probability

Random Variables

Discrete Random Variables

Continuous Random Variables

Expected Value

Variance and Standard Deviation

Covariance and Correlation

Common Distributions

Discrete Distributions

Continuous Distributions

The Normal Distribution

Limit Theorems

Law of Large Numbers (LLN)

Central Limit Theorem (CLT)

Statistical Inference

Point Estimation

Interval Estimation

Hypothesis Testing

Common Statistical Tests

Bayesian Statistics

The Bayesian Approach

Bayesian vs Frequentist

Regression

Simple Linear Regression

Multiple Linear Regression

Learning Resources

Books

Online Courses

Graph View

Table of Contents

Backlinks

Rai Notes

Explorer

Probability & Statistics

Probability Foundations

Sample Spaces and Events

Probability Axioms (Kolmogorov)

Conditional Probability

Bayes’ Theorem

Law of Total Probability

Random Variables

Discrete Random Variables

Continuous Random Variables

Expected Value

Variance and Standard Deviation

Covariance and Correlation

Common Distributions

Discrete Distributions

Continuous Distributions

The Normal Distribution

Limit Theorems

Law of Large Numbers (LLN)

Central Limit Theorem (CLT)

Statistical Inference

Point Estimation

Interval Estimation

Hypothesis Testing

Common Statistical Tests

Bayesian Statistics

The Bayesian Approach

Bayesian vs Frequentist

Regression

Simple Linear Regression

Multiple Linear Regression

Learning Resources

Books

Online Courses

Related Topics

Graph View

Table of Contents

Backlinks