Probability theory provides the mathematical foundation for quantifying uncertainty. Statistics uses data to make inferences about the world. Together, they form the backbone of data science, machine learning, and scientific research.

Probability Foundations

Sample Spaces and Events

A sample space Ω is the set of all possible outcomes of an experiment.

  • Events are subsets of the sample space
  • Complement: Aᶜ = {ω ∈ Ω : ω ∉ A}
  • Union: A ∪ B = {ω ∈ Ω : ω ∈ A or ω ∈ B}
  • Intersection: A ∩ B = {ω ∈ Ω : ω ∈ A and ω ∈ B}

Probability Axioms (Kolmogorov)

  1. Non-negativity: P(A) ≥ 0 for any event A
  2. Normalisation: P(Ω) = 1
  3. Additivity: P(A ∪ B) = P(A) + P(B) for disjoint events (A ∩ B = ∅)

From these axioms, we derive:

  • P(Aᶜ) = 1 - P(A)
  • P(∅) = 0
  • P(A ∪ B) = P(A) + P(B) - P(A ∩ B)

Conditional Probability

The probability of A given that B has occurred:

Independence: Two events are independent if and only if:

Bayes’ Theorem

This allows us to update our beliefs about A after observing evidence B.

Law of Total Probability

For a partition {B₁, B₂, …, Bₙ} of Ω:

Random Variables

A random variable is a function X: Ω → ℝ that assigns a numerical value to each outcome.

Discrete Random Variables

Takes countably many values.

  • Probability mass function (PMF): p(x) = P(X = x)
  • Cumulative distribution function (CDF): F(x) = P(X ≤ x) = Σ p(k) for k ≤ x

Properties of PMF:

  • p(x) ≥ 0 for all x
  • Σ p(x) = 1

Continuous Random Variables

Takes uncountably many values in an interval.

  • Probability density function (PDF): f(x) where P(a ≤ X ≤ b) = ∫ₐᵇ f(x)dx
  • CDF: F(x) = P(X ≤ x) = ∫₋∞ˣ f(t)dt

Properties of PDF:

  • f(x) ≥ 0 for all x
  • ∫₋∞^∞ f(x)dx = 1
  • P(X = x) = 0 for any specific value

Expected Value

The expected value (mean) is the weighted average of all possible values.

Discrete: E[X] = Σ x · P(X = x)

Continuous: E[X] = ∫₋∞^∞ x · f(x)dx

Properties:

  • Linearity: E[aX + bY] = aE[X] + bE[Y]
  • E[c] = c for any constant
  • E[g(X)] = Σ g(x)P(X = x) or ∫ g(x)f(x)dx

Variance and Standard Deviation

Variance measures the spread of a distribution around its mean:

Standard deviation: σ = √Var(X)

Properties:

  • Var(aX + b) = a²Var(X)
  • Var(X + Y) = Var(X) + Var(Y) + 2Cov(X,Y)
  • For independent X, Y: Var(X + Y) = Var(X) + Var(Y)

Covariance and Correlation

Covariance measures how two variables vary together:

Correlation is the normalised covariance:

Common Distributions

Discrete Distributions

DistributionPMFE[X]Var(X)Use Case
Bernoulli(p)pˣ(1-p)¹⁻ˣpp(1-p)Single trial
Binomial(n,p)C(n,k)pᵏ(1-p)ⁿ⁻ᵏnpnp(1-p)Number of successes in n trials
Poisson(λ)e⁻λλᵏ/k!λλCount of rare events
Geometric(p)p(1-p)ᵏ⁻¹1/p(1-p)/p²Trials until first success
Negative Binomial(r,p)C(k-1,r-1)pʳ(1-p)ᵏ⁻ʳr/pr(1-p)/p²Trials until r successes

Continuous Distributions

DistributionPDFE[X]Var(X)Use Case
Uniform(a,b)1/(b-a)(a+b)/2(b-a)²/12Equal likelihood
Exponential(λ)λe⁻λˣ1/λ1/λ²Wait times, memoryless
Normal(μ,σ²)(1/σ√2π)e^(-(x-μ)²/2σ²)μσ²Natural phenomena
Gamma(α,β)βᵅxᵅ⁻¹e⁻βˣ/Γ(α)α/βα/β²Sum of exponentials
Beta(α,β)xᵅ⁻¹(1-x)ᵝ⁻¹/B(α,β)α/(α+β)αβ/((α+β)²(α+β+1))Probabilities

The Normal Distribution

The normal (Gaussian) distribution is ubiquitous due to the Central Limit Theorem.

Standard normal: Z ~ N(0,1)

Standardisation: If X ~ N(μ,σ²), then Z = (X - μ)/σ ~ N(0,1)

68-95-99.7 Rule:

  • 68% of values fall within 1σ of μ
  • 95% of values fall within 2σ of μ
  • 99.7% of values fall within 3σ of μ

Limit Theorems

Law of Large Numbers (LLN)

As sample size n → ∞, the sample mean X̄ converges to the population mean μ:

Central Limit Theorem (CLT)

For independent, identically distributed random variables with mean μ and variance σ²:

The sum/average of many independent random variables is approximately normal, regardless of the underlying distribution.

Statistical Inference

Point Estimation

Estimating a single value for a population parameter.

  • Sample mean: X̄ = (1/n)Σxᵢ estimates μ
  • Sample variance: s² = (1/(n-1))Σ(xᵢ - X̄)² estimates σ²
  • Maximum Likelihood Estimation (MLE): Find θ that maximises P(data|θ)
  • Method of Moments: Equate sample moments to population moments

Desirable properties: Unbiasedness, consistency, efficiency

Interval Estimation

A confidence interval provides a range of plausible values.

95% CI for μ (known σ): X̄ ± 1.96(σ/√n)

95% CI for μ (unknown σ): X̄ ± t₀.₀₂₅,ₙ₋₁(s/√n)

Interpretation: If we repeated the experiment many times, 95% of the constructed intervals would contain the true parameter.

Hypothesis Testing

  1. State null hypothesis H₀ and alternative H₁
  2. Choose significance level α (typically 0.05)
  3. Calculate test statistic from data
  4. Find p-value or critical region
  5. Reject H₀ if p-value < α

Errors:

  • Type I error (α): Rejecting H₀ when it’s true (false positive)
  • Type II error (β): Failing to reject H₀ when it’s false (false negative)
  • Power = 1 - β: Probability of correctly rejecting false H₀

Common Statistical Tests

TestPurposeAssumptions
One-sample t-testCompare mean to valueNormality
Two-sample t-testCompare two meansNormality, equal variance
Paired t-testCompare paired observationsNormality of differences
χ² testTest categorical associationsExpected counts ≥ 5
F-testCompare variancesNormality
ANOVACompare multiple meansNormality, equal variance

Bayesian Statistics

The Bayesian Approach

  • Prior P(θ): Initial belief about parameter
  • Likelihood P(data|θ): Probability of data given parameter
  • Posterior P(θ|data): Updated belief after observing data
  • Evidence P(data): Normalising constant

Bayesian vs Frequentist

AspectFrequentistBayesian
ProbabilityLong-run frequencyDegree of belief
ParametersFixed unknown constantsRandom variables
Prior informationNot formally usedExplicitly incorporated
InferencePoint estimates, CIsPosterior distribution
InterpretationObjectiveSubjective/epistemic

Regression

Simple Linear Regression

Models the relationship between two variables:

Least squares estimates:

  • β̂₁ = Cov(x,y)/Var(x)
  • β̂₀ = ȳ - β̂₁x̄

Coefficient of determination R² = 1 - SS_res/SS_tot measures the proportion of variance explained.

Multiple Linear Regression

In matrix form: y = Xβ + ε

Least squares solution: β̂ = (XᵀX)⁻¹Xᵀy

Assumptions: Linearity, independence, homoscedasticity, normality of errors

Multicollinearity: When predictors are highly correlated, estimates become unstable.

Learning Resources

Books

  • A First Course in Probability by Sheldon Ross
  • Statistical Inference by Casella and Berger
  • All of Statistics by Larry Wasserman
  • Probability Theory: The Logic of Science by E.T. Jaynes

Online Courses