# <Understanding the Foundations of Probability in Machine Learning>

Written on

## Introduction

Probability theory serves as a mathematical framework to quantify uncertainty in various contexts, forming a cornerstone for machine learning studies. This article aims to equip readers with the necessary terminology and mathematical concepts required to apply probability theory effectively in machine learning tasks.

## The Mathematics of Probability

In this section, we will provide a brief overview of fundamental probability concepts, ensuring readers have a solid grasp of the essential vocabulary and axioms necessary for utilizing this theory in machine learning.

Probability revolves around the potential outcomes of events. The complete set of possible outcomes is known as the **sample space**, denoted as S. For instance, the sample space for a coin toss is {heads, tails}.

For probability definitions to align with the relative frequency interpretation, they must adhere to specific properties. Modern probability theory begins with a set of axioms that specify the following properties:

- A sample space S is defined. Axiom I states that the probability of an event A must be non-negative, confined between 0 and 1.
- The total probability of all events in S equals 1. Axiom II posits that there exists a total probability mass.
- Axiom III states that the total probability across two mutually exclusive events is the sum of their individual probabilities.

It's important to note that probability theory does not concern itself with how probabilities are assigned or their underlying meanings. Any probability assignment that meets the aforementioned axioms is considered valid.

A **random variable** (denoted as X) is a variable that assumes values randomly selected from a sample space. For example, if x represents an outcome of a coin flip, we may refer to it as *x* = heads. Random variables can be either discrete (like the coin) or continuous (capable of taking on an infinite number of values).

To articulate the likelihood of each potential value of a random variable X, we define a **probability distribution function** (PDF). We denote it as X ~ P(x), indicating that X is a random variable drawn from the probability distribution P(x). The nature of PDFs varies between discrete and continuous random variables.

### Discrete Distributions

Discrete random variables are characterized by a **probability mass function** (PMF), which assigns a probability to each value within the variable's sample space. For example, the PMF of a uniform distribution across *n* outcomes is represented as P(X=*x*) = 1/*n*. This indicates that "the probability of X attaining the value *x* is 1 divided by the number of potential values." This distribution is termed uniform because each outcome has an equal likelihood. A fair die roll is an example of a uniform distribution, while a biased die is modeled using a categorical distribution, assigning different probabilities to each outcome.

### Continuous Distributions

Continuous random variables are described using PDFs, which can be more challenging to comprehend. The PDF of a random variable *x* is generally denoted as *f*(*x*). PDFs map an infinite sample space to relative likelihood values. Unlike discrete distributions, the value of the PDF at X = *x* does not represent the actual probability of *x*. This common misconception arises as, due to the infinite values that *x* can assume, the probability of *x* taking on any specific value is effectively 0.

### Joint Probability Distributions

A joint probability distribution encompasses multiple random variables. For example, for random variables X and Y, the joint probability is expressed as P(X=*x*, Y=*y*) or simply P(*x*, *y*). This represents "the probability that X results in *x* and Y results in *y*." If X denotes the outcome of a coin toss and Y indicates the result of a dice roll, then P(heads, 6) signifies the likelihood of the coin showing heads and the die landing on 6. When both variables are discrete, the joint distribution can be illustrated as a straightforward probability table.

### Marginal Probability Distributions

The joint PMF/PDF reveals insights into the combined behavior of X and Y. However, there are instances where we seek the probabilities of events involving each random variable in isolation. For two discrete random variables X and Y, the marginal probability mass functions can be computed accordingly.

Similarly for Y:

Marginal PMFs uphold all the properties of single-variable PMFs and provide the data needed to calculate the probabilities of events concerning the respective random variable.

It is generally impossible to derive the relative frequencies of paired values X and Y solely from the individual frequencies of X and Y. The same applies to PMFs: typically, knowledge of marginal PMFs is inadequate to define the joint PMF.

### Conditional Probability Distributions

Often, we need to ascertain whether two events, A and B, are interrelated in a manner such that the occurrence of one event, say B, affects the likelihood of the other event, A. This involves computing the conditional probability P(A|B), which can be interpreted as "the probability of event A given that event B has occurred." Conditional probability can be expressed mathematically as follows:

Mathematically, we can also express this relationship as:

Knowing that event B has taken place indicates that the experiment's outcome lies within the set B. Thus, we can view the experiment as having a reduced sample space of B when analyzing P(A|B).

Additionally, by multiplying both sides of the preceding equation by P(y), we derive the **chain rule** of probability: P(*x*, *y*) = P(*x*|*y*) × P(*y*).

### Bayes’ Rule

From our exploration of conditional probability, we can express the chain rule for two variables in two equivalent forms:

- P(
*x*,*y*) = P(*x*|*y*) × P(*y*) - P(
*x*,*y*) = P(*y*|*x*) × P(*x*)

Equating both right sides and dividing by P(*y*), we arrive at Bayes’ rule:

Bayes’ rule is of paramount importance in statistics and machine learning. It is commonly applied in scenarios where events of interest form a partition. The "a priori probabilities" of these events, P(*x*), represent their probabilities before the experiment is conducted. Once the experiment is performed and we learn that y is the outcome, the "a posteriori probabilities" become the probabilities of the events in the partition, P(*x*|*y*), considering this new information. Bayes’ rule underpins Bayesian statistics, facilitating updates to our beliefs about quantities as we collect more data.

## Useful Probability Distributions

In data modeling, several collections of probability distributions frequently appear. We will examine these distributions to provide readers with insights when encountering them in practical scenarios.

### Distribution over Integers

**Binomial Distribution**
The binomial distribution can be viewed as the probability of achieving a "success" or "failure" in an experiment or survey repeated multiple times.

A typical example is a biased coin toss, where we flip a biased coin with a probability f (of landing heads) N times and observe the number of heads r.

Binomial distributions frequently arise in real life. For instance, when a new medication is introduced to treat a disease, it either succeeds (success) or fails (failure). Purchasing a lottery ticket also exemplifies a binomial distribution, as you either win or lose.

**Poisson Distribution**
Next, we consider the **Poisson distribution** with parameter ? > 0, used to model the number of events occurring within a specified time frame. We interpret ? as the average number of events within that time interval, expressed as follows:

This formula represents the probability of exactly r occurrences in a Poisson experiment with a mean success rate of ?.

For example, consider a hospital where births occur randomly at an average rate of 1.8 births per hour. To find the probability of observing 4 births in one hour, with r = 4, we can apply the formula:

Next, we will examine the **exponential distribution** (for integers). This distribution often pertains to the time (or time steps for integer cases) until a specific event occurs, represented as follows:

where ? = ln(1/f).

## Distribution over Unbounded Real Numbers

**Gaussian Distribution, Student-t Distribution**
In practice, many datasets consist of real numbers. Unlike discrete probability distributions, continuous data do not have distinct values. Instead, we utilize a curve or function to describe the probability density across the distribution range.

The **Gaussian** (or **Normal**) **distribution** describes a specific class of distributions that are symmetric and can be characterized by two parameters: mean ? and standard deviation ?. It can be expressed as:

where

The parameter ? = 1/?², known as the precision parameter of the Gaussian, is often useful. Widely regarded as a common distribution in real-world applications, the Gaussian distribution is crucial; however, caution is warranted when modeling data with it.

The Gaussian distribution is unimodal, possessing a single pronounced peak. It is an extreme example of a unimodal distribution, characterized by light tails, meaning the log-probability-density decreases quadratically. The typical deviation of x from ? is ?, with the probabilities of x deviating from ? by more than 2?, 3?, 4?, and 5? being 0.046, 0.003, 6×10^(-5), and 6×10^(-7), respectively. Although deviations from the mean that exceed four or five times the usual deviation may be uncommon, they rarely reach 6×10^(-5). The Gaussian distribution remains prevalent in machine learning data analysis (which we will address later).

Often, data is modeled as a mixture of Gaussians. A mixture of two Gaussians is defined by two means, two standard deviations, and two mixing coefficients ?? and ??, satisfying ?? + ?? = 1, ?? ? 0. This is represented as:

An appropriately weighted mixture of an infinite number of Gaussians, all with mean ?, results in a **Student-t distribution**:

where

and n denotes the degrees of freedom, while ? is the gamma function. The Student's t distribution is often employed in hypothesis testing, estimating the mean of a normally distributed population with small sample sizes, and testing the statistical significance between two sample means or confidence intervals for small samples.

If n > 1, the Student distribution has a mean, which equals ?. If n > 2, it also has a finite variance, ?² = ns²/(n-2). As n approaches infinity, the Student distribution converges to the normal distribution with mean ? and standard deviation s. The Student distribution is relevant in both classical statistics and Bayesian inference. From a Bayesian perspective, it is beneficial to consider the t-distribution as a continuous mixture of normal distributions with varying variances. In the specific case where n = 1, the Student distribution is termed the **Cauchy distribution**.

## Distribution over Positive Real Numbers

**Exponential Distribution**
We will first discuss the **exponential distribution**. In continuous-time stochastic process studies, this distribution is typically employed to model the time until a specific event occurs. It is expressed as follows:

where

In classical probability studies, the exponential distribution is the most significant continuous distribution for constructing and understanding continuous-time Markov chains. Based on the distribution equation, as 1/s increases, events in the process tend to occur more rapidly, leading us to view 1/s as a rate. In some nomenclature, it is common to substitute 1/s with ? in the equation.

A notable characteristic of the exponential distribution is its "memoryless" property, signifying that the future lifetime of an object follows the same distribution, irrespective of its existing duration.

This concept can be illustrated with the following example: Suppose we start an alarm clock at time 0, which will ring after a time X that follows an exponential distribution with rate s. Denoting X as the clock's lifetime, for any t > 0, we have:

Upon returning at time s to find that the alarm has not yet rung, we observe the event {X > s}. If we let Y represent the remaining lifetime of the clock given {X > s}, then

This indicates that the remaining lifetime after we observe that the alarm has yet to ring at time s retains the same distribution as the original lifetime X. Crucially, this means that the distribution of the remaining lifetime is independent of s. The memoryless property is essential for studying continuous-time Markov chains.

The exponential distribution is widely utilized to characterize events that occur randomly over time, such as the duration between failures of electronic devices or the time intervals between arrivals at a service counter. It is related to the Poisson distribution, which describes the frequency of event occurrences within a defined time period.

**Gamma Distribution**
The **gamma distribution** resembles a Gaussian distribution but differs in that while the Gaussian spans from -? to ?, gamma distributions extend from 0 to ?. Just like the Gaussian has two parameters, ? and ?, that determine the mean and spread of the distribution, the gamma distribution also has two parameters. The gamma distribution is generated by multiplying the one-parameter exponential distribution with a polynomial, x^(c-1), where c is the second parameter. The probability density function is expressed as:

where

An interesting fact is that the gamma distribution derives its name from its normalizing constant.

One useful intuition for the exponential distribution is that it predicts the wait time until the **first** event occurs. Conversely, the gamma distribution predicts the wait time until the k-th event occurs. The gamma distribution finds application in wait time modeling, reliability (failure) modeling, and service time modeling (Queuing Theory).

## Applications of General Probability in Machine Learning

This article's initial portion provided a brief introduction to essential probability terminology, enabling us to frame machine learning inquiries within a probabilistic context. This section will explore some applications made possible by the concepts discussed.

### Supervised Learning

In supervised machine learning, the objective is to learn from labeled data. Labeled data means that for specific inputs X, we know the corresponding outputs Y. Some potential tasks include:

- Classifying or identifying an image
- Detecting whether an email or file is spam or malicious
- Predicting stock prices based on specific company features

How do we achieve these tasks? We can learn the parameters that map X to Y through various methods. For instance, we could learn the conditional probability P(Y|X), which denotes a probability distribution over possible Y values given a new sample X. Alternatively, we might aim to learn P(X|Y), the probability distribution over inputs given the labels.

### Unsupervised Learning

Unsupervised learning encompasses a wide array of techniques for learning from unlabeled data, where we only have some samples X without corresponding outputs Y. Understanding the distribution of unlabeled data is valuable for many tasks, such as anomaly detection. If we learn P(X), with X representing normal bank transactions, we can utilize P(X) to gauge the likelihood of future transactions. When a transaction shows low probability, we can flag it as suspicious or potentially fraudulent.

Common unsupervised tasks include:

**Clustering**
Clustering is a fundamental unsupervised learning problem. Given data points from distinct groups, how do we ascertain which group each point belongs to? One approach is to assume that each group originates from a different probability distribution. The task then becomes identifying the most probable configuration of these distributions.

**Dimension Reduction, Embedding**
This category encompasses tasks that involve projecting high-dimensional data into a meaningful lower-dimensional space. High-dimensional data consumes memory, hampers computations, and complicates visualization and interpretation. Thus, we seek methods to reduce data dimensions while preserving as much information as possible. This problem can be framed as finding a distribution in a lower-dimensional space that mirrors the original data distribution.

## Reinforcement Learning

Reinforcement learning focuses on training artificial agents to complete specific tasks. These agents learn by acting within their environments and receiving reward signals based on their decisions and behaviors. The agent's goal is to maximize its expected long-term reward. Probability plays a crucial role in various aspects of the reinforcement learning process, as the agent's learning often involves quantifying the "reward" associated with one action versus another.

## Applications of Common Distributions in Machine Learning

In this section, we will explore common use cases where probability distributions manifest in machine learning.

**Binomial Distribution**
The binomial distribution is frequently encountered in binary classification problems in machine learning. In this scenario, the objective is to train and validate an algorithm that predicts whether a specific observation belongs to one class or another (0 or 1 scenarios). The most basic algorithm employed for this purpose is logistic regression.

For instance, if you want to determine whether an animal in an image is a dog, the algorithm ultimately provides a probability value indicating the likelihood that the animal is a dog based on input characteristics (pixel values).

Another example is predicting the risk of hospital readmission within 30 days post-discharge. Here, "risk" signifies the probability of being readmitted within 30 days based on patient characteristics (demographics, medical history, biochemical profile, etc.).

In all these instances, the dependent variable (also known as the target variable or response variable) is binary, taking values of either 0 or 1. In this context, the **objective of a logistic regression model is to estimate the probability of success given the input characteristics**.

**Exponential Distribution**
The previously discussed exponential distribution is integral to constructing Markov chains. Briefly, a Markov chain is a mathematical system that transitions between states according to specific probabilistic rules. The defining feature of a Markov chain is that the future state probabilities depend solely on the current state and the time elapsed (recalling the memoryless property). We will not delve into the mathematical specifics of Markov chains but will discuss their applications.

In recent years, Markov chains have gained traction as popular graph-based models in data mining and machine learning. As a widely used graph model, Markov chains have been integrated into various semi-supervised learning algorithms and employed in classification tasks. This is achieved by using Markov chains to estimate the posterior probability of data belonging to different classes, computing the probability of labeled data reaching unlabelled data (a different state of the Markov chain). Markov chains have also been utilized to estimate the posterior probability of unlabelled data belonging to various classes by calculating its probability of reaching labeled data during a Markov random walk.

**Gaussian Distribution**
Gaussian distributions are regarded as the most "natural" distributions, appearing ubiquitously in our daily lives. They possess numerous properties that facilitate mathematical analysis, making the Gaussian distribution prevalent in ML algorithm evaluation.

For instance, we often assume that data or signal errors follow a Gaussian distribution. This is because machine learning (and statistics) treats data as a combination of deterministic and random components. Typically, the random component of data adheres to a Gaussian distribution or is presumed to follow one. This assumption is supported by the Central Limit Theorem (CLT), which posits that the sum of a large number of variables, each exerting a small effect on the outcome, approximates a normal distribution.

In machine learning, expressing a dependent variable as a function of multiple independent variables (for mathematical simplicity) is common. If this function is summed (or represented as a sum of other functions), and if we propose that the number of independent variables is substantial, the dependent variable should exhibit a normal distribution (due to the CLT).

Regarding the treatment of error signals as Gaussian, in a typical machine learning scenario, one may encounter errors stemming from diverse sources (e.g., measurement errors, data entry errors, classification errors, data corruption, etc.). It is reasonable to deduce that the cumulative effect of these errors approximates a normal distribution (though it is always prudent to verify!).

**Gamma Distribution**
Gamma distributions frequently emerge in inference problems where a positive quantity is derived from data. Examples include inferring the variance of Gaussian noise from noise samples and estimating the rate parameter of a Poisson distribution based on counts.

Moreover, the gamma function's significance should not be overlooked. Beyond its role in defining the gamma distribution, the gamma function is integral in various other distributions, including the Beta distribution, Dirichlet distribution, Chi-squared distribution, and Student’s t-distribution.

For data scientists, machine learning engineers, and researchers, the gamma function is among the most widely utilized functions due to its application across numerous distributions. These distributions are subsequently employed in Bayesian inference and stochastic processes (such as queueing models).

## Conclusion

This article aimed to introduce and utilize the language of probability to enable a probabilistic perspective on machine learning challenges. We commenced with a review of fundamental terminology and notation associated with key probability concepts. Subsequently, we discussed several important distributions commonly encountered in practice. Finally, we explored the applications of general probability and probability distributions within the realm of machine learning.

## Reference

*Information Theory, Inference and Learning Algorithms*, David J. C. MacKay.

**Author**: Joshua Chou | **Editor**: H4O & Michael Sarazen

**Synced Report | A Survey of China’s Artificial Intelligence Solutions in Response to the COVID-19 Pandemic — 87 Case Studies from 700+ AI Vendors**
This report provides insight into how China has harnessed artificial intelligence technologies to combat COVID-19. It is also available on Amazon Kindle. **In addition to this report, we introduced a database featuring 1428 additional artificial intelligence solutions across 12 pandemic scenarios.**

Click here to explore more reports from us.