# Understanding Machine Learning: What it Means for Machines

Written on

In today's world dominated by artificial intelligence, the phrase "machine learning" has gained significant traction, extending its influence far beyond data science and technological fields. From curating playlists to identifying medical conditions, machine learning is transforming our interactions with the environment. Interestingly, the term AI has become overused, often referring specifically to machine learning rather than true artificial intelligence. **But what does it mean for a machine to "learn"?**

This article seeks to clarify the concept of machine learning by examining it from various angles. We will begin with a straightforward overview that encapsulates the core of machine learning in accessible terms. Following that, we will explore more complex explanations, illuminating the principles and theories that drive this groundbreaking technology. By the conclusion, you will possess a thorough understanding of what it entails for a machine to "learn" and the importance of this process in the evolution of artificial intelligence.

## The Fundamentals of Learning

Grasping the concept of learning is no easy task. Humans acquire knowledge through diverse experiences, gradually shaping our perception of the world around us. In contrast, machines approach learning differently — they analyze data. For machines, learning consists of enhancing performance on a specific task through data interpretation. The primary goal is for the machine to proficiently perform this task with new, previously unseen data after the learning phase.

The key distinction between traditional algorithms and machine learning algorithms is that the latter are not pre-programmed to solve tasks. Instead, they are designed to discover the optimal solution by iteratively refining their parameters based on the provided data.

## A Mathematical Perspective

Mathematically, we can simplify the problem. Given known values of *x* and *y* such that *f(x) = y*, how can we approximate the function *f*? The learning process for *f* involves adjusting a set of parameters *?* iteratively, ensuring that our current estimate of *f* is as accurate as possible based on the data we have encountered.

While variations of this definition exist — as sometimes only *x* is provided without *y* — we will adhere to this interpretation for the purposes of this article.

## A Practical Example: Linear Regression

Consider the scenario where we have gathered data about two variables, such as house prices and their sizes, or the income levels of individuals relative to their education. Our objective is to uncover the relationship between these variables. Assuming this relationship is linear, we aim to find the straight line that best represents the connection between them. Thus, referring to our previous equation, we seek the best approximation of *f(x)* such that:

In this context, *x* could represent house price while *y* corresponds to its size. What are our parameters *?*? Here, we aim to determine the values of *a* and *b* that best fit the data, so *? = {a,b}*. Our next step is to find a method to **learn** these parameters.

## An Iterative Learning Approach

As previously mentioned, the main difference from classical algorithms is that we won’t provide explicit instructions to solve this problem. Instead, we will present subsets of data at each stage of the learning process and refine our estimates iteratively. Although this might initially seem abstract, it will become clearer as we define the associated concepts. It is crucial to recognize that this process is iterative, meaning our estimate at time *i+1* should surpass our estimate at time *i*.

## The Loss Function

Our aim is to identify the best approximation of this function. To assert that it is indeed the best, we must establish a value that indicates how closely our estimate aligns with the actual data. Ultimately, we will seek to minimize this value to ensure our approximation closely resembles the data. This criterion is referred to as **error**, denoted by *J*. Our objective will be to identify the parameter set *a* and *b* such that *J* reaches its minimum. Various functions can serve as the error metric (also known as the **loss or cost function**). A common choice for linear regression is the mean squared error, defined as follows:

where *?* represents our current approximation of *y*. The goal is to optimize this function concerning the known *x* and *y* values so that our approximation reliably predicts previously unseen values of *y* when given new *x* inputs. But how can we ensure that we choose our parameters in a way that reduces this value over time? This inquiry leads us to a fundamental concept in machine learning, known as gradient descent.

## Gradient Descent Explained

One last piece is needed before we can formulate the algorithm. While calculating the error using our estimate of *f* and the actual data is straightforward, how do we determine the direction to adjust our parameters *? = {a,b}* for reducing the error? This is where gradient descent comes into play. Gradient descent is an optimization algorithm designed to minimize the loss function by iteratively progressing in the direction of steepest descent defined by the negative of the gradient. In practice, we compute the derivative of the loss function concerning each learnable parameter *?*, then subtract this value from the original parameter to move towards the steepest descent.

The parameter alpha is typically chosen to be relatively small to avoid overshooting the updates of these parameters, which could prevent finding a local minimum for *J*. This parameter is referred to as the **learning rate**, as it describes the speed at which the algorithm updates its learning. If the learning rate is excessively small, the algorithm converges slowly. Conversely, if it is too high, convergence may occur rapidly, with the risk of overshooting.

By deriving the values of the partial derivatives using the previously defined loss, we obtain:

If we refactor the last term of both equations:

At each iteration of the algorithm, we recalculate the error *J* to assess our current estimate. If we are satisfied with the value, we conclude the process; if not, we continue to the next step and update the parameters.

With this addition, we have all we need to define the learning algorithm.

## The Algorithm in Action

Let’s dive into a pseudo code to see how all the components fit together.

# Data is given X = [...] y = [...] m = len(X)

# We start by initializing our parameters to 0 and the error to infinity a = 0 b = 0 error = +inf

# The parameter alpha can be chosen differently depending on the task alpha = 0.1

# We choose the threshold to be as small as we want while(error > threshold):

y_hat = a * X + b # vector of same dimension as X and y

partial_a = (-2/n) * sum(X * (y - y_hat))

partial_b = (-2/n) * sum(y - y_hat)

a = a - alpha * partial_a

b = b - alpha * partial_b

error = (1/m) * sum((y - y_hat)^2) # MSE

## Learning vs Traditional Algorithms

We have outlined how the learning process functions for linear regressions. In contrast, if we were to employ a classical algorithm for this task, we would model it as an optimization problem, aiming to minimize the residual sum of squares by setting its gradient to zero.

For a single variable case, these equations are relatively simple and likely more straightforward to solve directly than with gradient descent. However, as we increase the number of dimensions, this formula becomes significantly more complex and resource-intensive to compute. Gradient descent, on the other hand, does not encounter such scalability challenges.

Why do we say that one method learns while the other does not? Essentially, gradient descent **learns** the parameters *a* and *b* by continuously refining them to provide a fitting approximation of the actual data. The linear algebra solution, in contrast, simply computes the optimal result without any iterative learning process. The iterative learning process for linear regression is depicted in the figure below.

While this example effectively illustrates the learning process, the strong assumptions inherent in this model often fall short for more intricate cases. Problems involving nonlinear relationships, interactions between features, or high-dimensional data necessitate more advanced techniques. In the next section, we will explore how we can extend the principles we've discussed to model more complex functions.

## A More Complex Scenario: Neural Networks

Unlike linear regression, which presupposes a straightforward linear relationship between inputs and outputs, neural networks are engineered to identify intricate, nonlinear patterns in data. This capability arises from their architecture, which emulates the structure of the human brain, enabling them to **learn** and generalize from extensive datasets.

## The Neuron

Within a neural network, the fundamental unit is the neuron, often referred to as a node or perceptron. Each neuron performs a basic computation that, when integrated with many others, allows the network to tackle complex issues. A neuron consists of **inputs**, **weights**, a **bias** term, and an **activation** function. Together, these elements yield the neuron's final output.

Suppose we aim to classify images of dogs and cats. In this instance, the input would be an array of numbers, with each element corresponding to a pixel in the image. For black-and-white images, each number can range between 0 and 1 based on pixel brightness.

Each input *x* is paired with a weight *w*. These weights are parameters learned during the training process and dictate the significance of each input in influencing the neuron's output. The neuron computes a weighted sum of its inputs, commonly referred to as the neuron's activation. Mathematically, this can be expressed as:

Here, *b* is the bias term, an additional learned parameter that allows the neuron to capture patterns that do not intersect at the origin. In the earlier example of linear regression, *a* represented a weight while *b* functioned as a bias.

The weighted sum *z* is subsequently passed through an activation function *g(z)*. The activation function introduces non-linearity into the model, enabling the network to learn and represent complex patterns. There are numerous activation functions available, such as the sigmoid function, tanh, and more. For this discussion, we will focus on the Rectified Linear Unit (ReLU) function, which is widely used and simple to implement.

This setup produces the output for our neuron. While the mathematical formulation allows us to perform simple nonlinear transformations of the input, it is insufficient for capturing the complex, hierarchical patterns that frequently appear in real-world data. Hence, we must integrate these neurons within a **network**.

## The Network

Transitioning from individual neurons to neural networks greatly enhances a model’s capacity to identify complex patterns in data. But how does this operate in practice? Two primary actions are involved:

**Layering**: We group multiple neurons into layers, with each layer possessing its unique set of weights and biases. Each neuron in a layer shares the same activation function.**Stacking**: Once our layers are defined, we stack them sequentially. This means that the outputs of one layer serve as the inputs for the subsequent layer. The first layer receives the input data, while the output of the final layer represents the model's output, with all intermediate layers acting as hidden layers. The output of each neuron is fed into every neuron of the following layer. This stacking results in what is termed a fully connected (or dense) neural network.

The output layer of a neural network may consist of a single neuron or multiple neurons, depending on the task at hand. For instance, if we aim to identify whether an image contains a dog or a cat, the output layer will consist of a neuron that outputs 1 for a dog and 0 for a cat. If we are predicting the next frame in a video based on the previous one, the output will be a series of neurons corresponding to each pixel of the next frame.

We previously discussed how to learn two parameters for linear regression. However, how can we apply gradient descent to such a complex model with potentially thousands of parameters? The solution lies in the concept of **back-propagation**.

## Learning Through Back-Propagation

Back-propagation is an algorithm that calculates the gradient of the loss function concerning each weight using the chain rule, allowing the network to learn from its mistakes and progressively enhance its predictions. This process is similar to what we have discussed earlier, except that now we must propagate this gradient backwards, starting from the loss function at the output layer and moving through all layers back to the input.

The algorithm is divided into two phases: the forward pass and the backward pass. In the forward pass, the input data moves through the network layer by layer. Each neuron computes a weighted sum of its inputs, adds a bias, and applies a nonlinear activation function. This process continues until a final output is generated, which is then compared to the actual target values to determine the loss, quantifying the prediction error of the network.

The backward pass, as the name suggests, occurs from the output layer back to the input. Its purpose is to adjust the weights of the network to minimize the error concerning the input and output that were just processed. Essentially, it seeks to modify the network's parameters such that when the same input reappears, the network will produce the correct output. We have previously seen how this operates with the straightforward case of linear regression. The challenge here lies in calculating the partial derivatives of the loss with respect to each parameter. The relationship between most network parameters and the error is defined by function composition, as one layer’s output serves as another’s input. Thus, we must employ the chain rule to perform this computation, as illustrated for one parameter in the image below.

Once we have determined the partial derivatives of the loss function with respect to each network parameter, we can update the weights by taking a step towards steepest descent. This entails subtracting the partial derivative multiplied by a coefficient (the learning rate) from the original weight value, mirroring the approach we used for linear regression.

And that's it! While there is a vast amount more to explore in machine learning beyond what has been covered here, these core principles represent the foundation of **what it means for a machine to learn**.

## Conclusion

In this exploration of the essence of machine learning, we have examined the fundamental concepts that empower machines to learn from data. From the basics of linear regression to the intricacies of neural networks and back-propagation, we've witnessed how machines can iteratively enhance their performance on designated tasks.

To succinctly address the initial question: what we, as humans, define as “learning” for a machine is the enhancement of its task performance over time by adjusting internal parameters to minimize errors based on the data it processes, thereby improving its predictions or decisions. The depth of this field is profound, and numerous additional topics remain to be explored, such as the connection between human learning and machine learning, hyper-parameter tuning, underfitting, and overfitting. However, those discussions are reserved for another time.

## References

**Linear Regression using Gradient Descent**This tutorial covers the mechanics of the gradient descent algorithm and its implementation in Python. [towardsdatascience.com](https://towardsdatascience.com)

**The Basics of Neural Networks (Neural Network Series) — Part 1**An introduction to neural networks. [towardsdatascience.com](https://towardsdatascience.com)