How to explain gradient descent to your mom

Gradient descent is at the heart of most AI/ML methods. It sounds kind of weird and intimidating. Descent? Damn, I hope I don't have to skydive...😒

Don't worry, you may still have to, but only if you want to. Even your 10 year old nephew is capable of understanding this explanation.

Let's imagine you're trying to learn a new skill. Let's say you live in an ancient tribe, and your task is to determine the weight of people by their height. Scales haven't been invented yet. I know it sounds ridiculous, but for simplicity's sake, let's assume that the survival of the tribe depends on it. So, one of your tribesmen comes and names his height, and you, based on experience, should determine his weight, conditionally, in sacks of potatoes.

So, your tribesman tells Andrei - my height is 174 centimeters. Note that you don't like Andrei very much, so you tell him that he weighs 100500 sacks of potatoes, and he runs away in tears.... 😂 But actually, in your experience, you're pretty sure that this guy probably weighs about 70 kilograms, or 7 bags of potatoes (10 kilograms each).

How do you know that? Well, you've met quite a few people with the same height and body type as Andreu, so you figure it must be about right. So you already have experience.

Now we want to train a system that can do that. To do this, we need to simulate the prior experience of meeting people of different heights and learning their weight. We do this with training data, which is as follows:

If we graph this data, it looks like this: Table of approximate height to weight correlation

There is clearly a pattern here. The higher the height, the more weight. Okay, let's try to draw a line representing this pattern. Using this line, we can estimate weight for any given height:

For example, for a person with a height of about 157, the weight is about 40 pounds. The function that represents this line is essentially our AI model. Sounds suspiciously simple, right? It is.😎

All the complexity arises from complex dependencies. The weight-growth dependency is simple and can be represented by a simple function. But in the real world, many dependencies (or patterns) are very complex and are represented by very complicated functions. But let's focus on ours for now.

A linear function like ours is represented by the following equation:

f(x) = a * x + b

y = a * x + b Where: a is the slope and b is the point of intersection with the y-axis. There are even more definitions 🤯. But they are very simple:

Slope is the angle of a line, intercept is the point of intersection with the axis

The point of intersection (intercept) is the point where the line intersects the y-axis, and the slope is the angle between the line and the x-axis. We know from simple 7th grade geometry that the angle of a triangle here is A/B (or dy/dx), as shown below:

These two parameters, a and b, are the coefficients or parameters of our model. In large models like OpenAI, there are billions of parameters and the basis function is different, but the principle is the same.

So, let's move on to the learning process. We decided that in our case we will use a simple line to represent the correlation between height and weight. That is, the function is linear. That is why it is also called linear regression. The "regression" part of the name is historical, I'll explain that at the end.

How do we find the correct slope and intersection? We do it the same way we master any skill. We start with a random guess, observe the result, and correct the guess if it's too far from the truth. It's like a cold-hot game. So, we place our line in a random location, like this: Slope is the angle of a line, intercept is the point of intersection with the axis

We measure the distances from the line to all our points, the sum of these distances is the total error in the current position of the line. It's called loss. Our goal is to minimize the loss, that is, to fit the curve so that the sum of the distances is as close to zero as possible. Ideally, the distance is zero, i.e. all points lie on the line - but in our case with our linear function this is not feasible. But we want it to be as small as possible.

Now we have formalized our learning problem - we want to find such coefficients a and b that the error (loss) is as small as possible, or in other words:

f(x) = a * x + b sum ( distances_to_line (a, b) ) -> 0 The formula for the distance D between two points is as follows:

The formula for the distance between two points

Let us randomly choose a = 0.3125 and b =14.375.

With this in mind, let's choose a point from our training data, say 174 cm, 70 kg. Let's calculate the same point by substituting x = 174, by our function y = x *a + b, with a and b we chose randomly above:

y = 174 * 0,3125 + 14,375 = 40. So the calculated point = 174 cm, 40 kg.

The distance is D = sqrt((174-174)² + (70-40)²) = 30. So we're wrong by 30 kg. Not the best model, we need to find a better fit a and b.

In the above formula, we don't need the square root as such. If you think about it, we use distance to estimate the error. We can easily use just:

where the estimated value is calculated using our function and the expected value is taken from the training set. Less computation, and it works just as efficiently. This is what is commonly used, this method is called the least squares method.

If we take all the points one by one and sum the distances, we get the total error. Now we change a and b slightly and calculate the error again. If it comes out less than before - then we are moving in the right direction. When do we stop changing the coefficients, that is, how do we know that the a and b we found are good enough? Well, that decision is left up to us, but I would say that if the error for each point is small - that's good enough. That is, if our model guesses 70.2 kg for a height of 174 cm, that's very close to 70 kg. If the error is so small for each point - we don't need to improve the model further.

Thus, if:

total error = (sum of errors/number of points) ≤ 0.2 we stop training.

Now we just need to find a way to move the function line, changing a and b, in a direction that minimizes the total error! If we were to change them randomly, it would take too long. Finally, we come to the gradient descent. Such a long way to go!

Gradient descent Gradient descent is a method for finding the minimum of a function. This is exactly what we need. How convenient! 😂

In our case, the total error, total error, that we want to minimize is a function of a and b, that is:

total_error = f(a, b) = sum(distances_to_line (a, b)) = distance((x1, y1_train) , (x1, y1)) + distance((x2, y2_train) , (x2, y2)) + ...

distance((xn, yn_train) , (xn, yn)) If we try some more random values of a and leave b constant for now, we see that the total error function looks as follows:

Error plot for randomly chosen values of the coefficient a

For certain values of a, the total error is smaller. A gradient is a slope, as in a valley, and since we need a minimum, we need to move down the slope. In other words, we need to go downhill. We are at some random point that we chose at the beginning, and the minimum we are looking for is the treasure at the bottom of the valley, and to find it, we need to go downhill.

The thing is, we don't actually know where the treasure is at the beginning - otherwise we'd just pick this point instead of a random🤔 point. So it looks something like this:

And in the beginning, nothing is clear...

We know we are on a slope somewhere, we know the treasure is somewhere at the bottom, but we don't know where that bottom is. What we do know is that we need to go down and look around for treasure along the way. So we slowly descend and look. That's what gradient descent is all about.

It works like this: we take the derivative of the error function at our random point. The derivative is the slope of the tangent to the line of the function at that point, or more simply, it indicates the direction of increasing/decreasing slope, and its value tells us how steep the slope is. At a minimum, the derivative is zero because it is a minimum, so there is no slope (like at the bottom of a valley - the ground is flat):

The tangent to the graph of a function indicates the direction of the slope and its steepness

So we need to move in the direction of decreasing slope. We add the derivative multiplied by the step size (called the learning rate) to our coefficient and calculate the new total error, and repeat this process until our total error is less than 0.2. The learning rate is an arbitrary number that is picked up by trial and error, but it's usually small enough that we don't jump too fast, or else this might happen:

This is the process of gradient descent:

we take the derivative of the error function

multiply it by the learning rate

and add the result to the parameter we are trying to optimize.

use the updated parameter to calculate a new error

if the desired accuracy threshold is not reached, repeat the process.

I won't tell you how to calculate the derivative - it's a basic school formula, and it's not that important if you understand what a derivative is. The derivative shows how the slope changes and in what direction:

Rate of change: Geometrically, the derivative shows how steep the graph is at any point. A large absolute value of the derivative indicates a steep slope (rapid rise or fall), and a small absolute value indicates a small slope (smooth rise or fall).

Positive or negative slope: If the derivative at a point is positive, the function is increasing at that point, that is, the slope of the tangent line is upward. Conversely, if the derivative is negative, the function is decreasing and the slope of the tangent line is downward.

Zero derivative: When the derivative at a point is zero, it means that the slope of the tangent line is horizontal. Such points are local maxima, local minima, or inflection points of the function.

Notes on the features of gradient descent

How gradient descent behaves in the presence of local minima How the gradient descent behaves in the presence of local minima It should be noted that gradient descent does not guarantee finding a global minimum. It may find a local minimum and stop there if threshold conditions are met. This depends on factors such as the shape of the function, the random parameters initially chosen, the learning rate, and so on...

Historical background on regression

As promised, here are the historical roots of the word "regression" in "linear regression" has historical origins. It was first used by Sir Francis Galton, an English researcher, in the context of his studies of heredity. In the late 19th century, Galton studied the relationship between the height of parents and their children. He observed a phenomenon he called "regression to mediocrity" (now more commonly referred to as "regression to the mean").

Galton observed that while tall parents tended to have children of above average height, these children were generally not as tall as their parents. Similarly, children of short parents tended to be below average height, but not as short as their parents. In other words, the height of children tended to "regress" or move toward the average height of the population.

Galton's work was later mathematically formalized by Karl Pearson and other scientists who developed the least squares method for finding the best line through a set of data points. This method was used to quantify the strength and nature of the relationship between two variables, a process they called "regression analysis."

Over time, the term "regression" became associated with statistical techniques used to model and analyze the relationship between variables.

Traduced form here