/ / درس 11

### توضیح مختصر

• زمان مطالعه 0 دقیقه
• سطح خیلی سخت

### دانلود اپلیکیشن «زوم»

این درس را می‌توانید به بهترین شکل و با امکانات عالی در اپلیکیشن «زوم» بخوانید

### فایل ویدیویی

برای دسترسی به این محتوا بایستی اپلیکیشن زبانشناس را نصب کنید.

### متن انگلیسی درس

We have reached the last piece of the puzzle before we can start building our first machine learning

algorithm.

So far we have learned at least conceptually how to input data into a model and to measure how close

to the targets are the outputs we obtain through the objective function.

However the actual optimization process happens when the optimization algorithm varies the models parameters

until the last function has been minimized in the context of the linear model.

This implies varying W and B.

Ok the simplest and the most fundamental optimization algorithm is the gradient descent.

I would like to remind you that the gradient is the multivariate generalization of the derivative concept.

Let’s first consider a non machine learning example to understand the logic behind the gradient descent.

Here’s a function f of x equal to 5 times x squared plus three times X minus 4.

Our goal is to find the minimum of this function using the gradient descent methodology.

The first step is to find the first derivative of the function.

In our case it is 10 times x plus 3.

The second step would be to choose any arbitrary number.

For example x not equals 4 x not is the proper way to say x 0.

Then we calculate a different number x 1 following the update rule X plus one equals x y minus eata

times the first derivative of the function at x.

I x 1 is equal to 4 minus eata times 10 times for plus three or four minus ITAR times 43.

So what is eata.

This is the learning rate.

It is the rate at which the machine learning algorithm forgets all beliefs for new ones.

We choose the learning rate for each case by the end of this lecture.

The concept of itoh will be clearer using the update rule we can find.

X 2 x 3 and so on.

After conducting the update operation long enough the values will eventually stop updating.

That is the point at which we know we have reached the minimum of the function.

This is because the first derivative of the function is zero when we have reached the minimum.

So the update rule X plus one equals x.

Minus eata times the first derivative at X Y will become X plus one equals X minus 0 or X plus one equals

x y.

Therefore the update rule will no longer update.

Let’s illustrate this with an example.

Let’s take an eata of zero point zero one.

We start descending X one is equal to 3.5 7 x 2 is equal to 3 point 1 8 and so on around the eighty

fifth observation we see our sequence doesn’t change any more.

It has converged to minus 0.3 once the minimum is reached.

All subsequent values are equal to it.

Since our update rule has become X I plus one equals x.

Minus Zero graphically the gradient descent looks like this.

We start from any arbitrary point and descend to the minimum.

All right the speed of minimization depends on the eata.

Let’s try with an eata of 0.1 we have converged to the minimum of minus 0.3 after the first iteration.

Now knowing the minimum is minus 0.3 Let’s see an eata of 0.00 1 this step is so small that we need

approximately 900 iterations before we reach the desired value.

We descend to the same extremum but in a much slower manner.

Finally I’ll try with an ETA of zero point to we obtain a sequence of four and minus 4.6 until infinity

no matter how many iterations we execute our sequence will never reach minus 0.3.

We already know minus 0.3 is the desired value.

But if we did it we would be deceived.

This situation is called oscillation.

We bounce around the minimum value but we never reach it.

We can use for or minus 4.6 in the algorithm but this won’t be its true minimum graphically we are stuck

into these two points never reaching the minimum.

Now that we have seen different learning rates and their performance let’s state this rule generally

we want the learning rate to be high enough so we can reach the closest minimum.

After repeating the operation in a rational amount of time so perhaps 0.00 one was too small for this

function at the same time.

We want it to be low enough so we are sure we reached the minimum and don’t oscillate around it like

in the case where we chose an eata of 0.2 in the sections in which we all study deep learning.

We will discuss a few smarter techniques that would allow us to choose the right rate.

All right.

There are several key takeaways from this lesson.

First using gradient descent we can find the minimum value of a function through a trial and error method.

That’s just how computers think.

Second there is an update rule that allows us to cherry pick the trials so we could reach the minimum

faster each consequent trial is better than the previous one with a nice update rule.

Third we must think about the learning range which has to be high enough so we don’t iterate forever

and low enough so we don’t oscillate forever.

Finally once we have converged we should stop updating or as we will see in the coding example we should

break the loop.

One way to know we have converged is when the difference between the term at place I plus one and place

is zero point zero zero or one.

Once again that’s a topic we’ll see in more detail later.

resources section.

We encourage you to play around with the learning rate or the arbitrarily chosen number ex-nun and see

what happens.

This will give you a good intuition about the learning rate which is central to teaching the algorithm

in the next lesson.

We will generalize this concept to the end parameter.