# Momentum

/ / درس 3

### توضیح مختصر

• زمان مطالعه 0 دقیقه
• سطح خیلی سخت

### دانلود اپلیکیشن «زوم»

این درس را می‌توانید به بهترین شکل و با امکانات عالی در اپلیکیشن «زوم» بخوانید

### متن انگلیسی درس

As we promised in this lecture we will explore additional ways to improve our chances of reaching the

global minimum rather than getting stuck in a local one.

The gradient descent and the stochastic gradient descent are good ways to train our models.

We need not change them.

We should simply extend them the simplest extension we should apply.

Is called momentum.

What is momentum.

An easy way to explain momentum is through a physics analogy.

Imagine the gradient descent as rolling a ball down a hill.

The faster the ball rolls the higher is its momentum.

A small dip in the grass would not stop the ball it would rather continue rolling until it has reached

a flat surface out of which it cannot go.

The small dip is the local minimum while the Big Valley is the global minimum.

If there wasn’t any momentum the ball would never reach the desired final destination.

It would have rolled with some none increasing speed and would have stopped in the dip.

The momentum accounts for the fact that the ball is actually going downhill.

Now in our greeting descent framework so far we didn’t consider momentum.

There is no reason to ignore it.

Therefore we created algorithms that will likely fall into a dip.

If there was one instead of descending to the optimal solution.

So how do we add momentum to the algorithm.

The rule so far was w my equals w minus eata times the gradient of the loss with respect to w including

momentum.

We will consider the speed with which we’ve been descending so far.

For instance if the ball is rolling fast the momentum is high otherwise the momentum is low.

The best way to find out how fast the ball rolls is to check how fast it rolled a moment ago.

That’s also the method adopted in machine learning.

We add the previous update step to the formula we want to multiplied by some coefficient.

Otherwise we would assign the same importance to the current update and the previous one.

Usually we use an alpha of 0.9 to adjust the previous update Alpha is a hyper parameter and we can play

around with it for better results.

Zero point nine is the conventional rule of thumb and this is what I also use.

All right we considered momentum in the next lesson we will look at the learning rate.

One of the key hyper parameters of the algorithm.

Thanks for watching.

### مشارکت کنندگان در این صفحه

تا کنون فردی در بازسازی این صفحه مشارکت نداشته است.

🖊 شما نیز می‌توانید برای مشارکت در ترجمه‌ی این صفحه یا اصلاح متن انگلیسی، به این لینک مراجعه بفرمایید.