Adaptive learning rate schedules

دوره: یادگیری عمیق با TensorFlow / فصل: Gradient descent and learning rates / درس 6

Adaptive learning rate schedules

توضیح مختصر

  • زمان مطالعه 0 دقیقه
  • سطح خیلی سخت

دانلود اپلیکیشن «زوم»

این درس را می‌توانید به بهترین شکل و با امکانات عالی در اپلیکیشن «زوم» بخوانید

دانلود اپلیکیشن «زوم»

فایل ویدیویی

متن انگلیسی درس

As with other things in deep learning we can dig deeper into the learning rate schedules.

Instead of using a simple rule to adjust the learning rate we can use advanced academic research on

this topic.

We will consider two types of adaptive learning rates either grad and Oremus prop they build on each

other so it won’t be too much to take in.

Be prepared to see many formulas although it’s unnecessary to learn these by heart intensive workflow.

You can simply say use Ada grad and the magic will happen as usual.

We recommend you go through the entire video as what we will see is cutting edge machine learning when

you practice machine learning.

You must be able to choose the best methods for training your models.

All right let’s begin.

First we have the ADA grad Ada grad is short for adaptive gradient algorithm.

It was proposed in 2011 so it is new.

It dynamically varies the learning rate at each update and for every weight individually.

So the original rule was

stated otherwise.

The change in w equals minus the learning rate times the partial derivative of the loss with respect

to W.

Nothing new so far when we consider the adaptive gradient algorithm the change in the weight will be

given by the same expression.

But in addition it will be divided by the square root of g at time T plus Epsilon.

So what is g here.

G is the adaptation magic g at EPOC T equals G.

At APA T minus 1 plus the square of the gradient and Epoque t we begin at g not equal to zero.

At each step G increases because we are adding to non-negative numbers.

Thus G is a monotonically increasing function.

Since we are dividing the learning rate eata by a monotonically increasing function etre divided by

that is obviously monotonically decreasing.

If you are wondering the epsilon is some small number we need to put there because if G is zero we won’t

be able to perform the division.

OK that was Ada grad.

It is basically a smart adaptive learning rate scheduler adaptive stands for the fact that the effective

learning rate is based on the training itself.

It is not a pre-set learning schedule like the exponential one where all the eata values are calculated

regardless of the training process another very important point is that the adaptation is per weight.

This means every individual weight in the whole network keeps track of its own function to normalize

its own steps.

It’s an important observation as different weights do not reach their optimal values simultaneously.

The second method is Oremus Propp or the root mean square propagation.

It is very similar to Ada grad the update rule is defined in the same way but the g function is a bit

different.

The two terms are assigned weights beta and one minus beta respectively.

Effectively keeping track of a moving average of the G values this new hyper parameter beta is a number

between 0 and 1 the value zero point nine is very typical.

We had a similar situation with the alpha in the momentum section.

The implication here is that the function is no longer monotonically increasing.

Hence eata divided by the square root of G is not monotonically decreasing.

Empirical evidence shows that in this way the rate adapts much more efficiently OK.

Both methods are very logical and smart.

However there is a third method based on these two which is superior.

We will explore it in our next lecture.

Thanks for watching.

مشارکت کنندگان در این صفحه

تا کنون فردی در بازسازی این صفحه مشارکت نداشته است.

🖊 شما نیز می‌توانید برای مشارکت در ترجمه‌ی این صفحه یا اصلاح متن انگلیسی، به این لینک مراجعه بفرمایید.