Learning rate schedules

دوره: یادگیری عمیق با TensorFlow / فصل: Gradient descent and learning rates / درس 4

Learning rate schedules

توضیح مختصر

  • زمان مطالعه 0 دقیقه
  • سطح خیلی سخت

دانلود اپلیکیشن «زوم»

این درس را می‌توانید به بهترین شکل و با امکانات عالی در اپلیکیشن «زوم» بخوانید

دانلود اپلیکیشن «زوم»

فایل ویدیویی

برای دسترسی به این محتوا بایستی اپلیکیشن زبانشناس را نصب کنید.

متن انگلیسی درس

Hi again we introduce the concept of hyper parameters 20 or 30 lessons ago.

I’m sure you remember parameters or the weights and the biases hyper parameters are things like width

and depth of the algorithm the number of hidden units and the number of hidden layers it is up to us

to choose their values.

We mentioned we should play around with hyper parameters to find the best route for our algorithm and

data at hand this lesson will focus on another hyper parameter the learning rate eata.

What do we know so far.

It must be small enough so we gently descend through the last function instead of oscillating wildly

around the minimum and never reaching it or diverging to infinity.

It also had to be big enough so the optimization takes place in a reasonable amount of time.

In the Excel file we provided on the gradient descent for one parameter you can play around with the

learning rate.

Moreover In one exercise coming with the minimal example you had the same chance but for the linear

model OK we’re doing science here.

So these phases small enough and big enough are too vague a smart way to deal with choosing the proper

learning rate is adopting a so-called learning rate schedule learning rate schedules get the best of

both worlds.

Small enough and big enough.

The rationale is the following.

We start from a high initial learning rate.

This leads to faster training.

In this way we approach the minimum faster than we want to lower the rate gradually as training goes

around the end of the training.

We want a very small learning rate.

So we get an accurate solution.

How are learning schedules implemented in practice.

There are two basic ways to do that.

The simplest one is setting a pre-determined piecewise constant learning rate.

For example we can use a learning rate of 0.1 for the first five epochs then 0.01 for the next five

and 0.00 one until the end.

This causes the loss function to converge much faster to the minimum and will give us an accurate result.

However considering what we’ve learned so far this seems too simple to be the norm right.

Indeed it is crude as it requires us to know approximately how many Epic’s it will take the last to

converge still beginners may want to use it as it makes a great difference compared to the constant

learning rate OK a second much smarter approach is the exponential schedule the exponential schedule

is a much better alternative as it smoothly reduces or DKs the learning rate.

We usually start from a high value such as eata not equal to zero point one.

Then we update the learning rate at each epoch using the rule in this expression and is the current

epoch while C is a constant.

Here’s the sequence of learning rates that would follow for a C equal to 20.

There is no rule for the constant c but usually it should be the same order of magnitude as the number

of epochs needed to minimize the loss.

For example if we need 100 epochs values of c from 50 to 500 are all fine.

If we need 1000 values from 500 to 5000 are alright.

Usually we’ll need much less.

So a value of c around 20 or 30 works well.

However from my personal experience the exact value of c doesn’t matter as much.

What makes a big difference is the presence of the learning schedule itself.

C is also a hyper parameter.

As with all hyper parameters it may make a difference for your particular problem.

You can try different values of c and see if this affects the results you obtain.

It’s worth pointing out that all those cool new improvements such as learning rate schedules and momentum

come at a price we pay the price of increasing the number of hyper parameters for which we must pick

values.

Generally the rule of thumb values work well but bear in mind that for some specific problem of yours

they may not.

It’s always worth it to Explora several hyper parameter values before sticking with one OK.

This will do for now.

Thanks for watching.

مشارکت کنندگان در این صفحه

تا کنون فردی در بازسازی این صفحه مشارکت نداشته است.

🖊 شما نیز می‌توانید برای مشارکت در ترجمه‌ی این صفحه یا اصلاح متن انگلیسی، به این لینک مراجعه بفرمایید.