Stochastic gradient descent

دوره: یادگیری عمیق با TensorFlow / فصل: Gradient descent and learning rates / درس 1

Stochastic gradient descent

توضیح مختصر

  • زمان مطالعه 0 دقیقه
  • سطح خیلی سخت

دانلود اپلیکیشن «زوم»

این درس را می‌توانید به بهترین شکل و با امکانات عالی در اپلیکیشن «زوم» بخوانید

دانلود اپلیکیشن «زوم»

فایل ویدیویی

برای دسترسی به این محتوا بایستی اپلیکیشن زبانشناس را نصب کنید.

متن انگلیسی درس

Hi and welcome back.

This is our section about optimization which in this context refers to the algorithms we all use to

vary our models parameters.

So far we’ve seen the gradient descent only and now it is time to discuss improvements that will lead

to enhanced algorithms.

Most of what we have learned is invaluable from a theoretical viewpoint but slow when it comes to practical

execution.

However there are simple steps to take to turn things around.

Let’s start with the clumsiest optimizer.

The gradient descent the D short for gradient descent iterates over the whole training set before updating

the weights.

Each update is very small.

That’s due to the whole concept of the gradient descent driven by the small value of the learning rate.

As you remember we couldn’t use a value too high as this jeopardises the algorithm.

Therefore we have many epochs over many points using a very small learning rate.

This is slow.

It’s not descending.

It’s basically sailing down the gradient.

Fortunately for us there is a simple solution to the problem.

It’s a similar algorithm called the SAGD or the stochastic gradient descent.

It works in the exact same way but instead of updating the weights once per epoch it updates them in

real time inside a single epoch.

Let’s elaborate on that.

The stochastic gradient descent is closely related to the concept of batching batching is the process

of splitting data into and batches often called many batches.

We update the weights after every batch instead of every epoch.

Let’s say we have 10000 training points.

If we choose a batch size of 1000 then we have 10 batches per epoch.

So for every full iteration over the training data set we would update the weights 10 times instead

of one.

This is by no means a new method.

It is the same as the gradient descent but much faster as all good things go.

The SAGD comes at a cost.

It approximates things a bit.

So we lose a bit of accuracy but the tradeoff is worth it.

That is confirmed by the fact that virtually everyone in the industry uses stochastic gradient descent

not gradient descent.

So why does this speed up the algorithm so drastically.

Was it worth talking about it for three minutes.

There are a couple of reasons but one of the finest is related to hardware splitting the training set

into batches allows the Seaview course or the GP course to train on different batches in parallel.

This gives an incredible speed boost which is why practitioners rely on it.

OK one last thing.

Actually stochastic gradient descent is when you update after every input.

So your batch size is one.

What we have been talking about was technically called mini batch.

Gradient descent However more often than not practitioners refer to the mini batch G.D.

as stochastic

gradient descent.

If you are wondering the plane gradient descent we talked about at the beginning of the Course is called

batched G-d as it has a single batch.

Okay perfect.

Now we can confidently close this topic.

Thanks for watching.

مشارکت کنندگان در این صفحه

تا کنون فردی در بازسازی این صفحه مشارکت نداشته است.

🖊 شما نیز می‌توانید برای مشارکت در ترجمه‌ی این صفحه یا اصلاح متن انگلیسی، به این لینک مراجعه بفرمایید.