Stochastic gradient descent
- زمان مطالعه 0 دقیقه
- سطح خیلی سخت
دانلود اپلیکیشن «زوم»
این درس را میتوانید به بهترین شکل و با امکانات عالی در اپلیکیشن «زوم» بخوانید
متن انگلیسی درس
Hi and welcome back.
This is our section about optimization which in this context refers to the algorithms we all use to
vary our models parameters.
So far we’ve seen the gradient descent only and now it is time to discuss improvements that will lead
to enhanced algorithms.
Most of what we have learned is invaluable from a theoretical viewpoint but slow when it comes to practical
However there are simple steps to take to turn things around.
Let’s start with the clumsiest optimizer.
The gradient descent the D short for gradient descent iterates over the whole training set before updating
Each update is very small.
That’s due to the whole concept of the gradient descent driven by the small value of the learning rate.
As you remember we couldn’t use a value too high as this jeopardises the algorithm.
Therefore we have many epochs over many points using a very small learning rate.
This is slow.
It’s not descending.
It’s basically sailing down the gradient.
Fortunately for us there is a simple solution to the problem.
It’s a similar algorithm called the SAGD or the stochastic gradient descent.
It works in the exact same way but instead of updating the weights once per epoch it updates them in
real time inside a single epoch.
Let’s elaborate on that.
The stochastic gradient descent is closely related to the concept of batching batching is the process
of splitting data into and batches often called many batches.
We update the weights after every batch instead of every epoch.
Let’s say we have 10000 training points.
If we choose a batch size of 1000 then we have 10 batches per epoch.
So for every full iteration over the training data set we would update the weights 10 times instead
This is by no means a new method.
It is the same as the gradient descent but much faster as all good things go.
The SAGD comes at a cost.
It approximates things a bit.
So we lose a bit of accuracy but the tradeoff is worth it.
That is confirmed by the fact that virtually everyone in the industry uses stochastic gradient descent
not gradient descent.
So why does this speed up the algorithm so drastically.
Was it worth talking about it for three minutes.
There are a couple of reasons but one of the finest is related to hardware splitting the training set
into batches allows the Seaview course or the GP course to train on different batches in parallel.
This gives an incredible speed boost which is why practitioners rely on it.
OK one last thing.
Actually stochastic gradient descent is when you update after every input.
So your batch size is one.
What we have been talking about was technically called mini batch.
Gradient descent However more often than not practitioners refer to the mini batch G.D.
If you are wondering the plane gradient descent we talked about at the beginning of the Course is called
batched G-d as it has a single batch.
Now we can confidently close this topic.
Thanks for watching.
مشارکت کنندگان در این صفحه
تا کنون فردی در بازسازی این صفحه مشارکت نداشته است.
🖊 شما نیز میتوانید برای مشارکت در ترجمهی این صفحه یا اصلاح متن انگلیسی، به این لینک مراجعه بفرمایید.