# Early stopping

/ فصل: Overfitting / درس 6

### توضیح مختصر

• زمان مطالعه 0 دقیقه
• سطح خیلی سخت

### دانلود اپلیکیشن «زوم»

این درس را می‌توانید به بهترین شکل و با امکانات عالی در اپلیکیشن «زوم» بخوانید

### فایل ویدیویی

برای دسترسی به این محتوا بایستی اپلیکیشن زبانشناس را نصب کنید.

### متن انگلیسی درس

There is one less detail related to overfitting we should examine more carefully.

We said a dozen times we trained the model until the last function is minimized.

We can go on doing that forever but at some point will overfit.

That is why we introduced the validation dataset and said a thing or two about breaking the training

process in this lesson we will explore additional rules that will indicate our model has been trained.

The proper term is early stopping generally early stopping is a technique to prevent overfitting.

It is called early stopping as we want to stop early before we overfit that is let’s explore the most

common ways to do that.

The simplest one is to train for a pre-set number of epochs in the minimal example after the first section.

We train for 100 epochs.

This gave us no guarantee that the minimum has been reached or passed a high enough learning rate would

even cause the loss to divert to infinity.

That’s something you should have tried for homework.

Still the problem was so simple that rookie mistakes aside very few epochs would cause a satisfactory

result.

However our machine learning skills have improved so much and we shouldn’t even consider using this

naive method.

A bit more sophisticated technique is to stop when the last function updates become sufficiently small.

We even had a note on that when we introduce the gradient descent a common rule of thumb is to stop

when the relative decrease in the loss function becomes less than zero point zero zero or one or 0.1

percent.

This simple rule has two underlying ideas.

First we are sure we won’t stop before we have reached a minimum.

That’s because of the way gradient descent works.

It will descend until a minimum is reached.

The last function will stop changing making the update rule yielding in the same weights.

In this way we’ll be stuck in the minimum.

The second idea is that we want to save computing power by using as few iterations as possible.

As we said once we have reached the minimum or diverged to infinity we will be stuck there knowing that

a gazillion more epochs won’t change a thing.

We can just stop there.

This saves us the trouble of iterating uselessly without updating anything.

That’s obviously a level up from the previous method while the pre-set number of epochs approach may

ultimately minimize the loss.

Chances are we can’t guess the number of required epochs.

Probably the algorithms would have performed thousands of iterations that did not update the weights.

Obviously each epoch that changes nothing is useless and should be dropped.

Alright so the first technique didn’t deal with any problems except for minimizing the loss.

The second technique optimize the cost and save computing power.

But both can lead to tremendous overfitting.

It’s only natural that we need a more advanced technique yes.

We are talking about the validation set strategy.

This is the simplest clever technique for early stopping that prevents overfitting.

Let me state the rule once again using the proper figure a typical training occurs this way as time

goes by the error become smaller.

The distribution is exponential as initially we are finding better weights quickly.

The more we train the model the harder it gets to achieve an improvement.

At some point it becomes almost flat.

Now if we put the validation curve on the same graph it would start with the training cost at the point

when we started overfitting the validation cost will start increasing.

Here’s the point I wish the two functions begin diverging.

That’s our red flag.

We should stop the algorithm before we do more damage to the model OK.

Now depending on the case.

Different types of early stopping can be used the preset number of iterations method was used in the

minimal example.

That wasn’t by chance.

The problem was linear and super simple.

A more complicated method for early stopping would be a stretch.

The second method that monitors the relative change is simple and clever but doesn’t address overfitting

the validation set strategy is simple and clever and it prevents overfitting.

However it may take our algorithm a really long time to overfit it is possible that the weights are

barely moving and we still haven’t started overfitting.

That’s why I like to use a combination of both methods.

So my rule would be stop when the validation loss starts increasing or when the training last becomes

very small.