# N-fold cross validation

/ فصل: Overfitting / درس 5

### توضیح مختصر

• زمان مطالعه 0 دقیقه
• سطح خیلی سخت

### دانلود اپلیکیشن «زوم»

این درس را می‌توانید به بهترین شکل و با امکانات عالی در اپلیکیشن «زوم» بخوانید

### فایل ویدیویی

برای دسترسی به این محتوا بایستی اپلیکیشن زبانشناس را نصب کنید.

### متن انگلیسی درس

In the last few lessons we explained why we should split the data into three parts training validation

and test.

This is a standard mechanism and usually when machine learning is appropriate we have enough data to

apply it.

What if we have a small data set.

We can’t afford to split it into three datasets as we will lose some of the underlying relationships

or worse we can have so little data left or training that the algorithm cannot learn anything.

There is another answer to this issue and it’s called n fold cross-validation.

This is a strategy that resembles the general one but combines the train and validation data sets in

a clever way.

However it still requires a test subset.

We’re combining the training and validation steps but we can’t avoid the test stage.

All right let’s say we have a dataset containing 11000 observations we’ll save 1000 observations for

the test.

What we are left with are 10000 samples.

Please notice this dataset is not very big in data science.

We often deal with ginormous data sets or at least that’s the hope side note.

Ginormous datasets have their own big problem being so large they often have a lot of missing values.

Such data is usually referred to as being sparse.

This introduces a whole new spectrum of issues.

In any case we want to train on 9000 data points and validate on 1000.

We’ll split the remaining data into 10 subsets containing 1000 observations each.

We fold it 10 times.

So this is a 10 fold cross-validation 10 is also a commonly used value and that’s why we picked it for

this illustration.

We treat one subset as a validation set while the other nine combined as a training set.

Visually it looks this way we have 10 combinations.

The orange set is the one.

In addition one while the blue ones are the training sets during the first epoch the first chunk of

data serves as validation.

Then in the second epoch the second chunk of data serves as validation and so on.

In this way for each epoch we don’t overlap training and validation as it should be.

Moreover we managed to use the whole dataset except for the test part.

As with all good things this comes at a price we have still trained on the validation set which was

not a good idea.

It is less likely that the overfitting flag is raised and it is possible that we all were fitted a bit.

The tradeoff is between not having a model or having a model that’s a bit over fitted and fold cross-validation

solves the scarce data issue but should by no means be used as the norm.

Whenever you can divide your data into three parts training validation and test only if it doesn’t manage

to learn much because of data scarce it you should try the old cross-validation.

OK this will do for now.

Thanks for watching.

### مشارکت کنندگان در این صفحه

تا کنون فردی در بازسازی این صفحه مشارکت نداشته است.

🖊 شما نیز می‌توانید برای مشارکت در ترجمه‌ی این صفحه یا اصلاح متن انگلیسی، به این لینک مراجعه بفرمایید.