# Balancing the dataset

/ فصل: Business case / درس 3

### توضیح مختصر

• زمان مطالعه 0 دقیقه
• سطح خیلی سخت

### دانلود اپلیکیشن «زوم»

این درس را می‌توانید به بهترین شکل و با امکانات عالی در اپلیکیشن «زوم» بخوانید

### متن انگلیسی درس

Before we start pre processing I’ll take a minute to talk about the importance of balancing your data

set let’s think about a photo classification problem with two classes cats and dogs.

What accuracy do you expect from a good model.

Well if we correctly classify 70 percent of the photos and that’s not too bad.

80 percent accuracy is good while 90 percent is very good for beginners right.

percent accuracy 90 percent accuracy for most problems is an impressive accomplishment OK.

Now imagine a model that takes animal photos and outputs only cats no matter what you feed to the algorithm.

It will always output cat as the answer.

Is this machine learning you may ask.

It’s definitely not the result we want from a machine learning algorithm but is a common result imagine

that in our dataset 90 percent of the photos are of cats and 10 percent dogs the same awfully bad model

applied to this dataset would classify all photos as cats but 90 percent of the photos in that dataset

are cats.

What’s the accuracy of the algorithm.

It is 90 percent incredible.

Why does this problem arise.

Well since the machine learning algorithm tries to optimize the loss it quickly realizes that if so

many targets are cats the output should most likely be cats to achieve a great result and therefore

it comes up with the same prediction at all times.

Cats OK.

Now if the distribution of photos is 90 percent cats and 10 percent dogs a model with an 80 percent

Why.

Because the dumb model that outputs only cats will do better than it.

Therefore only a result above 90 percent is a more favorable one we refer to the initial probability

of picking a photo of some class as a prior the priors are zero point nine for cats and zero point one

for dogs the priors are balance when 50 percent of the photos are cats and 50 percent or dogs or zero

point five and zero point five.

Examples of unbalanced priors are zero point nine and zero point one zero point seven and zero point

three zero point six zero point four and so on each pair is prone to the issue discussed a minute ago

a machine learning algorithm may quickly learn that one class is much more common than the other and

decide all the ways to output the value with the higher prior OK.

If we have three classes cats dogs and horses balancing the data set would imply picking a data set

where each class amounts to approximately one third of the data set.

If we have four classes twenty five percent each.

You get the gist.

In our business case by exploring the targets we quickly realize most customers did not convert in the

given time span.

We must surely balance the data set to proceed.

That’s done by counting the total number of target ones and matching the same number of zeros to them.

All right we’ll do that in the next lecture.

See you there.

### مشارکت کنندگان در این صفحه

تا کنون فردی در بازسازی این صفحه مشارکت نداشته است.

🖊 شما نیز می‌توانید برای مشارکت در ترجمه‌ی این صفحه یا اصلاح متن انگلیسی، به این لینک مراجعه بفرمایید.