Balancing the dataset
- زمان مطالعه 0 دقیقه
- سطح خیلی سخت
دانلود اپلیکیشن «زوم»
این درس را میتوانید به بهترین شکل و با امکانات عالی در اپلیکیشن «زوم» بخوانید
متن انگلیسی درس
Before we start pre processing I’ll take a minute to talk about the importance of balancing your data
set let’s think about a photo classification problem with two classes cats and dogs.
What accuracy do you expect from a good model.
Well if we correctly classify 70 percent of the photos and that’s not too bad.
80 percent accuracy is good while 90 percent is very good for beginners right.
I’m not talking about Google and Facebook classifications which achieve ninety nine point nine nine
percent accuracy 90 percent accuracy for most problems is an impressive accomplishment OK.
Now imagine a model that takes animal photos and outputs only cats no matter what you feed to the algorithm.
It will always output cat as the answer.
A bad model isn’t it.
Is this machine learning you may ask.
It’s definitely not the result we want from a machine learning algorithm but is a common result imagine
that in our dataset 90 percent of the photos are of cats and 10 percent dogs the same awfully bad model
applied to this dataset would classify all photos as cats but 90 percent of the photos in that dataset
What’s the accuracy of the algorithm.
It is 90 percent incredible.
Why does this problem arise.
Well since the machine learning algorithm tries to optimize the loss it quickly realizes that if so
many targets are cats the output should most likely be cats to achieve a great result and therefore
it comes up with the same prediction at all times.
Now if the distribution of photos is 90 percent cats and 10 percent dogs a model with an 80 percent
accuracy is a bad model.
Because the dumb model that outputs only cats will do better than it.
Therefore only a result above 90 percent is a more favorable one we refer to the initial probability
of picking a photo of some class as a prior the priors are zero point nine for cats and zero point one
for dogs the priors are balance when 50 percent of the photos are cats and 50 percent or dogs or zero
point five and zero point five.
Examples of unbalanced priors are zero point nine and zero point one zero point seven and zero point
three zero point six zero point four and so on each pair is prone to the issue discussed a minute ago
a machine learning algorithm may quickly learn that one class is much more common than the other and
decide all the ways to output the value with the higher prior OK.
If we have three classes cats dogs and horses balancing the data set would imply picking a data set
where each class amounts to approximately one third of the data set.
If we have four classes twenty five percent each.
You get the gist.
In our business case by exploring the targets we quickly realize most customers did not convert in the
given time span.
We must surely balance the data set to proceed.
That’s done by counting the total number of target ones and matching the same number of zeros to them.
All right we’ll do that in the next lecture.
See you there.
مشارکت کنندگان در این صفحه
تا کنون فردی در بازسازی این صفحه مشارکت نداشته است.
🖊 شما نیز میتوانید برای مشارکت در ترجمهی این صفحه یا اصلاح متن انگلیسی، به این لینک مراجعه بفرمایید.