سرفصل های مهم
Preprocessing the data
توضیح مختصر
- زمان مطالعه 0 دقیقه
- سطح خیلی سخت
دانلود اپلیکیشن «زوم»
فایل ویدیویی
برای دسترسی به این محتوا بایستی اپلیکیشن زبانشناس را نصب کنید.
ترجمهی درس
متن انگلیسی درس
Let’s start pre processing.
I won’t dive into the code too much as we want to focus on machine learning as much as possible.
You can later examine the code with comments or ask a question in the Q and A section to get a better
understanding as usual.
I’ll import num pi as MP next from SCA learn I’ll import pre processing let me stop for a second I’ll
use the SDK learn capabilities for standardizing the inputs it’s one line of code which drastically
improves the accuracy of the model almost always we standardize all inputs as the quality of the algorithm
improved significantly without standardizing the inputs we reach 10 percent less accuracy for the model
we will build here and you remember how a zero point five percent improvement in the amnesty example
was a great success we suggested installing SDK learned when we were setting up the environment in the
third section of the course it was also one of the exercises if you don’t have it installed yet I suggest
installing the package following the well-known methodology let’s get to work first I’ll load the CSB
file so raw CSB data equals NDP load t x t audio books data dot CSB with a comma delimiter OK now we
have the data in a variable our inputs are all the variables in the CSP except for the first and the
last column the first column is the arbitrarily chosen I.D.
while the last is the targets let’s put
that in a new variable called unscathed inputs all which takes all columns excluding the I.D.
and the
targets so the 0 column and the last one or minus first let’s record the targets and the variable targets
all using the same method they are the last column of the CSB girl we have extracted the inputs and
the targets the next section of code we’ll deal with balancing the data set we will count the number
of targets that are ones as we know they are less than the zeros next we will keep as many zeros as
there are ones ok let’s count the targets that are ones if we sum all targets which can take only 0
and 1 as value we will get the number of targets that are ones so number of 1 targets is equal to N.P.
some of targets all I’ll declare it as an integer as targets all may be a boolean depending on the programming
language then I’ll just keep as many zeros as ones let’s set a counter for the zero targets equal to
zero.
Next we will need a variable that records the indices to be removed for now it is empty but we want
it to be a list or a tuple so we put empty brackets let’s iterate over the data set and balance it for
I in range targets all dot shapes 0 semicolons targets all contained all targets its shape on the 0
axis is basically the length of the vector so it will show us the number of all targets in the loop
we want to increase the counter by one if the target at possession IE is zero
in that same if we put another F which will add an index to the variable indices to remove if the zeros
counter is over the number of ones
I’ll use the append method which simply adds an element to a list
so after the counter for zeroes matches the number of ones I’ll note all indices to be removed OK.
So after we run the code the variable indices to remove will contain the indices of all targets.
We won’t need the leading these entries will balance the data set OK let’s create a new variable unscathed.
Inputs with equal priors which is equal to N.P.
delete of unscathed inputs all from which I want to
delete the entries with indices from indices to remove on axis zero of the vector.
Similarly the targets with equal priors are equal to P delete of targets all indices to remove on axis
zero done.
All right we have a balanced dataset.
Next we want to standardize or scale the inputs the inputs are currently on scaled and we noted that
standardizing them will greatly improve the algorithm.
A simple line of code takes care of that scaled inputs is equal to pre processing dot scale of unskilled
inputs equal priors.
That’s the pre processing library we imported from S.K.
learn the scale method standardize is the data
set along each variable.
So basically all inputs will be standardized.
Cool so we have a balanced dataset which is also scaled our pre processing is almost finished.
A little trick is to shuffle the inputs and the targets.
We are basically keeping the same information but in a different order.
It’s possible that the original dataset was collected in the Order of Date since we will be badging.
We must shuffle the data it should be as randomly spread as possible so patching works fine.
Let me provide a counter example.
Imagine the data is ordered.
So each batch represents approximately a different day of purchases inside the batch.
The data is homogeneous while between batches it is very heterogeneous due to promotions day of the
week effects and so on.
This will confuse the stochastic gradient descent.
When we average the loss across batches overall we want them shuffled let’s see the code first we take
the indices from the Axis 0 of these scaled inputs shape and place them into variable
then we use the NDP random shuffle method to shuffle them.
Finally we create shuffled inputs and shuffled targets variables equal to the scaled inputs and the
targets where the indices are the shuffled indices.
All right so far we have pre processed the data shuffled it and balance the data set.
What we have left is to split it into training validation and test.
Let’s count the total number of samples it is equal to the shape of the shuffled inputs on the 0 axis
next.
We must determine the size of the three data sets.
I’ll use the 80 10 10 split train samples count is equal to zero point eight times the total number
of samples.
Naturally we want to make sure the number is an integer next the validation samples count is zero point
one times the total number of samples.
Finally the test samples count is equal to the total number of samples minus the training and the validation.
Cool we have the sizes of the train validation and test.
Let’s extract them from the big data set the train inputs are given by the first train samples count
of the pre processed inputs the drain targets are the first train samples count of the targets
and logically the validation inputs are the inputs in the interval from train samples count to train
samples counts plus validation samples count the validation targets are the targets in the same interval
finally the test is everything that is left
cool we have split the data set into training validation and test it is useful to check if we have balanced
the dataset moreover we may have balanced the whole dataset but not the training validation and test
sets I’ll print the number of ones for each dataset the total number of samples and the proportion of
ones as a part of the total they should all be around 50 percent let’s quickly run that code we can
see the training set is considerably larger than the validation and the test.
This is how we wanted it to be the total number of observations is around forty five hundred which is
a relatively good amount of data although we started with around 15000 samples in the CSB the proportions
or should I say the Priors look OK as they are almost 50 percent.
Note that fifty two percent or fifty five percent for two classes are also fine.
However we want to be as close to 50 percent as possible OK.
Finally we save the three data sets using the NDP saves method I name them in a very semantic way so
we can easily use them later.
All right.
Our data is pretty process now.
Each time we run the code we will get different proportions as we shuffle the indices randomly so training
validation and test data sets will contain different samples.
You can use the same code to pre process any data set where you have two classes the code will skip
the first column of the data as here we skip the I.D.
and the last column will be treated as targets
if you want to customize the code for a problem with more classes you must balance the data set classes
instead of two everything else should be the same the pre processing is over.
Henceforth we will only work with the NPC files.
I will save this Jupiter notebook and continue with the machine learning and a separate one.
Make sure that you have the raw CSB when you run the code on your computer in this way you will create
the NPC files which we will use for the machine learning part.
There are some additional adjustments you can make the code to improve the pre processing.
Check out the exercises and work on them if you’d like great work.
Thanks for watching.
مشارکت کنندگان در این صفحه
تا کنون فردی در بازسازی این صفحه مشارکت نداشته است.
🖊 شما نیز میتوانید برای مشارکت در ترجمهی این صفحه یا اصلاح متن انگلیسی، به این لینک مراجعه بفرمایید.