Preprocessing the data

دوره: یادگیری عمیق با TensorFlow / فصل: Business case / درس 4

Preprocessing the data

توضیح مختصر

  • زمان مطالعه 0 دقیقه
  • سطح خیلی سخت

دانلود اپلیکیشن «زوم»

این درس را می‌توانید به بهترین شکل و با امکانات عالی در اپلیکیشن «زوم» بخوانید

دانلود اپلیکیشن «زوم»

فایل ویدیویی

متن انگلیسی درس

Let’s start pre processing.

I won’t dive into the code too much as we want to focus on machine learning as much as possible.

You can later examine the code with comments or ask a question in the Q and A section to get a better

understanding as usual.

I’ll import num pi as MP next from SCA learn I’ll import pre processing let me stop for a second I’ll

use the SDK learn capabilities for standardizing the inputs it’s one line of code which drastically

improves the accuracy of the model almost always we standardize all inputs as the quality of the algorithm

improved significantly without standardizing the inputs we reach 10 percent less accuracy for the model

we will build here and you remember how a zero point five percent improvement in the amnesty example

was a great success we suggested installing SDK learned when we were setting up the environment in the

third section of the course it was also one of the exercises if you don’t have it installed yet I suggest

installing the package following the well-known methodology let’s get to work first I’ll load the CSB

file so raw CSB data equals NDP load t x t audio books data dot CSB with a comma delimiter OK now we

have the data in a variable our inputs are all the variables in the CSP except for the first and the

last column the first column is the arbitrarily chosen I.D.

while the last is the targets let’s put

that in a new variable called unscathed inputs all which takes all columns excluding the I.D.

and the

targets so the 0 column and the last one or minus first let’s record the targets and the variable targets

all using the same method they are the last column of the CSB girl we have extracted the inputs and

the targets the next section of code we’ll deal with balancing the data set we will count the number

of targets that are ones as we know they are less than the zeros next we will keep as many zeros as

there are ones ok let’s count the targets that are ones if we sum all targets which can take only 0

and 1 as value we will get the number of targets that are ones so number of 1 targets is equal to N.P.

some of targets all I’ll declare it as an integer as targets all may be a boolean depending on the programming

language then I’ll just keep as many zeros as ones let’s set a counter for the zero targets equal to

zero.

Next we will need a variable that records the indices to be removed for now it is empty but we want

it to be a list or a tuple so we put empty brackets let’s iterate over the data set and balance it for

I in range targets all dot shapes 0 semicolons targets all contained all targets its shape on the 0

axis is basically the length of the vector so it will show us the number of all targets in the loop

we want to increase the counter by one if the target at possession IE is zero

in that same if we put another F which will add an index to the variable indices to remove if the zeros

counter is over the number of ones

I’ll use the append method which simply adds an element to a list

so after the counter for zeroes matches the number of ones I’ll note all indices to be removed OK.

So after we run the code the variable indices to remove will contain the indices of all targets.

We won’t need the leading these entries will balance the data set OK let’s create a new variable unscathed.

Inputs with equal priors which is equal to N.P.

delete of unscathed inputs all from which I want to

delete the entries with indices from indices to remove on axis zero of the vector.

Similarly the targets with equal priors are equal to P delete of targets all indices to remove on axis

zero done.

All right we have a balanced dataset.

Next we want to standardize or scale the inputs the inputs are currently on scaled and we noted that

standardizing them will greatly improve the algorithm.

A simple line of code takes care of that scaled inputs is equal to pre processing dot scale of unskilled

inputs equal priors.

That’s the pre processing library we imported from S.K.

learn the scale method standardize is the data

set along each variable.

So basically all inputs will be standardized.

Cool so we have a balanced dataset which is also scaled our pre processing is almost finished.

A little trick is to shuffle the inputs and the targets.

We are basically keeping the same information but in a different order.

It’s possible that the original dataset was collected in the Order of Date since we will be badging.

We must shuffle the data it should be as randomly spread as possible so patching works fine.

Let me provide a counter example.

Imagine the data is ordered.

So each batch represents approximately a different day of purchases inside the batch.

The data is homogeneous while between batches it is very heterogeneous due to promotions day of the

week effects and so on.

This will confuse the stochastic gradient descent.

When we average the loss across batches overall we want them shuffled let’s see the code first we take

the indices from the Axis 0 of these scaled inputs shape and place them into variable

then we use the NDP random shuffle method to shuffle them.

Finally we create shuffled inputs and shuffled targets variables equal to the scaled inputs and the

targets where the indices are the shuffled indices.

All right so far we have pre processed the data shuffled it and balance the data set.

What we have left is to split it into training validation and test.

Let’s count the total number of samples it is equal to the shape of the shuffled inputs on the 0 axis

next.

We must determine the size of the three data sets.

I’ll use the 80 10 10 split train samples count is equal to zero point eight times the total number

of samples.

Naturally we want to make sure the number is an integer next the validation samples count is zero point

one times the total number of samples.

Finally the test samples count is equal to the total number of samples minus the training and the validation.

Cool we have the sizes of the train validation and test.

Let’s extract them from the big data set the train inputs are given by the first train samples count

of the pre processed inputs the drain targets are the first train samples count of the targets

and logically the validation inputs are the inputs in the interval from train samples count to train

samples counts plus validation samples count the validation targets are the targets in the same interval

finally the test is everything that is left

cool we have split the data set into training validation and test it is useful to check if we have balanced

the dataset moreover we may have balanced the whole dataset but not the training validation and test

sets I’ll print the number of ones for each dataset the total number of samples and the proportion of

ones as a part of the total they should all be around 50 percent let’s quickly run that code we can

see the training set is considerably larger than the validation and the test.

This is how we wanted it to be the total number of observations is around forty five hundred which is

a relatively good amount of data although we started with around 15000 samples in the CSB the proportions

or should I say the Priors look OK as they are almost 50 percent.

Note that fifty two percent or fifty five percent for two classes are also fine.

However we want to be as close to 50 percent as possible OK.

Finally we save the three data sets using the NDP saves method I name them in a very semantic way so

we can easily use them later.

All right.

Our data is pretty process now.

Each time we run the code we will get different proportions as we shuffle the indices randomly so training

validation and test data sets will contain different samples.

You can use the same code to pre process any data set where you have two classes the code will skip

the first column of the data as here we skip the I.D.

and the last column will be treated as targets

if you want to customize the code for a problem with more classes you must balance the data set classes

instead of two everything else should be the same the pre processing is over.

Henceforth we will only work with the NPC files.

I will save this Jupiter notebook and continue with the machine learning and a separate one.

Make sure that you have the raw CSB when you run the code on your computer in this way you will create

the NPC files which we will use for the machine learning part.

There are some additional adjustments you can make the code to improve the pre processing.

Check out the exercises and work on them if you’d like great work.

Thanks for watching.

مشارکت کنندگان در این صفحه

تا کنون فردی در بازسازی این صفحه مشارکت نداشته است.

🖊 شما نیز می‌توانید برای مشارکت در ترجمه‌ی این صفحه یا اصلاح متن انگلیسی، به این لینک مراجعه بفرمایید.