# Preprocess the data - shuffle and batch the data

/ / درس 5

### توضیح مختصر

• زمان مطالعه 0 دقیقه
• سطح خیلی سخت

### دانلود اپلیکیشن «زوم»

این درس را می‌توانید به بهترین شکل و با امکانات عالی در اپلیکیشن «زوم» بخوانید ### متن انگلیسی درس

Welcome back.

In this lecture will first shuffle our data and then create the validation dataset shuffling is a little

trick we like to apply.

In the pre processing stage when shuffling we are basically keeping the same information but in a different

order.

It’s possible that the targets are stored in ascending order resulting in the first X batches having

only zero targets and the other batches having only one is at Target.

Since we’ll be matching we’d better shuffle the data it should be as randomly spread as possible so

that matching works as intended.

Let me give you an unrelated example.

Imagine the data is ordered and we have 10 batches each batch contains only a given digit so the first

batch has only zeros.

The second only ones the third only twos etc.

This will confuse the stochastic gradient descent algorithm because each batch is homogenous inside

it but completely different from all other batches causing the loss to differ greatly.

In other words we want the data shuffled.

OK we should start by defining a buffer size to say 10000.

This buffer size parameter is used in cases when we are dealing with enormous datasets.

In such cases we can’t shuffle the whole data set in one go because we can’t possibly fit it all in

the memory of the computer.

So instead we must instruct tensor flow to take samples ten thousand at a time shuffle them and then

take the next ten thousand.

Logically if we set the buffer size to twenty thousand it will take twenty thousand samples at once.

Note that if the buffer size is equal to 1.

No shuffling will actually happen.

So if the buffer size is equal or bigger than the total number of samples shuffling will take place

at once and shuffle them uniformly.

Finally if we have a buffer size that’s between one and the total sample size we’ll be optimizing the

computational power of our computer.

All right time to do the shuffle.

Luckily for us there is a shuffle method readily available and we just need to specify the buffer size.

Let shuffled train and validation data be equal to scale train and validation data dot shuffle with

the buffer size as an argument and that’s it once we have scaled and shuffle the data we can proceed

to actually extracting the train and validation data sets our validation data will be equal to 10 percent

of the training set which we have already calculated and stored in num validation samples we can use

the method take to extract that many samples so validation data equals shuffle train and validation

data take number of validation samples.

Good.

We have successfully created a validation dataset in the same way we can create the train data by extracting

all elements but the first X validation samples an appropriate method here is Skip so train data equals

shuffle train and validation data skip num validation samples.

OK so far so good we will be using many batch gradient descent to train our model as we explained before.

This is the most efficient way to perform deep learning as the tradeoff between accuracy and speed is

optimal to do that.

We must set a batch size and prepare our data for badging.

Just a quick memory refresh a batch size of one equals the stochastic gradient descent while a batch

size equal to the number of samples equals the gradient descent we’ve seen until now.

So we want a number relatively small with regard to the data set but reasonably high.

So what would allow us to preserve the underlying dependencies all said the bad Qais to 100 that’s yet

another hyper parameter that you may play with when you fine tune the algorithm there is a method batch

we can use on the data set to combine its consecutive elements in the batches.

Let’s start with a train data we just created.

I’ll simply overwrite it as there is no need to preserve a version of this data.

That is not patched.

So train data equals train data batch and in brackets we specify the batch size variable and that’s

all this will add a new column to our tensor that would indicate to the model how many samples it should

take in each batch.

Great.

What about the validation data.

Well since we won’t be back propagating on the validation data but only forward propagating we don’t

really need to batch remember that matching was useful in updating weights.

Only once per batch which is like 100 samples rather than at every sample.

Hence reducing noise in the training updates.

So whenever we validate or test we simply forward propagate.

Once when matching we usually find the average loss and average accuracy during validation and testing.

We want the exact values.

Therefore we should take all the data at once.

Moreover when forward propagating we don’t use that much computational power so it’s not expensive to

calculate the exact values.

However the model expects our validation set in batch form too.

That’s why we should overwrite validation data with validation data.

Dot batch here will have a single batch with a batch size equal to the total number of validation samples

or NUM validation samples.

In this way we’ll create a new column in our tensor indicating that the model should take the whole

validation dataset at once when it utilizes it.

OK great to handle our test data.

We don’t need to batch it either.

We’ll take the same approach we use with the validation set.

This time though you’ll have the chance to do it on your own for homework.

All right.

Finally our validation data must have the same shape and object properties as the train and test data.

The M this data is editable and in two tuple format.

As we said the argument as supervised to true.

Therefore we must extract and convert the validation inputs and targets appropriately.

Let’s store them in validation inputs and validation targets and set them to be equal to next brackets

itor brackets validation data itor is the python syntax for making the validation data and iterator

by default.

That will make the data set editable but will not load any data next loads the next batch.

Since there is only one batch it will load the inputs and the targets.

Thanks for watching.

### مشارکت کنندگان در این صفحه

تا کنون فردی در بازسازی این صفحه مشارکت نداشته است.

🖊 شما نیز می‌توانید برای مشارکت در ترجمه‌ی این صفحه یا اصلاح متن انگلیسی، به این لینک مراجعه بفرمایید.