N-parameter gradient descent

دوره: یادگیری عمیق با TensorFlow / فصل: Introduction to neural networks / درس 12

N-parameter gradient descent

توضیح مختصر

  • زمان مطالعه 0 دقیقه
  • سطح خیلی سخت

دانلود اپلیکیشن «زوم»

این درس را می‌توانید به بهترین شکل و با امکانات عالی در اپلیکیشن «زوم» بخوانید

دانلود اپلیکیشن «زوم»

فایل ویدیویی

متن انگلیسی درس

If we want to create working models that can be easily adapted to different problems you must understand

the drivers of a machine learning algorithm.

That is why we’ve covered several theoretical steps and this is where the introductory part ends.

We will step on the one dimensional gradient descent concept and explain the concept of the gradient

descent used in machine learning.

In addition we’ll apply what we’ve learned about linear models and loss functions.

It will all fall in place promise.

Let’s consider the linear model we have discussed so far.

The inputs X times the weights w Plus the biases B are equal to the outputs.

Why now.

Each output y I can be represented using the linear model equation where the input is just the corresponding

x y the weights and the bias remain unchanged.

Using our apartment sized price example why would be the price of a single apartment the corresponding

x I would be information we have about this apartment.

In essence we are taking a single observation.

Therefore the output y is a scalar and is equal to the corresponding x y times w Plus the bias.

Naturally we are interested in the idea of target.

So t I this will be the target to which we will compare the output y.

All right time to pick the last function we’ll use.

Usually we denote the loss function with L and in brackets we put the outputs and the targets as the

last function depends on these arguments.

L is for a loss but we can have C for cost e for error and so on depending on the framework you are

using notations could differ but they carry the same meaning.

Ok since we’ve only discussed two types of loss functions the LTI norm loss and the cross entropy.

Obviously our choice is limited to them.

We will look into a regression example.

So let’s take the L to Norm loss and augment it a bit by dividing it by 2.

This is conventional and we will see why in just a minute.

A division by the Consta of two does not change the nature of the last function as it is still lower.

For better predictions the machine learning algorithm will not be affected.

We emphasize this in the objective function lecture.

Every function holding the general property to be lower for higher accuracy is a loss function.

Division by some constant changes nothing.

Make sure you remember what the gradient is and let’s start working in the multi dimensional space to

perform the gradient descent we need old beliefs which will be updated on each step.

Remember Well the update rule X I plus one equals X minus eata times the first derivative at x y becomes

W I plus one equals w o minus eata times the gradient of the last function with respect to w r for the

weights and B plus one equals B minus eata times the gradient of the last function with respect to B.

For the bias’s it is basically the same.

But for a matrix W any vector B instead of a number x OK we want to minimize the loss function by varying

the weights and the biases.

This means we are trying to optimize the last function regarding W and B.

Mathematically it looks like this.

The gradient with respect to W of the last function is equal to the sum of the gradient of 1 1/2 times


I’m minus t squared with respect to W.

from the linear model Y is equal to x y times w Plus the bias

where W and X are matrices.

And this is why we’ve applied bold formatting.

So let’s plug that in the formula running the operations we obtain the sum of X times Y minus t.

Please take a more detailed look.

In the course notes it is useful to combine y minus T into a new variable called Delta Delta is often

used to measure differences.

This notion will come in handy when we start coding in Python and when we start dealing with deeper

neural networks the final output becomes the sum of x y times Delta II with respect to eye.

So we calculate that expression for each observation and then some them all OK and logically the gradient

of the last function with respect to the bias is the sum of Delta.

I notice that the one half we introduced cancelled out the two we obtained when differentiating the


That’s why we included it to get a neater result.

Finally let’s go back to our update rule.

We said the generalized rule is W I plus one equals w o minus eata times the gradient of the last function

with respect to W by replacing the gradient with what we found here we obtain w.

Plus one equals w o minus eata times the sum of x y times Delta I n n logically the update rule for

the Byass is b c plus one equals B minus eight times the sum of Delta.


All right this was the generalized gradient descent of a linear model.

We can use it to minimize the cost function and train our model to enable it to produce valuable insights

from our data.

This is all we promised you at the beginning of this section and we’ve delivered maybe a bit more but

nothing less.

Okay great.

We are in good shape to create our first machine learning algorithm in Python.

See you in our next lesson.

مشارکت کنندگان در این صفحه

تا کنون فردی در بازسازی این صفحه مشارکت نداشته است.

🖊 شما نیز می‌توانید برای مشارکت در ترجمه‌ی این صفحه یا اصلاح متن انگلیسی، به این لینک مراجعه بفرمایید.