N-parameter gradient descent
- زمان مطالعه 0 دقیقه
- سطح خیلی سخت
دانلود اپلیکیشن «زوم»
این درس را میتوانید به بهترین شکل و با امکانات عالی در اپلیکیشن «زوم» بخوانید
متن انگلیسی درس
If we want to create working models that can be easily adapted to different problems you must understand
the drivers of a machine learning algorithm.
That is why we’ve covered several theoretical steps and this is where the introductory part ends.
We will step on the one dimensional gradient descent concept and explain the concept of the gradient
descent used in machine learning.
In addition we’ll apply what we’ve learned about linear models and loss functions.
It will all fall in place promise.
Let’s consider the linear model we have discussed so far.
The inputs X times the weights w Plus the biases B are equal to the outputs.
Each output y I can be represented using the linear model equation where the input is just the corresponding
x y the weights and the bias remain unchanged.
Using our apartment sized price example why would be the price of a single apartment the corresponding
x I would be information we have about this apartment.
In essence we are taking a single observation.
Therefore the output y is a scalar and is equal to the corresponding x y times w Plus the bias.
Naturally we are interested in the idea of target.
So t I this will be the target to which we will compare the output y.
All right time to pick the last function we’ll use.
Usually we denote the loss function with L and in brackets we put the outputs and the targets as the
last function depends on these arguments.
L is for a loss but we can have C for cost e for error and so on depending on the framework you are
using notations could differ but they carry the same meaning.
Ok since we’ve only discussed two types of loss functions the LTI norm loss and the cross entropy.
Obviously our choice is limited to them.
We will look into a regression example.
So let’s take the L to Norm loss and augment it a bit by dividing it by 2.
This is conventional and we will see why in just a minute.
A division by the Consta of two does not change the nature of the last function as it is still lower.
For better predictions the machine learning algorithm will not be affected.
We emphasize this in the objective function lecture.
Every function holding the general property to be lower for higher accuracy is a loss function.
Division by some constant changes nothing.
Make sure you remember what the gradient is and let’s start working in the multi dimensional space to
perform the gradient descent we need old beliefs which will be updated on each step.
Remember Well the update rule X I plus one equals X minus eata times the first derivative at x y becomes
W I plus one equals w o minus eata times the gradient of the last function with respect to w r for the
weights and B plus one equals B minus eata times the gradient of the last function with respect to B.
For the bias’s it is basically the same.
But for a matrix W any vector B instead of a number x OK we want to minimize the loss function by varying
the weights and the biases.
This means we are trying to optimize the last function regarding W and B.
Mathematically it looks like this.
The gradient with respect to W of the last function is equal to the sum of the gradient of 1 1/2 times
I’m minus t squared with respect to W.
from the linear model Y is equal to x y times w Plus the bias
where W and X are matrices.
And this is why we’ve applied bold formatting.
So let’s plug that in the formula running the operations we obtain the sum of X times Y minus t.
Please take a more detailed look.
In the course notes it is useful to combine y minus T into a new variable called Delta Delta is often
used to measure differences.
This notion will come in handy when we start coding in Python and when we start dealing with deeper
neural networks the final output becomes the sum of x y times Delta II with respect to eye.
So we calculate that expression for each observation and then some them all OK and logically the gradient
of the last function with respect to the bias is the sum of Delta.
I notice that the one half we introduced cancelled out the two we obtained when differentiating the
That’s why we included it to get a neater result.
Finally let’s go back to our update rule.
We said the generalized rule is W I plus one equals w o minus eata times the gradient of the last function
with respect to W by replacing the gradient with what we found here we obtain w.
Plus one equals w o minus eata times the sum of x y times Delta I n n logically the update rule for
the Byass is b c plus one equals B minus eight times the sum of Delta.
All right this was the generalized gradient descent of a linear model.
We can use it to minimize the cost function and train our model to enable it to produce valuable insights
from our data.
This is all we promised you at the beginning of this section and we’ve delivered maybe a bit more but
We are in good shape to create our first machine learning algorithm in Python.
See you in our next lesson.
مشارکت کنندگان در این صفحه
تا کنون فردی در بازسازی این صفحه مشارکت نداشته است.
🖊 شما نیز میتوانید برای مشارکت در ترجمهی این صفحه یا اصلاح متن انگلیسی، به این لینک مراجعه بفرمایید.