Exploring the dataset and identifying predictors
- زمان مطالعه 0 دقیقه
- سطح خیلی سخت
دانلود اپلیکیشن «زوم»
این درس را میتوانید به بهترین شکل و با امکانات عالی در اپلیکیشن «زوم» بخوانید
برای دسترسی به این محتوا بایستی اپلیکیشن زبانشناس را نصب کنید.
متن انگلیسی درس
I am super happy to have the opportunity to take you through the business case in a way.
This is the peak of the course but it is simply the application of everything you already know.
Here’s the problem.
You were given data from an audiobook app.
Logically it relates to the audio version of books only each customer in the database has made a purchase
at least once.
That’s the condition to be included.
We want to create a machine learning algorithm based on our data that can predict if a customer will
buy again from the audio book company.
The main idea is that the company shouldn’t spend its advertising budget targeting individuals who are
unlikely to come back.
If we can focus our efforts on customers likely to convert again we can obtain improved sales and profitability
So our model will take several metrics and we’ll try to predict human behavior a side effect of our
study is that the model will show us which are the most important metrics for a customer to come back.
Having the data and the technology to identify prospective customers creates a lot of value and growth
It is one of the better applications of data science.
All right here’s our data.
This c v file is included in the lecture resources when you download it.
The column headers won’t be included as we want no text in the data.
When training the model Each row represents a person let’s go through the columns and see why each one
of them could be of use.
First we have customer I.D.
is like a name whether the I.D.
is 1 2 3 or John 1 John 2 John
3 makes no difference as no information is contained in the I.D.
we will skip it in our algorithm OK.
Next we have book length the overall book length is the sum of the length of all purchases.
We also have the average book length the average book length is basically the sum divided by the number
So if somebody has bought a single audio book The average length and the overall length for this person
will be equal all right.
There is no need to include the number of purchases as it is contained in the two variables we just
Then we have the overall price paid and the average price paid.
These variables were constructed in the same way as those for book length.
The price is in dollars although it makes no difference to the algorithm.
By the way the price variable is almost always a good predictor of behavior the next variable is review.
Review is a boolean.
It shows if the customer left a review.
This is a metric that shows engagement with the platform.
Our assumption is that people who leave reviews are more likely to convert again then we have review
out of 10.
This is a different variable.
It measures the review of a customer on a scale from 1 to 10.
Pay attention here as we will show you the first pre processing trick here it comes.
Logically we will only have a value for people who left a review by examining the table.
We quickly see most people leave no review.
As in most marketplaces.
That’s bad for our data set and bad in general.
Side note if you like our course don’t forget the Labor review.
Just saying we have decided to leave the reviews posted to the platform and substitute all missing values
with the average review.
The average is eight point nine one for our machine learning algorithm.
Eight point nine one would mean the status quo a review bigger than eight point nine one would indicate
above average feelings while the review less than eight point nine one would indicate below average
feelings notice I am saying feelings.
Review is yet another variable that is an average.
A customer may have bought two or three books on the platform.
The average review she left indicates her feelings towards the content on the medium or better the medium
as a whole.
An average of two out of 10 indicates the person did not have a pleasant experience with audiobooks
especially when the average is eight point nine one.
It is logical that such a customer is not likely to buy again.
All right done here.
Then we have total minutes lessened which is a measure of engagement next to it.
We have completion completion is the total minutes lesson divided by the total length of books a person
has purchased assuming people don’t read listen to books.
BOTH variables are self-explanatory the next variable is support requests it is numerical and shows
the total number of support requests the person has opened.
Support is anything from forgotten password to assistance on using the platform once more.
This is a measure of engagement.
It may turn out that the more support a person needed the more he or she got fed up with the platform
and abandoned it or he or she likes it so much that by using it stumbles upon different issues.
Unlike someone who never opens the app.
Finally we have a variable measuring the difference between the last time a person interacted with the
platform and their first purchase date.
That’s yet another measure of engagement.
The bigger the difference the better.
If a person engages regularly with a platform this difference will be bigger.
Thus the customer is likely to convert again if the value of this variable is zero.
We are sure the customer has never access what he has bought or perhaps he did it on the first day only.
So it is unlikely he or she will convert again.
These are our inputs.
It is always necessary to ask how the data was gathered.
This piece of information is valuable for any analysis the data was gathered from the audiobook app.
As we said it represents two years worth of engagement.
Now we are doing supervised learning so we need targets right.
The targets will be a boolean one if a person converted and zero if he or she didn’t.
But what does it mean to convert.
That’s the big question here.
We have taken an extra six months of data after the two year period to check if a user converted so
we took two years and six months of data the first two years are contained in the data set you have
The next six months will show us if a person converted.
In other words if he or she bought another book and if that happened we can count them as a conversion
and the target will be 1.
Otherwise it is zero.
That’s how we created the target’s column six months sounded reasonable enough for us.
If one buys no new audio book in that period chances are they’ve gone to a competitor or didn’t like
the audio book way of digesting information OK.
That said our task is simple create a machine learning algorithm that can predict if a customer will
This is a classification problem with two classes won’t buy and will buy represented by zeros and ones
in the next lesson.
We will outline the solution.
You can try solving the problem on your own or check out the next few lessons.
Thanks for watching.
مشارکت کنندگان در این صفحه
تا کنون فردی در بازسازی این صفحه مشارکت نداشته است.
🖊 شما نیز میتوانید برای مشارکت در ترجمهی این صفحه یا اصلاح متن انگلیسی، به این لینک مراجعه بفرمایید.