کوارتل ها و باکسپلات

سرفصل: بخش ریاضی / سرفصل: آمار / درس 7

کوارتل ها و باکسپلات

توضیح مختصر

  • زمان مطالعه 13 دقیقه
  • سطح خیلی سخت

دانلود اپلیکیشن «زوم»

این درس را می‌توانید به بهترین شکل و با امکانات عالی در اپلیکیشن «زوم» بخوانید

دانلود اپلیکیشن «زوم»

فایل ویدیویی

متن انگلیسی درس

Quartiles and Boxplots

Quartiles and Boxplots. First, we need to discuss the issue of quartiles, what exactly are quartiles in a distribution. Imagine putting a list of numbers in order from lowest to highest, of course, the middle number in that list would be the median, we already talked about that. The quartiles are three numbers that would divide this list into four smaller lists, four equally sized lists.

The first of these numbers, the first quartile Q1 divides the bottom 25 from the rest of the less, so it marks off the lower quarter from everyone else, that’s the first quartile. You might suppose that the second number would be called the second quartile, but think about it. Since this number divides the lower 50% from the upper 50% it is exactly in the middle of the list, that’s the median.

The second number in the set is the median, the median is one of the quartiles. Technically it is the second quartile, but nobody calls it that. The third number is the third quartile, which divides the lower 75% from the upper 25%. These three numbers divide any list into four equal subsets.

So from the min to the first quartile is 25%, from the first quartile to the median is 25%. From the median to the third quartile is 25%. And from the third quartile to the maximum is 25%. Now how do we calculate them? Well, we have already discussed how to find the median.

If N is odd, then the median is the middle number on the list. If N is even, then the median is the average of the two middle numbers. And in other words, it’s not equal to a specific number on the list. Suppose we have a list of numbers and we found the median. The median divides the whole list into two smaller lists, the upper list and the lower list.

If the median is a number in the set, it is not included on either of these lists. So, imagine where the median is. You might draw a circle around it, or mark it off in some way. Everything above the median, that’s the upper list. Everything below the median, that’s the lower list. Q1 is merely the median of the lower list, and Q3 is the median of the upper list.

So, that’s the convention we’re gonna be following. Now, I’ll mention there are other conventions. There are other software packages that do things differently, calculators sometimes do it differently, other textbooks will explain it differently. I am explaining the convention used on the GRE. So you may get confused if you look somewhere else.

This is the convention that the GRE follows. In a moment, I will show some numerical examples with a couple small sets. I just wanna make this point first. Quartiles exist as tools to make sense of large data sets, ultimately, population-sized data sets. For example, we might wanna know the, the quartiles of household income in the United States for every single household in the United States, we’re talking about tens of millions, hundreds of millions of points.

There’s something a little silly about calculating quartiles for a set that has, say, 10 or 12 members, but we do it for pedagogical purposes, so you can get a sense of what the quartiles are. So just keep that in mind. What we have here really is a tool for understanding how population-sized sets operate.

But in order to understand how it works, we’re gonna need to look at a couple of tiny sets. So it’s a bit hokey what we’re doing, but just suspend belief on that point so we can talk about how this works. All right. Suppose we have this set, a not very big set, and we need to find the quartiles.

Well the median is the average of the two middle numbers. There are ten numbers on the list. The, the fifth and the sixth number are the two median numbers, 9 and 13, we average them, we get a median of 11. All right. So imagine we draw a line between the 9 and the 13, that’s where the median would be and that divides it between an upper list and a lower list.

The lower list are the five lowest numbers 2, 4, 6, 7, 9, the median is clearly the middle of that list, 6, that’s Q1. The upper list, 13, 13, 13, 14, 14, that is the median, the middle is 13, that has to be Q3. Now a totally new set, we’re going to find the quartiles again. The median number is 11.

This number is excluded from both the upper list and the lower list. So what we have here is a total of 13 numbers. We exclude the 11 in the middle, and then the lower list. Those are all the numbers below 11. The median is 4, technically it’s the average of those two middle 4s. The upper list, the median is 15.

Technically it’s an average of those two middle 15s. And so we have, Q1 is 4, median is 11, and Q3 is 15. Once again, for a relatively small set, one in which we can see all the numbers at a single glance, the quartiles don’t serve much of a purpose. The purpose of quartiles is to make sense of larger sets of numbers. Once again, there are many other ways to define Q1 and Q3.

But what I presented here is the convention that the GRE follows. If you look in other sources, they may define it differently, and you may be confused on these points. So just keep that in mind, there are other ways to define this, I’m showing you the way it’s used on the GRE. Here’s a practice question.

Pause the video and then we’ll talk about this. A set of 1203 alumni from a set of colleges take the GMAT, and each gets a score. And as you may know, the score on the GMAT, goes from 200 to 800. If the first quartile of these scores is 510, incidentally a not very GMAT, not a very impressive GMAT score there, and only one student got exactly that great, then how many of the students scored higher than that?

Well, let’s think about this. There are 200, there are 1203 alumni in this population. The whole set is odd, so the median is on that list. And so that means if we exclude the median, we put a circle around the median and exclude the median, we’d be left with 2,002 numbers. Divide that number in 2, 601.

That’s each half list, so there’s 601 people on the lower list, and 600 people, 601 people on the upper list. Now let’s look at the lower list. It’s also odd so the median is on this list. So we remove that, the median of that lower list we’re left with 600, and then we can divide that easily by 2, that’s 300 above, 300 below.

So that means that there’s 300 below the firs, first quartile, and 300 between the first quartile and the median. So let’s think about this now. There are 300 below the first quartile, then there’s the single first quartile score. Then there’s 300 above the first quartile, then there’s the single median score, and then there’s the 601 on the upper list.

Well of these five buckets, who is above Q1? Well of course, these three. So if we add up the 300 plus the single median plus 601, they add up to 902, 902 scores are exactly above that score. Often the mean, the median, and Q1, and Q3 are cited with the max and the min, to give a full sense of the distribution.

These five numbers are sometimes called the five-number summary of a distribution, that’s a term you don’t need to know for the test. But sometimes you’ll see those five given. And they are often summed up conveniently in a graphical form, known as a boxplot. So now, we’re ready to talk about boxplots, now that we know what quartiles are.

This is the general boxplot shape. So there’s always this rectangle in the middle with a vertical lines, somewhere in the middle of that rectangle, and then these two kind of arms on each side that consist of just a horizontal line going to an isolated vertical line at each end. Notice that this shape contains 5 vertical segments. So here we have the shape, and I’ve marked the 5 vertical segments in red.

These represent from left to right the min, Q1, the median, Q3, and the maximum. So, in other words, the vertical segments show the position of the numbers in the 5-number summary. The boxplot is displayed over the numerical scale of a variable to demonstrate the distribution of that variable. So for example, here we have a boxplot.

We don;t actually know what this is. This, this might be for example, the GMAT scores that we were just talking about, because GMAT is something that goes form 200 to 800. So let’s just pretend that these are GMAT scores. Here, we can approximate that the minimum in the set is 400, Q1 is around 450, the median appears to be right at 500, Q3 appears to be around, a little over 600, say 620, and the max is just below 800, so let’s say, 790.

And so that is the five number summary, of this score of distributions. Here’s a question, if the four quartiles divide the population into four equal parts, why aren’t the gaps between the vertical lines always the same size? So what we have here, we see completely different size gaps. Between the, the minimum and Q1, between Q1 and the median, those are relatively small.

Q1 and Q3 is very large. And Q3 to the max is the largest. Why are they different sizes if there’s 25% of the population in each one? Well first of all, think about it. There’s never any guarantee that the median will be the average of the max and the min, so right there, there’s no guarantee.

If you know your max and you know your min, there’s no guarantee that if you average those two, you’ll get the median, the median could be anywhere between the max and the min. More generally, the gaps from one bar to the next represent difference in score, not difference in number of people or points. So from Q1 to Q3, that’s a narrowed gap because the score, sorry, from the, from the minimum to Q1 is a narrowed gap because the score difference.

From the minimum score to the score at Q1 is a very small difference. 25% of the population is in that very narrow distribution of scores. Meanwhile at the top of the distribution from Q3 to the max is gigantic, that means that, that the top 25% have scores that are really spread out. The bottom 25%, in fact, the bottom 50% all have scores that are very close together.

Whereas the top 25% are really, really spread out. Here’s a practice question. Pause the video, and then we’ll talk about this. Okay, at a large university, 2,800 students take Biochemistry 101. The boxplot shows the percentage grades on the first midterm. What can you say about the mean?

Well, here;’s the thing, we know the median. The median is right here, around 30. The boxplot does not directly show the mean, but we can inferences about it. Notice that we could ask the question, where are the outliers? On the left side, everything is pretty close together on the left side. But on the right side, holy smokes, it seems to go on forever.

The outliers are clearly on the right in this distribution. And so, if you remember, comparing mean to median, the mean gets pulled in the direction of outliers. We talked about this way back in the lesson on mean and median. So, with the mean that’s pulled in the direction of outliers, that means the mean has to be pulled above the median, and therefor the mean is higher than 30, and that is the answer.

In summary, the quartiles, Q1, the median, and the Q3, divide the whole list into four equal lists. Q1 is the median of the lower list, Q3 is the median of the upper list. That’s how we calculate it on the GRE. Again, other sources will do it differently, but this is the GRE convention.

The max and the min and the three quartile numbers constitute the five number summary, and these determine the five vertical lines on a boxplot. And the general shape of a boxplot is as follows, we’ll talk more about boxplots in the next video.

مشارکت کنندگان در این صفحه

تا کنون فردی در بازسازی این صفحه مشارکت نداشته است.

🖊 شما نیز می‌توانید برای مشارکت در ترجمه‌ی این صفحه یا اصلاح متن انگلیسی، به این لینک مراجعه بفرمایید.