Gmane Introduction

دوره: Capstone- Retrieving, Processing, and Visualizing Data with Python / فصل: Spidering and Modeling Email Data / درس 1

Capstone- Retrieving, Processing, and Visualizing Data with Python

7 فصل | 22 درس

سرفصل های مهم

مشخصات درس
محتوای چندرسانه‌ای
ترجمه‌ی درس
متن انگلیسی درس

Gmane Introduction

توضیح مختصر

Now actually if you look at the readme on gmain.zip, it tells you how you can get a head start by doing this first 675 megabytes by one statement, and then you can sort of fill in the details. But this is almost ten years of data from the SAKAI developer list and people even changed their email address. If you want to actually make sense of this data, we clean it up by running a process that reads this completely, wipes this out and then writes a whole brand new content.

زمان مطالعه 0 دقیقه
سطح متوسط

دانلود اپلیکیشن «زوم»

این درس را می‌توانید به بهترین شکل و با امکانات عالی در اپلیکیشن «زوم» بخوانید

دانلود اپلیکیشن

فایل ویدیویی

برای دسترسی به این محتوا بایستی اپلیکیشن زبانشناس را نصب کنید.

ترجمه‌ی درس

ارسال ترجمه برای این درس

متن انگلیسی درس

So now we’re going to do our last visualization and it’s interesting that it’s kind of we’re coming full circle. We’re back to email. And so instead of you know, a few thousand lines of email, we’re going to do a gigabyte of email. And you’re going to spider a gigabyte. Now actually if you look at the readme on gmain.zip, it tells you how you can get a head start by doing this first 675 megabytes by one statement, and then you can sort of fill in the details. The idea is, is that we have an API out there that will give us a mailing list. Given a URL that we just hit over and over again, changing this little number. And then we’re going to pull in this raw data, and then we’ll have a analysis clean up phase. And then we’re going to visualize this data in a couple of different ways. Now, this is a lot of data, it’s about a gigabyte of data, and it originally came from a place called www.gmane.org, and we have had problems because when too many students start using www.gmane.org to pull the data we’ve actually kind of hurt their servers. They don’t have rate limits, they’re nice folks, if we hurt them, they’re hurt, they’re not defending themselves. And so where Twitter and Google, www.gmane.org is just kind some nice people that are doing this and so don’t be bogue don’t be uncool. I’ve got this http://mbox.dr-chuck.net that has this data and it’s on super fast servers that are cached all around the world. Using this thing called cloud flare. So they’re super awesome and you can beat the heck outta dr-chuck.net. And I guarantee you you’re not going to hurt it. You can’t take it down. Good luck trying to take it down, okay, because it is a beast. So, make sure that, when you’re testing, you better use drchuck.net. Don’t use www.gmane.org. Even though it would work, please don’t do that, I’ve got my own copy and okay, enough said. Okay, so this is basically the data flow that’s going to happen. And that is, you know, we go to this dr-chuck.net which has got all the data. It’s got an API and we basically have their sequence and numbers. So the just message one, message two, message three and so we can have a message one, message two, message three and we know how much we’ve retrieved. And so this program when it starts up it says how much is in the database go down, down, down, down, down, down, okay number four. So then it calls the API to get message number four, brings it back and puts it in. Calls the API message number five, six, seven, eight, nine, 100, 150, 200, 300, crash. Again this is a slow but restartable process okay. And so then you start it back up and it’s like we’re 51 so we go 51, 52, 53, 54 and if you were really going to spider this all, I think when I spidered it the first time it took me three days to get all of it. And so it’s kind of fun, right? Unless of course you’re using a network connection you’re paying for. Do not do that because you’re going to spend a lot of money on your network connection. If you’re on a unlimited network or if you’re on a university that’s got a good connection then have fun. Run it for hours, see, watch what it does. It just grinds and grinds and grinds and grinds. Now what happens is it turns out that this data is a little funky and it’s all talked about in the read me. But this is almost ten years of data from the SAKAI developer list and people even changed their email address. So there’s this little bit of extra patchy code. Called G model that has some more data that it configures it and it reads all this stuff and it cleans up the data. So this ends up being really large and if you recall from the data base chapter it’s not well normalized. It’s just raw, it’s set up to be, it’s unindexed, it’s very raw, it’s only there for spidering and making sure we can restart our spider. If you want to actually make sense of this data, we clean it up by running a process that reads this completely, wipes this out and then writes a whole brand new content. And if you look at the size of this, this is really large and this is really small. And if you have the whole thing, it can take, depending on how fast your computer is, it can take minutes to read this data, because it’s so big. And this is a good example of normalized data versus non-normalized data. So it takes, let’s just say it takes two minutes to write this, because it’s reading it slowly. because it’s not normalized. This is nicely normalized. It’s using index keys and foreign keys and primary keys and all that stuff. All that stuff we taught you in the database, that’s here. So this is a small, and you look at the size of the file. It’s roughly got the same information, but it’s represented in a much more efficient way. So then this produces content.SQL light. And then the rest of the things read content.SQL light. because this is the cleanup phase. That’s the cleanup phase. Now, what you can do is you can run this for a while, then blow that up. Then run this, and that’s fine. because every time this runs, it throws this away and rebuilds it. And maybe look at some stuff and you say, I want to run some more. And then that’s okay because now you can start this back up and as soon as you’re done with however far you went there, you stop that and then you do this again. So that you do this and it reads the whole thing and updates this whole thing. And so then, once this data has been processed the right way, then you run gbasic.py and it dumps out some stuff but it’s really just doing database analysis. And then if we want to visualize it with a line, you run this gline.py and again that loops through all the data and produces the data on gline.js. And then you can visualize this with a HTML file, in the d3.js. And if you want to make a word cloud, you run this gword. Which loops through all that data, produces some JavaScript, then is combined with some more HTML to produce a nice word cloud. And so, the readme tells you all of this stuff and gets you through all this stuff and tells you what to run and how to run it and roughly how long it’s going to take so you can work your way through all of these things. So in summary with these three examples, we’re really sort of writing a little more sophisticated applications, I give you most of the source code for these applications. But you can kind of see what a more sophisticated application looks like and, based on these, you can probably build your own data pulling and maybe even a data visualization technique and adapt some of this stuff. So, thanks for listening and we’ll see you on the net.

مشارکت کنندگان در این صفحه

تا کنون فردی در بازسازی این صفحه مشارکت نداشته است.

🖊 شما نیز می‌توانید برای مشارکت در ترجمه‌ی این صفحه یا اصلاح متن انگلیسی، به این لینک مراجعه بفرمایید.

Gmane Introduction

Capstone- Retrieving, Processing, and Visualizing Data with Python

سرفصل های مهم

Gmane Introduction

توضیح مختصر

دانلود اپلیکیشن «زوم»

فایل ویدیویی

ترجمه‌ی درس

متن انگلیسی درس

مشارکت کنندگان در این صفحه

خانواده زبانشناس

دانلود «زوم»

دریافت جدیدترین مطالب سایت

راه‌های ارتباطی