Worked Example- Gmane / Mail - Retrieval (Chapter 16)دوره: Capstone- Retrieving, Processing, and Visualizing Data with Python / فصل: Spidering and Modeling Email Data / درس 2
Worked Example- Gmane / Mail - Retrieval (Chapter 16)
The one that I have a nice copy of all this data's on a server that's accessible worldwide and won't crash. All now we're going to complete we're going to quit and if you can't parse it then we're going to tolerate five bad email addresses in a row. So if we take a look at the database, and we go into the gmain, any time you see the content SQLite journal that means it needed to run a COMMIT.
- زمان مطالعه 17 دقیقه
- سطح خیلی سخت
دانلود اپلیکیشن «زوم»
این درس را میتوانید به بهترین شکل و با امکانات عالی در اپلیکیشن «زوم» بخوانید
متن انگلیسی درس
Hello everybody, welcome to Python for Everybody. Were doing some code walk throughs, if you want to get the source code you can take a look at the sample code and download it, and work through it. And so what we’re working on now is doing some retrieval and visualization of email data. It’s kind of ironic. We’re going to now look at the email data that we look at the email data that we started with. It’s the same kind developer list email data. And so there’s this service called gmane. And gmane archives developer list and various email list. And I’ve made a copy of their data because all the students in my class getting the same, their server with their API would crush it. So in order to be a nice guy I put up a much more powerful server with just the data from this one list. And it’s about a gigabyte of data. So be real careful if you’re paying for network. So the basic process we’re going to go through is we’re going to have a spidering process that’s a simple restartable, focused on the network problems, data pulling to pull content.sqlite. And there’s going to be a database there. And then we’re going to have a cleanup process. This database is going to get large about a gigabyte. And then we’re going to have a process that takes kind of grinds through this data it takes a while. And so then it’s going to read this mapping. And I’ll show you that when it comes because things like people’s names have changed over all these years. And it does a clean up and makes a really nice highly relational version of this data. And then we visualize from here. So this could take you several days to finish this. This will take a few minutes to run. And then this will just take seconds to run. And so this is a multi-step process where if you were doing something like running something for two days to produce a visualization. And it blew up three quarters of the way through, it wouldn’t do you no good. And so that’s why we break this into simple parts. But right now we’re just going to focus on this part right here and take a look at the mail bit. The mail bit when retrieve the mail and then we’ll have another video to talk about the rest of this stuff, okay. So let’s take a look at the code. So here is gmain.py, that is is the basic code. And it’s hopefully the stuff starting to look familiar. The thing that’s weird here is we going to do some date time parsing and there is code that’s out there but you may have to install it. And I had to write my code in a way that didn’t assume that you could install the datetime parser. And so it has it even. If that’s not there it uses my own datetime parser. And that’s what this code is, don’t worry too much about that. And of course we have to deal with the lack of certificates inside of Python. And so we start things out and this is really a simple table. We’ve got a messages table that’s got a primary key. The email itself when it was sent, what the subject and the headers and the body, okay. And so what we’re going to do is because we have to pick up where we left off. We’re going to select the largest primary key from the messages table and retrieve that. And then we’re going to go to the one after that, okay? And so we know what the ID is and we’re going to pick up where we left off. And so we have a starting point start’s either zero or one. And we’re going to ask comony messages to retrieve. We’ve got some counters, and so we’re going to say okay see if select id from messages where id equals what our best starting is it’s the highest number we’ve seen so far. And if row is not none that’s means that we’ve already retrieve this particular email message otherwise we’re going to keep on going. And we’re in good shape and this is one that we want to retrieve. And we’re subtracting that so we know, and so this is the base URL, this is the URL of our API. The one that I have a nice copy of all this data’s on a server that’s accessible worldwide and won’t crash. So the format of this, as you can say I would like the email address for one, from one to two, or from 100, oops from message 101 to 102, we can walk through these things. So that’s the message ID, and so if we’re going to make the URL we’re going to take this URL add the starting address and then add plus one. So we got this slash at the end of the starting address and so that’s how we form those. And we’re going to retrieve that and we’re going to decode it. We’ve seen this in some other ones, we’re going to check to see if we got legit data. If not if I get a 404 not found or something else, we’re going to quit. If someone hits Ctrl+C, which is our Ctrl+Z, we get the program to interrupt. And we’ll stop if there’s some other problem, right. We’re going to complain and keep going. And if we have five failures in a row we’re going to quit but it will just keep on going because these things do have glitchy bits here. And so at this point if we made it this far we’ve retrieved the URL and we’ve got the number of characters we’ve retrieved. And if we get bad data if it doesn’t start with from cause this is a male message, right? And they all start with from space if its right, it started from space then what where going to tolerate up the five failure there for bad data cause it could be bad. And I’m going to find the blank line cause that the new line at the end of one line and then a blank line. And then we’re going to take and break this into the headers, the mail headers. Which is the mail headers is this stuff right here up to but not including the blank line. And then the body is everything after that, okay? And so we’ll just have break that into pieces. Your eyes will complain and they tolerate up to five characters. And then we’re going to use a regular expression, kind of from the regular expressions chapter to pull out an email address from colon line somewhere in this headers from colon right there. It’s going to go find a less than and then pull, come on pull this stuff out up to it. So you got the less than, you got the parenthesis. You got one or more non-blank characters followed by an and sign, followed by one or more non-blank characters. And we’ll get back a list of those, we should only get one. If we find one we’re going to grab the email. We’re going to strip the lower case. And if we got some little nasty less than sign in there we’ll tolerate that as well. So this is kind of clean up, and you get used to this where you’re like how come all these email addresses have this other stuff in them? And then we also look for it that there are no less than signs. And we do this way this is that’s different some mail messages have it this way and there’s again you write this code after you watch it for a while. And like it’s cracked out and giving you bad stuff. And I make them all lower case so they match better and get rid of bad characters. Why now I got an email address. Then what I do, is I look for the date of this. So I’m going to graph these by dates. So I look for this line and use a regular expression to pull that out, right? So I’m looking for a date, followed by a blank, followed by any numbers of characters, followed by comma. So I’m not interested in this Wednesday bit so I’m skipping that bit right there and going and grabbing everything after that comma space. And so it’s really here to the end of the line. So that’s the new line. So it’s going all the way, it’s going to pull this bit right here that’s the text. And this is where we’re going to like say that’s kind of a funky looking date and we want to standardize that date. So we’re going to, let’s see we’re going to chop it off to 26 character. Apparently I don’t know what the 26 why do we care about the 26 character but we chopped that off to 26 character. And now we’re going to parse it and that’s going to give us a nice clean date, sent at date. All now we’re going to complete we’re going to quit and if you can’t parse it then we’re going to tolerate five bad email addresses in a row. Then we’re looking for the subject line using another regular expression, subject line. Regular expression that’s pretty easy up to but not including, right. There’s a blank there. It’s a subject Then we pull that out, we get the subject. Now at this point we passed it we got good stuff so we reset the fail con. because I kept saying if you fail five straight times you quit. And we’re going to print it out and then we’re just insert that stuff. We got the the ID of the message which we’ve got email address that it’s came from the time it was sent the subject. And then basically the headers in the body and we’re just inserting it. And now we’re going to say every 50th we’re going to commit it so that’s speeds things up and ever hundred we’re going to wait a second. So that’s you know count is going up, up, up, up and every 50th you’ll see a pause and then it will every 100th it’ll pause for a second. Mostly that’s to let me hit Ctrl+C or to not overload any server. Okay, so that’s the simple one. The problem is that the data just gets ugly. And so you’ll find yourself wanting to reset this and start it over. This one’s going to work, of course. But it’s these are hard to build and that’s why it’s a good idea oops, Python three gmain.py. How many messages? Well let’s just do one. Okay, so it went and grabbed, do I have this already running? 51 through 52, let me start over. S- 1 sqlite. Okay, rn content. I must run it to test it. So let’s run it again python3 gmane.py and ask for one message. Okay, so there we went and got message one from one to two, we got 2662 characters. And we printed out the email address the time we got it after all that hacking and the subject line and that’s what we’ve got. So if we take a look at the database, and we go into the gmain, any time you see the content SQLite journal that means it needed to run a COMMIT. And it hasn’t run a COMMIT and it has to run COMMIT but I’ll hit ENTER and that will do the commit and you see that vanish. So now I can open it and I take a look at. How come there’s no messages? Did that one not get stored in there for some reason? It needs to refresh. Let’s run it again. Maybe it didn’t commit. Maybe I got a bug in it. Let’s make it change the code. I’m going to, see this connection.comitt, see that connection.commit. going to commit there and the other thing I’m going to do is every time I stop to read I want to commit right before I read it. So I think we should I hope that doesn’t blow up. We’ll see. So the idea is if I want to stop I want to commit it. So let’s do this. Let’s do one message. And now I should hit is it committed. Now that I put the commits in I think that it will look better. Okay, refresh and there it is because I committed it. And I don’t have the journal file, so that’s good, so that’s a good idea to put those commits there so I’ll just leave those commits in. When you download it it’ll have these commits in there. So again I put a commit here and a commit at the very, very end to make sure. So I missed that. But now we get 1, right? And so let’s just run it again, and you’ll see how by selecting the max of the ID, it’s going to select the max of this, and then add 1 to it, so it doesn’t do the next one. So if I run it again, They say give me one message so it goes two to three. And give me two messages. So I hit enter and I can do refresh and I see we’ve got four messages. And so let’s just fire this baby up. Tell it to get 100. Run, run, run, run, run. Right? It just goes and goes and it pauses once in a while to do commit and if I made a commit every time. Oop, it just paused there. Now, it finished. So this will run and we will get a bunch of data. The problem is if I just run this, it will take about five hours, okay? To run this and get this all. And I’ve got a really fast connection. So, I have got a file that you can download. Let’s go find it. Let’s see if I can Let’s see how long it’ll take me to download this. I’ve got a file that you can download and save. Now I’m going to use the command line curl or wget is another command that we Linux and Mac people can use. I don’t know, you will might have to use your browser to do it. Let’s see how long this is going to take. It’s retrieving, a 1:30 Okay. Well, I’ll just wait and just come back. Okay, so now that’s done. I was averaging ten megabits a second. I downloaded about 600 megabytes, ten megabits a second. That will probably be slower for you. So now if I take a look. You’re going to find that that content.sqlite is 624 megabytes. Now, what happens is I’ve free spidered this. And so now if you run gmane.py and ask for five more messages, it will pick up where I left that one off. So it’s up to message 59,000. And I think that, we saw an error. Saw a bug in that one. I don’t know what’s wrong with that one. So let’s see if, so at this point we’re going to have most of the data. It might find its way to the very end. Once you get this, it should be not too much more. I don’t know. Maybe it’s 63,000 or something. So what we’ll do is, we will let that run. And we will come back when that one’s finished, and run then the next phase after it’s got all of its data, okay? So thanks for listening.
مشارکت کنندگان در این صفحه
تا کنون فردی در بازسازی این صفحه مشارکت نداشته است.
🖊 شما نیز میتوانید برای مشارکت در ترجمهی این صفحه یا اصلاح متن انگلیسی، به این لینک مراجعه بفرمایید.