Worked Example- Gmane / Mail - Model (Chapter 16)

دوره: Capstone- Retrieving, Processing, and Visualizing Data with Python / فصل: Spidering and Modeling Email Data / درس 3

Worked Example- Gmane / Mail - Model (Chapter 16)

توضیح مختصر

The work that we're doing right now is we are in the process of building a spider and visual- visualization tool for email data that came originally from this website gmane. So remember I split the body into- in half and then the headers and so that's- I made this as raw as I possibly could because, as you saw, I had to spend so much time in the gmane just putting the data successfully retrieved. And so, this is just sort of like a- I pulled this in really quick, and I read all this stuff from the DNS mapping and I- other than stripping and making this lower case etc.

  • زمان مطالعه 12 دقیقه
  • سطح ساده

دانلود اپلیکیشن «زوم»

این درس را می‌توانید به بهترین شکل و با امکانات عالی در اپلیکیشن «زوم» بخوانید

دانلود اپلیکیشن «زوم»

فایل ویدیویی

متن انگلیسی درس

Hello everybody and welcome to Python for Everybody. We are going to do, be doing some code walk throughs. If you want the sample code though you can download the zip from our website. The work that we’re doing right now is we are in the process of building a spider and visual- visualization tool for email data that came originally from this website gmane. But I’ve got my own copy of it. And so, what we’ve done before is we ran gmane.py and I grabbed a url. I have a url that has all this data, and I downloaded that and then I ran gmane again to catch up. And so, it took quite a bit of catching up but by the time I get to- Remember how I said it run, tries to fails five times. Well, it ran out of data at 60,421 and then it started failing and then it quit. And so we pretty much have all of our data now. We have all- We have finished this process and S content SQLite, okay? And if I take a look in the database browser, we can see we got 59,823 e-mail messages. And so, if I look at any of these things, you see the headers, you see the subject line, you see the email address, you see the body of it. So remember I split the body into- in half and then the headers and so that’s- I made this as raw as I possibly could because, as you saw, I had to spend so much time in the gmane just putting the data successfully retrieved. And so, I don’t like cleaning the data up too much. And so, what we’re going to look at next is the data cleaning process. Okay? And so, this is gmodel.py is what we’re going to take a look at now. So let’s get rid of those guys and look at gmodel.py. I don’t think I need url lib in this code. Do I have any urllib? No. So I don’t need that. Sorry. Fixed that. Okay. So, it’s going to read from the database, it’s got a call reg, it’s going to use regular expressions. And zlib is a way to do some compressions so I’m going to do- In this one I’m going to compress some of the data to make it so that I have less data to- some of the text fields are going to be compressed. I wanted to keep these fields uncompressed inside of messages. And so, we have some just cleanup messages and cleans things up and it turns out that the way e-mail addresses, in this particular mail corpus, they changed over time and we- there’s certain kinds of things. Sometimes the gmane.org is the e-mail address when people want to hide their address. And I made all kinds of stuff, and I split it, and checked to see if it ended with this, and I cleaned up things, just that and nothing. And so, I have all kinds of cleanup stuff going on in here. And I’m just mapping and DNS mapping, that I’ll talk about in a bit, where organizations sometimes sent e-mail with different addresses over time, and people sent e-mail from different part time. And we’re going to do the parsing of the date and that is the code for that. We’re going to pull out the header information. This is sort of borrowed from the other code. We’ll clean up the e-mail addresses and the domain names. And we’ll pull the data out, pull the subject out, pull out the message ID, various things. So, here’s the main body of the code. We’re going to go from content.SQLite to index.SQLite. And what I’m going to do every time is I’m gonna wipe out index.SQLite and drop it to the messages, senders, subjects, and replies. So this is a normalized database and that it has foreign keys. So, there’s a messages table here with the integer primary key, the guid for it. GUID stands for Global Unique ID, sen times, sender ID, and then it’s going to have a blob. These are blobs, binary large objects, for the headers and the body because I’m going to compress them in this database to make them. And then the senders, each sender has a key, and then subjects, each subject line is going to have a key, and then replies are a connection from one message to another, and so this is like a many to many. Now, I also have this file called mapping.SQLite. And so, we can take a look at that one: Mapping.SQLite. And so, what happened is this has two tables that I hand deal with. And so sometimes Indiana- this was a e-mail address that mapped to that. That’s, so Indiana.edu, that’s a way to take- that’s the e-mail address. And then these were a bunch of people that had e-mail address changing throughout the project and I sort of, kind of, mapped them in a way. And so, this is just sort of like a- I pulled this in really quick, and I read all this stuff from the DNS mapping and I- other than stripping and making this lower case etc. I just am going to make a dictionary. DNS mapping, which is the old name to the new name, and the e-mail address mapping from the old name to the new name, and then fixsender. Fixsender is because the e-mail address is even within gmane were kind of funky. So, don’t worry so much about this. Okay. And so, now what I’m gonna do is I opened up a connection just to read all that stuff in and now I’m going to actually open the main content. And I’m asking it to open, this is a little trickier. I open that read only. That was so that I could potentially be running the spider and running this at the same time. I get a cursor. And so, I’m going to read through so in the content file, this is the big one, I’m going read through and go through every one and write all of these things in. And I’m going to take all the e-mail addresses and I’m going to put those in a list. So I loaded that. I’ve got the mappings loaded. And so, now I’m going to go through every single message. I’ve got all the senders, all the subjects, and all the global unique IDs. So I read in each message, so now I’m going through content one at a time. I parse the headers. I check to see if the sender’s name, email address, after it’s been cleaned up, is in the, is in my mapping, mapping.getsender, and the default is like a back sender. So that’s what that’s saying. Look up sender, if it’s in there, give me the entry of that key. Otherwise, give me send her back. We’re going to print every 250 things we do. We’ll complain if this is true. We’re going to go get the mapping between the senders which is a way to look up the primary key. I could have done this with a database thing but I wanted it to be fast. So that’s part of the reason I read all these things in so I can have those mappings to be really fast. You’ll see this takes a little while even though it, you know, even though I’ve got all the stuff cached. And so then, if I don’t have a sender ID, meaning that I haven’t seen it yet, then I’m going to do an insert or ignore into senders. And then I’m gonna do a select, and then you’ve seen this where I grab the row back and I’m really just trying to look at the recently assigned ID. And then I’m going to, not only, set the sender ID for this iteration loop, but I’m also going to store it in the dictionary and so that builds this dictionary up. And you’ll see the same thing is true for subject ID. I’m going to insert it into the subject table and get a primary key if I don’t know what it is. And then I’m going to put it into, not only am I going to put it into the database, but I’m also going to put it into my dictionary. And the same thing… I guess I didn’t do it for the guid. Okay. So now what I have is the sender ID and the subject ID which are foreign keys into the center table and the subject table, and I’m going to insert the message with the sender ID, subject ID, the sent that headers and body. And the values here are the guid sender ID, subject, sent at. Now, this here, zlib.compress. So what I’m taking is the message, the header, and the body, and this little bit ends up with the compressed version of this stuff and you’ll see it in a second. And this keeps the size of these text things down at the cost of the computation of- We have to look at the cost of the computation to compress and decompress when we want to read it. Okay? And then I pull the guids out, the ID, which is the guid, and I pull out the primary key for this thing based on the guid. And I update this dictionary. Okay? So let me run that code. It is doing a lot of cleanup, and I’ll tell you it took me a long time to make this work. So just- so this code that I’m running now, oh, hahaha. Don’t forget to take a Python3, Chuck. So this is going to run every 250. So it did all this pre-caching. So that’s how long it takes to do 250. Another 60,000 in here. And so, this is really busy. The reason it’s bouncing back and forth is that every time it makes this journal file that’s and then does it commit. So you kind of see that it’s busy making journal files and committing and there’s a lot of activity going on here. It so happens that Adam shows me these files. Okay, so it’s finished. It took about three minutes to finish that. Right? And so, if we take a look at the size of the files, we will see that the index is much smaller. It’s fully normalized. It’s still 263 megabytes. It’s all compressed. So let’s take a look at that in the browser. So, it’s 200 megabytes but it loads up a lot faster. There we go. So, we have a sender’s table, right? Which is just kind of a many-to-one table. We have a subjects to table which is a many-to-one table. And we have messages which has foreign keys and it takes a little bit to load that up. Okay. And so, so we see the foreign keys for sender and subject and that saves us, all those foreign keys save us. And so we have, you can kind of see that I can’t see the headers in the body because now they’re compressed. That saves me a whole bunch of stuff, right? It saved me a whole bunch of stuff. You know, and so, so that’s what’s in that file. And that we’ve finished this process. Okay? And we’ve finished modeling the data and making it really clean and we’ll pick back up and the rest of the stuff we will do is actually visualizing, pulling data out of index.SQLite. The idea is this can be restarted. This can be run over and over and over even though it takes like three minutes to run this. That’s way better than five hours to run this. So three minutes, five hours and then you’ll see, and we’ll see now reading this, as in seconds, because we got it all nice and normalized in a quite pretty way. So I hope this has been useful. In the next one we’ll actually do the visualization.

مشارکت کنندگان در این صفحه

تا کنون فردی در بازسازی این صفحه مشارکت نداشته است.

🖊 شما نیز می‌توانید برای مشارکت در ترجمه‌ی این صفحه یا اصلاح متن انگلیسی، به این لینک مراجعه بفرمایید.