Worked Example- Gmane / Mail - Visualization (Chapter 16)

دوره: Capstone- Retrieving, Processing, and Visualizing Data with Python / فصل: Visualizing Email Data / درس 1

Worked Example- Gmane / Mail - Visualization (Chapter 16)

توضیح مختصر

So part of what I wrote when I was doing this is I wanted to do some simple basic calculations on the data to make sure I really was sort of looking for anomalies, right? It shows different stuff, but it's using this, It's using this data to generate how big those things are, and then using a bit of randomness and simulated annealing to lay it out. And then I'm going to do a simple dictionary as I accumulate the sending organizations by splitting the person's names into at signs.

  • زمان مطالعه 13 دقیقه
  • سطح ساده

دانلود اپلیکیشن «زوم»

این درس را می‌توانید به بهترین شکل و با امکانات عالی در اپلیکیشن «زوم» بخوانید

دانلود اپلیکیشن «زوم»

فایل ویدیویی

متن انگلیسی درس

Hello everybody, welcome to Python for Everybody. We’re doing a bit of code walkthrough. If you want to follow along, you can download the ZIP file for the code from our website. We are in the process of retrieving data from this gmane server, one that I made a copy of. And we have so far spidered it all, ended up with 600 megabytes of spidered information. We have ran a rather complex cleanup process that you probably don’t need to fully understand. You can look at it for patterns, but in general, the cleanup process will be very sensitive to the data. And then we have this index.sqlite which is 260 MB right now. And we’re going to now do the easy, the fun easy bits here, where we’re going to run little queries that just pull data out. So these are much simpler. So part of what I wrote when I was doing this is I wanted to do some simple basic calculations on the data to make sure I really was sort of looking for anomalies, right? What was working or what wasn’t working, so I wrote a series of really simple things like this gbasic. The gbasic code just to give me some basic data, right? So I wrote things down, and I counted things, and so, do I need urllib to question this one? I don’t think so, let’s fix that bug, it’s not there, no reason to put any of that stuff in there. So it reads that index.sqlite, which is our cleaned up data. It reads through and makes a dictionary, this pattern you are going to see a lot where I’m going to make a dictionary of id to Senders. Save myself repeatedly looking at things. I’m going to grab the subjects, I’ve cached them all. I could have done this all with SQL, but I just wanted to do things faster. And now I’m going to go through each of these messages and make a dictionary of them. I’m going to put a lot of stuff in memory. And then I’m going to do some counts. I’m going to see who has sent the most, right, the organizations. And so now I’ve gotta go through all the messages. So you’ll notice that I am not selecting the body or the headers here. I am just getting sender id, subject id, I probably could have done this with a join, it would have been cleaner. That, you can do that, you can make that change. Do that with a join so it’s cleaner. And, so I’m going through all the messages, except not the body, so this is going to be really quick. And I’m pulling out the sender’s id, I’m breaking the sender into pieces, see my data’s clean now, I cleaned it all up in the previous processes. And if I don’t have two pieces, I continue and I get the domain name. So I have the person. I’m doing a basic dictionary histogram for the people and the domains. And then I’m going to sort them, right, with a sorted. And we’re going to grab the key. We’re going to sort it by the how many there are reverse, and then print out the top few of the organizations and the top few of the people, okay? And so we’ll just run that code. python gbasic.py, let’s type to dump out the top 10. So we loaded 59,000 messages, 29,000 subjects, and 1,800 senders and figured out the Top 10 people and the Top 10 organizations. And you can write various things like that, that just sort of screen through your data. And it’s good to get sanity checking on your data, okay? So that’s gbasic, now I want to do gword.py, because that’s kind of fun. gword.py, I don’t need urllib, why do I keep putting urllib in all these things? So I’ll get rid of that. So this is really simple, because I’m just going to go for the words in the subject line. And so I go through index.sqlite. I read in all of the subjects. And I make a dictionary of those. And then I go and find all of the subjects. And then I’m doing this code right here. I’m pulling out the subject based on the message. And I’m doing this so that when the subjects are used more than once, I count the words more than once. This str.maketrans, I talked about that in an earlier chapter. This basically throws away punctuation and numbers, so that when I make my words I don’t end up with words that are like dashes. It compresses them down. Then I strip it, I convert everything to lowercase. This is basically just to keep too many words from showing up. Then I do a split, and then I got accounts, a dictionary. So this is a no punctuation, no numbers dictionary count. And then I just take the and do a dictionary. And then I sort them in reverse order. And then I figure out what the highest and lowest is by running through a, I could’ve probably done this with a max and a min if I felt like it. And so now I have the highest and the lowest. You know, I should’ve done a max and a min on that one. Why did I do that? But, well. And now I’ve gotta spread out the size. And so I’m going to produce this file gword.js, which is needed by the visualization. because it’s going to use d3.js, a word visualizer, and gword.js. I have to tell it how big the text is, and so I’m doing some text normalization. Took me a little experimentation. So if I run this now, and I say python gword.js, And I say python3 gword.js, which is a lot better. Not, Python. Okay, so now I can go look at the gword.js, wherever that is, gword.js, yup. And so this is basically, it normalized all the frequencies, and made it font size. These are font sizes now, okay? And so this is just the data that’s needed by this gword.jm which uses this d3 visualization word cloud code. So this pulls in all my data, and then this is just some JavaScript that draws the picture on the page, okay? And so the easy part now is to just open gword.htm, in a browser. It just so happens on a Mac I can do this. And so that gives me a word cloud based on that data, and it kind of randomizes it. It shows different stuff, but it’s using this, It’s using this data to generate how big those things are, and then using a bit of randomness and simulated annealing to lay it out. That’s not stuff that we actually have to worry about, okay? So that’s how we get to the point where we’re seeing a word cloud from this. Now, we’re going to do another visualization. And this time we’re going to do a line visualization. And we’re going to create a thing called gline.js and produce with another HTML file. We’re going to use d3 and produce that output. So let’s say goodbye here. Goodbye, goodbye, goodbye, goodbye. So gline.py, get rid of that file . So again, I’m going to preload all of the senders in this case. And again, I could have done this with the join, probably should have done this with a join, I’m going to pre-load all the messages. The sender id, subject id, etc. And I’ll load those up. And now I’m going to read through, I’m going to have the sending organizations and the senders. And, I’m going to split the senders, and I’m going to have the sending organizations. And then I’m going to do a simple dictionary as I accumulate the sending organizations by splitting the person’s names into at signs. And then based on the organization, I accumulate it. And then I sort them. And I pull out the Top 10 Organizations. Print those out, and I’m going to break this down into months. And I’ll show you what this looks like in a second, let’s go to the gline.js. So the month look likes this. Okay, so the month looks like that. So that’s the first seven characters of the date. So if we look at the date, Date looks like that, and the month is the first seven characters. And this is the data that I’ve got to give it. We’ll clean that up in a second, that data will look better in a moment. Go back to gline.py. We’re doing a, the key is a tuple, which is the month and which organization is it is, that did it. And it’s only in the Top 10 organizations, and then we’re going to do a, We’re going to basically do a dictionary where the key is a tuple. And then we’re going to sort it, sort by key in this case, not by value. That’s in the months, we’re going to sort that. And then we’re going to write all this data out into gline.js, so let’s go ahead and run this. And again, this is just the data that has to be written in a way that the JavaScript can understand it. python3 gline.py, okay, so Top 10 organizations. So now let’s take a look at that JavaScript. So this is what it looks like. So it just so happens that you gotta tell it, these are the data points, these are lines. So this is the Year, the line for University of Michigan, gmail.com, swinsborg.com. So this first column is that line points and the next line points. So all this code was to get the data in such a way, that I could produce this JavaScript file. Because if I look at gline.htm, I need that data in that particular format. And I’ve got all this stuff. I make a line chart. I draw it with this data. That data, I had to go read all the documentation on how to figure this stuff out. And that’s the data that I’m going to use. And I had to figure this out. I had to transform it and make it pretty. It took me quite a while to get this to work. And this is not a JavaScript class, nor a how to visualize in D3. But basically, we pulled all that stuff in, and here’s the gline that came from the JavaScript and then it makes an arrayToDataTable. And then that data table is what gline draws. So with no further ado, let’s open gline.htm to show that data. So there you go, that’s the Sakai developer participation from 2005 through 2015, based on which organizations did the most commits in Sakai. And so I know that I haven’t done all this code full justice, there’s a lot of code here. The fun is just to kind of run it and see it. And then when the time comes, to come back and see the techniques that are used when you’re trying to build your own visualization pipeline. So I hope that you found this useful. This is a lot of code, hard to explain in 15, 20 minutes, but I hope you take some time and look it over. And I hope you found all these videos, this is kind of the last walk through video for Chapter 16 of the book, and so I hope that I will see you on the net.

مشارکت کنندگان در این صفحه

تا کنون فردی در بازسازی این صفحه مشارکت نداشته است.

🖊 شما نیز می‌توانید برای مشارکت در ترجمه‌ی این صفحه یا اصلاح متن انگلیسی، به این لینک مراجعه بفرمایید.