Worked Example- Page Rank - Spidering (Chapter 16)

دوره: Capstone- Retrieving, Processing, and Visualizing Data with Python / فصل: Building a Search Engine / درس 2

Worked Example- Page Rank - Spidering (Chapter 16)

توضیح مختصر

So I provide this bs4 zip as a quick and dirty way if you can't install something for all of the Python users on your system. And so this is just kind of nasty choppage and throwing away the URLs, that we're going through a page, and we have a bunch that we don't like, or we have to clean them up or whatever. But now, we finally, here at line 132, we're ready to put this into Pages, URL and the HTML, and it's all good, right?

  • زمان مطالعه 0 دقیقه
  • سطح متوسط

دانلود اپلیکیشن «زوم»

این درس را می‌توانید به بهترین شکل و با امکانات عالی در اپلیکیشن «زوم» بخوانید

دانلود اپلیکیشن «زوم»

فایل ویدیویی

برای دسترسی به این محتوا بایستی اپلیکیشن زبانشناس را نصب کنید.

متن انگلیسی درس

Hello, and welcome to Python for everybody. We’re doing a bit of code walk-though and if you want to, you can get to the sample code and download it also so that you can walk through the code yourself. What we’re walking through today is the page rank code. And so, the page rank code, let me get the picture of the page rank code up here. Here’s that picture of the page rank code. And so, the page rank code has four chunks of code that are going to, five chunks of code that are going to run. The first one we’re going to look at is the spidering code and then we’ll do a separate look at these other guys later. So the first one we’ll look at is spidering, and again it’s sort of the same pattern of we’ve got some stuff on the web, in this case webpages. We’re going to have a database that sort of just captures the stuff. It’s not really trying to be particularly intelligent, but it is going to parse these with BeautifulSoup and add things to the database, okay. And so, then we’ll talk about how we run the page rank algorithm, and then how we visualize the page rank algorithm in a bit. Now, the first thing to notice is that I put the BeautifulSoup code in right here, okay? So you can get this from the bs4.zip file. There might even be a README, no, but there’s a README somewhere. But you got to get use BeautifulSoup, you gotta put this bs4 zip or you have to install BeautifulSoup for your stub. So I provide this bs4 zip as a quick and dirty way if you can’t install something for all of the Python users on your system. So that’s what it’s supposed to look like. You’re supposed to have it unzipped right here in these files. And I don’t know what damnit.py means. That came from Beautiful Soup. If you look, it’s in their source code. So I’m not swearing. It’s Beautiful Soup, people are swearing. I’m sorry, I apologize, okay. So the code we’re going to play with the most is in this first one is called spider.py. And, we’re going to do databases, we’re going to read URLs and we’re going to parse them with Beautiful Soup, okay. And so, what we’re going to do is we’re going to make a file. Again, this will make spider.sqlite, and here we are in page rank, Ls minus l. Spider.sqlite is not there, so this is going to create the database. We do CREATE TABLE IF NOT EXISTS we’re going to have an INTEGER PRIMARY KEY, because we’re going to do foreign keys here. We’re going to have a URL, and the URL which is unique.The HTML, which is unique whether we got an error. And then, for the second half, when we start doing page rank we’re going to have old rank and new rank. because, the way page rank works is it takes the old rank, computes the new rank and then replaces the new rank with the old rank and then does it over and over again. And then we’re going to have a many to many table which points really back, so I call this from IB and to IB. We did this with some of the Twitter stuff. And then this webs is just in case I have more than one web does not really make much difference. Okay, so what we’re going to do is we’re going to SELECT id, url FROM Pages WHERE HTML is NULL, this is our indicator that a page has not yet been retrieved and error is NULL ORDER BY RANDOM. And so this is our way, this long bit of stuff. And this not all of this SQL is completely standard, but this order by random is really quite nice in sqlite. Limit 1 says just randomly pick a record in this database where this true, is true and then pick it randomly. And then we’re going to fetch a row and if that row is none right, we’re going to ask for a new web a starting URL and this is going to fire things up and we’re going to insert this new URL. Otherwise, we’re going to restart. We have a row to start with and otherwise were going to sort primness by inserting the URL we start with and insert into it. If you have enter it, it just goes to drchuck.com which is a fine place to start. And then what we do is what this does is its page rank this webs table to limit the links. It only does links to the sites that you tell it to do links and probably the best for your page rank is to stick with one site. Otherwise you’ll just never find the same site again. If you let this wander the web aimlessly, and so I generally run with one web which web that should be try to called websites. And I am pulling all the data, and I read this in and I just make myself a list of the URL, the legit URLs and you’ll see how we use that. And the webs is, what are the legit places we’re going to go because we’re going to go through a loop, ask for how many pages and we’re going to look for a null page. Again we’re using that RANDOM ORDER BY RANDOM limit one, and then we’re going to grab one. We’re going to get the fromid, which is the page we’re linking from and then the url, otherwise there’s no one retrieved. And so the fromid is when we start adding links to our page links, we gotta know the page we started with. And that’s the primary key. We’ll see how that primary key is set in a second. So, otherwise, we have none. And we’re going to print this, from id in the URL that we’re working with. Just to make sure, we’re going to wipe out all of the links, because it’s unretrieved. We’re going to wipe out from the links, the links is the connection table that connects from pages back to pages. And so we’re going to wipe out. So we’re going to go grab this URL. We’re going to read it. We’re not decoding it because we’re using BeautifulSoup which compensates for the UTF-8 and so it we can ask. This is the HTML error code and we checked 200 is a good error and if we get a bad error, we’re going to say this error on page. We’re going to set that error, we’re going to update pages. That way we don’t retrieve it ever again. We basically check to see if the content type is text/html. Remember in http you get the content type. We only want to retrieve it. We only want to look for the links on HTML pages and so we wipe that guy out if we get a JPEG or something like that. We’re not going to retrieve JPEG, and then we commit and continue. So these are kind of like, those are pages that we didn’t want to mess with. And then we print out how many characters we got and parse it. We do this whole thing in a try accept block because a lot of things can go wrong here. It’s a bit of a long try accept block. KeyboardInterrupt, that’s what happens when I hit control + c at my keyboard or control + z on windows. Some other exception probably means BeautifulSoup blew up or something else blew up. We indicate with the error=-1 for that URL so we don’t retrieve it again. At this point, at line 103, we have got the HTML for that URL. And so we’re going to insert it in, and we’re going to set the page rank to 1. So the way page rank works is it gives all the pages some normal value then it alters that. We’ll, see that in a bit. So it sets it in with one. We’re going to insert or ignore. That’s just in case this pages are already that the pages is not there. And then we’re going to do an update, and that’s kind of do the same thing twice, just sort of doubling making sure if it’s already there. Insert or ignore will cause us to do nothing, and the update will cause us to retain it and then commit it so, that if we do select later we get that information. Now this code is similar. Remember, we used BeautifulSoup to pull out all of the anchor tags. We have a for loop. We pull out the href. And you’ll see this code’s a little more complex than some of the earlier stuff. Because it has to deal with the real nastiness or Imperfection to weg. And so, we’re going to use urlparse which is actually part of the URL lib code, and that’s going to break the URL into pieces. Come back, use urlparse. We have the scheme which is HTTP or HTTPS. If it’s a relative references. This is all relative references by taking the current URL and hooking it up. Urljoin knows about slashes and all those other things. We check to see if there’s an anchor, the pound sign at the end of a URL, and we throw everything past, including the anchor away. If we have a JPEG, or a PNG, or a GIF, we’re going to skip it. We don’t want to bother with that. These we’re looking through links now, we’re looking at all the links. And if we have a slash at the end, we’re going to chop off the slash, by saying -1. And so this is just kind of nasty choppage and throwing away the URLs, that we’re going through a page, and we have a bunch that we don’t like, or we have to clean them up or whatever. And now, and we’ve made them absolute, by doing this. It’s an absolute URL. You write this slowly but surely, when your code blows up, and you start it over and start it over and start it over. Then what we do is we check to see through all the webs. Remember, those are the URLs that we’re willing to stay with and usually, it’s just one. If this would link off the sites we’re interested in, we’re going to skip it. We are not interested in links that leave the site. So this is like link that left the site, skip it. But now, we finally, here at line 132, we’re ready to put this into Pages, URL and the HTML, and it’s all good, right? And, Where’s that one’s going to be null right there, because we haven’t retrieved the HTML. This is NULL because this is a page we’re going to retrieve. We’re giving the page rank of one and we’re giving it no HTML and that way it’ll be retrieved and then we commit that okay? And then, we want to get the id so we could have done this with one way or another. But, we’re going to do a SELECT to say hey, what was the id that either was already there or was just created. And we grab that with a fetchone and say retrieve to id and now we’re going to put a link in INSERT OR IGNORE INTO Links. From id to id which is the id, the primary key of the page that we’re going through and looking for lengths toid is the length that we just created. And away we run. So, it’s going to go and go and go and go. Let’s go look at the create statement up here. From id and to id right there, okay, so, so let’s run it. Python3 spider.python so it’s fresh and so it wants a URL with which to start and I’ll just start with my favorite website, www.dr-chuck.com. Now this basically this first one you put in, it’s going to stay on this website for awhile, okay? So I’ll hit Enter, and let’s just grab like, let’s grab one page just for yaks. Okay, so grab that and print it out that it got 8545 characters and it printed out that it got. Six links, so if I go to this and Open Database and I go to code three and I got to page rank and I look at this. Let me get out so it closes. So, notice this sqlite journal. That means it’s not done closing, so I’m going to get out of this by pressing enter. And so you’ll notice now that that journal file went away. Otherwise, we would not be getting the final data. There we go, okay. So webs, let’s take a look at the data. Webs has just one URL, that’s the URLs that we’re allowing ourselves to look at. You can put more than one in here if you want, but most people just leave this as one. Pages, so we got this first one and we retrieved this and this is the HTML of it. And we found six other URLs in there that are dr-chuck.com URLs, right? There was lots of other URLs in there, but there are only five other ones that we found, okay? And so, and what we’ll find is that if we go to Links, we’ll see that page one links to two, links to three, links to four, links to five, links to six. because the Links is just a many to many table. So page one points to page two, page one points to page two, page one to three, page one to five. Okay, so that’s what happens when we have the first page. So let’s retrieve one more page. Now, we could have started a new crawl, but it’s just going to stay on dr-chuck.com, and I’ll just ask for 1 more page. And so now it went and grabbed. It randomly picked among these null guys, and I’m going to hit enter to close it. And then I’ll refresh this and so, it looks like we retrieved OBI sample and we didn’t get any new links, and so the links page. No, we didn’t get any new links, so that page whatever that page was OBI sample had no external links so let’s do another one. One more page. So that one had 15 links so let’s take a look now. So now, we have 15 pages, it picked this one to do right? And now it added 15 more pages. And then if you look at Links, you will see that page four, which is one it just retrieved, links back to page one. So now we’re seeing, this is where the page rank is going to be cool. Four links to one, four links is whatever, away we go, right? One goes to four, four goes to one. I should have probably put a uniqueness constraint on that. It’s not supposed to have duplicated that. Okay, so let’s run this a bunch of times now. So let’s just run it 100 times, for 100 pages. It’ll take a minute. So, you’ll see it’s like freaking out on certain pages, not parsing them. You know, it’s found it’s way into my blog. It’s finding like 27 links. This table is growing wildly at this point. It’s going to take us a while before we get to 100, it’s kind of slow. Now the interesting thing is, I can hit control C at any point and time , right? And so that blew up. But it’s okay because the data is still there. And, so if you go back to pages for example, and we refresh our data, we see we got a ton of stuff. And this will restart and all the things. So if we search this that I started that by HTML you see that there’s lots of files that we’ve got. And is never going to retrieve that again because those have HTML. So then I can run this thing again and start it up. And when I say Ctrl+C, your computer might go down, your network might down. There is all kinds of all kinds of things might happen and you just pick up where it leaves of. It just picks up where it leaves of and that’s what’s nice about this. Okay, so that’s pretty much how this works. We’ve got this part running. We’re seeing it flow into spider.sqlite. We’re seeing that we can start this and replace this. And so what I’ll do is I will come back in the next video and show you how all these things work together, and then how we actually do the page rank. So, thanks again for listening and see you in the next video.

مشارکت کنندگان در این صفحه

تا کنون فردی در بازسازی این صفحه مشارکت نداشته است.

🖊 شما نیز می‌توانید برای مشارکت در ترجمه‌ی این صفحه یا اصلاح متن انگلیسی، به این لینک مراجعه بفرمایید.