Worked Example- BeautifulSoup (Chapter 12)

دوره: Using Python to Access Web Data / فصل: Programs that Surf the Web (Chapter 12) / درس 5

Worked Example- BeautifulSoup (Chapter 12)

توضیح مختصر

It basically, someone has just went through and figured all the bad things that could possibly happen when you're reading and parsing HTML. And it's not that that website had a bad URL, it has a certificate that's not in Python's official list. So that gives you a quick summary of using the BeautifulSoup library in Python along with the urllib.

  • زمان مطالعه 9 دقیقه
  • سطح خیلی سخت

دانلود اپلیکیشن «زوم»

این درس را می‌توانید به بهترین شکل و با امکانات عالی در اپلیکیشن «زوم» بخوانید

دانلود اپلیکیشن «زوم»

فایل ویدیویی

متن انگلیسی درس

Hello everybody, welcome to Python for Everybody. We’re going to do a little bit of sample code. If you’re interested in getting the sample code, you can download this zip here at pythonforeverybody.com/materials.php. And you will download and you will get all the files, and all the files that I’m looking at here. So the one I’m going to play with today is the one called urllinks.py. So the first thing you gotta do before urllinks.py works is you have got to install BeautifulSoup. And I’ve got some simple instructions at the beginning of the file. And so one way to do it is install it using Python install process to install this BeautifulSoup for all Python applications. And if you are the owner of your computer and you’re going to use BeautifulSoup a lot, it’s a fine idea to do that. But I want to show you a simpler way that if you don’t own your own computer, and you just want to make it so that BeautifulSoup works, you can download this file, this file right here, bs4.zip. Unzip it, and put it in the same folder as here. And so if you look in this folder, I have a subfolder called bs4, and that’s the unzipped version of this. And it has these things. I didn’t write this code, so I’m sorry if the name is bad. But this is the code to bs4. And this is what’s in bs4.zip. And it’s in the same folder as urlinks.py. And so what happens is when you do this from bs4 import BeautifulSoup, that either can go to sort of this global magic place that Python installs stuff and pulls in the BeautifulSoup object. Or it can go to the folder bs4 and pull it in, okay? And so, that’s how that works. So, you have to do one of these two things. I prefer to keep it simple. Download and unzip this file and put it in the same folder as this code, and away you go. So from the previous example, we’re going to use urllib, of course, and then we’re going to pull in the BeautifulSoup, from the BeautifulSoup4 library. We’re going to get the BeautifulSoup object. Now, if you do this with SSL, if these websites we’re going to play with have SSL, you pretty much have to do this little hack. And these three lines, don’t worry too much about it. The whole idea, you can do Google on stack overflow and figure this out. But this is the way that you ignore errors when you have SSL certificate errors. And so we have to add this parameter, context=ctx, which is this variable that we create. So this part and this part, sort of just do them. If you don’t, you can take them out, actually. Otherwise you won’t be able to do HTTPS sites. So let’s take a look at what we’re doing other than dealing with the HTTPS problem. Going to ask the user for a URL. We are going to retrieve all the HTML. We’re going to do a URL open, just like we did before. Now this would return us something we could loop through line by line with a for loop, but instead we’re gonnasay hey, read the whole thing. And that basically returns us the entire document at that web page in a single big string with new lines at the end of each line. And this is not in Unicode, but it’s probably UTF-8 string. But it turns out BeautifulSoup knows how to deal with UTF-8, and it also knows how to deal with Unicode strings. So what we’re saying is BeautifulSoup, read through and deal with all the nasty bits, right? So HTML is very, very flexible. So dr-chuck.com/page1 Htm. And so if we take a look at the source of this, new page source. Make this bigger. You might be able to do regular expressions. But it does things like break stuff across lines. There could be a line break here. There could be all kinds of things, right? And so writing regular expressions or splits or whatever is really hard for HTML. So what we do is someone has written this, it’s called BeautifulSoup. And it’s basically, this is the code and it’s based on a joke from a children’s story. It basically, someone has just went through and figured all the bad things that could possibly happen when you’re reading and parsing HTML. So either you use it or you will slowly but surely derive all the things that it doesn’t work. And so when we look at this line right here, this line at a high level is saying, we’re giving you ugly, nasty HTML that could make no sense whatsoever. Please read it. Have all the brains that you have and all the weird stuff. Figure that out for us and give us back an object, I happen to call it soup, you don’t have to call it soup, an object. And that is a proxy for that HTML. This soup object is clean. And so what we can do is we can sort of retrieve all the anchor tags. So we can talk to this object and ask it, give me the anchor tags. What’s an anchor tag? Well, if we take a look at this source, the anchor tag is the a through the /a. That is the tag, it is a tag, it is attributes that are on the tag, it is the text within the tag and everything. So that’s what we’re going to get. Now I called it tags plural not because plural matters at all, but because we’re going to get a list of tags. Because even though this webpage has lots and lots of tags, if we look at, say, drchuck.com, And view source, whoa, that’s kind of small. View page source, right. And we go look for anchor tags. We got 45 of them. And they all kind of have weird stuff in them, right? So this line will give us back a list of tags. It will give us all the tags in this document. So it goes, the tag goes from there to there. And then what we’re going to do is write a loop to loop through all the tags. So that’s basically hopping, like it’s hopping through the document. Sort of like this, that’s what it’s doing. Hop, hop, hop, hop, hop, hop, and it’s pulling out the text of the href attributes. So it’s going to pull out this bit right here. Whoops, darn, that was so cool. because that’s a flaw, look at that. This is my own page. There is no closing quote here. But it’s going to work because HTML soup is like, I know what to do about that. I can deal with that. So let’s check to see if that one works because that’s like a mistake. But that’s one of the things we like about BeautifulSoup. So we’re going to read through and then we’re going to pull out all the hrefs. This is probably thousands of lines of code that you really don’t want to run. So python3 urlinks.py. And so let’s start with a simple one. http://www.dr-chuck.com and it reads it. No that’s actually the card one, because we got a whole bunch. So let’s see if tsugi, see the tsugi one worked. It found that one. It’s right after sekaiproject.org. Where is that? Is there another tsugi? No, it didn’t find that one. That’s kind of funky. Look, it found it wrong, but that’s okay. So you see, it found all these and did a lot of nice stuff for us. If we do it python3 urllinks.py and do the easy one. Http://www dr-cuck.com/page1.htm, we will only see one, and there we go. Now, the SSL is if you are looking at a page that has SSL. Python, urllinks too, so I’ll go to https://www.si.umich.edu/ and that will get a bunch of links. And so you’ll see If it wasn’t for that, so all kind of stuff coming back. And if it wasn’t for this bit right here and this bit right here, this HTTPS wouldn’t have worked. And it’s not that that website had a bad URL, it has a certificate that’s not in Python’s official list. So that the URL is okay. So that gives you a quick summary of using the BeautifulSoup library in Python along with the urllib.

مشارکت کنندگان در این صفحه

تا کنون فردی در بازسازی این صفحه مشارکت نداشته است.

🖊 شما نیز می‌توانید برای مشارکت در ترجمه‌ی این صفحه یا اصلاح متن انگلیسی، به این لینک مراجعه بفرمایید.