12.4 - Retrieving Web Pages
So that's I think really and truly amazingly beautiful and simple, to take this whole internet, knowledge, architecture and HTTP and all that stuff, and roll it into one import statement and three lines of code. And if we want to write another loop where we actually read the stuff, and we go look for href equals quote and then pull this out, perhaps with a regular expression on a split statement or some kind of find operation, because we're good at strings by now. That's a really tiny light version of what it is that Google is doing when it's trying to make a full copy of the entire Internet on its own servers.
- زمان مطالعه 6 دقیقه
- سطح ساده
دانلود اپلیکیشن «زوم»
این درس را میتوانید به بهترین شکل و با امکانات عالی در اپلیکیشن «زوم» بخوانید
متن انگلیسی درس
So it might be amazing that we can use sockets to write a ten-line program that retrieves a web page, but, hey, this is Python. We’re trying to make this as easy as possible, we don’t want to repeat ourselves. DRY — don’t repeat yourself. So there’s another library that wraps that socket stuff and does it for us automatically, called urllib. And so it’s a bit longer to type. We’re going to import some stuff, we’ll do a urllib request open. So we import some, some library bits and then we say, okay, urllib.request.urlopen. That’s a string, not bytes, if you remember from the last one. Here’s just a plain old URL. It’s going to parse the URL. It’s going to figure out what server to talk to, what document to retrieve, what HTTP version, that GET request, all that stuff. It’s inside it because it turns out that it’s the same for every time you’re going to do a web request, right? So why not write code to do it. And so this opens it up and returns us back a file handle. It’s kind of like a normal old file handle that you would if you opened a file. So this is, this line right here is almost the same as open for a file, right? It gives us back a handle. It doesn’t actually read the data, but it gives us sort of an opening so that we can run the reading. Now a common thing to do, certainly not the only thing we can do, is to run a for loop for line in this handle. So that’s going to iterate through all the lines of this URL. So that’s going to open the URL, read the data, and iterate with a for loop once through each line. Now, this line iteration is actually a byte array, not a string. And so we do have to do the decode to get ourselves a string, so this gives us a string version of it. And then we’ll do something like, you know, rstrip or strip or rstrip or whatever it is we want to do it. But at this point we’ve got ourselves a string that is all the lines of this stuff. And so that’s like really simple, right? This is what the output would be. It’s going to open this thing up. Now you’ll notice the headers aren’t here and it just shows you the data. So you’d know that there were headers, then turns out URL open eats them and remembers them, and you can actually ask for them later. You can say, “hey, give me the headers” if you want, but most of the time you’re not interested in the headers. It just skips down to the data and it goes through the for loop four times, and prints out the four lines that are in that file. So that’s pretty impressive because it’s like four lines of code and we’re reading a web page in Python. And that’s probably the easiest in any language that I know of to do that, to actually get the data off of a web page and just treat it like a file. So that’s I think really and truly amazingly beautiful and simple, to take this whole internet, knowledge, architecture and HTTP and all that stuff, and roll it into one import statement and three lines of code. And so it treats this like a file. And so if you start thinking about this, you can think about instead of doing open statements, you do urlopen statements, and then you write whatever loop you would write, and then you can handle data on the Internet. So this is no longer a web page, even though you can hit that URL. This is like a file on the Internet. it’s an Internet file as it were. And then what we do is we open this thing up, and then we, you know, make a dictionary like we did before. And then we read through it, and then we get each line. And the only thing we have to do a little different because this is coming from that outside scary UTF-8 world is we do have to decode it before we get it. The line is a byte string as compared to a character string. So the decode says, give me the byte string and turn it into a Unicode character string. But then all the string methods work on it because that is, gives us a string. So we just split it, but now they’re all strings. You only do the decode once. It’s like the first moment you touch that data and pull it in you decode it and then pretty much you work with it, from it with a string from that point on. And now you have an array of words and you iterate through the words in the array, and you do the count thing, and then you print it out. So the only real lines that changed were these two lines and we had this little decode. But other than that, this is identical to opening a file and reading through all the words in the file and counting them up. And again, that’s why we like to use Python to go get web data. Now we don’t have to just read text files, we can read HTML files. Just tell it, go give me that HTML file, and then write a loop. So now we have basically built a web browser in four lines of files and this will print out that content of the web page. Again, remember, the headers are there but they don’t come out in this for loop, you have to ask for them in a different way. And so that’s it. And if we want to write another loop where we actually read the stuff, and we go look for href equals quote and then pull this out, perhaps with a regular expression on a split statement or some kind of find operation, because we’re good at strings by now. We’re good at tearing things apart. If I gave you an assignment to go look through pages like this and find things that look like href is equals h, I say, oh here’s your assignment look for href equals double quote, and then some stuff, and then that, and pull this stuff out. And then what you do is you go back up and ask for that link, right? So here’s another link. Go ask for that link. OK? So you could write a program that pulled down a web page, looked for all the links in that web page, and then pulled down all those web pages and on and on and on and on. And so then you have written Google. You’ve written a Python program. I’m guessing, because Python is a very popular language at Google and so I don’t know if it’s true, but I’m going to guess the first WebCrawler was, you know, a few hundred lines beyond this. And by the end of this class, if you keep on going, we’ll actually write a Python web crawler using a database and a whole bunch of other stuff. That’s a really tiny light version of what it is that Google is doing when it’s trying to make a full copy of the entire Internet on its own servers. So up next we’re going to talk about how you can more efficiently take apart the HTML and look for various things and print those things out. Because it turns out that HTML is so ugly and so inconsistent that things like regular expressions don’t always work very well with HTML. So we’ll talk about some techniques for parsing HTML next.
مشارکت کنندگان در این صفحه
تا کنون فردی در بازسازی این صفحه مشارکت نداشته است.
🖊 شما نیز میتوانید برای مشارکت در ترجمهی این صفحه یا اصلاح متن انگلیسی، به این لینک مراجعه بفرمایید.