12.2 - Hypertext Transfer Protocol (HTTP)

دوره: Using Python to Access Web Data / فصل: Networks and Sockets (Chapter 12) / درس 2

12.2 - Hypertext Transfer Protocol (HTTP)

توضیح مختصر

And so again, it's just a set of rules about just so we know what we're going to do first, know what the syntax we expect to produce and consume and make it so different vendors can work together and it's just a form of standards. Now it's many years ago in the 90s you had to know all these things separately but then we just kind of concatenated them all together and that became the Uniform Resource Locator or, hey, type this URL into your browser. And so it reads the HTML, parses it, and there's a bunch of rules about where you add blank lines and all these other things so that it looks the way that you want.

  • زمان مطالعه 14 دقیقه
  • سطح خیلی سخت

دانلود اپلیکیشن «زوم»

این درس را می‌توانید به بهترین شکل و با امکانات عالی در اپلیکیشن «زوم» بخوانید

دانلود اپلیکیشن «زوم»

فایل ویدیویی

متن انگلیسی درس

So we just took a look at the transport protocol which is our first, the first layer, the lowest end-to-end layer in the TCP/IP stack. And so, you know, we wrote a Python program and in that Python program we made a connection with the socket and then connected to a particular port on a far-away computer. And now, we’re going to actually start sending data back and forth. So we made the connection. And so, this moves us from the transport layer up to the application layer. And the application layer means there’s something different when you’re talking to a mail server than when you’re talking to a web server. There are rules that describe how we talk to them. They are the rules of the road. So on a telephone call, at least in Western cultures, when you pick the phone up and it rings you say, hello. The person who picks the phone up is supposed to say hello first. You may not notice this but that’s what it is. The person who picks phone up says hello and then the second person says hello and then hopefully you start talking. Sometimes if things don’t work well, you don’t hear it. You’re like, “Are you there?” Or it gets kind of confusing if the phones, especially cell phones, aren’t reliable. So this is what we say to phone, into phones to get our conversation started is like the application protocol. And the protocol that we’re going to play with in this segment is what’s called the Hypertext Transport Protocol. I call it a Hypertext Transfer Protocol. It’s the dominant application layer on the Internet and it really was invented to retrieve web pages. And at the moment of its inception it wasn’t really thought of as like the greatest protocol ever but it has evolved into an amazing protocol. And what happened was that it was so simple that we could just layer new ideas on top of it. And so we started with this really basic, simple application protocol and away we go. And so it’s a set of rules that allow browsers to retrieve documents from the Web. And so if you wanted to go write a browser, you could. You would just have to go read the specifications for what it is that the servers are going to feed to you or if you’re going to write a server you want to talk to browsers, you would also read that specification for HTTP. And so again, it’s just a set of rules about just so we know what we’re going to do first, know what the syntax we expect to produce and consume and make it so different vendors can work together and it’s just a form of standards. And so, one of the things that HTTP standardized, which was really cool, was this protocol of uniform resource locators or URLs and we type them so much that we just think of them as like, oh you type this thing in to get this thing on your browser. And, and so, but they actually contain inside them some information “http://” says use the http protocol, “www.dr-chuck.com” says go to this host and then “/page1.htm” says go get this document. Now it’s many years ago in the 90s you had to know all these things separately but then we just kind of concatenated them all together and that became the Uniform Resource Locator or, hey, type this URL into your browser. So, every time the user clicks on something and it wants to get a different document, so you have a document, it’s got some links in it, the HTML has this thing called href value. You click on it and then you’re telling your browser, get me a different page by clicking on it and that’s the hypertext bit, is that in any document there are links that go to other documents. And these links are the magic of the Web. These were ways to access data from servers before, but the notion of the document that you have has links to other documents is a powerful notion. We take it for granted now but it, when it first came out in the 19, mid 1990s, ‘93, ‘94, ‘95. It’s like, whoa, this is better than things we’ve been doing. I mean otherwise we learned these weird commands and did this stuff. And so it sends what’s called a GET request to get the document and then retrieves the document and then parses the document and then displays it for you. And so this is a little bit of a diagram of how that works. And so you’re sitting there, you’re looking at a web page, and you click on a link and I made it the blue link. So you click on it. It says Second Page. And the browser is a piece of software running on your computer and it intercepts that click and says “Oh, you’ve clicked on something” and it looks at what’s in the HTML of the page that you’re coming from to say what web server to connect to, what port to connect to on that web server, and then what document to retrieve. And so, your browser then makes a socket connection to port 80 and sends a request called the GET request and sends that get request to port 80 and then it goes in that web server and the web server parses that request and figures out what document you’re looking for. It might run a little bit of software but when it’s all said and done it produces on that same socket a response and it sends that response back and the response back is in the form of HTML, the hypertext markup language, which is really kind of tags inside less than and greater than pairs that “h1” says that it’s a header 1, p says it’s the start of a paragraph and then the a tag says that this is an anchor and so it’s supposed to be clickable text on that next page. And then that comes back and your browser reads that and then makes the page show up. And so it reads the HTML, parses it, and there’s a bunch of rules about where you add blank lines and all these other things so that it looks the way that you want. And so that is called the request/response cycle and it has to do with when you click, where it goes to the server, gets data back, and then shows it to you. You basically see click, new page. But there’s a lot going on behind the scenes when that happens. And all of the rules of exactly what was sent, and exactly how it was sent, how the, how those strings are put together is, there’s a standard for it, and there are a whole series of standards and thankfully they’re free and open and available for you to read and while they’re long and complex you can look at them. There was a group that was formed many years ago to start building these standards. They number each one of these RFCs, they’re called Requests for Comments. That’s a bit of a tongue-in-cheek suggestion that even though there is an RFC that guides how your browser works with, you know, millions and billions of browsers work with hundreds of, hundreds of thousands or millions of servers and they’re pretty solid ideas, that there could always be room for improvement, there can always be room for improvement. That’s what the Request for Comments means is that no matter how perfect we think we’ve got these engineering standards for the Internet, they could always be improved. So if you took a long enough time and read long enough you would find this one called RFC 2616 which tells you something about the HTTP protocol. So you’re writing a browser, you’re going to read the HTTP protocol, you’ve got hundreds of pages to read. It’ll take you a while, probably easier to just download a free browser than make your own. But let’s hypothetically think you’re going to do this. You’ll be reading through this and you’ll be paging through and paging through and you get to this section and you’re like, oh, this is what the syntax of a request from the client to the server includes where the first line of that message, the method to apply the to the resource, identifier of the resource, and the protocol version in use. And then we looked and we see, oh, here’s a sample of one of those things, right? Get, G-E-T, capital letters with a space, and then a URL, Uniform Resource Locator, and then a protocol. And so we connect and then this is the line that we send. That’s a requesting of a document. So it turns out that if you have the program telnet and Macintosh people have this and Linux people have this and Windows people can install it. Go find how to make telnet work on Windows. What you give telnet, telnet is a, it’s like a prehistoric piece of software. The reason they don’t have it on Windows is because they think it’s probably a security hole. They might be right but they took it off. It is a prehistoric thing because it’s, it’s a way to connect to any server, any port on any server, in an insecure manner and send stuff to it. So, what you would type on your computer is telnet and then the host and then the port. And by picking this port I’m saying I want to connect to the web server and it connects up. Now some web servers are impatient because they expect to talk to browsers so if you take too long to type this it will like say “you took too long to type. You’re just a human, you’re cheating”. But, if you type this fast enough, it might help to cut and paste it, and you type this exact HTML command that is exactly the syntax and then you hit an enter here. You hit just an enter right there. Then what’ll happen is that will, that, those two lines are enough to convince that web server to send you a page back and it will send you back two chunks of information. It will send you back the headers, this is metadata, metadata about the file that you’re about to get including what kind of file it is and it says, oh, this is just a text/html file, and then a blank line, blank line splits between the headers and the content, and then the content of the file, and then the connection is closed. So the connection closed is not part of the text that just says it got closed. And so, this then is that page that is shown with some stuff on there and some more links etc., etc., etc. So this is the request/response cycle except normally what’s happening is this is a browser making a socket connection and then sending a GET and then getting headers back and then getting the body back and then making a pretty page out of that body. And so, this is how real people hack into real computers, is they actually make connections and they send stuff on those connections. And there’s this famous scene in Matrix 2, I think, where she is, Trinity is hacking into the back of the power grid. And most of the, sort of, security movies up to that point postulated that security people when when they break in would actually break in with these really cool user interfaces. But it turns out in the real world they usually have really lousy user interfaces. It’s kind of like the command line that I keep trying to tell you to use in this class. And so, this actually is an interesting scene. You can go to this URL and take a look at this scene and it actually is written using actual security cracking software. And it’s, it was the first of its kind to actually create in a movie how people really, sort of, come in the back door of computers and do stuff. And so it’s, it’s kind of just an interesting thing. I’m trying to show you how to become an expert in all this stuff and all this sneaky, clever, highly sophisticated stuff often has very simple user interfaces. So, if we’re going to do that same thing, which is make a connection to a port, send a GET request, and then get some data back, we can then do this in Python. So we started with those first three lines import a socket, connect the socket. So this socket first that when you do the socket it’s sort of like this porthole that lets you out. It’s like a doorway out of your computer but it’s not, the doorway’s not open and the doorway’s not connected to it yet. That’s kind of like a Matrix thing too, right? There’s a doorway but it, what’s the doorway connected to? Hmmm. There’s a couple of Matrix scenes that come to mind all of a sudden. Okay. Well, whatever. That’s what this does, makes the doorway but there is nothing connected. Then the connect basically extends out of your computer. This could fail if this server doesn’t exist. So it goes and finds the server, connects to port 80, and establishes the socket. When this line is done, what we have is we have a socket and it’s connected to a server. You do know that the server’s there and you know that there’s software on the other end of it, otherwise, the connect will fail. But if the connect works you’re talking but you haven’t sent any data. Now you can call methods on the socket object. Now that it’s been connected, like send and receive, to send data across this or receive data from it. Now part of the application protocol is what do you do first, do you send or receive? Now it turns out with HTTP, the server does a receive first and you do a send first. That’s the rules. And so the first thing you do is you make up a request and this is just a string. Now we have to prepare it for sending. I’ll talk in the next section about how this encode works. Prepare it for sending, and then we send it. And you’ll notice there’s two newlines at the end of it. Enter, enter was what you did when you were in telnet and it was the get blah, blah, blah, blah, blah, enter, enter, and then we have to prepare it. And we send that. And so that means that you’ve sent something to the server and the server receives it and it goes and reads some files and does some stuff and then it’s going to start sending data back. And you can use a while loop now and receive is a method in the socket object. Once you have sent it and it’s going to, and it might take a couple of sends to get all the data. So we’re going to just print this stuff out onto our screen. So we’re going to receive up to 512 characters. If we get no data that means end of file or end of transmission, so we break out. And then if we did get data, we decode it and we’ll talk in a second about that. That’s sort of taking data from the outside world and interpreting what it means internally for us so we’re going to decode it. And so this loop is going to run a bunch of times until it hits end of file and then we’re going to close the socket which tears all this stuff down because this actually takes up resources in your computer and the far end’s computer as well. So mysock.close closes that and that’s kind of it. So that basically is the request/response cycle in Python and it’s only like, what, ten lines of code. And so that’s really impressive that Python is capable of doing that. And so what we’ll get for the output of this is we’ll get the same kind of stuff we got from Telnet. It’s going to just be this loop that reads this stuff and decodes it and prints it and it’ll be header, header, header, header, header, header. How ever many metadata, that’s the metadata, then a blank line and then the text. Okay? And so it’s the exact same thing and those Python commands did that same, make a connection to port 80, send a GET request, send a blank line, wait and read data and then print that out onto our screen and that’s what we would see. So next I want to talk to you about that encode and decode bit because it’s kind of important.

مشارکت کنندگان در این صفحه

تا کنون فردی در بازسازی این صفحه مشارکت نداشته است.

🖊 شما نیز می‌توانید برای مشارکت در ترجمه‌ی این صفحه یا اصلاح متن انگلیسی، به این لینک مراجعه بفرمایید.