13.4 - Parsing XML

دوره: Using Python to Access Web Data / فصل: Web Services and XML (Chapter 13) / درس 4

13.4 - Parsing XML

توضیح مختصر

stuff is a tree of information that's parsed and gives us methods and attributes that we can use to go through the data. Go find the Id tag and then go grab the text field so that's going to print out 001 and we can go get the item and then there is the attribute that is directly under it. Up next, we're going to talk about a more lightweight way to store data called JSON, or JavaScript Object Notation.

  • زمان مطالعه 10 دقیقه
  • سطح متوسط

دانلود اپلیکیشن «زوم»

این درس را می‌توانید به بهترین شکل و با امکانات عالی در اپلیکیشن «زوم» بخوانید

دانلود اپلیکیشن «زوم»

فایل ویدیویی

متن انگلیسی درس

So now we’re going to move into writing code in Python to deal with XML. Now, it’s not too difficult because like most of the things we do in Python, the first thing we do is a really clever import statement that does most of the work for us. So this is importing a library xml.etree.ElementTree and this ET then becomes, that’s an alias. The syntax of as is like an alias. It ends up being a short form so we don’t have to type this long thing. Now we’re going to, normally we would be reading all of these data with urllib and read and whatever and then we would parse it. But just to make these simple on one screen I’ve kept it simple. And so I have a string. Now this is a new syntax that you haven’t seen before, probably, and that’s the triple-quoted string. So a triple-quoted string in Python is a potentially multi-line string. And so that’s the beginning of the string. The string ends down here. The newlines that are here are part of the string. Okay, so this is as if we read this bit of stuff from here to here in from a file or in from the web. So this is just my way of emulating like a urllib and then a read so we can just look at it all in one screen. So here’s our XML. And you see that it’s well-formed XML. We’ve got a beginning tag and an ending tag, being it’s the same stuff that I’ve been doing. So we have to parse it. And this is kind of like what we do with HTML and Beautiful Soup. We have to pull this string data and give ourselves an object back and then work with that object. And so we take this string data, we pass it in to ET.fromstring. And what fromstring says is take this string and give us back basically a nice tree. So to think back to those tree pictures, give us back these trees and make sense of it. It’s still got the same thing like Chuck and the phone number 303. All the stuff. They’re all in there, right? It just has kind of constructed this as a internal memory structure inside of Python. And that’s what we get back from ET which is this, that goes into this tree variable right here. So we got this tree of information that’s properly parsed. Now this could blow up. This could traceback. If you have a syntax error like you didn’t put the slash in or something, this would fail. Say you got bad HTML or bad XML. And so that’s kind of what you got to do. But when it’s all said and done, if this line of code succeeds, then you have good XML and you can make sense of it, okay? And so what we could do is we can say within that XML data, go find me the tag name. So that basically is this, tree.find(‘name’) finds me that. So if you think of this in a little picture, there is the tag named name. And then remember the child tag was Chuck. And so the whole tag is this. And then to get down and get just this Chuck bit, we say .tx text. So this is this. And it’s also that. So if you want to get the text that’s in between the name tag and the end name tag, you say tree.find, tag name, and then .text. And that text is an attribute of this particular node give me back that thing. And if you want to get an attribute, not the text node, you can say, okay, go find me the email, which is this, which is this. So if you look at the email, it looks like this, email is the node, and it has an attribute of hide, and what, yes. And there is no text, right? Because this is a self-closing tag so this doesn’t exist. So we say tree.find(‘email’). That gets us this whole thing, and then we call the get method within it and say, get(‘hide’)). And that says go find me the attribute named hide within the tagged email and so that then gives us back yes. So this whole expression, tree.find email .get hide, gives us yes. And so that allows us to work our way down in through some XML and pull stuff out of the XML. And that’s what we have to do when we’re in a program. Okay? And so that’s the syntax for parsing XML. And if you go online you’ll see lots and lots of examples of how to pull data out of XML. When I’m writing code like this I tend to have to print out tree.find(‘email’) and then I get a few things. So I tend to these expressions kind of get long as you’re working your way down a tree of XML. And then you find a thing. So, don’t expect that you necessarily can write this code perfect the first time. You sort to write a little bit, then add a little bit more, then add a little bit more, then add a little bit more, and then finally you see the thing that you want to get out of your tree of data. So you can have either a tag, which is sort of a simple tag that has a child, or you can have a tag that has multiple tags. And so, we use a different way if there are multiple child tags. So, here we have, again, the single, the triple-quoted technique where the bigger outer tag is stuff, and there is a users tag below that, and then there is a number. So the idea here is we have many dot dot dot dot dot. Many users. User, user, user, we have x equals 7, an id, and each user has a little bit of data, etc., etc., etc. And so, now we want to be able to write code that’s going to go through each of these user tags. And so we’re going to use the findall method. And again, we take all of the text, we pass it into fromstring, and we get back an object. stuff is a tree of information that’s parsed and gives us methods and attributes that we can use to go through the data. So that’s what this does. We take it from the outside world to the inside world in Python. Now, what we’re going to do is we’re going to say oh,, okay, we’re going to call the findall method in there. And we’re going to search for the users tag, all of the user tags below users. So that’s what findall means, says, there’s a bunch of these under users, there’s user tags. Find all of them and then give them all to me. So what you basically get is these tags, except in a list. Right? So these tags are in a list. Not just the word Chuck and Brent, but the whole tag. In a sense, it’s a list that’s itself little trees, right? That’s a tree. That’s a tree. So this is a list of tags,which is trees with little mini trees of information. And so this is a list, it’s not a list of strings, it’s a list of tags. But we can ask how long is it? So we’ll print how many there are, and then we can loop through them. There’s just a little list, right? So it’s a list of two things, it has a little tree here and comma little tree, little tree, so that’s what we’re going to do. We’re going to write a for loop to go through that, and we’re going to have an item that’s going to iterate through each of these things. I can call that tag, for tag in the list of tags, that would make sense. And so item is going to take on the successive values of this list. This for loop is going to run twice. It’s going to run once for this and once for that. And that’s what’s going on. And lst is, lst is that data structure. So now when we come in here we’re going to have a little tag. And so the first user tag looks like this with an x equals 2 and then a child tag of Id and then a child tag of name and the first one is going to be 001 and Chuck. So item is going to point to this. So we can then do those same kind of things. We can say within item find the name tag so that grabs this bit out and then go grab you the text. So that this bit prints out Chuck. The same thing for go find the Id tag. Go find the Id tag and then go grab the text field so that’s going to print out 001 and we can go get the item and then there is the attribute that is directly under it. And then the loop runs again and so it’s now pointing at this bit right here. So item is now pointing to that tag. And it says go find the name tag and find the text. So find me the name tag and then find me the text within the name tag. And that prints out, then it runs this line. Go find me the Id tag. Go find the Id tag. Find the Id tag and then grab the text out of that, so 009 is going to print out there. And then from the original tag go find the x attribute. That’s what item.get(“x”) is. And so that’s going to be 7. So that’s what we’re going to get there. So that’s going to pull that out. So in those two examples, I’ve shown you how you sort of dig through a tree or loop through a list of trees. So that was a list of trees. Tree, tree, those are the two basic things that you tend to do when you’re parsing XML. Is either cruise down a tree or get a list of trees and then cruise down those trees. And sometimes you have lists within lists and trees within trees, and it can be very complex. And your programs can get complex but sooner or later you get them working and this is why the schema is so important. Because once your code works and if they change that structure your code tends to blow up badly and so you want to yell at the other person. Say like, why did you change the XML? And they’re like, well I didn’t change the XML. And you’re like wait, here is the schema that proves that you changed the XML. That kind of gets us through XML, and there’s a lot of challenges to XML, XML is a very rich way to serialize data. Up next, we’re going to talk about a more lightweight way to store data called JSON, or JavaScript Object Notation.

مشارکت کنندگان در این صفحه

تا کنون فردی در بازسازی این صفحه مشارکت نداشته است.

🖊 شما نیز می‌توانید برای مشارکت در ترجمه‌ی این صفحه یا اصلاح متن انگلیسی، به این لینک مراجعه بفرمایید.