11.2 - Extracting Data

دوره: Using Python to Access Web Data / فصل: Regular Expressions (Chapter 11) / درس 2

11.2 - Extracting Data

توضیح مختصر

And up to now we've just been playing with the search which gives us back a true or a false depending on whether it matches or not, but now we're going to actually pull stuff out. So this is kind of like a split and a for loop and checking to see if it's a number, and a whole bunch of stuff all rolled into one in one little program. And so here's a little bit of code that sort of uses regular expressions to both pick lines and extract data.

  • زمان مطالعه 15 دقیقه
  • سطح خیلی سخت

دانلود اپلیکیشن «زوم»

این درس را می‌توانید به بهترین شکل و با امکانات عالی در اپلیکیشن «زوم» بخوانید

دانلود اپلیکیشن «زوم»

فایل ویدیویی

متن انگلیسی درس

So now we’re going to talk about extracting data. And up to now we’ve just been playing with the search which gives us back a true or a false depending on whether it matches or not, but now we’re going to actually pull stuff out. And so we’re going to start by looking at a different regular expression, a new regular expression, the square bracket. The square bracket is kind of weird in that it is one character. So that is describing in between the square brackets what we mean by a single character. And we can have a range in here. We can have a list of things like A, E, I, O, U would be vowels. 0 through is 9 is digits. So [0-9] is a single digit. But then we added a + to it, and that says one or more digits. Now if we put a star, that’s zero or more digits, which is kind of silly. But one or more digits. And now we’re going to use a function called findall, a function in the regular expression library called findall. And so what we’re saying here is, this is the string we’re looking through, x. And we’re looking for the pattern one or more digits. And so then it’s going to look and say oh, let me see, one or more digits. That looks good, I like that one. Let’s keep looking. That’s good. And let’s keep looking, and that’s good. And so it may find zero, it may find one, or it may find more than one. And so what it does is it runs all the way through the text that you’ve asked it to look for, checking to see when this matches, and it gives us back a list of the matches. So it extracts out the pieces. So this is kind of like a split and a for loop and checking to see if it’s a number, and a whole bunch of stuff all rolled into one in one little program. And because findall, if it gives us nothing, it will give us an empty list. But in this case, it’s given us three strings. Now they’re not numbers. This the string 2, this is the string 19 , and that’s the string 42. But that’s what we get back. We get back a list from findall of all the possible matches. Okay? Pretty powerful. Okay, and so, it returns zero or more things. We can, in this case, we asked for one or more digits. In this case, I’m saying one or more, so that’s a single character, it’s an uppercase vowel, AEIOU all upper case. Plus means greater than one or greater than or equal to one. So there’s got to be at least one. And you say, are there any uppercase vowels in here? No, no, no, no, no, so it doesn’t find it. So I get back nothing. So it has to give me a list, find all of the substrings that match that regular expression and give them back to me. There were none, so you have an empty list. So you do have to check to see how many things you got back. because you might get 1, you might get 0, you might get 25 things back from a particular regular expression, when you give it a line. Now, as you are thinking about this, you think of the regular expressions it’s almost like a stamp, where it’s going stamp, stamp, stamp can I, is this piece work, is this piece work, does this piece match, does this piece match. And the problem is, there is this notion in the matching called greedy matching and unless you say otherwise, the regular expression library attempts to give you the largest possible version of the string that you’re matching. And so, here we have the first character is an F, any character, one or more times, and then a stop with a colon. And if this is the text that we’re looking at, you would say, yeah, there’s the beginning F and there’s characters and there’s a colon, we’re done. The problem is that it doesn’t stop there. It’s like, oh wait a sec, technically this also matches. So what do we get back? Do we get back the From, or do we get back the whole thing? And greedy matching says you are going to get back the larger thing, and that’s exactly what you get. And so all else being equal, you’ve got to be careful when you construct these things. Now I could’ve put non-blank in there, but I’m doing this to make the point to say that in the sense this is pushing. That’s the greediness, is that this wants to be as big as it can possibly be, and then still match the entire expression. So if you’re thinking stamping this expression on that string, you can stamp it on the small thing or you can stamp it on the big thing, it says I’ll take the big thing. Now you can override this, but basically you can think of this kind of these wildcards as very pushy, very pushy outwards, greedy, as large a possible string. And that’s what we mean by greedy. Both the asterisk and the plus push outwards as wide as they can. But, just like everything in regular expressions, so you can fix that with another character. So now we have a three-character sequence and we, to the plus or the asterisk, we can add a question mark. So this says, any character, one or more times, but don’t be greedy. So now it looks at it and says okay, I’ve got a beginning F and I can stop here, or I can stop here, but I am not greedy. So the not greedy prefers the shortest. The greedy prefers the longest, the not greedy prefers the shortest and so this is what we get. Now, and again, when you are writing code using regular expressions, it’s really important that you test your code so that you see kind of weird anomalies like this, like whoa, why did I get that? Huh? What’s going on? Why not that? And then you run it and you realize oh yeah, its greedy matching, it pushed really hard. Usually it doesn’t take too long to figure that out. But you do have to sometimes check it. And so, sometimes you’ve got to do something like add this question mark. Don’t be greedy, okay? So just a fascinating thing, you’re coding. That’s like an if statement. The question mark, it’s like an if statement. Hey, do the shortest one. And you communicate that in a single letter. That’s why they’re kind of fun. They’re like a whole programming language in characters. Okay, so here we have, we’re looking for the email address. The common one of the things we’re trying to do, is take those From lines and tear them apart, right? And so what we say is, hey, let’s go find everything that matches one or more non-blank characters followed by an at sign, followed by one or more non-blank characters. So this is a non-blank character, but there’s no at sign. This is not a blank character, oh yay, there is an at sign followed by some non-blank characters. So that’s a yes match. And then none of these other sets of non-blank characters match that, right? And so that comes out and so there you go, and we get out exactly what you would expect. We get the non-blank characters followed by an at sign, followed by some more non-blank characters and I’ve got them pluses to make them be one or more. Backslash S is a non-blank character, if you go back the cheat sheet, that was part of the non-blank character, okay? And you can think of this as also greedy, meaning they’re kind of pushing, so this technically d@u would be one or more non-blank characters, followed by an at sign, followed by one or more non-blank characters. But with greediness, it pushes outward. And so it goes as far as it can. In this one, you do want to be greedy so you get this. If you made this non-greedy, you’d get d@u. So that also kind of helps you understand how greediness and non-greedy wants. Now, we can adjust how findall works by using parentheses. But this is not really using parentheses here, so we’ll do that next. And, so, we can fine-tune the string extraction, and have more that we’re matching than we’re extracting. And so if we look at this particular example, where we add caret From space and then one or more non-blank characters followed by an at sign and one or more non-blank characters, this matches this, right? So it’s a From followed by a space, followed by one or more non-blank characters, followed by an at sign, followed by one or more non-blank characters. It’s like this part here is a match. But we don’t actually want to get back the whole thing. And so we can add parentheses. So what I’m doing is I’m saying to start extracting after the space, so that From space is part of the match, but the extracted parts starts here and then the extracted part ends here. So that says, this is the part that I want extracted, even though I demand this to match. So I’m extracting less than what I’m matching. I’m using the matching to be very precise as to the lines I want and then I’m using the parentheses that I add to pull out what I want. And so here I get back exactly the email address even though now I’m already in this one thing, making sure it’s from lines that have a prefix of From space. So I’ve got lines with prefix of From space extract the second thing. And now it’s not just any thing, but it’s got to be From space and then immediately non-blank characters followed by an at sign followed by non-blank characters. So again, this is really fine-tuning. Okay, so let’s take a look at this thing that we were doing a long time ago but without regular expressions. And so the idea is we want to pull this little bit out, right? And here’s the old one. We find the at position, which is position 21, and so that gives us 21. We start at that position so we look up and we say when’s the next space, and we get 31 and that comes into here. And so we say we want to do a string slice from one beyond the at position up to but not including the space, remember, up to but not including. And that prints us out this little piece. But we can do a similar kind of thing with regular expressions and we’ve seen this with dual split, right? So this is the find way of pulling that out. Dual split is we split it into words with spaces, then we grab the second one, we split that by at signs and then we grab the second piece of that. So we take the second word, we split that second word by at sign and then we take the second piece, second piece and then we get this. So we were able to do that with four lines, a little more elegant. But if we do regular expressions, we can say, hey, go find me an at sign, followed by some number of non-blank characters. And I don’t want to extract the at sign, see where I put the parenthesis. I want to start extracting after the at sign and up to that rest of those non-blank characters. So that says I’ve got what I want. So it’s a way to say in a little expression, right? Match a non-blank character, that’s with a bracket. So that’s another syntax, and that is, this is a single character. But if the first letter of the set inside there is the caret, that means not everything but. So that means everything but a space that’s non-blank. So that’s everything but a space asterisk. There’s other ways to do that but that’s what this is saying. That’s a single non-blank character zero or more times but that’s what I want to extract and again out comes this little bit. And we can fine tune this by saying I want to start with From in the line. I want a space but I want any number of characters up to an at and then I want to begin extracting all the non-blank characters and then end extracting. And so this is just sort of, it’s really kind of adding this bit to it, this fine-tuning that’s also in a way could be used to filter the lines. So if you didn’t have a From line, you would get nothing back and you’re not finding email addresses in the middle of text, you’re just finding email addresses on lines that start with From space. And so you just sort of build these things up, you tell the regular expression what you want back and you get back a list. And like I said, you’ve got to check to see if the list is empty because that is your way of knowing that it didn’t match anything. And so here’s a little bit of code that sort of uses regular expressions to both pick lines and extract data. And so this is similar to one of the assignments where you’re going to look for lines that have a form like this, they say, X-DSPAM-Confidence: and then a floating point number. So we’re going to run through this, we’ll open the data, we’ll read through the lines. We’re going to strip the data and now we’re going to use findall to look for lines that start with X-DSPAM-Confidence:, quite a bit of stuff. It’s got to match every character, followed by a blank, then start extracting. Take 0-9 and period, because we’re looking for floating point numbers, so we want to get the period, bracket, one or more times, and that’s what we’re interested in. Now, here’s the part where you kind of have to check. If we are looking for a line that we are looking for here, there is going to be exactly one successful extraction. If you don’t have the prefix or don’t have the number, then you’re going to get zero extractions. And so what I’m basically saying is, this stuff is a list. A list of the matches. If that length is not 1, meaning it’s bad if it’s 2, because that means there’s more floating point numbers out here. How did that happen? Do we know or who knows what? It’s unlikely that this is going to match more than one, so we’re not going to do that. It’s one floating point number is what we got on the line. Then we’re in good shape, otherwise we’re going to skip that line. So this is both a filtering like an if blah, blah, blah, continue or if not startswith continue. And if it finds it, it’s also parsed the line and done a split and pulled all those things out. So that’s how with regular expressions you can make programs more succinct. And when you see someone else’s regular expression, it might take you a little while to figure out what the heck this is doing, right? And you have to read it. But the nice thing is, is it’s not a bunch of lines. So it’s a way to make your program shorter. Don’t overuse it. Put a few comments in, pound sign blah, blah, blah. Pound sign blah, blah, blah. This is looking for a line that’s of this particular syntax and blah, blah, blah, blah, blah, blah. Some kind of comment that help your reader out. But once you get used to them, and you will start to see them. They’re often used for data validation, for searching and extracting. Now we’ve got all these characters, weird little characters dollar signs, carets, etc. And sometimes we actually want to match those characters. And so we have one more special character and it’s the backslash. And the backslash just can be prefixed on an otherwise active characters. So dollar sign has meaning but slash dollar sign means it’s a really a dollar sign. So if I’m looking for strings that start with a dollar sign having numbers and dots and the non-blank characters that says give me the strings that start with dollar sign, one or more numbers and or dots, and so that then matches this bit right here and pulls it out. So escape characters when you really want one of those characters like a bracket or an asterisk or a plus or a dot. So, that’s a quick zoom through regular expressions. They’re fun. They’re fascinating. They lead to elegant code when used appropriately. I would suggest you don’t overuse them, but there are some times that they just are the right thing to do. And so, don’t use them, don’t try to confuse your reader of your code because the reader of the code might be you in the future. But they’re really interesting and powerful and you’ll probably see code that uses them. So thanks a lot.

مشارکت کنندگان در این صفحه

تا کنون فردی در بازسازی این صفحه مشارکت نداشته است.

🖊 شما نیز می‌توانید برای مشارکت در ترجمه‌ی این صفحه یا اصلاح متن انگلیسی، به این لینک مراجعه بفرمایید.