16.1 - Geocoding

دوره: Using Databases with Python / فصل: Databases and Visualization / درس 1

16.1 - Geocoding

توضیح مختصر

And if you don't understand this material, you really need to go back and review all those other chapters because we're going to start moving pretty fast. And so you have this gathering process that basically says look, this is slow, yucky, unreliable, dangerous, and you might want to start this up and then restart it. But in this, because the problem is harder to solve and there's unreliability and other external things, we will basically break it into multiple steps.

  • زمان مطالعه 9 دقیقه
  • سطح خیلی سخت

دانلود اپلیکیشن «زوم»

این درس را می‌توانید به بهترین شکل و با امکانات عالی در اپلیکیشن «زوم» بخوانید

دانلود اپلیکیشن «زوم»

فایل ویدیویی

متن انگلیسی درس

Welcome to Chapter 15. We’re going to have a little bit of a different take on Chapter 15. We’re actually going to make more complex programs, and we’re going to actually do multi-step programs in this chapter. And we’re really just applying all of the skills that we gained in the first chapters. And if you don’t understand this material, you really need to go back and review all those other chapters because we’re going to start moving pretty fast. So what we’re really going to work on is we’re going to combine the act of working across the networks and we wrote a program that read some stuff off the network. We’ve done that. We’ve looked at databases and how Python programs can put stuff in databases. But now what we’re going to do is we’re going to use a database as an intermediate step and we’re going to be gathering from some kind of a data source. Increasingly, data is found on the Web, often you find it in the wild. And you’re doing something to, you’re doing something, you’re pulling something from a Twitter API like we did in the previous, or a GeoJSON API but then there’s some rules about it. You might have to have an API key or you might have a rate limit or it might even be unreliable. And so you have this gathering process that basically says look, this is slow, yucky, unreliable, dangerous, and you might want to start this up and then restart it. This is a hard process. This might take hours of time. It might even take days and it might break and it might get fixed, right? And it might break and it might get fixed. And so this gathering process is something we want to be real careful, and that’s why we tend to put the data in a database. Because databases are really good at sort of not losing data and you’re halfway through, and something blows up, the data you have is in the database. And so the gathering, in many ways, it’ll be like, okay how much data do I have? Now I’m going to go get some more data, put this in the database, oop, we blew up. So now we’re going to run again, we’re going to start up again. Okay, start up. Let’s look at how much we got. We got more now. We’re going to start at a different place looking at the data and we’re going to gather that and add it on in the database. And we might have to do this many, many times. As I mention in the GeoJSON thing, because there’s only 2500, it took me several days to get through 10,000 bits of data from the Geodata API. And what we tend to do is in this gathering process, we tend not to do any analysis of the data. We’re just, like, we keep these programs relatively simple. They read something, they stick it in the database, read it, stick it in the database, deal with the fact that you’ve been blown up and have to start halfway through. So we keep these really very simple. And then we get our data and in here sometimes we’ll have very raw data because we’re really focusing this database on handling the complex management of the problems that you have while you’re gathering the data. So at some point you’ve got your raw data and you may have a separate step that is a Python program that goes through and reads all the data in this database, runs a Python program, and might even run another database. And frankly, you could have more databases here, etc. But some process that basically reads the raw data. And then you might write another database. Some of these will just actually go straight to analysis or visualization in our earlier ones. But in later, what we’ll do is have this pretty data, this is the clean data. This is the data that makes sense, right? It’s the clean data. And then we’re going to write another. So each one of these are Python program, Python program, and now we’re going to run maybe couple other Python programs. This is going to read from the clean database and do some analysis and print us up some data or it might read from the clean database and then try to visualize the results. And so, these are separate steps and each of these boxes is a separate Python program. Now in a way, everything we’ve done up to this point has been write one Python program to produce some result, right? And we write a loop, and we read the stuff, and we make an array, and then we print the array out. But in this, because the problem is harder to solve and there’s unreliability and other external things, we will basically break it into multiple steps. And we’ll write a little Python program for each of these steps. Now what we’re working on is not exactly data mining. It is and it isn’t. I don’t call this data mining, because that would be overstating what we’re doing. There are many very complex data mining technologies and that’s not what we’re going to cover in this course. There are other places that you can learn about data mining and I’d like to think that our course that we’re doing here is a good preparation for learning about data mining technology. So there’s open source things like Hadoop and Spark. Amazon has a whole data mining operation called Redshift. And there’s many community source, and then dot, dot, dot, dot, dot, dot, dot, dot, dot, dot. And so, don’t assume that this is all there is to data mining. This is a particular style of data mining that I call “Personal Data Mining”, right? And it is not to say that once you’re done with this you’re a data mining expert, because that would be a gross overstatement. We’re really more interested in this chapter on making you better Python programmers by solving some simple rudimentary data mining problems with Python programs and then looking at those Python programs and becoming better Python programmers. So the first thing that we’re going to do is we’re going to build on something that we did in the last chapter, and that is talk to Google’s Geocoding API. And pull some data into a database and then visualize something out of that database. And we’re going to use the Google Maps API. So you do need to be connected to the Internet when you do this. And so, here we go, and of course, whenever you’re doing any of these things, I will generally give you URLs to use other than the official URLs. You can use the official URLs, but at some point, we don’t want them to get annoyed with so many students taking the class and pounding all of these APIs, I’ll get some kind of email that says quit talking about our API. So I will, whenever possible, give you my own API to use for these kinds of things and I’ll give you a whole video showing you these programs in action. But right now, I just want to show you the general outline of the picture of how these things work. And so we do have the Google Geodata API here. We have played with this before. And so if you look at this program, geoload.py, by the way, you download all this stuff from right there, and these are just files in there. geoload.py is a lot like the thing that you read before that reads some JSON, hits a URL, reads some JSON, parses some JSON, and then writes the JSON into a database, right? And this actually takes as a list of locations. So if you remember the other thing asking you for each location, this actually takes where.data, which is a list of locations. And this can have thousands or even hundreds of thousands of locations, and then as we retrieve each location, we put lines in our database. And this ends up in a file called geodata.sqlite. .sqlite is the suffix for SQL data. And so, this will run and this can start and stop, and start and stop. And remember this is only 2500 of these per day. Start and stop and slowly but surely, we build this up. Now the interesting thing is, even if you haven’t got all the data you can still run these other things, because let’s say you’ve got the first 500 of these records. Well, you can still make a pretty picture of 500 records and then later the next day you can then go get 500 more or 1000 more depending on your network connection, etc., etc. And also don’t get yourself in trouble with your network service provider by running these things 24 hours a day and downloading gigabytes of data, and all of a sudden you’re on some mobile device, and so just be care how much data that you download. So at some point you have the data cached. We use the word cache, which is kind of a local copy of something that’s elsewhere. So we’ve got a nice copy. So now we don’t need to talk to Google any more. We’ve got all of our data sitting in this database, so we will write a little program called geodump.py And it will write a loop that’s going to loop through all the records in this database, loop, loop, loop, loop, loop. And this one prints it out, just on the screen and it tells you, oh yeah, and I also wrote as a side effect a bunch of the data into a file called where.js This is a JavaScript file and you can take a look at it. This is not a JavaScript class. And what I’ve given you is I’ve given you a whole bunch of HTML and JavaScript that takes care of all this and this HTML file reads this JavaScript file. And then calls the Google API to make all the little dots on the map for you, right? And so if you in effect pull more data in, and then run this program, and then run the program, and then hit refresh on the screen, new little dots will start popping up. Okay? Now, the screen doesn’t actually go straight to the database you have to run the geodump.py But now we’re kind of seeing this multi-step process, where you do this for a while. You get your data filled up, and then you say oh I’ve got myself some nice raw data here that’s been cached and now I’m going to run it. See what’s going on and then I’ll visualize it. Okay? Like I said, I’m not going to teach you in this class exactly how to write each one of these things, although in the Capstone some of you may play around a bit with doing those kinds of things. So this is the summary of our first of three examples of how we’re going to do this personal data mining. So we’ll see you in the next one where we’ll talk about page rank algorithm.

مشارکت کنندگان در این صفحه

تا کنون فردی در بازسازی این صفحه مشارکت نداشته است.

🖊 شما نیز می‌توانید برای مشارکت در ترجمه‌ی این صفحه یا اصلاح متن انگلیسی، به این لینک مراجعه بفرمایید.