15.7 - Reconstructing Data with JOIN

دوره: Using Databases with Python / فصل: Data Models and Relational SQL / درس 4

15.7 - Reconstructing Data with JOIN

توضیح مختصر

We just created these many database tables, we've spread things out, we've linked them all back together with these numbers. We finally have touched the relational power and it may seem painful, but this is important, especially when data scales. Except now we can handle millions or billions of rows, because we have carefully constructed a data model that makes sense.

  • زمان مطالعه 12 دقیقه
  • سطح متوسط

دانلود اپلیکیشن «زوم»

این درس را می‌توانید به بهترین شکل و با امکانات عالی در اپلیکیشن «زوم» بخوانید

دانلود اپلیکیشن «زوم»

فایل ویدیویی

متن انگلیسی درس

So, we’re about to come full circle. We just created these many database tables, we’ve spread things out, we’ve linked them all back together with these numbers. And it seems like we just created a mess of everything. But now we’re about to bring it all back together using the JOIN. We finally have touched the relational power and it may seem painful, but this is important, especially when data scales. And if what you’re doing matters, then data’s going to scale. You can’t say, oh my data is so small I don’t need to do this. Well, if you’re doing it, and it’s big, and you’re really doing data analysis, and it’s like something you’re doing professionally, you tend not to do small things professionally. You tend to do large things. So this notion of primary keys, and foreign keys, and the little numbers that point to stuff, that’s what it’s all about. So we need to reconstruct, for our user interface, the data and reconnect all these connections. And the SQL syntax that allows this to happen is the JOIN operation. And the JOIN basically says, we are selecting data from more than one table. And part of the JOIN operation is the ON clause that says when we want to connect one row and one table with a corresponding row in a different table. And so we have this example here of our data, where we just want to sort of pull the artist title and the artist name out. Except that the artist title is in the album field, and the artist’s name, because we did this relationally, is in the artist field, so we don’t have it in one field so we have to pull across two tables. So that’s where we say SELECT, and then we want the things. Now this is a syntax here where you have table name, field name, tablename.fieldname. The title column from the Album table, the name column from the Artist table, FROM, which is the same FROM we’ve done before. Here’s a table joined with another table. And what we’re doing is we’re saying, let’s make a super table that really is the rows of this table, as well as the rows from that table, and that’s what the JOIN does. It’s like, we’re not going to one table, we’re going to two tables. Then, in addition to the JOIN operation saying we’re going to do all these tables, the ON clause decides when we connect a row in this table to a row in this table. So row to row, row to row, and the ON clause that we’re going to write is when the album’s artist id that is from the Albums table, the foreign key, the start of our little arrow, is equal to the id field from Artist, Artist.id, so it is the destination. So, this is the starting point of the arrow and this is the ending point of the arrow. So, this is the start. This is the end. So, that’s matching, reconnecting all the arrows for all the corresponding rows. Select these two fields, one field from each table, and then we got to glom the two tables together, but only connect the rows when this is true. For me, I would call this when. Some people prefer to write this as a longer WHERE clause, some databases prefer that. I tend to like using the ON clause, so I don’t kind of confuse my WHERE clauses. But if you learn the WHERE clause trick, it’s the same as the ON clause. So let’s run this one. That’s already in there. Oops. So I’ll run it. So here I go. I’ve got the title. Oops. I’ve got the title from one table and the name from another table. And we joined them all together. And we had an ON clause that tells when they’re supposed to be connected. Okay? So we can look a little more deeply into this relationship that we have, right? We can, you know, clear this. Look a little more deeply into the relationship that we have here by actually looking at all the data that’s actually participating in the connection between these two rows. So here we once again have our SELECT, and our FROM clause goes between Album and Artist. So we’re kind of, this is our super row. From the Album table and the Artist table. That’s Artist, that’s Album. And we have the foreign key relationship still. And we’re going to add to this SELECT these two fields. Everything is the same as before, except we’re going to show you how this connection is made, okay? So we’re going to select the title, the artist id, the artist’s id, and name. So you can see how we’re seeing the whole table at this table. These things have been joined together like there’s glue in the middle between them. They’ve been glued together, and they’re glued together in the situation where these two things match. So if you type that, that’s what you see. And so that’s a way these don’t really need to be shown, but they are the source of the connection. Okay? So that’s the source of the connection, and by just selecting them, you can see what it looks like. Okay, so let’s construct another one. Let’s say we want the Tracks title and the Genre’s name. And if we take a look at the Track, right, we’ve got genre id. Now we have some replication here. And then we do need to do this lookup, right? And we want to see all the tracks. Right? All the tracks along with their genres. Right? Rock, Rock, Metal, Metal. So now we’ve reconstituted the duplicates. Right? This is what the user wants to see. But this is what we’re storing in our database. All right? So again, we say SELECT Track.title. That’s this thing. Genre.name. That’s that thing from a different table. FROM Track JOIN TO Genre. So that makes, like, this big blob thing that’s a super row, that smears across two tables. And then we have an ON clause. And you’d think these are hard to construct. But, oh this is the Track table. There is a foreign key name named genre id. Oh, and that’s going to be equal to tablename.id. Right? I almost cut and paste these things so fast as I write them. And so they get easier and easier and easier, when you’re going from the Track table to the Genre table, and the syntax that I use for the connection is very canonical. I do it over and over and over again. Okay? So let me run this guy and there we go. We’ve reconstructed the replication. But this also a good time to show you a little bit more about what’s going on underneath this join, right? And now it all seems pretty and simple and this ON clause. But what happens if we just take out the ON clause? Select these two things from these guys glued together. Somehow we’ve glued these two tables together. So if you don’t have an ON clause what happens is it basically says all combinations. All combinations. So there’s four rows here and two rows here. So all combination of each row here. So this turns into 2, this turns into 2, this turns into 2, so that turns into 8. Now if this had 100, and this had 100, that’d be 10,000. And so if you don’t have an ON clause it ends up with all combinations. And we can show this. So this is exactly the same SELECT statement. Track.title, genre id, Genre.id, we’re adding the middle joiner guys that would have been an ON clause. And we’re saying FROM Track, JOIN with Genre, and no ON clause. Okay, that’s the thing that’s changed. We added these two columns and we took away the ON clause. So let’s take a look at what happens when we run this one. And lines don’t matter, so the fact that I put this on multiple lines you will see it doesn’t care. And so you can add spacing and lines to make your SQL look prettier. I do that all the time, I indent them. Indenting has no significance, but it is still very pretty. So now we see that we’re getting all the combinations. Right? 1, 1; 1, 2; 1, 1; so Black Dog has got both genres. So it got genre 1 and genre 2. Because we don’t have an ON clause, so it’s not interested in matching. And so you can think of a JOIN, you can think of a JOIN as building across both tables all possible combinations between the tables. The ON clause is throwing away the ones that don’t match. Or a better way to say it is it’s picking the ones that do match, right? We like this combination. We like this combination. We like this combination. And we like this combination. That’s what the ON clause is doing, is it’s picking the ones where there’s a match. If you don’t have the ON clause, then boom. All combinations, 1 to 2. This is one row and it’s being combined with that. This the the next row and we want both combinations. This is the third row, all combinations; fourth row, all combinations, and then the ON clause wipes out the non-matching ones. Whoops, I did it wrong again. I keep doing it wrong, but you get the idea. So that’s the ON clause. That’s really how it works, and frankly this is one of the things when I first looked at, it must be doing that but it doesn’t do that. When it sees the ON clause it’s really clever about pulling things in. Super clever. Again, that’s not our problem. So now it can get complex. Because now what we want is the track title, the artist name, the album title, and the genre name. And this looks like a lot of stuff, but if you follow it, it’s a pattern. So we have a SELECT, these are the things we want. That is the list of the output we want. We’re not interested in all the id’s. And we have to have this long FROM clause, that says, Track join with Genre join with Album join with Artist. That’s all the rows, all the tables concatenated together, and then the ON clause, which seems kind of tricky, but now we start. We start from Track and work out. We have the foreign key in Track of genre id is equal to the genre’s primary key. And the Track’s album id is equal to the album’s primary key, and there’s one more foreign key and that is the Album’s artist id equals the Artist.id. So go back to that little picture that we had this that pointed to this, that pointed to that, that pointed to that. We just look at this, the picture, and this is why the pictures are so important, and we reconstruct all this. And so this for me, it may look a little complex. But after you do it a few times, you realize that the naming convention really, really saves you, okay? Let me show you that, select this whole monstrous mess. Type her in, and then run it. And we’ve reconstructed it, right? Now, we’ve got replication here, but that’s the output. That’s the beauty of databases, you reconstruct any replication, but you don’t actually store the replication. And that’s why you have to learn to write JOIN clauses. So if we think about this, we’ve gone full circle. it’s taken us, just like any start-up, it just took us like three days to build our product. We started with a user interface that we designed that had replication. Then we came up with a data, a logical data model, then a physical data model, then we inserted all the data, we connected things with numbers instead of strings. And now we use a JOIN to reconstruct it. And so, we start here. We go through a step, a step, a step, a step, and then we come back to here, and we’re able to reconstruct, in effect, the output that our end users are going to want to see. Except now we can handle millions or billions of rows, because we have carefully constructed a data model that makes sense.

مشارکت کنندگان در این صفحه

تا کنون فردی در بازسازی این صفحه مشارکت نداشته است.

🖊 شما نیز می‌توانید برای مشارکت در ترجمه‌ی این صفحه یا اصلاح متن انگلیسی، به این لینک مراجعه بفرمایید.