© Distribution of this video is restricted by its owner
00:01 | So now coming back to my comments about, uh, a little bit |
|
|
00:05 | reminder in terms of trying to understand little bit about the performance And, |
|
|
00:12 | , some of this this societies for called Paschal, um which is about |
|
|
00:18 | generations old off and the g p S. But it this the GPU |
|
|
00:24 | is on bridges. So it's relevant that purpose. Some of the data |
|
|
00:28 | some of the other GPU soon maybe its like differ. Just remember this |
|
|
00:33 | kind of hierarchical clustering or things in off what is called the, uh |
|
|
00:41 | if PC clusters and far, whatever big chunks there were six of them |
|
|
00:48 | remind it pieces. Stanford won but you may look at the |
|
|
00:54 | It's on the slides anyway, So of them has 10 off this streaming |
|
|
00:58 | processors and and then each one of has a bunch of these crude a |
|
|
01:03 | in it. And so then And I'm using the food of |
|
|
01:12 | basically, which is warps and thread . Instead, off vectors on |
|
|
01:20 | I guess eso they kind of So the native thing that was around |
|
|
01:25 | a long time was just 32. then they got more silicon, so |
|
|
01:29 | speak. Or they got FINA smaller sizes. So then, at some |
|
|
01:33 | back, I think Pascal may have could, of course, per streaming |
|
|
01:39 | processor. But they still have this of the 32 white grouping or threads |
|
|
01:45 | simple threads into some vector. So still there. And then they group |
|
|
01:51 | into what they call Fred blocks, then they manage the blocks independently off |
|
|
01:59 | other. And here's a little bit . What then? It kind of |
|
|
02:06 | like a more or less what I on again, trying. And I |
|
|
02:11 | produces GP use seriously, and maybe have a chance to do it for |
|
|
02:15 | assignment, to play around with it see how this kind of gets reflected |
|
|
02:20 | the performance. You see the structure what's happening underneath um, and back |
|
|
02:29 | the thing that I have tried to all through the course Memory hierarchies is |
|
|
02:35 | , and in this case, it yes, capital. That was the |
|
|
02:42 | prior to Pascal in terms off GPU what the memory latency ASUs in |
|
|
02:50 | off where things hit, whether it's of the memory on but in the |
|
|
02:56 | the unit or the thing that is between the different units in a streaming |
|
|
03:03 | the processor. And then whether you to go outside and go to your |
|
|
03:08 | memory, you see a little bit what the difference is. So it's |
|
|
03:14 | for GP use. Going to the memory is quite expensive compared to staying |
|
|
03:20 | the silicon back post the streaming So for that they have. They |
|
|
03:30 | what known as a memory coalition, the way they have a friendly |
|
|
03:39 | their memory controller, as it seems me, is that they can have |
|
|
03:44 | outstanding or two requests to the memory on at the same time. And |
|
|
03:50 | , things get serialized. So if one off the simple threats that gets |
|
|
03:56 | up in tow work make your request memory for some data, then there's |
|
|
04:03 | requests, I guess issued, because , it's Cindy all at the same |
|
|
04:06 | , synchronously. So 32 requests the hits at the same time and, |
|
|
04:12 | , American on this or, you , to request so that means, |
|
|
04:17 | know, it takes 16 rounds from memory to serve these 32 guys. |
|
|
04:22 | no good. So that's why they what they call Coalition and trying to |
|
|
04:27 | these requests into no more than So, in this case, an |
|
|
04:35 | here where the requests are eight vice . Then the eight requests that is |
|
|
04:42 | here then turns out to the request 64. Bite on. I think |
|
|
04:49 | a good thing. And the way interpret the slightest Yeah, I can |
|
|
04:52 | that this is a good thing based what, while already wrote it down |
|
|
04:57 | . So if you go back and at the architecture that they use, |
|
|
05:02 | are 384 bits or wires to their . So that is 3 84. |
|
|
05:13 | then it's the D and the G are the first D, which is |
|
|
05:17 | so gone. Este per talk period the memory. You get two rounds |
|
|
05:23 | 3. 84 bits. So 64 6 512 bits. So freaking. |
|
|
05:29 | , fine. They get this for rounds in one memory cycle, they |
|
|
05:35 | serve the 512 this. So that's I think they used this particular |
|
|
05:40 | But it's again coming back. That's . Why let you for I think |
|
|
05:45 | waas on, then repeated when I a little bit more about tpu |
|
|
05:50 | Understand the hard work architectures understand the path in the system. In orderto |
|
|
05:56 | out how many Iran's of things it to get things in and out When |
|
|
06:00 | tried to interpret How come things behave way? Go back and look How |
|
|
06:04 | are the data past? How much waas try to transfer and how does |
|
|
06:09 | get transferred? Then you can get idea what takes That's what I was |
|
|
06:17 | to say about open A C C , and again, it's very you |
|
|
06:25 | , Uh huh. Small portion of you can do so just an into |
|
|
06:31 | get your flavor of what there is the slides Have you are else for |
|
|
06:38 | to find things. More information in of tutorials and current specifications. Other |
|
|
06:44 | . So for interest beyond during the , you know, go on, |
|
|
06:49 | at you or else and there's no I make some okay, think so |
|
|
07:06 | a little bit e wouldn't say but it's kind of algorithm. But |
|
|
07:12 | coming back to remember the hierarchy again figuring up the impact off cold and |
|
|
07:19 | to figure out how to analyze a bit what to expect in some very |
|
|
07:24 | case. So today is best, , uh, simple things act in |
|
|
07:34 | off cash is so bar kind of things from Call me catch up in |
|
|
07:44 | , that is now UT Austin is of the compiler guys type. So |
|
|
07:50 | did this work quite some time and and so the data is |
|
|
07:56 | but qualitatively is still highly relevant. in this case, depending about don't |
|
|
08:02 | about all the different versions that they out to try to make things |
|
|
08:06 | Well, the only thing that take message for this class is basically things |
|
|
08:11 | the very bottom. Things are just around zero. You can get |
|
|
08:16 | or if you do something really you can get something. In this |
|
|
08:20 | , that is probably about 2000 times performed. So and it's not about |
|
|
08:30 | clever in terms of the algorithm. you're people probably heard about Straczynski algorithm |
|
|
08:39 | does make use multiplication in whatever and 2.7 or something instead of an |
|
|
08:47 | That's not the game here, just , playing around with your three nested |
|
|
08:50 | , that's it. And I can a dramatic impact on performance. What |
|
|
08:55 | do. So not due to all different versions, but basically trying to |
|
|
09:03 | Start with a matrix vector multiplication and out how you can get Understand. |
|
|
09:09 | happening with cash is depending upon how write is to look when your matrix |
|
|
09:14 | multiplication. Um, so here's a way. All right, in your |
|
|
09:23 | vector multiplication. Um, so do collection of in their products one |
|
|
09:31 | Israel of the Matrix. Um, And you look at the number of |
|
|
09:37 | of Frances now. So we're trying understand the memory performance again. We're |
|
|
09:41 | trying to do anything about the algorithm In this case, it says four |
|
|
09:47 | squared memory references. Okay, so means, um, you need to |
|
|
09:53 | every X. Then you need to a and then you need to read |
|
|
09:58 | . So that's three I need to . Why? So that's four And |
|
|
10:04 | nothing good is happening, you need do that. Ah, as many |
|
|
10:11 | , Sarah Instances being executed and their squared off them. So there's four |
|
|
10:16 | the statement and there's n squared times statement. So that's four n squared |
|
|
10:21 | reference that you need to do Hopefully if you have cash is you |
|
|
10:26 | to be able to do better. that's what we're going to figure out |
|
|
10:28 | much better can be. Yeah, . And it says here. So |
|
|
10:37 | naive thing here is just the point , you know, this is total |
|
|
10:40 | . Yes. You know, in decent compile, it will not read |
|
|
10:44 | . Meanwhile, back to memories. and then if it's not that great |
|
|
10:49 | lists, you can fix it by attempts yourself, and then it doesn't |
|
|
10:52 | confused. But we'll talk about that later. So So, um, |
|
|
11:01 | I guess maybe I should so down bitterness. So how many has had |
|
|
11:09 | course of talking about cash is and they work? No one to three |
|
|
11:18 | uh okay, Alright. So I you pictures in lecture four owners, |
|
|
11:28 | know, level one, level two three, maybe even have a four |
|
|
11:32 | hierarchy of things and at the cash to functional units than toe operate that |
|
|
11:41 | same clock great as the functional units so they can transfer. Typical since |
|
|
11:49 | kind off for that case were driven scientific and engineering stuff that does. |
|
|
11:57 | is a lot major spectrum matrix, and stuff. So that means there |
|
|
12:02 | like the director example, right to two or three inputs in this case |
|
|
12:08 | and X two loads, and then have Y convention needs to be |
|
|
12:12 | So there is basically typical Things are for two concurrent loads and one store |
|
|
12:18 | level ones can ship to our 64 arguments. Typically, too, the |
|
|
12:24 | fun, and it can take one back to the cash from the register |
|
|
12:29 | , so they tend to be, , for two loads and one store |
|
|
12:34 | and different 100 does a little bit about. That's one thing typical. |
|
|
12:39 | then once you move to a level level three, then typically they don't |
|
|
12:44 | the same capability as a level so things may run a little bit |
|
|
12:50 | than May run at half the clock a quarter of the clock. So |
|
|
12:53 | though the path maybe about this wide data transfer rate is not to say |
|
|
12:59 | something now there are more. Then think I also mentioned in those slides |
|
|
13:06 | , uh, most caches today are caches. They are. The other |
|
|
13:13 | versions is direct map and fully Direct map means that every memory location |
|
|
13:21 | the main memory yes, have, , its own designated place in the |
|
|
13:26 | . It can only go toe one if that place is already taken and |
|
|
13:32 | somebody needs decide what to do. associate fully associating cash is is that |
|
|
13:37 | other extreme, uh, in a location can go anywhere in the |
|
|
13:42 | anything that is free. Just go , Grab it. Andi, In |
|
|
13:48 | practical thing is in order not to to search for where are my free |
|
|
13:53 | ? Then you just look up a places and say Okay, fine. |
|
|
13:58 | choose between 48 or 16 places That's considerably less complex than looking at |
|
|
14:04 | of them. So most of them four way 8,000,016 wave associate, then |
|
|
14:10 | So that's one thing. You know you can go and then it's a |
|
|
14:15 | . When things are already occupied, do you do so well, do |
|
|
14:20 | just sit around and wait until somebody that that's not needed anymore? Or |
|
|
14:26 | you then decided to say, fine. I'm going to kick |
|
|
14:33 | If it's never been updated, you necessarily need to kick it up. |
|
|
14:36 | that cash line have been updated, need to write it back somewhere. |
|
|
14:42 | then they're different policies for the cash as to which one's gets kicked |
|
|
14:50 | and the most common one is, at least recently used. So that |
|
|
14:57 | the things that has been sitting untouched the cash for the longest time gets |
|
|
15:03 | out, or and then over then there's a process. So when |
|
|
15:08 | get this cash, miss to write back to memory and then load whatever |
|
|
15:13 | want into memory. So L You policy is what's being used in |
|
|
15:18 | here, and that's something Thio again attention to on most of these |
|
|
15:29 | At least they are on the slides telling which processors use what kind of |
|
|
15:33 | four walking, Um um, so the cash vocabulary, then, depending |
|
|
15:43 | what happens when you try to load which try to read something, I |
|
|
15:47 | say, Um, one thing is known as compulsory Mrs. That means |
|
|
15:54 | it's never been in the cash. , it needs to migrate from main |
|
|
15:59 | to the functional units, so there no option you have to get it |
|
|
16:03 | . And that's what the notion of missus is, um, the other |
|
|
16:08 | of simple cases that the cash is because it's a lot smaller on, |
|
|
16:13 | the main memory. So at some that gets filled up. And then |
|
|
16:17 | it's full, then you get Mrs. Because it's full, |
|
|
16:25 | And then there's the thing that has deal with the associative iti in particular |
|
|
16:30 | things are occupied. So then it's conflict. So today we'll do the |
|
|
16:36 | K first two cases that are relatively and clean to analyze. It gets |
|
|
16:41 | complicated when you look at their so the obvious thing is that large |
|
|
16:49 | is good. Andi, for the . I'm going to use them, |
|
|
16:56 | says after problems. Size is small that the whole problems, so to |
|
|
17:05 | , fits in the cash and you a small matrix and that you |
|
|
17:09 | and it all fits in cash. the cash, this kind of large |
|
|
17:13 | to the problem you have. So the notion of large cash is going |
|
|
17:17 | use, Uh, the other one a small cash that is the |
|
|
17:22 | The cash is relatively small, Whatever a race of data that you |
|
|
17:28 | , um, now and you design , the larger the cash so obviously |
|
|
17:34 | capacity in conflict, Mrs. But hit time Because you need to look |
|
|
17:38 | the Roger address range and figure out politics and retreat. Hired social activity |
|
|
17:45 | reduces conflict, Mrs. But increase time as I mentioned before on a |
|
|
17:50 | block sizes. Cash line size in case is yes, the more you're |
|
|
17:54 | a law in some sense, it could be better, because if |
|
|
17:58 | need the rest of the thing in line, that's a good thing. |
|
|
18:01 | you didn't need it, you load lot more data than you need. |
|
|
18:05 | there is a trade off between number cash, Mrs and excess time for |
|
|
18:11 | stuff you don't want. So that's trade. Often. That's why you |
|
|
18:15 | pretty much all the architectures out as in No. 64 bytes cash lines |
|
|
18:20 | least close to the functional unit. they may have won 28 that some |
|
|
18:26 | the cases as when you get closer main memory so you don't. So |
|
|
18:30 | that case, you take 1 28 out of memory, and then you |
|
|
18:34 | of petition it can do 64 bits the level one cash somewhere. |
|
|
18:41 | now, So this is the one one needs to remember. And that's |
|
|
18:45 | should remember this one, because that's of important from trying to understand cash |
|
|
18:52 | . And that's this notional off, , stack distance. Or which is |
|
|
18:58 | how many cash lines do you try touch between to events you're interested |
|
|
19:03 | in our case is going to be to read the same data again. |
|
|
19:09 | it tells you best how many things stuck into the cash before you tried |
|
|
19:14 | get back to it. So if a lot of it, most likely |
|
|
19:18 | you want again has been kicked out you don't have it in there. |
|
|
19:23 | that's where we're going to use this . Um, so this is now |
|
|
19:31 | to figure out, you know, can you expect kind off the system |
|
|
19:40 | behave as the giant cash thing? don't have to worry about caches or |
|
|
19:45 | does it kind of get in the of getting good performance? And you |
|
|
19:49 | to worry about cash, Mrs. is what this kind of thing tried |
|
|
19:54 | say on the use of the Matrix about to buy. And I think |
|
|
20:00 | ready, said no. That's quite basically large cash again, you get |
|
|
20:05 | compulsory Mrs, because again, you to read the data once. But |
|
|
20:10 | is no capacity, Mrs, because know it holds the whole problem. |
|
|
20:16 | small cashes in need to worry The catch is operating. So here |
|
|
20:24 | basically five scenarios. I'm going to if I can get through for the |
|
|
20:29 | vector case today. Um, the two ones is basically doing the things |
|
|
20:36 | on is just one number. And you try to figure out what's the |
|
|
20:40 | of having a larger cash light and it's good. But it depends as |
|
|
20:44 | will see. So back now to make expected to nested loops. And |
|
|
20:52 | I mentioned, this is the typical a product formulation. So practice, |
|
|
20:57 | also you on the So we look you know how many mistresses it? |
|
|
21:06 | now it's the last large cash Right? So we need to read |
|
|
21:10 | obviously that stand square data elements we to read X way we need to |
|
|
21:15 | light and we need to write So, um so in that |
|
|
21:22 | uh, in terms off the once we have loaded why it's again |
|
|
21:26 | the large cash it's there. So don't miss them. We need to |
|
|
21:30 | it. So we basically and Mrs X and Y and n squared for |
|
|
21:36 | , eso that gives you the total of this is and relative to the |
|
|
21:41 | model that just did the four in thing. Then you get best |
|
|
21:47 | It's 0.25 plus, 05 divided by rights. Just like small end is |
|
|
21:55 | then eventually goes down to missing about quarter off Now, so the next |
|
|
22:04 | should be on the small cash So now what happens? So on |
|
|
22:09 | top, there is basically the string memory references that the loops do. |
|
|
22:15 | everything that is kind of shaded in way is trying to denote going through |
|
|
22:20 | first grow from start to finish. then, uh, the first sort |
|
|
22:26 | white type Imo non traded item after is going on to the next |
|
|
22:33 | So still no surprise that the sand , Mrs for a is never reused |
|
|
22:41 | anyway. So we need the Are a number of matrix elements regardless |
|
|
22:46 | what Yeah, access being reused right every role used, the same vector |
|
|
22:55 | . So there's potential reuse off there totally used. Why accept depending upon |
|
|
23:02 | you like that? Look that so why now if it's a small |
|
|
23:07 | the idea is, uh, you through the shady guys and that's so |
|
|
23:14 | know, it gets loaded and sits cash X loaded sits in cash. |
|
|
23:18 | have Y one and it sits there and I guess, gets iterated, |
|
|
23:22 | that's not really a problem. But keep loading ace and access, and |
|
|
23:27 | some point, the cash is Most. It's a small cap, |
|
|
23:32 | that means when you're at the end the role and you want toe, |
|
|
23:36 | with the next draw. The idea , it's a small cash, so |
|
|
23:40 | ex is no longer left in cash there's a small cash has got so |
|
|
23:46 | written when I, you know, or small fraction away through the rope |
|
|
23:51 | upon the size of their own. X is gone. So first I |
|
|
23:56 | to know it for the first row times the and elements in X, |
|
|
24:01 | then I need to reload it for the other roads as well as a |
|
|
24:06 | that's and and those for every other and their end minus one. Those |
|
|
24:11 | after the first. So that gives basically the kind of n squared mrs |
|
|
24:18 | for X. So now we have squared for a we have n squared |
|
|
24:23 | , and why is not the Because we just use it for every |
|
|
24:29 | . So it sits there until I the next one. So that's why |
|
|
24:34 | got to end square plus and now number of cash Mrs and then you |
|
|
24:40 | at the ratio compared to the first one. Then it's 0.5. Plus |
|
|
24:45 | 0.25. And, uh then so you try to figure out then. |
|
|
24:52 | , So what problems says can I instead of stay in the small or |
|
|
24:59 | Catherine Jean, which is the one is performed the best compared to the |
|
|
25:04 | town? So then you basically, , plug things in and try to |
|
|
25:09 | out how large does the cash needs be for me? Thio stay in |
|
|
25:16 | large Castro Gene. Right? Um if it's, you know, the |
|
|
25:29 | policy or what to replace is very . It knows. Then I'm never |
|
|
25:34 | to touch my a again, so don't need to worry about it. |
|
|
25:38 | the only thing I need is I X in there one space for a |
|
|
25:42 | on space for why? So I I only need space where X and |
|
|
25:46 | other variables. So that's entrusted. the other hand, if you have |
|
|
25:53 | and are you replacement? A. an ex state and it's ex still |
|
|
26:02 | be there at the airport at the of the road. You need to |
|
|
26:06 | space for one roll of a and a complete X plus re y. |
|
|
26:15 | maybe it's for safety. One more for, you know, plugging Axinn |
|
|
26:20 | messing things up. So that's why you need to end. Plus, |
|
|
26:25 | not end plus two, because you both one role or a and one |
|
|
26:29 | X because of the policy used for so So that means now if you |
|
|
26:36 | at when things changes, it's basically when N is kind of half the |
|
|
26:42 | up size of the problem So, , so this is what you |
|
|
26:50 | If you have the large caches started in terms of administration, then it |
|
|
26:57 | it goes down to the extreme On the other hand, if you |
|
|
27:03 | end goes past. So all of sudden the cash is small relative to |
|
|
27:07 | problem size, then the cash Mrs up again on then. And that's |
|
|
27:14 | you get that cash, please. , we'll see where we get. |
|
|
27:23 | this one is the other way. tends to be favored. And this |
|
|
27:28 | to depending upon whether you see your . One of the other is good |
|
|
27:33 | one is favorable toe draw, major and the other one to call a |
|
|
27:37 | order. So, Saxby, because the I look that actually goes through |
|
|
27:43 | of a is the enormous loop. it goes from within a column and |
|
|
27:51 | down in the column from the innermost , you know, just goes |
|
|
27:56 | and then the outermost loop scans across . So for certain architectures, this |
|
|
28:02 | an ideal product. But now we're for this example of his own role |
|
|
28:06 | order. So that means when you from, um, um, |
|
|
28:13 | I'm ahead of myself. That comes . If you look at this thing |
|
|
28:17 | the cash Einstein, it was one . So, in fact, it |
|
|
28:21 | up being the same miss ratio as . So forget about that. So |
|
|
28:26 | is what I had in my head I started to talk, eh? |
|
|
28:29 | now we're going thio, get the line size. So we're going to |
|
|
28:32 | do the same thing, whether to a product and Saxby Orthodoxy or XP |
|
|
28:38 | , and figure out what happens when cash line size is larger than |
|
|
28:42 | um, value. So now it's major order. So that means I |
|
|
28:52 | the animals loop, right? It through role and no major orders. |
|
|
28:57 | get the elements of a for one makes. So the number of cash |
|
|
29:01 | then gets divided Goodbye. The cash sites, which is n squared n |
|
|
29:06 | miss some good news for X not get those air. Just vectors of |
|
|
29:12 | doesn't matter the same thing for I get B at the time, |
|
|
29:15 | there's no problem. So that means got everything gets divided by B |
|
|
29:23 | and I think this is pretty much sense. So and then you look |
|
|
29:27 | this model and Lord Cash Mobs and was in the previous one and the |
|
|
29:34 | cash model. Now you have a of Mrs Andi other. The calculation |
|
|
29:40 | the same. So everything is just case one in the product elements |
|
|
29:45 | cash line size one. So everything gets shifted by be in terms of |
|
|
29:51 | things transition have, I think you got So, um, the |
|
|
30:07 | off the l are you? It's different because again you have just |
|
|
30:11 | Will be, but it's still what need is a whole roll away and |
|
|
30:15 | whole X and modular things being divisible . Be on after you are kind |
|
|
30:22 | tribute. So nothing happened. Um, stop the Yeah, What |
|
|
30:36 | ? At the same point. but the missus, if you look |
|
|
30:40 | the numbers uh, yes, the show goes down. Right, Because |
|
|
30:44 | get more and you get the of . So the basic is divided by |
|
|
30:47 | Andi, That's what. So now was to do the XP guy |
|
|
30:55 | Um, so now things are a different because things are again allocated and |
|
|
31:04 | major order. You always get be of a But you actually only wanted |
|
|
31:12 | because the next element off hey, you want is in a different |
|
|
31:18 | Same column, a different role. that's not in the same cash line |
|
|
31:23 | what's in the same cash line are columns in the same role of a |
|
|
31:28 | elements in the same column. So why the cash basis are still and |
|
|
31:34 | for, but the other ones are , so that doesn't matter. Still |
|
|
31:40 | . But since you go down that the same thing. Things happen. |
|
|
31:46 | . Why that you get be elements the number of cash basis is not |
|
|
31:54 | , but I never be. But you have to go and repeat |
|
|
31:59 | because when you're through with one the why may be gone. So |
|
|
32:10 | what happens now is, in what forget is because you need the |
|
|
32:16 | role. So that means the transition for a much smaller and because you |
|
|
32:26 | again the number of cash lines off is in the number of rows. |
|
|
32:32 | for each cash line, you have elements, not just one. So |
|
|
32:38 | why you get the picture like where now things are transitioning to the |
|
|
32:43 | cash model as a function and at much your value and hands on the |
|
|
32:48 | lines. So in this case, that's what you do, cash line |
|
|
32:52 | of one is much better than 64 or 1 28 or nine. And |
|
|
33:03 | , I So what? The other that people do then in real life |
|
|
33:10 | trying to figure out instead off just the naive, uh, looping. |
|
|
33:20 | do so black algorithms. And that's think you were used to do in |
|
|
33:23 | of one of their make use multiply . Eso instead of doing element y |
|
|
33:29 | , you try to fit a block the matrix and it as some marginal |
|
|
33:35 | and making sector, But it has significant impact in the matrix matrix multiplication |
|
|
33:41 | . In this case, the idea you do a small in this case |
|
|
33:46 | B by capital B matrix on you such blocks at the time. And |
|
|
33:51 | you look at you know you need would be elements of X and Y |
|
|
33:55 | the cash at the same time to off during one block off the metrics |
|
|
34:01 | off. And so in this you get some benefits and and two |
|
|
34:14 | slides, and I think I can do this. So And this is |
|
|
34:18 | thing that is, um, important is actually Houston a number of sin |
|
|
34:32 | that goes way beyond doing with lectures, stuff. Um, So |
|
|
34:40 | you have the blocks and you still to decide in which order you want |
|
|
34:44 | traverse the blocks, right, you do no major order. You go |
|
|
34:48 | , you know, a block role by block in the road. Or |
|
|
34:52 | can go down in the column block block, or you can try to |
|
|
34:57 | some other orders. So what is the upper right corner? And they're |
|
|
35:06 | twisted around a little bit. But order is was known as a Morton's |
|
|
35:12 | ordering. You can see disease and some fellow by the name of Wharton |
|
|
35:17 | his name attached to disorder. So anyone I heard about bought on |
|
|
35:25 | ordering? No, no one has about space feeling curve. So? |
|
|
35:35 | I'll come back to that when I about Mike later about petitioning data |
|
|
35:40 | Um, so one technique that is used as thio the space feeling |
|
|
35:49 | You try to walk across your entire structure on idea I just if I |
|
|
35:57 | traveling salesman on a visit one place place only once, and visit all |
|
|
36:03 | but then but being occurred, it a multidimensional problem into one dimensional |
|
|
36:10 | and then you have a traverse in that supposedly will be good in some |
|
|
36:20 | . So it's juice for doing load in clusters that, and it is |
|
|
36:26 | for trying to figure out how to memory Hurricane works well on the more |
|
|
36:32 | ordering, as you can see from line is a recursive ordering. So |
|
|
36:37 | can do to coercively from the course and toe find your level. And |
|
|
36:43 | of the competitors, in fact, use off. When they do blocking |
|
|
36:49 | you, they do blocking and then alarm blocking and maybe more frenzy and |
|
|
36:56 | other ordering not is recursive because in and I'm a talk about that a |
|
|
37:04 | bit recursive algorithms are intrinsically good for is in real life. It turns |
|
|
37:16 | not to be quite true. So watch the compiler guys that got |
|
|
37:21 | excited about this about 15 20 years and they built compilers doing a |
|
|
37:27 | and then we tried them up. doesn't this work? And then I'll |
|
|
37:33 | a bit more why it doesn't work all cases, or why this one |
|
|
37:37 | to be, as always, careful to use it. But this space |
|
|
37:42 | current concept is something new for your that has computational science may want to |
|
|
37:48 | and may be useful in some and, uh this is just chose |
|
|
37:55 | little bit. And if you do blocking in terms of the Matrix |
|
|
37:58 | multiply, what you can get is function Cashman, sites and other |
|
|
38:05 | So I think that's pretty much what had just just chose and strip |
|
|
38:12 | Is that for the compiler guys talk when they petitioner look into seconds that |
|
|
38:20 | fine fits some block size or some length and so on that they loved |
|
|
38:25 | runs over a bunch of elements on want to use, you know, |
|
|
38:29 | 11 32 and you have, you , 100,000 thing. You you still |
|
|
38:35 | to use vectors, and then you anything to an inner loop that uses |
|
|
38:39 | elements at the time. And that's they call strip mine and tiling. |
|
|
38:44 | talked about in terms off the GP , so So that was, I |
|
|
38:54 | it for to die and women is . So next time I'll talk about |
|
|
39:01 | matrix multiplication, and then I'll give little bit of into there are clusters |
|
|
39:09 | put together and move on to kind aggregates on note, um, as |
|
|
39:16 | basis for talking about this is passing MP I for short. That will |
|
|
39:22 | the assignment after the GPU assigned we're about, but I thought, understanding |
|
|
39:30 | little bit of cash behaviors and how structure algorithms for memory, our case |
|
|
39:35 | note based so far not in terms cluster based partition, memory or distributed |
|
|
39:40 | and still just local memory making use cash hierarchy. I'll try to do |
|
|
39:46 | now and finish up next time and talk about clusters as how they're put |
|
|
39:52 | basically very quickly. So you have mental picture if you don't already over |
|
|
39:56 | and why Mrs Passing makes sense. , all right. So last time |
|
|
40:07 | talked about me made to expect the and, um, had to show |
|
|
40:19 | little bit. The impact of Luke in terms off what fun could expect |
|
|
40:23 | terms off performance. And I show thing that just gives you the wide |
|
|
40:28 | off performance on some older machines, upon what you do to something |
|
|
40:35 | And in this case, it wasn't his matrix matrix. It wasn't the |
|
|
40:39 | vector, but it is matrix matrix you have a little bit more degrees |
|
|
40:44 | freedom and opportunities. And the whole again is about the memory system and |
|
|
40:53 | to make sure you understand or remembers get sort of really ingrained into your |
|
|
41:00 | about of the memory system work. then what? Your cold, how |
|
|
41:06 | actually uses the memory system. um, to me, it's not |
|
|
41:11 | the qualitative aspect we know it's kind different, but it's actually quantitative aspect |
|
|
41:16 | you need to worry about, So understanding, um metrics is really |
|
|
41:22 | through what I'm talking about. This just reminder about the Jewish difference in |
|
|
41:27 | of the collective capabilities off the core's a CPU versus the memory channels associating |
|
|
41:39 | the CPU. So it's kind of than an order of magnitude typically. |
|
|
41:43 | that means, um, if you have locality of reference in the applications |
|
|
41:50 | start with, then there's kind of hope of getting anything close to the |
|
|
41:55 | performance in terms off what the arithmetic logic units can do because it will |
|
|
42:01 | for memory. Um, if you data reuse and depending upon but the |
|
|
42:07 | managed to you when your course with coat, then you may or may |
|
|
42:14 | get the ability to fully benefit from cash because it's again. If the |
|
|
42:20 | mostly sees the cash is it runs cash bead kind of thing or a |
|
|
42:24 | unit speed. If it sees the memory, it's another, you |
|
|
42:28 | in order of magnitude or worse off terms of the performance. Eso today's |
|
|
42:34 | is basically trying to figure out, know, and when you have a |
|
|
42:37 | multiply, you know, what can do? And in part, the |
|
|
42:43 | that did with open MP and maintenance is, um, problem mostly comes |
|
|
42:50 | of frustrating, maybe experienced, um I wasn't really intent or getting you |
|
|
42:59 | frustrated but the M K L that the reference cases well optimized library. |
|
|
43:08 | that gives you reference as to what be done. And there is no |
|
|
43:14 | clever and magics being used is kind very basic insights into how the memory |
|
|
43:20 | uses. And I'll tell you exactly kind of techniques is being used in |
|
|
43:24 | K l eso. Why they get performer and the other thing that |
|
|
43:29 | I guess, hopefully the you noticed has a good experience, probably. |
|
|
43:38 | yes, you can get Port Sequential to make them parallel code by using |
|
|
43:43 | open MP, and it's not too to do that on. They may |
|
|
43:48 | and run, but it may kind not run very well. So, |
|
|
43:53 | , and it's all pretty much about the memory system on data accesses that |
|
|
43:59 | you the difference in performance. So sorry, but it's kind of sometimes |
|
|
44:05 | lessons are hard to learn. So , so yester minor about the two |
|
|
44:12 | things we talked about last time in of the compulsory and capacity, |
|
|
44:16 | is it and we kind of ignored ones that are a bit more |
|
|
44:21 | but and not so little complex, just to remind you about the associative |
|
|
44:26 | and small cashes. But today we'll do the simple things. Just look |
|
|
44:32 | or capacity, Mrs. And how this affected based on what they call |
|
|
44:38 | . This is the one thing that I think everybody should try to |
|
|
44:45 | And since it is one of the issues in trying to understand the behavior |
|
|
44:52 | your cold is to try to figure the reference pattern and how it relates |
|
|
44:56 | your cash is, um, so the large cash model is pretty |
|
|
45:05 | Everything fits in cash, or you expect too many problems. That's kind |
|
|
45:08 | easy. Um, and the small is what I'm kind of tired will |
|
|
45:14 | on today and try to understand what when the cash is relatively small compared |
|
|
45:20 | your problem at hand. For those you who does would say final |
|
|
45:29 | finite element codes is what I can about and for from upwards to try |
|
|
45:39 | represent it when she don't. But a metrics, it would be a |
|
|
45:44 | matrix. But the metrics then has structure. So even if you don't |
|
|
45:49 | the things, then you still have of a large amounts off matrices that |
|
|
45:56 | quite small incense because it depends on degrees of freedom in each node and |
|
|
46:01 | like that. But they're modest in , so each one of those may |
|
|
46:06 | well fit in. Certainly level two , if not level so so understanding |
|
|
46:12 | little bit. The large cash model be beneficial if you try to figure |
|
|
46:17 | to deal with the block structure in of these parts may takes cases, |
|
|
46:22 | that's what we played around with. Matrix vector was different. Luke, |
|
|
46:25 | and tiling and stuff, and we'll to do the same now for the |
|
|
46:30 | matrix multiplication. Um, and this pretty much what it says. |
|
|
46:37 | um, Matrix Matrix multiplication is not different than a collection of matrix vector |
|
|
46:44 | . So if you have eight times now, if it be is the |
|
|
46:49 | this classic collection off vectors. So each column off the lead, you |
|
|
46:54 | a matrix vector multiplication. So he's kind off at the first. Order |
|
|
47:02 | very close. It's just wrapped one loop around to take care of a |
|
|
47:05 | of vectors. Um, so where economical? White people tend to write |
|
|
47:11 | , rectum up matrix matrix multiply So three nested Loops unit most Lopez doesn't |
|
|
47:19 | a product, takes one roll off times a column of Be nothing |
|
|
47:25 | then what this particular quote does. it goes through in the exchange, |
|
|
47:30 | that means after it was one in product and it sequence the elements of |
|
|
47:36 | or computers to see in the same . And when it's done with a |
|
|
47:41 | of C, then it goes on do the next 00 C. So |
|
|
47:45 | what it does now on this slide said yes, the objects of the |
|
|
47:50 | air basically and squared things. So , uh, three and square |
|
|
47:57 | On the other hand, I hope you look county operations is basically to |
|
|
48:02 | cubana to one for Adam on from . And sometimes people used count affirmation |
|
|
48:08 | , so they just can't add multiply one. But we do it at |
|
|
48:13 | individual operation levels. There's basically two cube. So in this case, |
|
|
48:19 | the matrix vector multiplication, now you and to the operation and square |
|
|
48:24 | So, in principle, um, every data item is used on the |
|
|
48:31 | time, so in this case, have the much better opportunity for getting |
|
|
48:36 | are reduced than they have a Fact was for me. Sure, |
|
|
48:41 | element of the matrix that is a guy is only used one extra sections |
|
|
48:46 | y equals a a eight times used to possible. It's for every |
|
|
48:52 | of it, so you can get of experts. You can't get reused |
|
|
48:58 | this case so but if you look it, it's basically in that case |
|
|
49:04 | square data plus to end for the and two and squared operations. So |
|
|
49:11 | basically, on average, two operations data. So in this case, |
|
|
49:15 | ends. So the trick is how do you benefit from the fact |
|
|
49:18 | you can, uh, get data and which infrastructure your code in states |
|
|
49:24 | cash As much as. And then can see the cash performance and |
|
|
49:30 | But eso this is pretty much all . Already said, this is what |
|
|
49:38 | want to suspend the time right? this is back to the same site |
|
|
49:43 | I think there was a serious So to look at what happens in this |
|
|
49:48 | cold again a little bit more in . So if you look at a |
|
|
49:51 | is in a product, so what that mean? They go through a |
|
|
49:54 | away and out of the and then move on. So one of the |
|
|
49:59 | to see and then element and the rolled see on you go back and |
|
|
50:07 | same column of sorry. Grow or another Carnival needs to get there of |
|
|
50:15 | eso In fact, your kind off a vector Matrix multiplication are supposed to |
|
|
50:21 | vector because you use the same roll a for the different columns of B |
|
|
50:27 | it becomes a row vector times a supposed to major x times a column |
|
|
50:33 | . So this is what this thing . And then there are loop just |
|
|
50:36 | through a bunch of Israel metrics applications . I think this is kind of |
|
|
50:46 | explain in this one. So now kind of them trying to figure out |
|
|
50:51 | these things works. And so we for each role off a we |
|
|
51:02 | the whole matrix of the So that's squared. And then we do it |
|
|
51:06 | all the different roles. So it , uh, and times and squared |
|
|
51:12 | of Visa, saying cube reads off Hey, you're only deal once you |
|
|
51:18 | through it all away and then you to the next row away so a |
|
|
51:22 | only see or load once. So n squared loads for a and for |
|
|
51:28 | you have to in this case Yet in initial see from memory, |
|
|
51:32 | then you write it back. But only sort of load and store see |
|
|
51:37 | . So that's the to end squared memory references for see, So you |
|
|
51:43 | it all up, and then you then two plus three and square memory |
|
|
51:48 | . So now, if you look the ratio between the number of operations |
|
|
51:54 | the number or memory references, you'll get. This is basically to |
|
|
52:00 | squared operations on Ben, the in Plus three and square data references. |
|
|
52:05 | it is no surprise because, the two enormous loop that is exactly |
|
|
52:13 | vector matrix to be exactly operation. and then there's just another loop around |
|
|
52:18 | . So no surprise that you don't any potential for data reuse because you |
|
|
52:24 | do vector matrix multiplication a bunch of . On the other hand, as |
|
|
52:30 | saw on the previous slide in one should be able to engineer something |
|
|
52:36 | you can get much better data reuse there's a potential for and, |
|
|
52:43 | reuse some data. So this is this like and again the this quote |
|
|
52:50 | between operations and memory references, it's cold arithmetic intensity, and I talked |
|
|
52:57 | that a little bit back on, guess. Lecture three. So there |
|
|
53:02 | this rule flying model that people used frequently to look at when things are |
|
|
53:08 | , then with limited and when they basically arithmetic logic you and it's |
|
|
53:14 | And for that one uses this culture operations and memory reference now. So |
|
|
53:24 | see. But you know, when did the Matrix vector thing, we |
|
|
53:28 | around with different Luke borders. There two loops in that case. Now |
|
|
53:32 | have three so on and for the vector, the performance was quite different |
|
|
53:38 | particular. When you have the cash off some number of elements that it |
|
|
53:45 | is the performance depending upon the loop . And so you should expect the |
|
|
53:48 | toe happen with Matrix matrix. So I'm going to trying to put you |
|
|
53:55 | to work a little bit. so here is the JK Group. |
|
|
54:00 | typical. When people talking about uh, vector multiplication instances very |
|
|
54:06 | you know, and cubed operations people to I I would say people call |
|
|
54:12 | algorithm. I don't know what the different through border but algorithmic Lee, |
|
|
54:17 | wouldn't say there's much difference. So they tend to refer to things by |
|
|
54:22 | index order you have some. In case is I J k going from |
|
|
54:27 | animals loops to the inner Muslim. , so this kind of role says |
|
|
54:34 | Okay, um, the Tulou borders keep interchanges the hyeon jae loop, |
|
|
54:42 | to keep okay as the innermost. it's one can then try to figure |
|
|
54:49 | what is the data locality off the reference pattern when K is the innermost |
|
|
54:56 | , regardless of what happens. So a sense of focus on the innermost |
|
|
55:00 | that happens all the time, and doesn't matter too much what the order |
|
|
55:05 | the other two loops are. So kind of locality do we have for |
|
|
55:13 | A, B and C in this ? How about a what's locality? |
|
|
55:29 | when you go from one K to next, what happens, right, |
|
|
55:37 | you don't reuse the same A for next case, So it's not temporary |
|
|
55:43 | . So then it's a question. other option is then special locality. |
|
|
55:48 | is the spatial locality for a because it happens to be Grow major |
|
|
55:54 | and you walk down a row of So yes, you get spatial locality |
|
|
55:58 | a How about the no cause Gay marched on in a collar. And |
|
|
56:06 | you have no major order, that no spatial locality. How about |
|
|
56:15 | right? Yes. Because every integration the loopy reused, the same see |
|
|
56:19 | in just accumulate the results of the product or the partially in the |
|
|
56:27 | As if so this is what you . That's what is it? So |
|
|
56:31 | how about if you put I in inner open step, then what |
|
|
56:39 | How about a but right? So locality, Special locality. Nothing |
|
|
56:59 | Because you used the new A. it's in a different Joel. So |
|
|
57:04 | means there's neither station or temporal How about be well? So what |
|
|
57:26 | to be when ice changes? Except you get temporary, lieutenant, and |
|
|
57:40 | no into the eye. I is the most. So you go down |
|
|
57:45 | the color. Yes. So this what happened. So last case J |
|
|
57:55 | is in a Muslim a correct. happens with Jay is increments would be |
|
|
58:17 | . Yes, the same role. now we get special locality for |
|
|
58:20 | We get temporal for a and about Yes, this patient Yes. |
|
|
58:29 | Um, so that was it. I think Let's see, s |
|
|
58:37 | that's kind of a this just simple . It's probably a little bit out |
|
|
58:43 | order, but this is if you this. George has fashion area that |
|
|
58:48 | figure everything fits and you can figure in this that So if you go |
|
|
58:55 | to the slide, then it is we have said. I just realized |
|
|
58:58 | the animals out of order. So exactly the case we went through and |
|
|
59:02 | figuring out exactly what the number off basis is because for B, there |
|
|
59:10 | when you had K in the innermost that can see right. It |
|
|
59:14 | um, no locality for me. you missed every time for B. |
|
|
59:18 | there was special locality for a so basically forever the only means for every |
|
|
59:23 | elements off A and C is temporary and just reusing nothing. Now, |
|
|
59:32 | you go back and look at the two porters that I just showed |
|
|
59:35 | one can do the same analysis for of them in that case, one |
|
|
59:41 | contained So just do The simplistic thing soon be is very large, but |
|
|
59:48 | not. Then it becomes 0.25. , for the next one is basically |
|
|
59:55 | . So that means it's a high with I innermost than it is for |
|
|
60:00 | innermost. And then we had the one, which was J in the |
|
|
60:04 | , uh, and in that if the it's a reasonable number, |
|
|
60:09 | it beats the other. So in sense, one can. Based on |
|
|
60:14 | simple analysis of how data is you can figure out that the last |
|
|
60:20 | border ought to be the best one terms of performance. Second will be |
|
|
60:25 | tough one, and the worst one be with Ryan. So and then |
|
|
60:31 | guys were nice enough to run this on a platform, and you can |
|
|
60:38 | three very distinct behaviors in terms off left things where there is the 0.5 |
|
|
60:45 | races up 2.5 in terms of the Mrs. And then at the other |
|
|
60:50 | , you see the purple one on bottom. So you can see that |
|
|
60:54 | on what you do. Yes, behaves as you expect. Um, |
|
|
61:02 | that and here is basically the loop , the show for the different |
|
|
61:05 | And how did you can also see it's not a nice, smooth |
|
|
61:09 | It's a lot of sprite. Um , um, I'm on there to |
|
|
61:14 | . Why? Despite right. So has to deal with the associative it |
|
|
61:28 | of the cash conflicts, What gets up? Because, uh, not |
|
|
61:33 | that's what the spiking comes from. the essential behavior qualitatively is no |
|
|
61:39 | but the curves a very much in with what? The thing. So |
|
|
61:42 | is against I s a very simple of your code allows you to get |
|
|
61:50 | good idea. What reported and of course, is to compile this |
|
|
61:54 | . To do that for you on something simple enough. Yeah, traditional |
|
|
62:01 | in it and so on. So analysis may very well turn up at |
|
|
62:04 | con Party does right thing, the to Congress on about it yourself and |
|
|
62:10 | sure things are set up so you depend on the compiler to figure it |
|
|
62:13 | for you. any volunteers in terms these things? Not to this. |
|
|
62:20 | a foreigner here. It's a fluke the measurement. Yes, exactly. |
|
|
62:32 | they on Dennis's on the horizontal line . So when the majors small |
|
|
62:39 | basically the all 15 cash and you no problem with memories so better and |
|
|
62:44 | Because again, no. Their smallest to people and where they have so |
|
|
62:50 | becomes more and more documentation that You're yes, forget. But then |
|
|
62:57 | not a race here. Okay, is not protecting, and you go |
|
|
63:03 | and pick does the simple analysis and did before. And you can figure |
|
|
63:10 | exactly what when you start to lose large cash behavior as a function of |
|
|
63:15 | of your matrices. Um, so stuff. This is what I guess |
|
|
63:27 | had. So the one caveat about ISS the fact where the optimal order |
|
|
63:42 | a senior Russian, I have kind a arguing and figured out, uh |
|
|
63:48 | depends on the shape of the major . So if the majors is a |
|
|
63:51 | , this is the right thing to . But if you have rectangular majors |
|
|
63:54 | you may have different reporters that is so. So it's not given that |
|
|
64:02 | always the right thing to do They removed border. And so the |
|
|
64:06 | portrait I just showed you for, , sorry, I'm a little bit |
|
|
64:13 | of myself. I'm getting on to next thing anyway, so I'll come |
|
|
64:17 | for that since I'm ahead of So, yes, I just put |
|
|
64:20 | in as a reminder, way back some lecture showed you got to play |
|
|
64:25 | two codes and you with all the and this just chosen. Yeah. |
|
|
64:32 | this is again pretty much what I . There One thing also, we |
|
|
64:42 | with the Matrix vector multiplication. We at the tile, the blocked |
|
|
64:47 | And that's in fact, they're a more important for the Matrix made, |
|
|
64:52 | vacation because that's where you actually start get some benefits from tax. |
|
|
64:59 | um, so now basically so the of the it's kind off. What |
|
|
65:06 | see in this loop border is kind the same as the I J K |
|
|
65:11 | . But the difference is now, of have individual elements work with a |
|
|
65:17 | , a small sub metrics, but everything is the same. So you |
|
|
65:22 | think of your now if I K algorithm and just instead of working |
|
|
65:27 | individual elements, think of it as block matrix on, then it's all |
|
|
65:32 | same. So if you do then I can figure out what, |
|
|
65:39 | a function off the block size where assume Capital B in this case role |
|
|
65:46 | and columns for the blocks or each sub block sub matrix is a bee |
|
|
65:52 | bee matrix. But then otherwise it's off the same. So but |
|
|
65:58 | instead off grabbing individual elements and doing whatever in squared times. So if |
|
|
66:09 | look through it so you basically mhm of a black row off eight times |
|
|
66:17 | block column off be, uh, the kind of innermost thing then, |
|
|
66:24 | that means you have again. And would be, um, blocks for |
|
|
66:36 | A and B eso. For You have any of the B |
|
|
66:42 | and each one is B squared, then you go and do it for |
|
|
66:49 | the difference. Then you have n divided by B block columns of |
|
|
66:54 | So before you're done with one block of a. Then you have done |
|
|
66:59 | over B squared times, the squared terms of element. So that's what |
|
|
67:07 | of within their square bracket thing. then you have the number of roadblock |
|
|
67:13 | of a that you need to go , and there's an over B block |
|
|
67:16 | of a until you're done with, times being so that's so in the |
|
|
67:22 | , you basically get something that um, for the Cube, divided |
|
|
67:29 | the it's a lot, and then can do the same, because in |
|
|
67:33 | case, it becomes symmetric respect to and B and then see it's still |
|
|
67:38 | same. It doesn't matter. You it your computer block by block, |
|
|
67:42 | you only need to load, see and stored one so can ignore the |
|
|
67:48 | the block sizes in material. The order details in your matter a little |
|
|
67:53 | in terms of heart maps of cash signs and details. But in terms |
|
|
67:57 | the number of elements loaded, it not affected by the blocks, and |
|
|
68:03 | still too in cubed operation. So you can see if you do this |
|
|
68:09 | matrix multiply. You have the option the B squared blocks of a BNC |
|
|
68:16 | in the cash. Then you can basically reuse that is proportional to the |
|
|
68:22 | off rows or columns in the block they're using. So by using these |
|
|
68:28 | , approach on keeping them of a where they fit in cash. |
|
|
68:33 | you know, get data reuse that didn't get before. And, of |
|
|
68:38 | , then the larger the block size , the better it is for the |
|
|
68:44 | . So and then I guess the line shows so that, basically, |
|
|
68:49 | that modular a few times and other , but essentially the three blocks |
|
|
68:54 | B and C all need to fit the cash. So that means the |
|
|
68:59 | size is kind of depending upon a cash you have, but it's proportionately |
|
|
69:04 | is the square root of the cash terms of do, and then you |
|
|
69:08 | figure out if you want something to kind of balanced, so you want |
|
|
69:18 | memory not to get in the That means you look at the number |
|
|
69:22 | data loads and the time for each load, and then you look at |
|
|
69:26 | time it takes to do in your and point operation and Onda how many |
|
|
69:31 | them you do and you look at ratio. And then it turns |
|
|
69:35 | if you're have this arithmetic intensity, can figure out, you know what |
|
|
69:41 | size and cash size do you need this? Computation in terms of |
|
|
69:46 | multiply not to be limited by the of memory that you have, so |
|
|
69:52 | a fairly simple thing, and then figure out also. Then, if |
|
|
69:58 | going to be, even if you a cash, is it going to |
|
|
70:01 | memory limited or not? Um, you can figure it out from your |
|
|
70:06 | points because we have from the When he talked about it, |
|
|
70:10 | It has a clock rate, um it has an instruction with that may |
|
|
70:18 | either a scaler or, in this months, why the compiler should be |
|
|
70:23 | and finding. So if you take and can do 16 operations, |
|
|
70:28 | at multiplies in a single instruction and it runs or whatever clock grades and |
|
|
70:34 | whatever course and you can physical plug in here and figure out the time |
|
|
70:39 | floating point operation and you also have memory bandwidth that you get from the |
|
|
70:45 | of memory channel and the speed of memory that you have. So you |
|
|
70:49 | figure out what the tea sebum and you can figure out what he |
|
|
70:54 | is. And you can then figure with the optimum block size are you |
|
|
70:59 | to be limited by the album cash should? So, which is just |
|
|
71:05 | say I think it comes on some here. So yeah, I guess |
|
|
71:11 | slide makes a comment so it turns for those are kind of theoretically inclined |
|
|
71:16 | this particular scheme one can also prove is the optimum scheme to do things |
|
|
71:22 | terms off, trying to basically get maximum reduce off each level in the |
|
|
71:28 | hierarchy. So you apply it to level one. You apply to the |
|
|
71:32 | two and you apply it to the three and whatever else you have, |
|
|
71:37 | if you have this beyond your main than your flight, you know the |
|
|
71:42 | memory, and so the disk and not so it applies to every level |
|
|
71:47 | the memory hurricane, which is kind what the next strike says so That |
|
|
71:53 | you know everything you got topped in software. Engineering goes out the window |
|
|
71:59 | you're going to mess up your coat , making it much less intuitive and |
|
|
72:04 | to sign because you break things Essentially do block sizes for every level |
|
|
72:11 | the memory. Higher case you get of you ask your three nested |
|
|
72:15 | You know, you get three times number off memory levels you have in |
|
|
72:20 | sister. And if you're really um, then the innermost is not |
|
|
72:27 | level one cash, but it's actually register. So that's what you |
|
|
72:34 | People that can kill guys do, I do. And he gets to |
|
|
72:41 | and enjoy. So then, in most things are kind of taking a |
|
|
72:44 | because the number of right tens 32 something from most process. I got |
|
|
72:51 | years for about four by four. , uh, what you dio so |
|
|
73:00 | much this is. You know what said what m K l does and |
|
|
73:04 | with the loop. Borders is place the blocking for the various levels in |
|
|
73:07 | memory hurricane, the way they get performance. Now that's obviously painful. |
|
|
73:18 | some people want to automate the process these days. I guess you can |
|
|
73:24 | to be clever and try to be these days. They used machine learning |
|
|
73:30 | up in the old day you did the brute force way and just tried |
|
|
73:34 | the different options on the brute force . One way is what's done by |
|
|
73:40 | called Atlas Library that you can download Net flip and what it does. |
|
|
73:46 | basically right all kinds of LA Kings Luke, petition ings and Luke |
|
|
73:53 | and then the very code for all different things. And then just try |
|
|
73:58 | out on your platform and figure out is the winner. Of course, |
|
|
74:02 | means that you just do one matrix . You spend a lot more trying |
|
|
74:07 | figure out which one is the fastest in the snow with me. What |
|
|
74:11 | doesn't really pay unless you repeat the operation many times. But it is |
|
|
74:18 | insights into the range of performances you get as a functional partitioning and look |
|
|
74:27 | . Well, this is pretty much yeah summary. I want, um |
|
|
74:38 | as I mentioned, I think last or yes, when I talked about |
|
|
74:43 | Tide Matrix director thing talked about this of space feeling, curves and other |
|
|
74:52 | . So some people says again, is the pain. So how can |
|
|
75:00 | kind of do something that is, it automatically? Or in addition to |
|
|
75:09 | the brute force is there's another Then it turns out and principle that |
|
|
75:16 | you do it recursive algorithms instead of intuitive things that I've shown all the |
|
|
75:22 | than you. So if you want do so, it's kind of like |
|
|
75:29 | block matrix at the high level. suppose you have your and by in |
|
|
75:35 | cities, then you're partitioning into four . They're basically end of the |
|
|
75:40 | each one, and then you multiply one of these two. So now |
|
|
75:43 | branch, you have four instances and you recursive Lee petitioned them and do |
|
|
75:49 | and smaller matrix multiply. And eventually have something radius to Ellen. So |
|
|
75:55 | recursive the partition, they may be . Two pi into block matrix |
|
|
76:01 | So that's what presumably is. So means and you get down today. |
|
|
76:10 | element level they definitely 15 cash and stay in cash as you kind of |
|
|
76:15 | up and you larger and larger block . And eventually you run out of |
|
|
76:20 | . A lot of them that fits cash and then depending on which order |
|
|
76:24 | process that blocks, you can use space filling carbs to be clever and |
|
|
76:28 | to preserve more and more of a . Um, so yes. So |
|
|
76:38 | is what So people try this as said, Um, and unfortunately, |
|
|
76:47 | practice, it hasn't been able to this kind of automatic procedure. So |
|
|
76:55 | interactive ones, there's still what's being in the libraries out there. And |
|
|
77:01 | you want Thio do something that is makes maximum use. So in this |
|
|
77:05 | , for on platform, I think got about two thirds and another |
|
|
77:09 | They got to about half the peak using the reverse of approach. And |
|
|
77:13 | has to deal with a lot of their memories system actually works, and |
|
|
77:17 | cash lines works and a lot of underneath on this is just a reminder |
|
|
77:24 | some of these, But, you , put this in there. So |
|
|
77:30 | is anyone interested? So the first is so Charles License and that Mitt |
|
|
77:39 | student was one off the early ones , if not the earliest that |
|
|
77:46 | you know, came to think about richer semantics, multiplying, being kind |
|
|
77:50 | combination of systems. Card on theoretician proved a bunch of properties. And |
|
|
77:55 | has very nice lecture. And I you are into this, lecture them |
|
|
78:01 | on about the course of approach and It's great. Uh, but it's |
|
|
78:08 | certain practices kind of India, and kind of lose. So that's what |
|
|
78:17 | had on Matrix |
|