© Distribution of this video is restricted by its owner
00:00 | and start lecturing this point. It like it is recording. Yes. |
|
|
00:11 | um them just comment to this lecture that reminded me started, I started |
|
|
00:24 | talking a little bit about oh hardware in order to understand what resources I |
|
|
00:36 | and what it takes to potentially good . And then we talked about tools |
|
|
00:43 | actually find out what the codes are . And then after that, talked |
|
|
00:51 | three different programming models that are being and HPC and pretty much anything in |
|
|
00:59 | of apparently computing that is open, official memory programming and programming. Heterogeneous |
|
|
01:07 | . And then the programming of clusters an P. So now as going |
|
|
01:18 | return to talking about techniques to actually good performance for applications since and that |
|
|
01:28 | think is important for we're projects and . Um, and so I pointed |
|
|
01:35 | and some of the feedbacks too. proposals. It's essential that in your |
|
|
01:42 | you have a good understanding and good of how efficient your cold is. |
|
|
01:50 | again kind of the basic four So I'll do that. Return to |
|
|
01:58 | how to get efficiency by doing Very simple Or two. Simple examples |
|
|
02:04 | to expect a matrix matrix multiplication and to understand how to make such simple |
|
|
02:11 | work well on current type of Yeah. So uh, yeah. |
|
|
02:23 | , so first a little bit about background in terms of to what extent |
|
|
02:31 | why bother essentially. And um, I'll give some background and I'll talk |
|
|
02:39 | the concept known as that distance. is really important. Try to get |
|
|
02:44 | idea what to expect in terms of performance and how to restructure codes in |
|
|
02:51 | for them to work with and cash systems, which is pretty much all |
|
|
02:58 | today and I'm not doing it. go through in these two examples made |
|
|
03:05 | spectrum Matrix matrix multiplication two go So you get an insight in how |
|
|
03:12 | think about your approach in order to good performance. Okay, so first |
|
|
03:20 | brother and this is kind of an . They don't need to worry too |
|
|
03:24 | about all the different labels on the person. The plot on the right |
|
|
03:31 | side of the side. But the thing to pay attention to is kind |
|
|
03:37 | the difference between the worst performance gold is at the bottom the blue dots |
|
|
03:44 | is apparently above zero and the good that are purplish squares on top. |
|
|
03:52 | as you can see it's a huge . Um I was at least two |
|
|
03:58 | of magnitude difference and this is for matrix multiplication. Very simple competition. |
|
|
04:07 | , depending upon how the gold destructed this particular example for this particular |
|
|
04:14 | There was more than two orders of difference in performance. So that's the |
|
|
04:19 | why things matters and how things are done and understanding how cash is actually |
|
|
04:26 | work and how to make good use them. So this is just reminds |
|
|
04:33 | a little bit about metrics in terms latency and bandit and spices that are |
|
|
04:40 | . And I'll come back to that . Mm So and then in terms |
|
|
04:47 | crashes and there were the three different of cache misses that do we affect |
|
|
04:56 | ? The compulsory misses or cold start ? The best of the first stress |
|
|
05:00 | to in a cache line. There's other way then there will be a |
|
|
05:05 | because data starts out in main memory eventually you want data back in main |
|
|
05:14 | . So then I typically do not cash. So that's the compulsory or |
|
|
05:20 | start. This is the capacity We talked a little bit about talked |
|
|
05:26 | memory systems simply that typically the cash any cash. It's too small to |
|
|
05:37 | the entire problem. So that means , capacity issues. Mrs regardless. |
|
|
05:44 | thanks for being done. And then were the additional things we talked |
|
|
05:49 | the cost typical caches are associated and that means that even if there potentially |
|
|
06:01 | space because of the greater map cache in main memory is mapped to the |
|
|
06:09 | . There can be conflict misses. these are the kind of three types |
|
|
06:15 | dishes And the example I'm going to about want to first and mostly focused |
|
|
06:23 | compulsory and capacity mrs. So that's of the first um at least complicated |
|
|
06:30 | understand cache behavior for codes. Um so this is special as I |
|
|
06:40 | going to use a very simple model can't understand cache behavior and uh so |
|
|
06:48 | will use symmetric vector matrix matrix multiplication try to understand What one can do |
|
|
06:54 | the consequences for performance for a very cash model. Okay, so the |
|
|
07:09 | sorry sure. The very simplest model to ignore the conflict message as I |
|
|
07:16 | . So just focus on compulsory misses capacity Mrs and to do so in |
|
|
07:23 | following several slices and I've world from fresh up in galilee at uT austin |
|
|
07:31 | that is a compiler type person is look at most enormous large cash model |
|
|
07:42 | means that there's no capacity Mrs and more real when it's a small |
|
|
07:50 | when there is both compulsory and capacity But for the first several examples not |
|
|
08:00 | about conflict messes. Mhm So this pretty much against what I already signed |
|
|
08:08 | large cast model. That means the , it's no capacity is only the |
|
|
08:15 | missed. Now the cash is what is also ice. So a number |
|
|
08:25 | these slides so follow them and try figure out when is the cash can |
|
|
08:32 | one can be considered large Respect to problem that 10 or well, is |
|
|
08:39 | more like a small cash relative to problem that happens what we're going to |
|
|
08:46 | to try to understand the car's Is this notion of that distance that |
|
|
08:54 | will talk more about common slider on show at the point of the stack |
|
|
09:02 | that it tries to account for the that typically there's a number of the |
|
|
09:13 | and I was being read at some on me want to use the same |
|
|
09:21 | values again. Which is certainly the in for instance matrix vector multiplication because |
|
|
09:28 | vector and made to effect an application used for every role in the multiplication |
|
|
09:36 | that case, elements of X depending the access order. Eventually a given |
|
|
09:44 | of export the read several times and stack distance and tried to account for |
|
|
09:51 | many other cache lines are being accessed one trying to use the same trash |
|
|
09:56 | again. Right. So it is of the notion of the stack |
|
|
10:04 | So if one looks at sort of timeline of references to memory then there's |
|
|
10:15 | particular Cache line is you know, at the same time are one and |
|
|
10:20 | there are a number of others cash or a memory address is being accessed |
|
|
10:25 | eventually that time are too than the cache line that was accessed that time |
|
|
10:32 | one is access to get. So , this stack distance then has an |
|
|
10:42 | . How much gets kind of loaded cache from memory and clearly the further |
|
|
10:50 | the Lord of the distance is the Or two is from R one the |
|
|
10:56 | data has kind of been loaded into and at some point depending on the |
|
|
11:02 | size, the car's gets full and the goat capacity nemesis. Yeah. |
|
|
11:14 | I guess I should stop before now into the examples in any questions on |
|
|
11:20 | basic concept of large and small attraction distance. Okay so here is our |
|
|
11:40 | generic might expect an application to nested using what typical in a product |
|
|
11:50 | The role of items selected for That gives an element of the result |
|
|
11:55 | work. So in this case based reform memory references and uh statement there |
|
|
12:08 | and for X. one for hey run for why on reading A and |
|
|
12:13 | there's another one for storing sorry for why and then storing why. So |
|
|
12:19 | becomes for references uh to executing the which is an executed then squared times |
|
|
12:27 | in total. Uh and this can way of doing it. It's four |
|
|
12:31 | spread memory references which will kind of in this example. Now what's pointed |
|
|
12:39 | on this slide here is also Well good compiler will probably exactly what's |
|
|
12:47 | and I'm here try to make sure why it doesn't get read and written |
|
|
12:56 | the same value for every iteration and most loop since it's why why is |
|
|
13:06 | for old J. So that means composition actually keep it and registers instead |
|
|
13:12 | writing it and reading it from my for every intervention but for simplicity in |
|
|
13:19 | student I. D. Version of scored as the reference case and then |
|
|
13:24 | to figure out how various uh new and other uh manipulations of this competition |
|
|
13:36 | affects. Yeah. And what this I used row major storage for a |
|
|
13:48 | is important as possible, pointed out think early on in his course and |
|
|
13:55 | guess can assignment I believe in playing with luke porters and major expansions |
|
|
14:04 | So It's the four scenarios that five that uh I've used to try to |
|
|
14:13 | what happens in the first two years doing very simplistic thing on just assuming |
|
|
14:21 | elements are accessed element wise basically there's line size effects on that one. |
|
|
14:28 | then we can kind of show them is the consequence If one as a |
|
|
14:35 | architecture today Cash loan sizes are not but typically there are 64 or 1 |
|
|
14:43 | bytes For them. As you can from Lecture four or 64 by |
|
|
14:49 | Lots the significant of the will the in the next few slides. So |
|
|
15:01 | say it a little bit difficult to interaction this one but over the |
|
|
15:06 | But we'll see. So we have gun's a generic matrix rectum out the |
|
|
15:13 | tool which is also known as either stopped or deduct. Um For inner |
|
|
15:21 | product. And the first the is double position and if it's single it's |
|
|
15:26 | essence that so generically it's about product executed. So now there you have |
|
|
15:38 | cached on signs of the investors that read everything. One element at the |
|
|
15:42 | whether it's a the X or Um So then we tried to figure |
|
|
15:48 | in the large cash models. Large models. Again remember that was only |
|
|
15:55 | misses. So now we'll see if can help me work out the number |
|
|
16:04 | MRS for start with a how many misses will it be in this coat |
|
|
16:16 | mm have any takers? So and is willing to speak up. Uh |
|
|
16:40 | . Again A starts out in Right? So that means every load |
|
|
16:47 | a the tinkerer Castmates. There is other way it has to be loaded |
|
|
16:55 | every element of A is not in cache. So basically one that and |
|
|
17:03 | mrs for a. Now it's down about X. And the one is |
|
|
17:15 | to come up with the number for . Many cache misses for X. |
|
|
17:26 | one answer from students and N. Okay, good. And why? |
|
|
17:46 | just another minute the statement is executed square times. Yeah. Um the |
|
|
18:01 | was that it is and I say in cache misses for X. |
|
|
18:29 | The wholeness and all. Yeah. there is only in elements of |
|
|
18:41 | That's true. But the statement is and squared times. All right. |
|
|
19:04 | unless it any comment, I don't I was going to the coast. |
|
|
19:10 | yes, it's and misses. Now reason is only enemies is it's because |
|
|
19:18 | assumed it was a large cash So that means for when excess pain |
|
|
19:29 | you know in the innermost loop going jay it was one to end and |
|
|
19:35 | the excess has been loaded. And it's a large charge it all fit |
|
|
19:39 | the cache. So when the outer for I is executed then X. |
|
|
19:49 | already in the cache. So that's between A and X. Because it |
|
|
19:57 | to have N squared element. Each is only used once song that. |
|
|
20:04 | is there is no reason reuse but a potential reuse or there is reuse |
|
|
20:10 | X. And if the cash is enough X. Entire vector X fit |
|
|
20:16 | the cache. So it only red as well as elemental. But it's |
|
|
20:22 | again N squared times for each element excess used sometimes. But there's only |
|
|
20:30 | for the first time it's being used I got something similar for why? |
|
|
20:40 | and finance it all up. This the total number of cache misses in |
|
|
20:45 | case of a large cash. So again only first time because once it |
|
|
20:52 | been read the cars can hold it there is no issue. So in |
|
|
20:58 | case, if one looks at the ratio Denmark, it's something that depending |
|
|
21:05 | the size of event then it becomes quarter of the readings of data values |
|
|
21:19 | a miss relative to the naive for squared model any questions on this or |
|
|
21:32 | makes sense. Okay. And presume makes sense. Now suppose the cash |
|
|
21:48 | small relative to the problem size and what happens? Oh let's see. |
|
|
22:00 | is no difference what A has to read. Um So there's no way |
|
|
22:06 | it for that. But what happens X and why? No, the |
|
|
22:21 | vector X. Because the cash is , will not fit in the |
|
|
22:30 | So if you look at this loop was right. That kind of illustrated |
|
|
22:35 | the boxes on top that. And the first situation of being the |
|
|
22:43 | loop and were excellent. J Index and then Next one was the J |
|
|
22:50 | two, et cetera. So that And the woman come to J. |
|
|
23:01 | two. We have four elements already there and after J equals two is |
|
|
23:08 | there is an additional two elements Uh one is still there? But now |
|
|
23:14 | have anyone to an X. two also gets loaded into cache and others |
|
|
23:20 | for every day we had two more uh into the cache. And at |
|
|
23:29 | point that means one run out of size if it's a small cash. |
|
|
23:37 | then my mom guns try to get computing wild too. That means they |
|
|
23:50 | it F and use the typical, know model of least recently used That |
|
|
23:59 | X one is no longer in the , it has been over it because |
|
|
24:05 | cash was small. Same applies to when they get to the next |
|
|
24:10 | But hey is not being reduced So that doesn't really matter in terms |
|
|
24:16 | cache misses. So here's what happens yes. Um, the number of |
|
|
24:27 | are exes and go through the first , yeah, index for jay the |
|
|
24:34 | slope. There's the real terms but then when you come to doing |
|
|
24:42 | for the next I value the the X values you need are no |
|
|
24:50 | in the cash. So that means each traditional role we're going to compute |
|
|
24:58 | times six how to reload the entire X. So now I'm basically get |
|
|
25:06 | and square mrs for X for It's the case is better because if |
|
|
25:16 | do it in this particular order, only occupies one place and it stays |
|
|
25:25 | as you keep going or women day . So now these are one the |
|
|
25:34 | misses. And these were the capacity And that is only a concern for |
|
|
25:41 | . The capacity mrs. So now have a whole number of mrs is |
|
|
25:48 | N squared instead of N squared. basically budget show this thing. |
|
|
25:55 | Um uh, as I said towards bottom of the slide and I'll talk |
|
|
26:02 | that on the next slide. The looks out there. Hell are you |
|
|
26:09 | strategy but it's important to realize that of the cash is small. So |
|
|
26:17 | the entire vector expo not fit. but only needs one element of |
|
|
26:23 | Of Y N A. So that's such a big issue. So here |
|
|
26:32 | a little bit more careful now doing are you type of idea? And |
|
|
26:40 | , if I want to figure out , as I said, you |
|
|
26:43 | for a given problem size, can cash be viewed as a small or |
|
|
26:48 | cash in this case? Um, cash become small relative to the problem |
|
|
26:57 | ? When the cash is smaller than Monroe on a tire back for |
|
|
27:10 | And um, the elements of life one in the service. So that |
|
|
27:17 | plus two. So dependent. So the size is roughly and when |
|
|
27:27 | the answer of the cash starting to as small as opposed to a large |
|
|
27:37 | . Now I can go to their is actually the two M Okay. |
|
|
27:42 | A because both one rule of a one row of access to previously plus |
|
|
27:53 | and one more element needs to be the cash because you know, want |
|
|
27:57 | do um, the first element of is move to the next situation or |
|
|
28:08 | , so that's when it becomes actually , when the matrix dimension and is |
|
|
28:18 | greener than half of the car Then the cash start to act in |
|
|
28:22 | small case, small cash case rather the large trash case and they comments |
|
|
28:34 | questions on this analysis. So that's of the behavior later. I was |
|
|
28:52 | . So what do you do about to try to oh, make uh |
|
|
28:59 | were called mm hmm. Make effective of kind of a small cash |
|
|
29:10 | So here is kind of then you see in the different scenarios that Or |
|
|
29:15 | smaller than cache size over two. it kind of behaves that's a large |
|
|
29:21 | in once and becomes greater than half the cache sites than it starts to |
|
|
29:27 | as a small cash. Um, , scenario two was still using the |
|
|
29:37 | , flies cache line size being But the difference is to interchange the |
|
|
29:45 | border and it's then known as an type computation and set up there. |
|
|
29:54 | competition is being used and understands for X pass by. And in this |
|
|
30:01 | the inner the loop border is such what happens is instead of computer in |
|
|
30:07 | computing in their products one There's the columns of a and that means it |
|
|
30:15 | in a partial contribution to the entire why. And then one keeps adding |
|
|
30:22 | this portion contributions to the result Why as scale directors scale columns away |
|
|
30:33 | should say. But um, martin through the same analysis and it turns |
|
|
30:41 | that the behavior in the is ends being the same. So no further |
|
|
30:49 | that that will continue it on your . Now the next thing is okay |
|
|
30:53 | . So make it a little bit real than to see what happens if |
|
|
30:59 | now has a cache line size the be great in the months and arbitrary |
|
|
31:07 | A company 64 Bytes or a equivalent a double precision numbers which is common |
|
|
31:15 | maybe 1 28 is used in some . So still large trash models. |
|
|
31:25 | now the difference is For every Miss gets be elements because cash long sizes |
|
|
31:34 | so the number of these, this simply just getting scaled by the cash |
|
|
31:39 | . Say so that's true for all them. Otherwise technologic is the |
|
|
31:44 | So now I want to get something or now of the cash. This |
|
|
31:50 | simply gets scaled down by the size the cache like so that's part of |
|
|
31:55 | reason why not cache lines. So 64 bit and no one can have |
|
|
32:03 | cache lines. But then what happens depending upon how you use your |
|
|
32:11 | the penalty to bring in the cash and increases with these sorts not to |
|
|
32:19 | . That is universally good. And know if you use all the elements |
|
|
32:23 | the cash on it is good but you only use a few or one |
|
|
32:29 | in a cache line, a big can uh them can in fact be |
|
|
32:35 | to performance. So there is a off and discuss that a little bit |
|
|
32:41 | when I talked about members systems. trade often paris, ISIS and cash |
|
|
32:46 | ISIS but for now this has Yes, it's effective in reducing the |
|
|
32:52 | system and um the same thing I do with the cash, small cash |
|
|
32:59 | and there's essentially all their friends accept get scaled by the cache line size |
|
|
33:08 | so this a long dive into that further. And the so this just |
|
|
33:18 | the scaling with a corresponding science. various case the graph used four elements |
|
|
33:24 | mark the eight elements that is typical . Um So this was you know |
|
|
33:33 | expo blue border and the conflict is analysis simpler. Similar. Sorry. |
|
|
33:45 | I don't think there is much more trying to do and that that the |
|
|
33:54 | and And of course one as crash upside be size speed then the B |
|
|
34:02 | up in a number of places and transition now happens. And if Ellery |
|
|
34:11 | um basically cache size divided by the line size versus if one really had |
|
|
34:21 | the degree of freedom of figuring out they had to replace in the |
|
|
34:27 | then it will be still similar that transition happens. Um Then is about |
|
|
34:34 | cash plan size. See someone gets little bit like this type of picture |
|
|
34:41 | shows depending upon whether the right uh Trash can size is one or greater |
|
|
34:54 | what when the transition happens. So questions on this. Mhm. All |
|
|
35:12 | . So and when the potential one the projects and try to understand |
|
|
35:20 | Um and it's a useful thing to what the cash on science is trying |
|
|
35:27 | interpret your results. No, I . So what can one do to |
|
|
35:39 | to improve potential the behavior of the it is to use a lot? |
|
|
35:50 | one was blocking. So the best try to well on petitions The two |
|
|
36:01 | and two um tiles or blocks. our investment do a collection of small |
|
|
36:12 | spectrum multiplies where each one of those matrix have two multiplies fits in the |
|
|
36:21 | and thus behaves as there was a cash even though it's not and then |
|
|
36:29 | other looks scan over the block size to get the whole drop down. |
|
|
36:38 | right. Somewhere from the concept then one will notice that one does get |
|
|
36:47 | performance and I'll show more of I want to talk about 98s latest |
|
|
36:54 | , where it's much more significant then is for me to come out |
|
|
37:00 | but this is just the very simple . End up. That's good to |
|
|
37:04 | from the matrix vector multiplication case. let's see next. Yes. Um |
|
|
37:20 | is a little bit I guess at comment on this blocking algorithm. So |
|
|
37:30 | order in which controversies the box is important for the performance. So depending |
|
|
37:39 | how the layout is in memory one another way. All traversing what I |
|
|
37:46 | do things by column or by role terms of the inside the blocks. |
|
|
37:54 | then I want those things from one to another. It's also one Monday |
|
|
38:00 | blocked rules and block columns. And the traversal that affects the memory performance |
|
|
38:09 | well. It does affect the performance data in the cache but it affects |
|
|
38:16 | the team around performance depending on how are laid out. Remember. So |
|
|
38:24 | particular ordering um that you see on right hand side they're kind of known |
|
|
38:33 | more than the ordering song. The that I guess first investigating it really |
|
|
38:42 | computers they tried to absolutely change traversal index place for I and J. |
|
|
38:50 | such one or the other or some . Also traversal of memory results in |
|
|
38:58 | to optimize the Iran performance. And is kind of what actually happens then |
|
|
39:09 | himself. Um administrations for a very block cash model. So law is |
|
|
39:19 | because administrations are not good. So is ideally for the proper block and |
|
|
39:27 | things to be recording to the red . Uh Well this is just the |
|
|
39:39 | done summarized. There's nothing that I already said. Um So now um |
|
|
39:51 | will I guess I should mention that mining is to set up petitioning the |
|
|
39:58 | a vocabulary that compound that people tend use quite often um invest there petition |
|
|
40:09 | loop into. I said I look still two loops where the inner was |
|
|
40:16 | is length is or integration space is determined by the character size is that |
|
|
40:29 | intended for. So the discussion I made was for a very simple model |
|
|
40:35 | there's one cash and one main memory in practice it has to be applying |
|
|
40:42 | each level of the cash. That I'm trying to optimize it for um |
|
|
40:50 | levels of cash that is common in CPUS one does Styling for each one |
|
|
40:56 | the three cash is for each loop . So that is it kind of |
|
|
41:08 | one may have learned and how to . Nice clean cold because the coat |
|
|
41:17 | and stop starting to look they all dependent on the particular underlying architecture in |
|
|
41:25 | world to get performance. Um The of loop nested loop levers. As |
|
|
41:33 | said it's a function of the number crashes and the duration range. But |
|
|
41:41 | inner loops depends on the church sizes the various levels but they can parameter |
|
|
41:46 | that doesn't need to be hard There is a dependence on the particular |
|
|
41:54 | of the processor, Vegas. next is maintenance matrix multiplication. So |
|
|
42:05 | questions don't see anything. Okay well do the thing that for which this |
|
|
42:24 | of tailor things to crash structure is important compared to make this spectrum |
|
|
42:33 | So so here's the income. The matrix matrix multiplication cole bring us the |
|
|
42:44 | um The inner loop doing a matrix matrix or the two enormous luke doing |
|
|
42:50 | vector multiplication. But The Inner Loop and in the product take a roll |
|
|
42:55 | eight times a column of B. to generate one element of you |
|
|
43:02 | the product metric. See and the difference compared to the magnetic spectrum application |
|
|
43:10 | kind of the outermost loop on I basically carry things out as a sequence |
|
|
43:20 | made to expect um applications. So we have as we know, there's |
|
|
43:31 | in cute operations to do a matrix of and by n matrices um You |
|
|
43:41 | on the 2, 1 of them in two years. The multiply and |
|
|
43:47 | is the ad size dalton and there's and uh each in our old times |
|
|
43:56 | column they're all wait times a column be is to in operations and then |
|
|
44:02 | repeated for every column of be in column or all of a sudden we |
|
|
44:10 | to and you just hope, you , no, the mattress is a |
|
|
44:19 | and C. Or each have n elements. So on total there are |
|
|
44:26 | squared three n squared words that needs be loaded from memory and principle. |
|
|
44:36 | that means and there is ideally um our mathematical intensity of basically order and |
|
|
44:48 | basically two in divided by three to operation and the three n squared data |
|
|
44:57 | . So so if one can come with algorithms such as data reuse well |
|
|
45:04 | shouldn't be able to yeah kind of the required memory bandwidth. Um correspondent |
|
|
45:14 | proportional to potentially. And we'll go more about it. But the idea |
|
|
45:21 | that since now there is data reuse is significant. It's order in I |
|
|
45:28 | like to take advantage of it for metrics vector multiplication. There isn't much |
|
|
45:38 | a S N square and X and is order N. And I can |
|
|
45:46 | X. But again There are two quite operations and there is N squared |
|
|
45:57 | . So the maximum reuse is approximately . So that's why matrix matrix multiplication |
|
|
46:08 | and much more interesting example and how figure out how to get they they |
|
|
46:15 | and to use caches. Yeah, now we're going to use the ideas |
|
|
46:22 | the matrix factor and application case. . So our mountain actually we have |
|
|
46:31 | um that a performance compared to the this particular thought history. Yeah. |
|
|
46:40 | yes, there's this small and large models and we'll talk about that. |
|
|
46:46 | first here is kind of what's We'll annotate this cold a little bit |
|
|
46:53 | terms of data references. Um and depending upon what is small or large |
|
|
47:01 | mall. So um and I will a little bit more this more in |
|
|
47:13 | of that culture to during that So it's essentially just listen this makes |
|
|
47:24 | or it's nothing um more of this than putting in the other loop on |
|
|
47:32 | to do this sequence of matrix vector , everything behaves estimated spectrum multiplying in |
|
|
47:39 | particular case. So if it looks little bit more and depending upon what |
|
|
47:46 | assumptions are, how large the cash . So on the arithmetic form calculation |
|
|
47:57 | terms of memory references on this slide . Yeah from the lewdness that we |
|
|
48:06 | that this is in the product that the since one go through a row |
|
|
48:16 | A. And then wonder and the of the heat and then when the |
|
|
48:24 | outer loop the loop then um moves the same role of A. But |
|
|
48:37 | use a different column of the it's matrix vector for each integration |
|
|
48:47 | And then you do a bunch of spectrum applications to do. Okay so |
|
|
48:58 | that case what happens is that if cash has not really learned so everything |
|
|
49:08 | in cache. What's happening is that you go through the J lube as |
|
|
49:18 | as the K loops at the two loop, then you have effectively read |
|
|
49:24 | entire matrix B. And then you to do they are look so in |
|
|
49:34 | scenario basically the matrix B gets red time. So it's best. And |
|
|
49:41 | references memory references for B. In case with the soon H. |
|
|
49:46 | A fill it in the cash. that means A. Is only read |
|
|
49:53 | you don't need to be loaded for column will be that you use. |
|
|
49:58 | that's why is Red Monsters to be . And similar for see it kind |
|
|
50:03 | stays in josh so on. They the load and store of sees mr |
|
|
50:08 | and square. So that's what you . So now if one looks at |
|
|
50:14 | american intensively for this particular lewdness with assumption um Cash being large enough to |
|
|
50:22 | essentially roll away plus a few more one guess the arithmetic intensity two or |
|
|
50:32 | for the number of arithmetic operations per of reference to the same thing as |
|
|
50:38 | matrix vector multiplication. So I was great because in principle and should be |
|
|
50:45 | to get um charismatic intensity that approaches 2/3 event. It's just um when |
|
|
50:57 | a good job in terms of figuring out. Oh, how to orchestrate |
|
|
51:02 | loops or petition the loops. So the typical way of looking at this |
|
|
51:13 | No one talks about the various orders the loops. You know, In |
|
|
51:19 | making expected case we looked at the product and the XP version. Uh |
|
|
51:27 | the print. Change the loop border the two ropes we had in that |
|
|
51:32 | . Now we have three loop for and content to label them. |
|
|
51:37 | J K. For the loop in is being used and the order in |
|
|
51:42 | they appear with the first thing the most loop and the last thing, |
|
|
51:47 | innermost loop. So one of these permutations of I'm going to order the |
|
|
51:58 | . So comment on what happens in , I J K or J K |
|
|
52:11 | . So one thing to consider it is now one needs to worry about |
|
|
52:18 | in order and how it affects cache . So in that case we have |
|
|
52:27 | locality measures, we talked about way , I'm going to talk about memory |
|
|
52:33 | is temporal and special mortality. So if you look at okay being the |
|
|
52:43 | stoop, regardless of the order of two other laws, we can notice |
|
|
52:54 | for a for instance that we start a as in the case changes, |
|
|
53:04 | proceed from one element or in a to the next element in the same |
|
|
53:12 | . So that means actually a good locality for a see, it turns |
|
|
53:23 | it's not respect. Okay, in case because in this case it also |
|
|
53:27 | for the loop jay because we have and the next time I C I |
|
|
53:35 | plus one. So basically C is access in the order and the row |
|
|
53:41 | order. That means this could also locality results of temporal locality and this |
|
|
53:47 | because see I J is used in iteration in the Now for me things |
|
|
54:01 | not so good. I didn't list on this slide that listed the good |
|
|
54:07 | because with B in the innermost loop go from traverse B, column by |
|
|
54:18 | . So it goes from one element a column to the next element in |
|
|
54:22 | same column. But they are separating in the row major order by the |
|
|
54:30 | of the role of the, So that case the special locality for B |
|
|
54:38 | fair and there's no good locality for . Similarly, we can then look |
|
|
54:50 | , so I put I in the loop and it turns out then if |
|
|
54:56 | look at uh the toronto, look this case uh be because if I'm |
|
|
55:08 | in the innermost loop, um the stays the same every iteration, there's |
|
|
55:14 | dependence on I. So it has temporal locality. Um And that's pretty |
|
|
55:22 | all one can say because if you through from eye to eye passed one |
|
|
55:28 | it's in the innermost loop. That controversies both A and C. A |
|
|
55:35 | by columns or from an element in column to the next element in the |
|
|
55:41 | column. That is something with the strike equal to the length of the |
|
|
55:46 | of A and C. So that's special opportunity as well as temporary |
|
|
55:53 | Some things are bad and the final was to look at J in the |
|
|
55:59 | slope and I can figure out what do special and temporal locality is and |
|
|
56:06 | same reason. So this is things different and it's going to result and |
|
|
56:16 | cache behaviors and performance. There's a a question in the chat. |
|
|
56:23 | Rodman is asking if transposing the matrix can help an I J K loop |
|
|
56:30 | . Yes. So if you transport then you pay the price for the |
|
|
56:37 | that then has a problem and you when and how you strike through the |
|
|
56:44 | and depending upon how you do it then it costs you extra memory to |
|
|
56:51 | you don't do it kind of in . So yeah, so this notion |
|
|
56:56 | that which happens practically that the compiler one rule for laying multi dimensionally raised |
|
|
57:08 | memory. It doesn't have different rules the different operations. So this set |
|
|
57:14 | slices assuming that everything is kind of out from probably in the same what |
|
|
57:21 | thanks the actual and what's good spatial temporal locality changes with this real war |
|
|
57:30 | major order. But we got kind the same effect is just um difference |
|
|
57:37 | which ones are good and which ones bad. You see the same behavior |
|
|
57:42 | in taller major order but it's a point just if one is transposed to |
|
|
57:47 | another than you can, it affects what's good in terms of spatial and |
|
|
57:56 | locality. Yeah. So right now that so now, so this is |
|
|
58:08 | working through this thing with scenario that just talked about what for this I |
|
|
58:16 | K case. What? Yeah, cache misses are for cash flow and |
|
|
58:24 | of B. Um there's a large scenario and then now if one does |
|
|
58:31 | little bit from a small cash scenario I can work it through and one |
|
|
58:36 | uh numbers are similar to what was case they made to expect the multiplication |
|
|
58:45 | and a quarter of the memory accesses mrs and they go back and look |
|
|
58:58 | the slides from the matrix vector multiplication because again this jK lu porter is |
|
|
59:04 | , yes, a collection of accepting and you should get a similar |
|
|
59:13 | No. Then what's being done on line is kind of work it out |
|
|
59:21 | or six rather 2 pointers. But three cases for which one is the |
|
|
59:28 | , nope. Okay. I know and one got these expressions on the |
|
|
59:34 | hand side. So fantastic that these and I'll try to so if you |
|
|
59:39 | remember it before looking at I think next slide but if you look at |
|
|
59:45 | in terms of the mystery show uh been in a most loop is one |
|
|
59:52 | the worst case .5 que in the opus somewhat better is about half the |
|
|
60:00 | of misses and J and in the look is the best content not to |
|
|
60:08 | the benefit from the cache line size the So if you know like common |
|
|
60:15 | double position These eight or single is is significantly lower number of cash in |
|
|
60:23 | system Either on the other two so kind of is the best uh in |
|
|
60:32 | of being the innermost, okay is worse but not the worst and I |
|
|
60:42 | be the worse to have in the . Now this is what No, |
|
|
60:51 | something I borrowed a slide from they way back. So it's an old |
|
|
60:56 | pension dream. But in principle the is similar for anyone born Processor Like |
|
|
61:05 | one. So what we have was j in the innermost loop. What's |
|
|
61:12 | best? I think. So that's if you look at there is kind |
|
|
61:23 | a purplish line that is very different the other colors that is towards the |
|
|
61:29 | on this slide and this is gonna shows the difference and then we had |
|
|
61:37 | intermediate ones was okay in the in blue and that's sort of the bluish |
|
|
61:46 | yellowish depending upon the order of the two loops. But there it is |
|
|
61:51 | , okay, you know what blue also okay in Mosul said it was |
|
|
61:55 | different numbers right? The factor of . So that shows again considerably more |
|
|
62:01 | system the best case and I and R. Loop was the worst |
|
|
62:08 | So I haven't in the animals to the worst case. So and they're |
|
|
62:15 | hard to see about. There's actually colors. There is kind of brownish |
|
|
62:19 | then the purpose that is not travel much and then we can see a |
|
|
62:28 | bit strange behavior at the beginning here . So anyone can. So first |
|
|
62:37 | guess the simple analysis that was done a good indication of What one should |
|
|
62:46 | terms of ordering the loops and now was for all major order and the |
|
|
62:53 | doesn't for column is your order. idea of the border will be |
|
|
62:59 | I just need to do the similar that figure out which the loophole you |
|
|
63:04 | be respectful cache misses. No anyone a comment from what the strange behavior |
|
|
63:12 | . Um close to the origin and plot. Well it's simply that for |
|
|
63:30 | small values of n the cash that have been a little large cash. |
|
|
63:36 | is not many nieces. So this for this particular case. This is |
|
|
63:40 | of a transition happens between uh the behaving has occurred small versus large chest |
|
|
63:50 | one can figure out this transition point the cash face and um The fact |
|
|
63:59 | there are three options, A, . And each one should be fit |
|
|
64:04 | the cash for the cash to behave a large trash. So one can |
|
|
64:13 | very easily figure out when this transition about to happen. And then martin |
|
|
64:19 | one's performance data and see if it sense what you observe. Um in |
|
|
64:26 | case. So a few more slides try to figure out how to make |
|
|
64:33 | really good. So this is against I said in terms of figuring out |
|
|
64:38 | transition and then this is just a of what they did way back in |
|
|
64:46 | of sensing loop corridor was assignment to Anyways now one more thing about this |
|
|
64:57 | response applies for now going back to we learned from the matrix vector multiplication |
|
|
65:04 | Was one should do entire version. Burleson did make better use of |
|
|
65:12 | So the previous thing was just um to make good use on the cash |
|
|
65:19 | that's still relevant. Um So that change but now there's the additional feature |
|
|
65:26 | trying to minimize the number of casualties partitioning looks into types. So um |
|
|
65:38 | this case the best no one do matrix matrix multiplication, czar blocks and |
|
|
65:44 | I wanna tre versus those box and order that is similar to the one |
|
|
65:54 | has talked about when they killed the size was one instead of Okay, |
|
|
66:01 | there's kind of no difference is just . Instead of thinking of element multiplication |
|
|
66:07 | in The Matrix Matrix multiplication one now of each gentlemen being replaced by a |
|
|
66:15 | subnet to exercise bi bi bi and one does that no man can go |
|
|
66:23 | and with the same analysis in terms number of um Los and needs to |
|
|
66:32 | and then actually discovers that if again blocks all size bi bi bi fits |
|
|
66:41 | the cash Then one now has and sorry it's mating intensity that is proportional |
|
|
66:53 | the so I wanted to mention on blocks I saw The larger the block |
|
|
66:58 | one gets the better and using the this will reduce the need for the |
|
|
67:07 | bandwidth proportional to be. So this what's being done in most schools today |
|
|
67:17 | um want and then figure out essentially here. You know that A B |
|
|
67:24 | C. Sort of the three operations of size B squared. A simple |
|
|
67:30 | and one chooses them. He's such three B squared is just than the |
|
|
67:35 | slots. And then we can play with the numbers and figures out what |
|
|
67:41 | . But that's when the way one effective use of caches and then when |
|
|
67:48 | I mentioned, one needs to repeat for each level of the cash. |
|
|
67:52 | basically one just blocking for everyone. blocking for level two cache and blocking |
|
|
67:57 | level three cache and then don't get things to work on. This is |
|
|
68:05 | to send them there. I can't prove anyone other theoretical interest. There |
|
|
68:10 | something called er red blue Pebble Game was done by it's the tongue and |
|
|
68:17 | of his students way back in the show that this is indeed the optimal |
|
|
68:24 | of doing business. And this concept this blocking applies to pretty much any |
|
|
68:33 | of competition that you do. It's . And so even though it's easier |
|
|
68:40 | understand probably than many other scenarios, just understanding the matrix competition because they |
|
|
68:48 | fairly simple but I know for oh one group or you want to |
|
|
68:55 | the 50s from the project and we apply this recently also too Uh 50 |
|
|
69:04 | and under's blocking or how those are done in order to get good performance |
|
|
69:11 | these and I'll try to talk about next time. Let's see what else |
|
|
69:16 | . And this is best for saying to do the same thing for all |
|
|
69:20 | different yeah, cash is I get performance. Uh this is a little |
|
|
69:30 | what happens in terms of optimization. customer diversity people sometimes like again, |
|
|
69:39 | that both, what happens when transition small class behavior, the effect of |
|
|
69:48 | and yes, they are software packages there. Uh one of the oldest |
|
|
69:58 | this case is known as atlas a for automatically to engineer. Underperform software |
|
|
70:07 | what it does is basically does an church search of luke orders and stylings |
|
|
70:17 | all cash levels in the processors. , um, it's a pretty extensive |
|
|
70:27 | that is being done, but on next slide here it shows how it's |
|
|
70:35 | uh competes with lender optimized libraries and is an outside. I haven't seen |
|
|
70:41 | new comparison but it tells you that blue is the Atlas version. This |
|
|
70:52 | search to try to find the best yeah, loop ordering and tiling and |
|
|
71:02 | it compares to um, one is , this vendor library for this particular |
|
|
71:08 | and then there's another old fortune But I guess I can compare the |
|
|
71:15 | optimization towards the then there's what can in different ways? Obviously this fairly |
|
|
71:23 | search procedure did indeed come up with optimal um tiling and reporter compared to |
|
|
71:35 | vendors and sometimes it actually even took the vendor library tiny did. Now |
|
|
71:40 | horizontal axis is different machines that were on this at this particular time. |
|
|
71:48 | I think that's pretty much what I one more comment. So this is |
|
|
71:58 | all right. Um taylor and co particular architecture even though it can be |
|
|
72:04 | wrist so it's not totally um I say I shouldn't dislike it totally. |
|
|
72:15 | that's a better way of saying But it is. And introducing a |
|
|
72:21 | amount of the tendency by the cannon bi parameter rise. So combining communities |
|
|
72:29 | they want to animate all of not like the atlas way by doing |
|
|
72:33 | that instead a writing code that is oblivious. That means that um they |
|
|
72:43 | trying to come up with these ways um writing codes for that. |
|
|
72:49 | the compiler would that do the right and then the code would end up |
|
|
72:55 | the optimal performance for close to it an extensive search through parameter space uh |
|
|
73:04 | out so that's so in doing so idea was to do made right maintenance |
|
|
73:13 | for instance as a recursive procedure. can certainly do. So you think |
|
|
73:20 | the matrix multiplication is as being it's multiplication of Medicis as Take two x |
|
|
73:30 | blocks and then each one of these x 2 blocks. Matrix multiplication is |
|
|
73:35 | reversible, Divided into smaller two x blocks and eventually at the bottom level |
|
|
73:43 | the rickerson. Then things would fit the Cache and things would be ran |
|
|
73:48 | one things would just happen without anyone about what's going on. It turned |
|
|
73:54 | in practice it didn't do so So that's why in the end most |
|
|
74:02 | libraries that are not using any cache approach to writing code, they do |
|
|
74:09 | parameter trist tiling and lou pre So this is so I'm here is |
|
|
74:19 | of uh that's what it said and I said that why it didn't |
|
|
74:26 | But there's also then so this is of what the bottom nine of this |
|
|
74:33 | the cash up get. In terms scientific oblivious. Uh It's a smart |
|
|
74:39 | codes. It didn't work so So the winner. And they always |
|
|
74:42 | up being the kind of interesting block versions. But there are some good |
|
|
74:49 | for both the cash a business and cash. Uh huh conscious coach. |
|
|
74:58 | at that point time is up I and I'll take some questions. Uh |
|
|
75:05 | Yeah. So I think all messages reporters are important and darling is important |
|
|
75:19 | to get good performance we need to about sometimes. When is lucky that |
|
|
75:25 | algorithms actually do the job automatically. yeah, most of the time unfortunately |
|
|
75:33 | though in principle it should in practice of the way the caches and instructions |
|
|
75:39 | works. Um it doesn't. So was stop sharing the screen at this |
|
|
76:00 | . We also mm stop the |
|