© Distribution of this video is restricted by its owner
00:03 | So today I will start basically leading programming on GPS and first. Then |
|
|
00:17 | talk about what's known as victimization or is something one needs to do for |
|
|
00:25 | known a singing processors. Um, just first, just little bit of |
|
|
00:37 | , or what? Wherever you are so what they've done so far, |
|
|
00:43 | basically the first. It was a lectures and assignments about tools and talks |
|
|
00:51 | processor and memory architectures to get to basis for the platforms and try to |
|
|
00:57 | behaviors. Man Mother just discussed about power in that case measurement. |
|
|
01:05 | then talked a little bit about the energy management that is now happening in |
|
|
01:11 | processors and at all levels of On demonstration was talking about your memory |
|
|
01:20 | using open M. P s the for programming and talked a little bit |
|
|
01:28 | what the compilers attempt to do and one can help the compilers to be |
|
|
01:33 | in doing the code transformations, Thio efficient codes and today and move on |
|
|
01:42 | more or less continuing in terms off , the source transformations. But |
|
|
01:49 | with the specific objective to use so vector instructions. And then maybe if |
|
|
01:57 | some, I will start to talk little bit about, then have to |
|
|
02:02 | Ah accelerators. And for this particularly GP use Andi, just |
|
|
02:13 | In order to get the fish and , Juan Cano need to be vertical |
|
|
02:17 | understand all the different layers. One just do the typical, um, |
|
|
02:25 | just by obstructions and layers. All , so today since then talked about |
|
|
02:34 | explain what this Cindy mhm notion means case to another familiar with it, |
|
|
02:43 | then talkto give examples of source to transformations that, um, it's illustrative |
|
|
02:51 | compilers tried to do and also And I talked about the compilers, |
|
|
02:56 | they tried to do, how one potentially right the code from the start |
|
|
03:02 | reduce the need for program transformations or compiler to be able to find out |
|
|
03:09 | changes can make to the coat and one sometime. But that then almost |
|
|
03:17 | to talk about the programming of history snow in some particular how they are |
|
|
03:23 | put together and a society at the of the slide. So same the |
|
|
03:30 | is also often called data parallelism and may use the terms interchangeably, and |
|
|
03:38 | they will start to make sense. not now. At the end of |
|
|
03:41 | lecture, Andi, I just just the record, more or less, |
|
|
03:49 | is the notions that they're being Um, in the computer designs community |
|
|
03:56 | invariable classifications of different types of Er, um, the single instruction |
|
|
04:03 | data, Your Manila single threat, , program named Model um, Cindy |
|
|
04:11 | what we're talking about today, which single instructions from multiple data. Or |
|
|
04:17 | that's what the notion of active comes that there's one instruction that applies to |
|
|
04:22 | range of data on. Then there's instruction, single data just to cover |
|
|
04:29 | permutations. And it's not really something on the world that is actually being |
|
|
04:36 | , Wendy is what they used in of the open MP type or architectures |
|
|
04:44 | open and pay most programming models that means you have unique instructions for maybe |
|
|
04:52 | the difference. Statements that you have new code or combinations of data are |
|
|
05:00 | to seem there was a single instruction raise some data, and the last |
|
|
05:06 | is fairly new. So the first was something that notions were something that |
|
|
05:12 | Flynn the Stanford came out eight years on It has been sort of persistent |
|
|
05:19 | decades off computer architectures. Been there something that started to happen. In |
|
|
05:26 | . Man clusters started to emerge, that stands for simply single program in |
|
|
05:32 | data. So it is a typical over for clusters where you used the |
|
|
05:39 | program in all the different notes. . I'll talk more about that when |
|
|
05:44 | comes to ah, programming are So where is the basic notion of |
|
|
05:53 | ? Uh, is basically one instruction one instruction unit on this particular |
|
|
06:05 | and that the same instruction is used a range off functional units. On |
|
|
06:12 | times, they may have their own as well. So instructions in this |
|
|
06:19 | of architecture, er is in a broadcast Teoh a number of execution |
|
|
06:27 | On this, something first emerged in of image processing, but typically the |
|
|
06:34 | operation is may be applied, at in some stages, to all the |
|
|
06:40 | in an image. So it was very efficient way of doing or are |
|
|
06:45 | image processing systems. But it also advantages compared to the Membe because it |
|
|
06:54 | two simpler So the processing course, you like on that's what's being |
|
|
07:01 | indeed be used in particular of our basically use a Cindy course, and |
|
|
07:08 | processors. It also is a uh, design or as a very |
|
|
07:18 | impact on the memory system, since don't need to. Your old instructions |
|
|
07:26 | kind of each combination of data so to pressure on the memory system whenever |
|
|
07:33 | attend use. Uh, this mode operation on here is just the difference |
|
|
07:42 | Cindy and Spendy. Where again, coldest replicated doing different processing course or |
|
|
07:50 | , notes in the Cluster City. now any question this general things that |
|
|
08:01 | one will start to talk more about transformation Yes, yeah, I |
|
|
08:08 | I asked this a couple of lectures . I think it was like three |
|
|
08:10 | four. But just to be single instruction, multiple data on single |
|
|
08:15 | , multiple data. Um, the the refers to I guess a really |
|
|
08:20 | down example would be, say, add instruction to, uh, two |
|
|
08:26 | locations in memory and then doing that add instruction to different things. |
|
|
08:32 | uh, when you say same are we talking about literally the |
|
|
08:37 | Uh, the same assembly instruction applied different memory locations? Um, |
|
|
08:44 | Or OK. Ah. And for program. I guess it would be |
|
|
08:50 | sort of You. You you replicate instruction stream or not exactly. Because |
|
|
08:55 | could have conditional or what have Um, if you're doing it to |
|
|
08:59 | data sets, so s P m . Could that be viewed as, |
|
|
09:05 | , sort of a multi instruction s m d if that makes sense. |
|
|
09:10 | huh. All right. Um, a bit cautious saying yes. |
|
|
09:20 | because it's so different in terms of the whole code. Ah, so |
|
|
09:32 | to using the same up code for range of memory addresses. So maybe |
|
|
09:40 | the examples, at least the clear what seeing the hard works okay will |
|
|
09:47 | clear, which is not downloading the code into multiple processors or, you |
|
|
09:57 | , it could be the same. when we talk about M. |
|
|
10:04 | I is basically tying together pre applications hold programs, as supposed to |
|
|
10:17 | That is, as I took this processing example. Applying the same operator |
|
|
10:25 | pixels in an image. Okay, , that I think that's a good |
|
|
10:33 | . So So I'll, you talked to a number of examples and |
|
|
10:40 | and hopefully the notional Soon they will clear. So here is kind of |
|
|
10:46 | that he's being used to tried Well, in order to use |
|
|
10:55 | uh, factor in for Cindy instruction . So in a way, I |
|
|
11:04 | I think this has been some slice coming. But in a way, |
|
|
11:10 | victimization means that your app a loop each statement that tomainia So that's what |
|
|
11:20 | supposed to having a loop with a of statements inside and kind of trying |
|
|
11:27 | figure out how to wrapped the loop each individual statement on that will |
|
|
11:37 | hopefully become clear and what I'm talking . Eso illustrate some of these transformations |
|
|
11:45 | the next several slides, Uh, how one can initially write the code |
|
|
11:54 | of this compatible what we need to to figure out how to change the |
|
|
11:58 | in order to do that, invoke the instructions. But first I'll talk |
|
|
12:07 | little bit about things that make it to try to take care. Typical |
|
|
12:15 | written for one of the scaler type as supposed to vector units of vector |
|
|
12:25 | . And there, then different data on the wall, illustrating the next |
|
|
12:31 | slides through a bunch of examples. the concept that I think is I |
|
|
12:37 | to be at least familiar well, what they mean. So, to |
|
|
12:43 | dependence is specifically what it means. one reads a variable after having written |
|
|
12:49 | this, and the anti dependency is of the opposite. Instead, you |
|
|
12:56 | a variable after having read it, I'll give examples to are these things |
|
|
13:02 | happens in codes. And then there's known as output dependence on you make |
|
|
13:09 | several rights to the same memory So these are kind of the three |
|
|
13:15 | dependence is one talks about and and things are embedded in loops than |
|
|
13:21 | of these dependence is. May look because off looping through the sequence of |
|
|
13:30 | , and that's and now has looked dependence. But so some of these |
|
|
13:35 | three guys made then Booker because off new preparations, So here is now |
|
|
13:47 | little examples and try Thio Illustrate. , these three concepts and the reader |
|
|
13:53 | right on the right after read. we'll see if you guys have witnessed |
|
|
13:58 | far Never, Maybe some volunteer on figure out to read after write and |
|
|
14:06 | very simple code What? When we such dependencies Among these four statements in |
|
|
14:15 | low public, you've got one online , because X was written thio online |
|
|
14:24 | and then you read it so would a role from two on one. |
|
|
14:30 | , saying that again. So online we have, Right. You have |
|
|
14:35 | read after write because X was written . So you would say that line |
|
|
14:41 | , has a read after write dependency line one. Eso the statement works |
|
|
14:46 | the left side is they signed Yeah, right after read E remember |
|
|
14:52 | with us and okay, maybe I use errors instead of the math equal |
|
|
14:59 | . But this is an assignments on right tense. The outcome of the |
|
|
15:03 | of the right hand side is assigned the left hand side. So, |
|
|
15:14 | yeah, so in statement to obviously one has a new assignment. |
|
|
15:19 | excess written to after its red. , any other dependence is of any |
|
|
15:29 | or just, um okay. And it's just like on the three |
|
|
15:39 | volunteered here. So I will. guess you have a right off the |
|
|
15:51 | from 4 to 1. Right? they're supposedly executed this sequence, |
|
|
16:00 | So, um, but if you to reorder them, yes, that |
|
|
16:06 | have consequences. So I was a 1 to 3, right? We |
|
|
16:16 | the read after write to read things statement three that was written two in |
|
|
16:22 | . And then you also have the from 1 to 4. Because you |
|
|
16:27 | why, in a statement, four was ripped into in Iran so that |
|
|
16:34 | have a clear president's order that you to execute one before you can execute |
|
|
16:40 | or four in order to preserve the of the sequence of statements. Eso |
|
|
16:50 | we have. There's something written to it was red and that we have |
|
|
16:56 | 1 to 2, you read X one on. Then you modified the |
|
|
17:01 | in to so you cannot the order put to head of one because then |
|
|
17:08 | get a very different outcome. So think we have. And then I |
|
|
17:14 | there was another one from 3 to , right that for updates, variable |
|
|
17:21 | . So you cannot put forehead afraid then you get the different results. |
|
|
17:29 | then we had 1 to 4 otherwise vice. So this applies to heart |
|
|
17:35 | or the statements. And so these , the dependence is locks in certain |
|
|
17:43 | on reordering off statements, and the may be advantageous again for memory. |
|
|
17:53 | is or registered use or something. I'll get lots of examples or not |
|
|
17:59 | four or five into this lecture where ordering some help. The use of |
|
|
18:11 | now. So because of this one easily wrap um, the basically compute |
|
|
18:25 | the integrations for one before and then all the integrations onto etcetera because that |
|
|
18:34 | you want to do all statements, , or every instance off statement one |
|
|
18:43 | all the 100 integrations in the then you will get very different |
|
|
18:48 | Compared Thio going through the four statements each iteration of the loop, so |
|
|
18:56 | would not kind of work to wrap loop around each one of these individual |
|
|
19:03 | and get the same outcome. So is then there was no look carry |
|
|
19:10 | because, uh, in the situation this, um, for Luke is |
|
|
19:17 | same index. I is used for statement. So nothing happens when you |
|
|
19:22 | to the next generation. It's all dependence on are not from our little |
|
|
19:30 | to another all dependent and so are the same integration. So this is |
|
|
19:38 | way than to try to help a bit in terms of being able to |
|
|
19:47 | the victimization by using temporary variables, called variable renaming or but in these |
|
|
19:57 | , 10 variables. And if you this set of 10 variables have eliminated |
|
|
20:02 | dependencies. Um, because now, you are too Uh huh certainly execute |
|
|
20:17 | second statement that should work without impacting correctness. So let's see it. |
|
|
20:29 | think I have in Afghan maybe on next. Like nope. So let's |
|
|
20:34 | shows a little bit how things potentially reduce some of the dependencies. |
|
|
20:45 | So here is now an example of , care, dependence on so |
|
|
20:54 | So in this case, we'll see the second statements rights to the Variable |
|
|
21:06 | and then that particular instance, are or access being read in statement one |
|
|
21:13 | the next situation. When are his ? So that is, this is |
|
|
21:22 | known as they look dependence that the victim value of X and one |
|
|
21:35 | is ready. And the other um also the U value is being |
|
|
21:44 | in each generation, so but you only used in state and one is |
|
|
21:51 | used in the second statement. So if we wanted to use with to |
|
|
22:02 | to wrap the loop, the four around each one of the statements but |
|
|
22:10 | not get correct result if you first all the instances off the first statement |
|
|
22:18 | then all the instances of the second . So is there anything and one |
|
|
22:26 | see that one can do to, fact enable the use off seemingly instructions |
|
|
22:35 | , wrapping loops around each one of statements? And I tried to for |
|
|
22:42 | kind of like pictures. Oh, tried Thio illustrate the access pattern on |
|
|
22:48 | graph to the right In this the first thing that comes to mind |
|
|
22:55 | be to make a copy of the X ray. I'm still thinking about |
|
|
23:09 | . Okay? So if you copy x ray than so that means you |
|
|
23:22 | the original X ray. Is that you have in mind and then have |
|
|
23:27 | updated X ray? Yes. That statement, um, doesn't interact with |
|
|
23:36 | one, but the first statement needs new X values. So having the |
|
|
23:48 | X values around doesn't help. So for I equals one right wellness with |
|
|
24:00 | x zero is a Valium argument. that stroll, right. So first |
|
|
24:08 | , you compute x one in the statement. For the next iteration, |
|
|
24:14 | equals two. You need X one was just computed in the previous situation |
|
|
24:21 | the second statement. So there in fact, from The Sopranos said |
|
|
24:34 | what you can do is and this correct is preserving because there's nothing to |
|
|
24:40 | you from computing all the X values , uh huh. What was original |
|
|
24:48 | one Onley uses a new values on New values of X was computed based |
|
|
24:57 | two independent to raise violence. It it was not updated. So in |
|
|
25:05 | case, now, there's nothing to you from. Compute all the instances |
|
|
25:12 | X and then computed all the instances or update the year values using those |
|
|
25:22 | X values. So in this this is kind of what can be |
|
|
25:31 | in this case by doing state country because of the way these dependence is |
|
|
25:37 | . So now this particular code, doing reordering of statements, can be |
|
|
25:46 | . Christ, So now you can this is again the notion of this |
|
|
25:53 | thing that you have in this case statement and you do all the |
|
|
26:00 | Um, so in this case, the code you need is the |
|
|
26:06 | and then in this particular case is simple. You start with the Y |
|
|
26:12 | the Z address and the next and in this case, you just |
|
|
26:17 | those three addresses by one for each of the loop. But you don't |
|
|
26:21 | in the up coat or even, , explicitly computing new memory addresses except |
|
|
26:33 | through implementing by one, which is very cheap operation compared to, |
|
|
26:37 | sending it totally new address. See memory. So in this particular |
|
|
26:46 | because it's implement by one one saves in terms of not telling Thio, |
|
|
26:53 | new up codes on in this not even more than basically adding one |
|
|
27:02 | everything each one of the ray addresses stretching or writing data. So on |
|
|
27:15 | array, notation is very common as way of illustrating the notion off I |
|
|
27:24 | it for. Then try Thio. it clear that these things can be |
|
|
27:28 | ized and I'm on you that might familiar with either maps, lab or |
|
|
27:35 | other kind of very processing or programming . They tend to use this kind |
|
|
27:42 | triplet notation wherever starting, address and address and potentially stride. And if |
|
|
27:48 | missed the stride additions to strike this , so but all my examples you |
|
|
27:55 | once on the road, just Penis of now Trip club or whatever, |
|
|
28:00 | Clift type location 21 Try to point what the vector on instructions kind of |
|
|
28:11 | . So any questions on this Mhm. So again, this fact |
|
|
28:20 | . Competitors thes days. If they parties sufficiently simple, and then they |
|
|
28:28 | to be had for some recently successful doing this form out reordering and then |
|
|
28:36 | doing back to instructions. But it's for me a good programing practice. |
|
|
28:43 | of be aware all these things and challenging me program analysis to figure it |
|
|
28:49 | for you. All right, so is another example. And now we |
|
|
29:02 | also they look carry dependence right from to 3. Except now it's kind |
|
|
29:12 | to integrations loop away in terms of loop carry dependence. And then we |
|
|
29:18 | also read as the right from 1 3. Um e guess Any other |
|
|
29:32 | anyone spit observes so in a saying and notice otherwise about this code In |
|
|
29:52 | of wrapping the for loop around each of the statements. You know, |
|
|
30:03 | is a street dependence years, right our findings One So the previous value |
|
|
30:11 | is there is a, uh from to 3. There is a reuse |
|
|
30:17 | Z off. See I. But why off by has Bean The history |
|
|
30:24 | via PFI is used Drink right? why is never modified is just |
|
|
30:32 | Right? Um What? I didn't you. That's ok. Eso When |
|
|
30:42 | comes to why it's never modified in cold it is just being accessed or |
|
|
30:51 | is never updated. You can reorder two in one. Correct eso that |
|
|
31:02 | have better Lee. They're afraid that's you're between one and three, |
|
|
31:09 | Yeah. So there is dependencies between and three, but to is the |
|
|
31:17 | simple ones. So that's correct. this is well, I think we |
|
|
31:26 | about. So in this case, guess what I did was put that |
|
|
31:32 | they used the second statement is are to what you suggested first, but |
|
|
31:38 | doesn't matter. But the next thing can do them because the second statement |
|
|
31:45 | independent off the other two statements. again, neither you is not used |
|
|
31:54 | one or three on why is not updated. So what one can do |
|
|
32:01 | basically pull the second statement out and part that particular statement can directories in |
|
|
32:09 | own right. Then my still has figure out what to deal with the |
|
|
32:15 | two statements one or now, or one and three statements that has this |
|
|
32:20 | dependencies in terms of three, depending the outcome of one for Z and |
|
|
32:28 | depending on the outcome off X two or to iterations earlier in terms all |
|
|
32:37 | this thing so this thing can be , arised, and that's can be |
|
|
32:44 | game is not the full game up hope for. And then one have |
|
|
32:49 | one. And it turns out the this particular loop or the first loop |
|
|
32:58 | with one and three statements, the on the right hand sites happens to |
|
|
33:06 | in these two statements. It in fact, we recognized, but |
|
|
33:13 | a very quite the complex rear reordering is required, and I'm not going |
|
|
33:23 | go into that and this kind of simple way. But there are ways |
|
|
33:28 | for this particular look to the common . Uh, so it's not just |
|
|
33:36 | simple way off dealing with the loop be able to do some level of |
|
|
33:43 | , but one can basically try to it, uh, a tree in |
|
|
33:52 | way on then. Yet the correct still so modular that arithmetic operations needs |
|
|
34:01 | be associative than it can indeed, . But Asai mentioned it's complex, |
|
|
34:08 | the same proportion. Um, so . All right, just so this |
|
|
34:19 | a different example, So that is different way, all trying to figure |
|
|
34:27 | how to be able to again wrapped loop morality each or at least some |
|
|
34:33 | the statements. So anyone sees what problem is in terms of trying to |
|
|
34:42 | this particular construct or set for Uh, well, actors read, |
|
|
34:54 | being written in both lines. Two three. Yes, that's fun. |
|
|
35:04 | , but we also used to you writing it on line three. |
|
|
35:13 | right. Um, yes. yeah. Read after write from 2 |
|
|
35:20 | 3. Um, now and then have the reader after, right, |
|
|
35:31 | , from 1 to 2 and Respect to X. Um, there |
|
|
35:40 | snow. Don't care independence, because this case, right. Um, |
|
|
35:51 | there is another problems, So in fact, can the evac |
|
|
36:06 | But that requires something. Um we don't use the value. |
|
|
36:19 | x and the first statement, we write to it. So if we |
|
|
36:23 | um, no, hold on, , if I wouldn't preserve the correctness |
|
|
36:27 | the program, I mean so in , if one word to compute all |
|
|
36:50 | instances on the first statement and then those available than the corrections of the |
|
|
37:10 | will be preserved. So that means . So I'll try Thio here's what |
|
|
37:18 | talked about. So the thing is thing that prevents these things for being |
|
|
37:24 | by victory visible is that X? just a single memory location and a |
|
|
37:30 | , as it's no. So if work to promote it to an array |
|
|
37:41 | everything could be wrecked arised. So is what's known as, um, |
|
|
37:49 | scaler to the vector. So this someone do It's like this thing |
|
|
37:59 | Um, we call it T Rex terms of the right now. So |
|
|
38:03 | keep all the instances off the outcome the first statement, then, um |
|
|
38:16 | then use the proper instance in statements and three things will be perfectly |
|
|
38:26 | So this is now what can be arised at the expense off promoting a |
|
|
38:35 | , doing the right. So this , um, one thing again that |
|
|
38:46 | compilers tends to be pretty good at they need them to figure out |
|
|
38:51 | That dependence is what there are, see that if you make an array |
|
|
38:56 | of X instead of keeping it as scaler, then things can it be |
|
|
39:06 | . But as a program and one pretension and then decide that this |
|
|
39:11 | or one can do it from the . All right, so on here |
|
|
39:20 | another type of them now and anti in, so it's a right after |
|
|
39:32 | . That happens due to the Um, so that's right. So |
|
|
39:48 | problem is this X I was one a statement to, um, that |
|
|
40:02 | the next iteration X I plus one written to in statement. Wrong. |
|
|
40:14 | , um so so in a statement one basically used one old X |
|
|
40:31 | the I plus one and one new value the X um I so And |
|
|
40:46 | suggestion for how you fix this So okay, so again kind of |
|
|
41:05 | I gave it away, right? statement to use is old and |
|
|
41:11 | And in order for to preserve the results, you need to have the |
|
|
41:17 | one around in order to make sure to get the correction So in this |
|
|
41:23 | , by introducing attempt Ferriol or a Year rate that keeps the old |
|
|
41:32 | then you can preserve the collectors, then you can back to rise the |
|
|
41:39 | thing. You can first through a of accent, too temporary. And |
|
|
41:44 | you set in the two statements Was second statement to at the correct |
|
|
41:55 | . So this is temporary. so here is a kind of different |
|
|
42:02 | of off dependence that happens so and this get dizzy. Variable is |
|
|
42:19 | . Read. But X is updated this second statement and sort of red |
|
|
42:32 | the first date. But it is quite a little bit more complex look |
|
|
42:41 | independence and than in the previous So General helps to comment on this |
|
|
42:55 | . Yeah. Let the array from toe. 100 because you're not gonna |
|
|
43:01 | You're not gonna cross x until you to the middle, right? Very |
|
|
43:05 | . Yes, exactly. So you the loop into half, basically. |
|
|
43:11 | in this case, you can split index, and then you get the |
|
|
43:16 | halves of the loop, and then come back to rise this part. |
|
|
43:22 | this is what happens. So that's kind of dependence that happens. And |
|
|
43:29 | on these two have these types of . All right, so this is |
|
|
43:39 | they tend to call dupes plating. let's see what happens in this |
|
|
43:47 | So this is, um just let it all the outcome. So this |
|
|
43:53 | kind of half a analogous. I to talk about compiler optimization, |
|
|
43:59 | Thio, take things out the That or not, I need to |
|
|
44:07 | in the loop. Eso in this , the part of the right hand |
|
|
44:17 | in these two statements can in themselves act arised, then, yeah, |
|
|
44:25 | figure out how to deal with the , which has no, the dependence |
|
|
44:30 | off the loop. Carrot. Think both x and Y in this |
|
|
44:37 | But at least in this case, known is now expecting and all this |
|
|
44:43 | the cold block of the statements, , in the for loop that once |
|
|
44:48 | then. All right, you take out some of the operations and |
|
|
44:55 | vector instructions for those and then trying figure out how to deal with the |
|
|
44:58 | part. Um, this is just of tribunal against looping has overhead associated |
|
|
45:10 | it because you need to test low , etcetera. So just reducing it |
|
|
45:17 | a single look can be beneficial for . So all this is yet another |
|
|
45:29 | known as the peeling, and that be worth spending a little bit of |
|
|
45:36 | figuring out what's this construct works and you can help, um, getting |
|
|
45:45 | cold little bit, potentially recognizable, so feeling with some off one can |
|
|
46:03 | at this. So what is to Thio in some ways take advantage? |
|
|
46:14 | what one can infer from the condition in this code to if statements So |
|
|
46:29 | it's two nested loops, right? with some luck, at least one |
|
|
46:35 | potentially do something about characterizing one level the two loops. So in this |
|
|
46:49 | , what the best process in the look that is on the index I |
|
|
46:56 | but in this case, they condition depends on the other Loop index. |
|
|
47:04 | basically, what is this the first ? The last integration on the other |
|
|
47:09 | Park kind of special? Because they one is special and the equals artist |
|
|
47:16 | . But all the other J values treated in the same way, so |
|
|
47:24 | what's kind of Luke peeling is's, , are recognize that fact and restructure |
|
|
47:30 | cold. For the three. The equals one on the day equals |
|
|
47:37 | and then the rest of the Jenny . So now one can see, |
|
|
47:46 | least for the J was one, and just has one loop level. |
|
|
47:54 | I look and it has some dependence ex Uh, but yeah, there's |
|
|
48:11 | look carry independent. So one should able Thio do something with that as |
|
|
48:16 | as the other two loops. So then on the next slide, trying |
|
|
48:22 | figure out how to now do the , given the dependencies and doing all |
|
|
48:29 | a equals one during all the integrations the first statement is kind of |
|
|
48:37 | Someone wants and backed arise. In , thanks for the J equals |
|
|
48:44 | And there's some regularization also possible. all of them, I guess, |
|
|
48:58 | remainder on the arse except R equals that has some trick. Independence is |
|
|
49:03 | part of the statements, but this just trying to show again that |
|
|
49:10 | sometimes the original here on the as perhaps an actual way of writing |
|
|
49:19 | from the functions you want him But some awareness off right ability to |
|
|
49:30 | may want you might want to consider something more on the right hand side |
|
|
49:36 | compilers to have an easier time trying figure out how to vaporize things. |
|
|
49:43 | I think this is the last elements the accusation and I wanted to bring |
|
|
49:48 | . So victimizing codes with conditions like used in the last example is, |
|
|
49:57 | , the turkey business on again. the kind of early picture in this |
|
|
50:03 | about the same d what I There was one instruction and multiple execution |
|
|
50:09 | that supposedly used the same instruction for data. But when you have conditions |
|
|
50:20 | , that means you get data dependencies terms of executing that construction. So |
|
|
50:28 | way this has been dealt with, , or is death but traditionally is |
|
|
50:33 | use what's known as mask mode or . Another version is compressed mode |
|
|
50:40 | Then I'll give examples of what these means. Asi says a mass modus |
|
|
50:47 | just tried to pretend there is no but then rectifying what actually happened and |
|
|
50:56 | wills kinda ridiculous. Cartoonish example. , off the details off doing this |
|
|
51:06 | statement on and top of the slide you divide wonder right with another array |
|
|
51:14 | by element. And of course, doesn't like to divide by zero. |
|
|
51:20 | , um I don't only want to this statement as long as we, |
|
|
51:27 | the rate element of B is not . And then, in order to |
|
|
51:32 | the two mask and compressed mellowed, assume some of Avery cartoonish type architecture |
|
|
51:40 | in terms off overhead or Leighton sees with the construction. And so the |
|
|
51:48 | again in the vector instructions that have kind of set up and do the |
|
|
51:56 | . And I think earlier last lecture about construction. Layton sees for add |
|
|
52:03 | multiplies and so on in terms of units. And the idea is, |
|
|
52:09 | up when you have the same the that you pay the overhead that the |
|
|
52:15 | see once. And then there's only cycle car additional add or multiply, |
|
|
52:23 | need to do so. This is notion, and then you have a |
|
|
52:29 | length in terms off characterizing cold, that means you load the vector into |
|
|
52:40 | register. Find so normally than arraignment much longer than the what you can |
|
|
52:46 | in the register file. So then array gets carved into up into chunks |
|
|
52:53 | you can load into their register file the time and for the I'll show |
|
|
53:02 | a little have come of evaluating these in a couple of sides. But |
|
|
53:10 | the notion here is that that you the overhead or the associate ID or |
|
|
53:17 | City with function unit for the type instruction you do and then the response |
|
|
53:24 | for each element in the vector that into the register file. And doesn't |
|
|
53:31 | L that I have in the cycles ? And then so here, when |
|
|
53:40 | use the mass load than your first thio access to two vectors. And |
|
|
53:49 | this case, I assumed that was load off. Each one of the |
|
|
53:54 | Victor's off the size that fits into register, finally, is an overhead |
|
|
53:59 | 12 and then whatever that factor length such it as the number of cycles |
|
|
54:05 | takes to do these two notes. then you have to prepare to |
|
|
54:12 | whether with value off the zero or , and that is this selling of |
|
|
54:20 | variable. And then there is what's as an extra mask register, where |
|
|
54:24 | want test all the elements off the in for B in this case against |
|
|
54:37 | value you happen to have load in case zero. And so basically, |
|
|
54:42 | have to test it for each one the elements on that case, assume |
|
|
54:46 | parallels in the testing. So you through it element by element, so |
|
|
54:50 | taste to me really account for the This mask register and then you go |
|
|
55:04 | dso you said that to zero or , depending upon whether you should do |
|
|
55:08 | instruction or not. But then, fact, you kind of blast ahead |
|
|
55:13 | you still do it for every And then you try to make sure |
|
|
55:18 | if you divide by zero, that parts of the software does not through |
|
|
55:23 | exception and blow things up in your . But given that you know that |
|
|
55:30 | Bible zero should be ignored, things still work under this mask model off |
|
|
55:36 | with conditional and then you only store results for which you should keep the |
|
|
55:46 | . But you kind of go ahead store. So basically, you don't |
|
|
55:51 | the results for which you should not done the execution. Um, so |
|
|
55:59 | is kind of the model for how math mode works now and then we'll |
|
|
56:06 | about the compressed mode and then show the outcome. And then I'll happy |
|
|
56:10 | take questions of the two versions. the mass mode is trying thio gonna |
|
|
56:21 | a little bit more efficient, potentially essentially all in carrying out operations for |
|
|
56:32 | the operations are desired. So in , what to do in this |
|
|
56:43 | is that you, um, based the values off B In this |
|
|
56:54 | you kind of know which elements off you should load. And you could |
|
|
57:05 | do the same for a um I'm . So this plan now I see |
|
|
57:15 | a mistake on the side. So first one should state Lobi, not |
|
|
57:18 | day left hand is correct. Lobi V one, not load a |
|
|
57:24 | So that's what a load a B . And you find out which elements |
|
|
57:30 | B are non zero, and thus a fraction f off the video elements |
|
|
57:38 | gets loaded into the registers. Then do you do? You do kind |
|
|
57:47 | account of how many elements or be on zero and then, you |
|
|
57:56 | from the position within the director, which elements are non zero or |
|
|
58:06 | And then you construct the new vector addresses on Lee to the elements off |
|
|
58:13 | that a non zero. So that then. I used to what's known |
|
|
58:20 | load based on this, um, vector off non zero values on B |
|
|
58:30 | then you know, the corresponding values a. So now you eventually |
|
|
58:43 | and this very explicit what you're doing . Now you go and load on |
|
|
58:47 | the elements of a for which you operate. That's part of this, |
|
|
58:54 | , vector. Initially off length the in the corresponding elements of B And |
|
|
59:00 | now you're only divide those elements on and B that less than the subject |
|
|
59:07 | all and on the elements in this collection of data on Then you on |
|
|
59:15 | store back those elements off a that updated based on a divided by B |
|
|
59:25 | so. But you have to load entire vector a B to figure out |
|
|
59:30 | ones are zero. But then then do not load us many elements off |
|
|
59:37 | and you don't stores many elements, the updated vectors and you don't do |
|
|
59:43 | divide Brasileiro and count of other too. Make sure things don't go |
|
|
59:49 | . So if I do this little with the particular assumptions I did in |
|
|
59:53 | of overheads and cycle accounts, etcetera get the new day expression on, |
|
|
60:01 | on the next slide just plugging in particular values and try to figure out |
|
|
60:08 | one or the other approach gives you better performance. So this is, |
|
|
60:20 | , now using elements per cycle, that means in this case, it |
|
|
60:30 | a few cycles. Thio do kind each one off the iterations, the |
|
|
60:40 | , because off the condition is and dealing with it and mass mood and |
|
|
60:47 | confessed mode. So in this in the two collar, um, |
|
|
60:56 | a mass more than compressed Index The higher value means something is more |
|
|
61:02 | for how your performance, because you relative Mawr used uh for every cycle |
|
|
61:10 | you spent. So in this one can see, you know, |
|
|
61:18 | there is the fare, the condition true. Um, for I |
|
|
61:26 | in this case there's a lot or say, a different. That's all |
|
|
61:35 | you need to carry out the statements very relatively two elements. Uh, |
|
|
61:46 | vector length part, then the compressed winds. Where is the extra exercises |
|
|
61:55 | figure out which elements to load and ? It's sufficiently custody. So |
|
|
62:04 | um, the numbers integrations in the you shouldn't carry out is becomes relatively |
|
|
62:12 | than using the Nazmul is beneficial. come about these things and and |
|
|
62:18 | depending upon what the relative times are overheads. And so I used this |
|
|
62:29 | this line here, in terms of and store the overhead loads and |
|
|
62:34 | This 12 cycles is very low, that's kind of means in practice |
|
|
62:40 | That that would these things are kind in cash is if you have to |
|
|
62:45 | to memory. Those numbers are much , pretty much and in order of |
|
|
62:50 | , higher. So then it should the balance quite a bit and pushes |
|
|
62:57 | much further to that. Um, compressed mode is only effective when most |
|
|
63:08 | the interactions on the loop should not carried out. Uh huh. And |
|
|
63:17 | one show this slide in terms of known has gathered structure or scattered |
|
|
63:24 | Well, obviously, through this one then I stopped and take questions. |
|
|
63:29 | , so this tried to illustrate simply this indirect addressing or is doing. |
|
|
63:37 | the gather is one. Use another toe point to the elements that one |
|
|
63:43 | to retreat. So in this okay is a rare or addresses pointing |
|
|
63:52 | elements of extra to do want, then you're treated those elements and stored |
|
|
63:58 | they are a Y. And then example I just talked about in terms |
|
|
64:04 | compressed and masked ball. That means happens to be in this case, |
|
|
64:08 | in the register fired and the scatter the opposite. That's again. You |
|
|
64:14 | it. Um, the indirect addressing try is used to point to where |
|
|
64:24 | want values off X to distort in rest of wine. So in this |
|
|
64:31 | , you have the indirect store, , right, having the level of |
|
|
64:42 | in terms of addresses and why I that collection part. So this is |
|
|
64:48 | as gather scatter type constructions on um those are the things some computer |
|
|
65:01 | in fact, had some things didn't . They have hardware support for this |
|
|
65:10 | of indirect address and to try to it there sufficient as possible. |
|
|
65:19 | and so I'll stop and hold for about in the victimizations or dealing with |
|
|
65:27 | for this scatter scattered type constructions. gather scatter is kind of expensive |
|
|
65:41 | And anyone familiar with sports makings competitions things are not nicely organized as sequential |
|
|
65:50 | normally make extensive use organic scatter And it also means compared to one |
|
|
66:11 | the earlier assignments, right, the assignment. So and this kind of |
|
|
66:17 | when you need to sort of gather scatter, um, memory assistant performance |
|
|
66:24 | more guts likes than it becomes his life. No. Okay, so |
|
|
66:39 | few minutes left on top, sort high level things about this notion of |
|
|
66:46 | nodes. And then I'll talk more explicit dealing with programming. The central |
|
|
66:57 | types architectures. So I may be has to wrap up again. |
|
|
67:08 | the notion the same. The one called and multiple, um, address |
|
|
67:17 | being used for data those in but one out code and they sleep |
|
|
67:25 | unquote streaming, um, data If you like that construction. All |
|
|
67:37 | , so most of the things we'll it in the class is a particular |
|
|
67:42 | off. Heterogeneous note architectures in this that when have some form of accelerator |
|
|
67:53 | attached special to some degree purpose device is integrated into the note, |
|
|
68:09 | and typically this is the right things being integrated to now so deeply use |
|
|
68:19 | way you're going to use it in class. And problem. Why Anyone |
|
|
68:23 | has used, if you already has it is in the form of an |
|
|
68:29 | or processor that is connected to the on. And I owe bus known |
|
|
68:38 | the P C. R E Express , typically these days and which is |
|
|
68:46 | from the memory bus to which the is attached or accessible to the safety |
|
|
68:56 | Ah, so the thing there's many Thio. So the first thing to |
|
|
69:09 | aware in this type of scenario there things you mentioned briefly hopefully, |
|
|
69:14 | before the class is over that things . I'm deeply used in your the |
|
|
69:23 | . CPU has an integrated GPU. doesn't have it normally and attached to |
|
|
69:28 | that they might, but typically many processors for desktop, the every GPU |
|
|
69:34 | the same piece of silicon as the , but in Wales, focus on |
|
|
69:41 | this class and for your assignments, programming over attached devices. So the |
|
|
69:49 | thing thio become aware off that the devices, like accelerate and deep use |
|
|
69:57 | . P J s, uh, to have their own memory. And |
|
|
70:06 | there is to our multiple memory spaces sometimes you have several GP use attached |
|
|
70:13 | a given CPU on this PC I bus. You can host if |
|
|
70:21 | uh, deep use or other typically, but depending on what the |
|
|
70:30 | or target market is for the seat you it may also have more than |
|
|
70:35 | type of fish I express bus that can get also get political a touch |
|
|
70:42 | so But the point is, programming types of notes they have to memory |
|
|
70:48 | to worry about is also the case these attached processors to have their own |
|
|
70:55 | set. So now you have to spaces to worry about, and you |
|
|
71:00 | to instruction sets toe worry about in of programming these devices and I wanted |
|
|
71:08 | also mention that this piece I expressed of the Are you busted? It's |
|
|
71:13 | standardized bus that is commonly used for devices on here is just today. |
|
|
71:22 | the stampede for Stan produced the H expressed version for on, and I |
|
|
71:29 | that's also indicated for bridges to will for this next the GP programming exercise |
|
|
71:36 | doesn't have GPS, so we use for the GPU assignment. So these |
|
|
71:43 | the data rates. And just to , remind you a little bit, |
|
|
71:49 | , to see if you memory band typical you know, 100 plus maybe |
|
|
71:55 | to 200 gigabytes per second per socket was deep. You, on the |
|
|
72:03 | hand, is even higher to either local memory. So compared to these |
|
|
72:11 | data path between CPU and its memory GPS and its memory, the GC |
|
|
72:18 | is very weak on Dhere is a bit also off, I guess things |
|
|
72:26 | keep in one when it comes to use, an advantage or not of |
|
|
72:30 | and you to use. And for case, I just took two of |
|
|
72:34 | examples of the processes have talked about the process of lecture as well as |
|
|
72:40 | two deep use I mentioned in against processor architectural lecture as well. |
|
|
72:49 | so the things to keep in mind different. Degrees of parallel is not |
|
|
72:56 | . Uh, in terms off CPUs GPS. And, um also then |
|
|
73:10 | ops for cycle as well as the spur. Yes, performance per |
|
|
73:20 | So the point of this slide on will leave my last night and I |
|
|
73:24 | continue next time is a few So first, if one looks at |
|
|
73:32 | course or threads, it looks like are considerably weaker, um, or |
|
|
73:40 | powerful than the GPS. That is of a bit of a fallacy, |
|
|
73:50 | the CP used today they have these units associated with them. So in |
|
|
74:03 | instruction, you can for each that you like, they can do a |
|
|
74:11 | more operations than what the DP use do. Deep usar Cindy engines have |
|
|
74:20 | simple threads, so to speak, can Onley. Each thread can only |
|
|
74:25 | to ops per cycle. Where is C could use can do Hey, |
|
|
74:36 | or sometimes 30 to even ob supercycle threat. So they actually changes the |
|
|
74:45 | to not be quite as traumatic as core or thread count may seem, |
|
|
74:51 | indicate. So in the end, this deep use that are listed here |
|
|
74:57 | once that are designed for scientific and applications, which means they have quite |
|
|
75:08 | capabilities for double precision floating points, is not typical from energy to |
|
|
75:16 | um, for CP, used as of two between single and double position |
|
|
75:24 | . And that's true for this deep and are listed as well. But |
|
|
75:28 | most deeply used the double position maybe the factor off 10 or more |
|
|
75:37 | than on the single position performance. I guess the take home message in |
|
|
75:45 | for this side is that the bottom in terms of application is yes. |
|
|
75:52 | one ends up, um, using , use the peak performance maybe for |
|
|
76:00 | GPU in itself, maybe factor of ish or so higher, whether you're |
|
|
76:04 | or double, depending if you have type of DPU. But in order |
|
|
76:09 | get that, you need to have much higher degree off parallel is than |
|
|
76:21 | need for the CPUs, so it's no means a nisi thing. Thio |
|
|
76:28 | this potential difference because the C. . U. S are. I |
|
|
76:35 | the meme D capability. So there's things you can do that on to |
|
|
76:41 | good performance, where the name the the more appropriate and Cindy but from |
|
|
76:49 | use simply is pretty much the only the gap and they were close to |
|
|
76:55 | peak performance. And with that, will stop and some already over time |
|
|
77:01 | continue to talk about the programming next . And I guess Final, as |
|
|
77:07 | mentioned last week, uh, will out the midterm. That's the one |
|
|
77:13 | take home return that is basically scared you. Um, I was going |
|
|
77:23 | . The strategy was, we'll find of the answers to this problems on |
|
|
77:29 | midterm either and slides or the links on the site. So it's away |
|
|
77:37 | meat. I'm sure that, go through the sites. There's no |
|
|
77:41 | tasks in the midterm, so with can stop sharing the screen any for |
|
|
77:55 | . Comments for the block matrix matrix problem. The second one. |
|
|
78:07 | I've noticed that we haven't been using I mean, the assumption is that |
|
|
78:12 | don't provide the A V X 5 flag to the compiler right um, |
|
|
78:18 | as far as I know, we been using the vector units because it |
|
|
78:22 | that flag in a compiler. Um hey, I am so I |
|
|
78:32 | remember the exact texting for they uh, so did with rule out |
|
|
78:41 | use of the vector instructions in the . Do you remember sash way? |
|
|
78:48 | explicitly say that. Don't use Uh huh. The main motive that's |
|
|
78:54 | the problem is that we use one constructs to get however performance can you |
|
|
79:02 | It's not that we discourage using idiots , but again, the motive is |
|
|
79:09 | you to understand, open and be . No, it was certainly find |
|
|
79:18 | use it. I, um But I said, the notion was essentially |
|
|
79:24 | , you see, open, empty and maybe find out the impact are |
|
|
79:30 | . Okay. Yeah, I just that the m k l wasn't using |
|
|
79:34 | , so I figured we shouldn't use , but, you know, anyone |
|
|
79:38 | certainly welcome. Experiment with it and if the compiler managed to do the |
|
|
79:44 | . It may not. Okay, , that's the vector instructions or a |
|
|
79:50 | counter on puppy. Right? So we did decide to experiment with |
|
|
79:56 | Yes. Okay. And it's not you're looking for how much performance you |
|
|
80:04 | gain. It's like, what? ? All strategies you have tried and |
|
|
80:10 | what? What are your insights from those strategies for trying to battle |
|
|
80:14 | Those quotes? Okay, a za point. And we should perhaps be |
|
|
80:20 | clear. And if they use the again. So, from my |
|
|
80:25 | was most a question about again being on using shared and private variables in |
|
|
80:34 | of his consequences. Um, memory on performance. So what's mostly focused |
|
|
80:43 | managing memory as supposed to using the tech features? That was where my |
|
|
80:53 | was in writing this thing. I think the first part was very |
|
|
80:59 | about whether in or other was better , um or at least my results |
|
|
81:06 | to that. Right. Well, . That was a good point. |
|
|
81:12 | , well, job. Ah, afternoon. Okay. Any other question |
|
|
81:16 | anyone else? Just a comment that I site reported yesterday that he was |
|
|
81:26 | some issues when using zero optimization with block matrix multiplication when he was trying |
|
|
81:35 | the outermost loop level for some reason does some things that messes up the |
|
|
81:44 | allocations and gives them the segmentation fold what I can treat from his messages |
|
|
81:51 | one optimization seem to have done the . So in case anybody is getting |
|
|
81:57 | similar try using a different optimization the No. Zero. Yeah. |
|
|
82:03 | I guess I should also had, a couple of your arrest one more |
|
|
82:09 | Thio and in the assignment, and had said yes. So we should |
|
|
82:14 | yes to all of you. So have until tomorrow and we'll send an |
|
|
82:18 | as well to make sure enough This is that you have one more |
|
|
82:24 | . Three instruction said we were supposed be using, 01 optimization level for |
|
|
82:30 | baseline, right? If it's said , then that's fine. I don't |
|
|
82:35 | exactly the Texas Well, yeah, you in case you're using or |
|
|
82:39 | That's why I say Okay. I just wondering why he was using a |
|
|
82:44 | , but Okay, great. Another that, um I'm not sure we'll |
|
|
82:52 | a straight answer, but, I've had similar results. Um, |
|
|
82:58 | I run a script during an interactive versus just handing it off as a |
|
|
83:03 | job. Um, what do you ? Close toe identical results. Whether |
|
|
83:09 | doing it on an interactive session versus this job because the dove and the |
|
|
83:16 | s k X nodes air identical from hardware perspective, right? Yeah. |
|
|
83:22 | are two different ways of submitting As long as you're requesting the same |
|
|
83:29 | and keeping all the runtime parameters you may not get exactly the same |
|
|
83:35 | , but the values should not deviate from from the two things. So |
|
|
83:40 | overhead and having so not if we're the V N. C. Session |
|
|
83:46 | opposed to an ID obsession in the the state terminal Um, the overhead |
|
|
83:52 | providing ah, graphical interface with, , with Santo s That shouldn't have |
|
|
83:59 | appreciable effect on on our results. should be very minimal. E don't |
|
|
84:06 | b and C, it should be significant issue. Okay. Yeah, |
|
|
84:13 | the bad job let me just Sometimes I find myself waiting longer. |
|
|
84:21 | and that's a question. Any measure the cluster guys do not have people |
|
|
84:26 | too many jobs at the time. , for sure There's just a lot |
|
|
84:33 | permutations on this assignment. Yes, this way. The next one. |
|
|
84:42 | . I wish I had scripted better earlier assignments, but no, I |
|
|
84:52 | . Hopefully we're giving up. Just work. Forgive some insights in different |
|
|
84:57 | . Mhm. Unless the purpose of game sensory evidence exposed. Okay, |
|
|
85:20 | no more questions than I guess. , and to me session, |
|