© Distribution of this video is restricted by its owner
00:00 | so Okay, so today is more openings to see, not as interrupted |
|
|
00:11 | way of programming attached processors and the for these classes deep use. But |
|
|
00:21 | in principle, not restricted just to , even though I don't know if |
|
|
00:25 | has been used for anything else at time. Mm, So here is |
|
|
00:34 | of an outline of what you're going talk about today in the first few |
|
|
00:38 | is basically re capital. But he that the last lecture. So that's |
|
|
00:47 | difference in the level of parallel Uh, difference in memory sizes are |
|
|
00:53 | important and bandwidth from Anne Marie and fact that there is, um, |
|
|
01:00 | thing pipe relative to other data ways the CPU and the GPU and the |
|
|
01:07 | I express bus and that at two bases and to instruction sets that needs |
|
|
01:14 | be dealt with in order to get working code. And here was the |
|
|
01:19 | that things starts and ends on the . And that means that both data |
|
|
01:28 | cold needs to remove from host to attached invites and then hopefully executes, |
|
|
01:35 | relatively independently, and then when the are available. Then they get copied |
|
|
01:42 | now, given that the memory on device tends to be significantly smaller than |
|
|
01:47 | host memory, Sometimes that means sector of interaction between the host on the |
|
|
01:57 | . Um, in order to accomplish entire competition, it's also the case |
|
|
02:02 | they can work sort of a so things can be handed over to |
|
|
02:10 | device for certain segments on the vile. The CPU works on other |
|
|
02:15 | of the code. Then they started . Talk about this. Now, |
|
|
02:21 | Example A Z on. I will this example to look illustrate some of |
|
|
02:29 | features of Opening City. And as met him Yeah. Sorry. |
|
|
02:37 | you say hosting device to be consistent principle, but in practice, we're |
|
|
02:42 | CPU and GPU, respectively. Right this class. Yes. And as |
|
|
02:50 | as I know, open a sissy not been used for anything else. |
|
|
02:56 | I mean the a little bit out date, but not much open |
|
|
03:03 | That now also supports that accelerators has used for other devices as well. |
|
|
03:11 | GPS and FPs. Okay, Thank you. And I'm trying to |
|
|
03:20 | emotional device that seems to now be used to some degree both in the |
|
|
03:28 | MP community and the Open A C community. So, uh, Andi |
|
|
03:38 | also mentioned these compiler flags that I believe, unique to the page |
|
|
03:47 | open agency compiler and personally, it's only open A C C. |
|
|
03:53 | Well, that's not true. Pray has an open a C C. |
|
|
03:59 | on, but that may differ somewhat the PGA compiler, but the start |
|
|
04:07 | them was kind of a joint development , he said on this lecture in |
|
|
04:12 | consortium between both Korean video and R g i. Eso on this electorally |
|
|
04:21 | a little bit about using this different for the target device or accelerator. |
|
|
04:30 | most of Italy on GP in terms exempt. Yeah, and I think |
|
|
04:37 | showed this one last time as well the sequential code for doing this Kobe |
|
|
04:43 | where, um and those of you or have had an intuitive America analysis |
|
|
04:52 | know about Kobe as an one of simplistic narrative algorithms. And, |
|
|
04:59 | it kind of works in a synchronized . If you like that, it |
|
|
05:06 | I mentioned last time that its use versions of the rate the A version |
|
|
05:16 | that they are the blue dots and generates the new Iranian the a new |
|
|
05:22 | is the red dots that is kind the average off its four neighbors. |
|
|
05:28 | you need to compute all the red . All the new one values off |
|
|
05:33 | points. First on, then swap . And so the first two loops |
|
|
05:41 | over there. Two D grid on states or computer, all the read |
|
|
05:48 | and then, uh, the next loops, then swaps and make the |
|
|
05:56 | values the blue for the next intuition in the while loop and then run |
|
|
06:04 | out. Put the maximum updates magnitude , which is the error in this |
|
|
06:13 | case, and one use basically the in magnitude update value. As a |
|
|
06:25 | measure, it's just a very common very simple technique. It's not very |
|
|
06:35 | , efficient soul, and people enough that very much for doing intuitive |
|
|
06:42 | But that's a different, um, pitch and s. So I think |
|
|
06:50 | is where, uh, then and I may have shown this thing when |
|
|
06:58 | used the option for the compiler. , take care of everything. So |
|
|
07:04 | speaks off that the target waas GPU as it says for those So you |
|
|
07:11 | know what Tesla has Nothing to do the famous test like about electro Magnetics |
|
|
07:18 | the name of the product line from . And then it also has the |
|
|
07:27 | that tells the compiler that the program wanted the compound it pretty much take |
|
|
07:35 | of everything and, uh, what used and then creating the code for |
|
|
07:47 | accelerator wants to use the HTC parallel . Drachma too. Identify the region |
|
|
07:58 | the code targeted for the accelerator through pragmatic sec parallel part on. Then |
|
|
08:08 | wanted, um, the program. wanted this darkness Thio be paralyzed and |
|
|
08:15 | talk a little bit more about that slides to come. So there's two |
|
|
08:22 | regions in two pragmatists, um, what's on the left hand side? |
|
|
08:30 | using the info flag for the telling what they did respect to the |
|
|
08:39 | . And I don't remember exactly how I talked about it, but let's |
|
|
08:44 | well, on a hand here on basically chose again here What they did |
|
|
08:52 | paralyzed. They are a look using notional gangs on. Don't talk a |
|
|
09:00 | bit more about that, and then slides to come. Eso There were |
|
|
09:06 | these three notion of parallel list for use that is kind of unique to |
|
|
09:13 | use not just in leaders the same for I think all of the vendors |
|
|
09:20 | and and these GPS that heuristics gang of parallelism er is the what's known |
|
|
09:31 | a worker level of parallel list. talk about that, and then the |
|
|
09:36 | and the single feature soul, in case used to gang level parallel list |
|
|
09:42 | the other loop in this construct. it used the victimization for the inner |
|
|
09:49 | . And that's kind of a common for compiler writers to use to fact |
|
|
09:57 | loops and try to use, gang or high level or parallelism for |
|
|
10:05 | loops. And it generated the reduction . Also, that was necessary for |
|
|
10:12 | first look. Next, in order guarantee correctness similar to what we talked |
|
|
10:18 | in terms of open and P, making sure that everything gets executed correctly |
|
|
10:25 | similar for the second lewdness they did same strategy gangs for the outer and |
|
|
10:32 | to from the inner. And then also took care off the data |
|
|
10:40 | Remember, things starts and ends on host. So this case used in |
|
|
10:46 | A. So it needs to come somewhere. So it comes from the |
|
|
10:51 | . So there is a copy in the A than it. Also, |
|
|
10:59 | new is something that is actually, , created on the Nice. So |
|
|
11:13 | was used this copy out. Then think I talked about last time eso |
|
|
11:19 | allocates memory for the array. And it's make sure that the whole section |
|
|
11:25 | I was the outcome of the Where is, um, evaluated or |
|
|
11:33 | ? Values? And the error variable something that the host says. So |
|
|
11:40 | also then make sure that it is copies from the host to the device |
|
|
11:47 | then copy back from the device to host. So they kind of stay |
|
|
11:53 | sync in terms off what the value the area variable is analogous, |
|
|
11:59 | The thing happens then at the other past that it is a new as |
|
|
12:08 | input right inside for the assignment So copied from the host to the |
|
|
12:17 | And then the result is there a A That is a returned thio the |
|
|
12:29 | . There are a number of other that one can notice. Maybe |
|
|
12:42 | Let me see if I wanted to that now. All right, so |
|
|
12:47 | I won Has any questions or comments what goes on in this, |
|
|
12:55 | version on the compiled code. I'll about other versions in coming slants. |
|
|
13:04 | , could you say a little bit how the second, uh, |
|
|
13:09 | Or I guess the second fragment does copy in, uh, a good |
|
|
13:16 | . Yes. Um, so the it cop, it doesn't copy in |
|
|
13:26 | . So the compiler waas collaborate enough figure out about a but probably not |
|
|
13:35 | a new that a is used both the first and the second part of |
|
|
13:42 | region. Um, one can try hypothesize as to why the compiler choose |
|
|
13:52 | be conservative about a new that assume it knows what a is for. |
|
|
14:01 | didn't need actually, because a gets in a second look. So asl |
|
|
14:08 | as its allocated memory for it on it doesn't need to get it back |
|
|
14:18 | that helps. But it didnt understand of that a new is used in |
|
|
14:28 | second probably region. So did the as an open and be when we |
|
|
14:33 | about things ends at the part of region. And when I came toe |
|
|
14:38 | MP and just to share a memory , Uh, if things were not |
|
|
14:45 | shared variable and allocated inside the Farrelly . It gets the allocated. And |
|
|
14:50 | you wanted to be preserved to another region, order just a sequential |
|
|
14:59 | In terms of NNP, you needed copy it out. So now it's |
|
|
15:04 | separate piece of memory. It's GPU memory in this or device memory, |
|
|
15:10 | it needed them to write it to host were it not get lost. |
|
|
15:22 | , right, Andi, copy out say yes should remember that the copy |
|
|
15:29 | all saying includes an allocation, even it doesn't initialize the right so it |
|
|
15:35 | allocate memory in new for a But it doesn't need to be initialized because |
|
|
15:42 | only written to, that's why it's copy out. It's sufficient to advocate |
|
|
15:49 | return values, too. The host it was the same in terms off |
|
|
15:58 | a new was hand of in the probably region, right? There was |
|
|
16:05 | initialized that was allocated on. Then was assigned values in the region. |
|
|
16:10 | didn't need to be initialized. so on and here at what |
|
|
16:22 | It did make good use off in case, the DPU. So they |
|
|
16:28 | a very good speed up. here is a little bit off than |
|
|
16:38 | this notion off that the compiler is is a notional, unified memory. |
|
|
16:45 | it means there are supporting mechanism. kind of pretend that it is just |
|
|
16:55 | address or one memory. So there the common address space. Even though |
|
|
17:04 | access properties are the different parts of address, space is quite different. |
|
|
17:10 | the compiling is to keep track about the properties are all those different |
|
|
17:17 | But it in itself looks or has to both the host memory and the |
|
|
17:27 | memory and try to figure out what best thing is to do in terms |
|
|
17:33 | allocating memory and transferring data between the memory and the device memory. And |
|
|
17:44 | did reasonably Well, no comment or more when I just say reasonably |
|
|
17:52 | obviously, they got a very good up, so maybe I shouldn't |
|
|
17:57 | but I will make it clear. as a reasonable as we go. |
|
|
18:05 | , um, so one can Sometimes it's beneficial. Try to |
|
|
18:17 | um, the address spaces oneself. that's what I'm going to talk about |
|
|
18:24 | , because the compiler is not always of doing a good job and managing |
|
|
18:35 | two other spaces with different properties. first now is to try Teoh. |
|
|
18:45 | what happens if we permit. There's attributes for this, um, |
|
|
18:58 | a flexor. Now the compiler is given the task off managing the two |
|
|
19:07 | of memory, so it, um optimize it. I should say it |
|
|
19:17 | general code that if the correct So it's perhaps not correct to say |
|
|
19:27 | it doesn't manage to to memory spaces Gaza, but treats them as truly |
|
|
19:33 | spaces. So here's what happened. in this case, uh, on |
|
|
19:45 | right hand side, um, you see what happened. Um, it's |
|
|
19:53 | not corrected. You should not say data causes part because that's not |
|
|
19:59 | So this is an error in the . I'm sorry. Coffee. The |
|
|
20:03 | without it's pointing it on film you know, used to slide a |
|
|
20:07 | times so they manage memory. They a very good job. But if |
|
|
20:12 | do not have explicitly managed memory, the compiler expressively told to know |
|
|
20:20 | memory, Then things can so kind badly. So next I'm going to |
|
|
20:29 | a little bit. So what actually in this case and what the differences |
|
|
20:34 | and then proceed to using the data together back to having good performance. |
|
|
20:45 | there is now kind of a look the code on without now compiled without |
|
|
20:56 | that man are they call a Attribute on on the left hand |
|
|
21:03 | Have kind of with what the compiler , what it actually did in this |
|
|
21:10 | because of what did it do? , two things. It certainly made |
|
|
21:15 | that the code was correct for generating reduction clause. You know, the |
|
|
21:21 | it did, uh, took these loops, did the same thing as |
|
|
21:27 | the managed case that it used the pearl of this for the other loop |
|
|
21:33 | directories, things in the interview on same thing with next. Ah, |
|
|
21:43 | region on the two loops. They exactly the same thing. So the |
|
|
21:49 | ization. Yeah, end up being exactly the same way whether it was |
|
|
21:55 | to know. So that's not the for the judge dropped in performance. |
|
|
22:06 | it's hard to perhaps memorized. But data management is the reason for things |
|
|
22:15 | so different. So here is what have in terms off what the compiler |
|
|
22:23 | doing now. So it did. copy out that music, allocated memory |
|
|
22:32 | a new and then carpeted back to holes, which was all sudden case |
|
|
22:43 | the manage case. Andi. This similar to what was done in the |
|
|
22:59 | in this case that which is not from the compiler output, that that's |
|
|
23:13 | how they're handled differently. But they , so it looks like it's more |
|
|
23:17 | less the same. But this is what's happening and the reason why there's |
|
|
23:23 | lot of traffic. So, as , um, was the case in |
|
|
23:28 | managed case mhm but is so that's kind of doubt that they actually output |
|
|
23:36 | arrest, Um, just politics and on. Sorry. So you have |
|
|
23:44 | rerun the code and making sure that output is actually consistent with, |
|
|
23:54 | flag set things. But so in case, what happens is that there |
|
|
24:02 | a lot. So first, as says, there is the problem that |
|
|
24:07 | from one of these parallel regions to other, they a new mean moves |
|
|
24:17 | and forth, rescued from device to and from holds back to the |
|
|
24:25 | which is not really necessary. It necessary because of the mhm syntax for |
|
|
24:37 | the private regions are handled. But not really we have to wish would |
|
|
24:43 | and, um, otherwise with a does the proper job. So I |
|
|
24:56 | put the little question mark here. anyone as an idea off any other |
|
|
25:17 | ? That's okay. I'll come to next problem. Yeah, So here |
|
|
25:24 | a little bit Taiwan can try Thio help the compiler by using |
|
|
25:31 | um, clauses to tell it what trouble doing in terms of copying things |
|
|
25:39 | and from and to the device and host on and has also shaped directive |
|
|
25:49 | is sometimes a good thing to use help the computer figure out how to |
|
|
25:56 | memory. And was you stopped in examples that comes so here is now |
|
|
26:07 | explicitly managed the data traffic, so speak. But now, putting in |
|
|
26:15 | , trying to help the compartment says I want and copy in for the |
|
|
26:20 | , Um, And in this I don't know exactly why the person |
|
|
26:26 | didn't examples use copy, but a . And it could have been perfectly |
|
|
26:30 | just to use to create, clause instead of a copy for a |
|
|
26:36 | , because it doesn't need to have input values from the host. |
|
|
26:44 | so, uh, now we can in this case, they had this |
|
|
26:50 | it turns out you can find as I did kind of hear that |
|
|
26:55 | water must be a little bit but pretty much everything is the same |
|
|
27:00 | one without no explicit copying and copy copy out clauses. So this didn't |
|
|
27:09 | help. So the other problem that trying to lurch there too, |
|
|
27:22 | um what it says on this So, in addition to kind of |
|
|
27:32 | traffic between the device and the host the two parallel regions. Yeah, |
|
|
27:43 | happens for each iteration in the So there is both insides. The |
|
|
27:54 | loop access traffic on this PC I bus that is the weakest part as |
|
|
28:02 | as between eight directions. And this so that was really notified by the |
|
|
28:10 | off open A C C as well open MP that we need Perry tools |
|
|
28:16 | manage. Uh huh. Data are or on the device and this situations |
|
|
28:29 | this. So there is a data that can be used to specify the |
|
|
28:44 | off variables or a raise that are on the device so much. And |
|
|
28:57 | I will now show in the next . So here is how it can |
|
|
29:01 | used with than the clauses off what wants happen to the various race. |
|
|
29:13 | this case, one doesn't want a pick up into the holes for every |
|
|
29:22 | from interior and the same thing with a new on. You also wanted |
|
|
29:27 | be preserved between the two currently So in this case, we need |
|
|
29:37 | initialize both allocates memory for a as as initialize it from the host. |
|
|
29:44 | that's why it's a copy and the , because also returns values to the |
|
|
29:53 | when things are set them down. in this case to create class was |
|
|
29:59 | for hey, because the host doesn't need to know about the a new |
|
|
30:06 | . It just needs to know what final outcome is when you're sitting |
|
|
30:11 | So a new is entirely snow cone the device in this case, and |
|
|
30:18 | only than is copied once from the at the start of the integration and |
|
|
30:27 | at the end of the integration. now I guess I'm shown on this |
|
|
30:36 | . Then Now pay attention to the is in this case same thing. |
|
|
30:43 | up the production could cause have figured how to paralyze things and the very |
|
|
30:48 | radius before. But now the copy versions is done on Lee ones outside |
|
|
31:01 | Cardinals, so to speak. It's with the cardinals but has done once |
|
|
31:08 | . That's now results and things behaving well. So in this case, |
|
|
31:17 | fact, the manually managed data copping it's a little bit better performance than |
|
|
31:32 | compiling managed and the small difference tells also that what I showed on |
|
|
31:39 | uh, data managed compiler version. doubt that. Yes, um, |
|
|
31:48 | they raised her copied as much as kind of looks like. And the |
|
|
31:55 | output. The compiler, it seems , must also have basically done at |
|
|
32:00 | . Otherwise, that would be a difference. So, yeah. I |
|
|
32:07 | , this is a good stopping point questions, So that's what you're thinking |
|
|
32:18 | potential questions. Now again, the that there is this to memory spaces |
|
|
32:27 | seriously non uniformed memory access and that accelerators are typically connected to the whole |
|
|
32:38 | in are you bus, that is a lot. There were performance |
|
|
32:46 | other on the two memory Busses. the problem. And managing and reducing |
|
|
32:53 | amount of copy is one of the things when needs to worry about and |
|
|
33:01 | recently useful and efficient deep, you . Otherwise, you may not only |
|
|
33:10 | disappointed about things when he actually slow . All right, so this is |
|
|
33:24 | more or less one than kind of that, and it was a tiny |
|
|
33:28 | in this case. So I think is kind of this benchmark slide forever |
|
|
33:36 | for if you, um actual application , you know, very small wide |
|
|
33:46 | . Well, I could dynamics and codes and some physics codes, but |
|
|
33:51 | know, and it just shows that the compiler managed the unified memory versions |
|
|
34:01 | most of the time is close Thio Angela managed. But sometimes it |
|
|
34:10 | . There's such a great job, guess, many times in death, |
|
|
34:17 | in terms off approach the programming. may be a good to first, |
|
|
34:24 | see kind of manage compiler option to the cold running, and then one |
|
|
34:31 | start to try to figure out if can make it do better by starting |
|
|
34:36 | explicitly managed copying between the device and host. Um, the open |
|
|
34:52 | um, I, environmental correctly didn't have a parallel constructs that actually |
|
|
35:04 | this construct. That is the colonel's in many ways, that similar to |
|
|
35:12 | power constructs. But it is a that please more freedom to the compiler |
|
|
35:21 | figure out what to do. So I think it's done, one slide |
|
|
35:27 | lecture is that and the arguments between Open MP community and the Open a |
|
|
35:35 | C community is, that's is that eyes? You see, guys says |
|
|
35:41 | compilers are very good this point in . So give it to the compiler |
|
|
35:48 | figure out how to paralyze the Where is the open MP again? |
|
|
35:54 | a little bit more conservative in the of being prescriptive, as supposed to |
|
|
36:01 | agency that claims to be more The pilot construct is more in the |
|
|
36:08 | MP flavor, and the Cardinals construct more and the open a C C |
|
|
36:14 | they completely do things flavor. So this case, what shown on this |
|
|
36:21 | is that the cardinals, uh, is used for the entire set Aloofness |
|
|
36:30 | of them. There is no explicit for each one of the two likeness |
|
|
36:35 | it wants us to be paralyzed. says, Here is a piece of |
|
|
36:40 | . Figure it out, and now is the outcome. And, |
|
|
36:52 | yeah, so this is coming from little bit different tutorial on someone. |
|
|
37:01 | this case, it's not speed up plotted by time, so I |
|
|
37:06 | little bars air good. They're supposed bad, and it shows kind of |
|
|
37:11 | difference. CPU and the number of core is being used, and the |
|
|
37:17 | did not paralyzed all that well yeah, it's sped up 24 ish |
|
|
37:24 | . But then six and more course didn't really do much Andi again, |
|
|
37:30 | they tried to maximize it performance for course, who knows? Because this |
|
|
37:39 | done by and really on. Artists on keep use, not on CPU |
|
|
37:46 | , but we can see in this the colonel's construct did a little bit |
|
|
37:51 | than a private construct. And neither them did well, which was the |
|
|
37:58 | that, um, we just talked when we didn't use the data construct |
|
|
38:06 | make sure erase were kept on the . So in this case, neither |
|
|
38:11 | Cardinals nor the private construct figured out keep things on the device between iterations |
|
|
38:18 | for a lot so copying. But take a look. Nevertheless, the |
|
|
38:25 | Kathy corners did a little bit better the parallel construct. So here |
|
|
38:31 | Can the compiler output on telling what did in the two codes? |
|
|
38:38 | I'll just try to highlight the differences the pieces, so one of the |
|
|
38:47 | is in the inner loop that the construct the compartment was more conservative and |
|
|
38:57 | not trying to do the higher level in terms of gangs. Yeah, |
|
|
39:03 | leave actualized the inner. Whereas the the colonel constructs they compared to figured |
|
|
39:10 | that can also use the high level journalist in combination of actualization for the |
|
|
39:18 | loop. And I was, too both off the Tulou pianists, |
|
|
39:28 | and see anyone? Eso yes, just showed that there's little blue bars |
|
|
39:35 | the bottom of this, um, , and it's hard to tell the |
|
|
39:39 | in performance on the computer part And the gray there first to the |
|
|
39:46 | is for the data copying that is . And the gray is whatever else |
|
|
39:52 | in the cold. It's an excellent about the difference in what happens with |
|
|
39:58 | data copping between the cardinals and the construct. So in this case, |
|
|
40:07 | the colonel's construct, there was a bit less data copying going on, |
|
|
40:16 | you can see in this, case that it, uh avoided the |
|
|
40:28 | in for a new and that was in the paddle constructs. So it's |
|
|
40:36 | of knew that it was already on device. It was conservative and still |
|
|
40:44 | the host know what the outcome But otherwise, um, it is |
|
|
40:52 | . So look at the virus. , the colonel did a little bit |
|
|
40:56 | copying then the others, but it's significant. And it's even without the |
|
|
41:05 | types of overheads, is still made GPU or code in this case, |
|
|
41:12 | performance to the CPU coat. So O. That was just thio. |
|
|
41:26 | to show I don't have an explicit for using the data construct to preserve |
|
|
41:39 | between iterations between and see the difference colonels and parallel. But both construct |
|
|
41:48 | a very good job and again hope will have the time to try it |
|
|
41:54 | . So question update sites and show potential difference. And this is very |
|
|
42:08 | of text that this yes, most the region says that what the difference |
|
|
42:14 | that again, Cardinals compiler has more of freedom, um, than the |
|
|
42:22 | construct. It also means that sometimes may actually fail, and it's not |
|
|
42:30 | to advantage if it's a fairly complex compartment, says have said they have |
|
|
42:38 | guarantee correctness of the code. So conservative in terms off optimization and in |
|
|
42:46 | of doubt. Now things done. in the my mind you surprise construct |
|
|
42:54 | more prescriptive. So in a the peril the programmer takes responsibility and |
|
|
43:01 | says, Go ahead and paralyzed this where the risk. If the program |
|
|
43:07 | wrong, the code may not be . And this is more or less |
|
|
43:14 | I just said in terms of these . So again, just starting on |
|
|
43:19 | textbook. Put this text on the on this a little bit more allowed |
|
|
43:26 | manage things. We don't use it the assignment that you will get. |
|
|
43:32 | just to the very simplest thing to you to use the directors programming approach |
|
|
43:40 | accelerators, I will stop in a more slides. Um, Thio suggest |
|
|
43:51 | do the demo off the opening C mode using the GP works on the |
|
|
44:00 | . Yeah, but there are um, sometimes when you do explicit |
|
|
44:08 | off data, there are ways to sure that the host again man u |
|
|
44:19 | execution threats both on the host and device at the same times, and |
|
|
44:28 | let the host in certain things that better at than the deep use in |
|
|
44:33 | , if there's really not much use Cindy constructs or instructions, then there |
|
|
44:40 | be as well to have the whole . Uh, well, we've then |
|
|
44:46 | other pieces of the code on to streaming type processors. But then you |
|
|
44:54 | also need to make sure for certain or erase that they are instinct. |
|
|
45:00 | then there are explicit ways requesting that are staying in sync by either copying |
|
|
45:08 | from the host to the device or , topping things from the device into |
|
|
45:16 | host. And this is just a example that I let you and look |
|
|
45:25 | because I want to suggest to have for me. Demo. Um, |
|
|
45:32 | is in order thio add more flexibility just given data construct for the |
|
|
45:45 | Set the parallel regions. You can it for kind off. Relatively arbitrary |
|
|
45:52 | of code was using enter and exit statements, and the only requirement is |
|
|
46:00 | the These are matching pairs, so always need to be matching exit data |
|
|
46:07 | an inter data statements. But the degree of freedom off where no place |
|
|
46:17 | to, um, directives on. is kind of a couple of examples |
|
|
46:26 | . You can place them, and is a little bit more that |
|
|
46:30 | even after can even being different functions long as the code execution path means |
|
|
46:40 | , then the execution path encounters and numbers off interests and exit data. |
|
|
46:50 | me see. I will talk briefly this and very quickly and then |
|
|
46:56 | Thio. So, yeah, so is a couple of other things that |
|
|
47:03 | used thio kind of valid marriage and to get better performance as one is |
|
|
47:13 | when it comes to loops collapsing loose the other one is what's now known |
|
|
47:18 | Thailand groups. And it has to with trying to understand the architectures off |
|
|
47:26 | memory system on that goes back to talk way back about memory systems and |
|
|
47:33 | how then the main memory in terms dear I'm are, um, |
|
|
47:41 | or that there is structure and the is by no means random access |
|
|
47:46 | It is highly structured, so there simply what the collapses. There's nothing |
|
|
47:56 | . It's just Chelsea compiler and basically will collapse. In this case, |
|
|
48:05 | 24 Lopes and to single loop that has the low pounds. That is |
|
|
48:11 | product of the two. The Uh, with two other loops on |
|
|
48:19 | is how I can use it in the Kobe example that we have talked |
|
|
48:25 | I used so much already by collapsing two loops, um, in the |
|
|
48:31 | parallel regions and, uh, in case, collapsing on the little still |
|
|
48:38 | very much. But it bought a bit. And the other thing you |
|
|
48:44 | Thio. Sure, you knew about this tile construct, but basically and |
|
|
48:50 | considered integrations in the two loops together treat them as tiles basically do a |
|
|
49:01 | interactions in each of the two And then, um, you, |
|
|
49:10 | , basically partition each generation into So you get basically loopiness There is |
|
|
49:18 | that case, four loops on this . You get two steps in each |
|
|
49:23 | of the two loops and then you to step up through both. They |
|
|
49:27 | in the inner loop on fact, generates it looked just with four |
|
|
49:33 | And here is just another case, on here is more again higher can |
|
|
49:38 | it in terms of this Jacoby kind of that case using 32 by |
|
|
49:44 | Azaz against was recent beneficial and during cold baiting code and try to optimize |
|
|
49:53 | with respect to the memory system. here has worked some different stylings that |
|
|
50:01 | years four by fours up to 32 30 tours. And sometimes there was |
|
|
50:07 | improvement. And sometimes there was a . Uh, I think, |
|
|
50:14 | So I guess in this case for proper dining, the best case was |
|
|
50:21 | or 10% above collapsing loops by using style primitive. And I think that's |
|
|
50:29 | stopping point, I think for very quickly. You can also |
|
|
50:37 | um, decided to try Thio manage many, um, gangs are being |
|
|
50:49 | . And how many workers in each of the game gangs and the vector |
|
|
50:56 | in terms of the victimization. So the gang you remember, they get |
|
|
51:03 | to this streaming off the processors. one of these has, um, |
|
|
51:15 | , uh, core and to the in it, andi, than the |
|
|
51:24 | . What's, um allows each worker use several of the crew the course |
|
|
51:36 | again it reflects the kind of architecture off the hardware in terms off through |
|
|
51:42 | course, being grouped into sm access grouped into the graphics processing clusters. |
|
|
51:52 | think using the NVIDIA Norman closed. , and this is just explicit. |
|
|
51:58 | telling what type called fertilization on one from the loops, and this is |
|
|
52:04 | then how many of them are one each one, and this is how |
|
|
52:09 | could be used in their Kobe And here's in this case trying to |
|
|
52:15 | an explicit management. Did interaction help compile? It did a better |
|
|
52:20 | And with that I will love past over Thio suggest there's a few more |
|
|
52:28 | in the slide decks that talk a bit more explicitly about the energy for |
|
|
52:35 | . Andi, if there's time Left on it, otherwise they leave it |
|
|
52:39 | you to look at it. And that, leave it to so |
|
|
52:47 | get start sharing my screen. so it's nice screen visible. |
|
|
53:03 | Okay. Awesome. Uh, so Z may have, uh, |
|
|
53:11 | And may I mentioned earlier that on to knows that we have access to |
|
|
53:16 | notes do not have a g So in this case will be using |
|
|
53:20 | notes on the gist cluster and for time will be using the century. |
|
|
53:26 | the example that basically resigned the slides now? And interestingly enough, there |
|
|
53:34 | a few examples that may not concur the results that we signed the |
|
|
53:38 | so just keep an eye out. , well, they got started. |
|
|
53:45 | so since I've been using stampede until , So here's a quick reminder how |
|
|
53:50 | can connect toe the religious question. can use the business. Do you |
|
|
53:56 | that taxi? And on the on bridges cluster, if you want to |
|
|
54:03 | access to the GPU known, in the interact more. This is |
|
|
54:08 | command that you will use, Um , and is the number off notes |
|
|
54:13 | you want to get access to since want access to cheap, you |
|
|
54:18 | um, you have to provide the flag as well on that this flag |
|
|
54:23 | , you provide, uh, the of the GPU knows that you want |
|
|
54:28 | to so in. In the case bridges, we have two types off |
|
|
54:32 | notes one that contains the P 100 , which are relatively the New Orleans |
|
|
54:38 | three other notes contained gave four p Remember correctly, which are the slightly |
|
|
54:45 | GPS from in video. And then second parameter here with that is the |
|
|
54:50 | of GPS that you want to access in this case will be just accessing |
|
|
54:55 | , one GP. And then this the time that you want to be |
|
|
55:00 | for eso. As you can I've already run that command and I'm |
|
|
55:05 | on the GP. You know, you can confirm that by seeing the |
|
|
55:09 | , it should change from log into eso first thing before you start working |
|
|
55:16 | open A C C codes, you to make sure you have a couple |
|
|
55:21 | modules loaded. Yeah, and in list. In this case, it |
|
|
55:31 | be one off PG compilers, and we also need the food around |
|
|
55:37 | Since we're using thean video news, sure you have those two models. |
|
|
55:44 | is the model for P. C. C. Can pilot that |
|
|
55:48 | used to compare the open SEC Now this PG model. It also |
|
|
55:54 | some of uh, some useful some of which you can see. |
|
|
55:59 | is bgcc, which is the BG for C just the G d. |
|
|
56:04 | plus plus, which is the companion C plus plus court. Uh, |
|
|
56:10 | we're interested right now is in this . That's the PT Axel Info. |
|
|
56:18 | you run that, you can get information about the whole, uh, |
|
|
56:23 | GPU that's on the label on the this particular north. You can see |
|
|
56:28 | sorts off specification that this is the version of the compute capability that the |
|
|
56:36 | likes to call it a six Oh, you can also see the |
|
|
56:41 | . You can also see if this , of course, manage memory. |
|
|
56:48 | , that I will not. The that's most important is this one the |
|
|
56:53 | G i. D. For So we will need, uh, |
|
|
56:57 | flag Thio, tell the PG I find that way Want to use, |
|
|
57:03 | , this Tesla CC 60 uh, off on the cheap? You combining |
|
|
57:09 | courts in this case, the C 60 stands for the compute capability. |
|
|
57:14 | this point which was just here on ground? Original Hendrick Benign Strip. |
|
|
57:23 | , the first example that I will with can I za again statistics the |
|
|
57:29 | Jacoby. Cool on. We will with using the magma sec kernels. |
|
|
57:36 | so when we compiled its and from program for compiling and running both we |
|
|
57:43 | see one problem each with just like ex ISI journals and will step step |
|
|
57:50 | step. Try going through hell called we go through. Now, here |
|
|
57:56 | have just, uh for, my physical nose around our main computer |
|
|
58:03 | lock. When you want to compile code, you can simply use the |
|
|
58:08 | . I compile, like bgcc. also need to provide the flag, |
|
|
58:12 | , SEC that tells us that you're open E C C. It's another |
|
|
58:19 | , which is fast, which is to the optimization levels that we've seen |
|
|
58:25 | the intel compilers you was. You to three. In this case, |
|
|
58:29 | very likely, uh, most causes three organization level. Look, |
|
|
58:38 | we also need to tell that you're that particular, uh, computer teach |
|
|
58:46 | . So that zizi 60. then there's another flag and info. |
|
|
58:54 | with that provided excel as the parameter so that we can get details about |
|
|
59:01 | the compiler there in terms off, the food off our accelerator. So |
|
|
59:07 | stands for the accelerator here. Then can provide the name of the source |
|
|
59:15 | . So industries, that's Jacoby underscored the thing. Mm, and just |
|
|
59:25 | okay, thank you. And then are told, are prosecutable. Now |
|
|
59:36 | we compile that, as I we'll see one issue when we compile |
|
|
59:42 | court. And as you can see section here that the combined it for |
|
|
59:49 | reason thought that our compute loops contains dependency. And that's why did not |
|
|
59:58 | perform any kind of federalization for that move? However, we can be |
|
|
60:04 | , obviously know that there is no since we're treating a I would not |
|
|
60:10 | it, even though thes indexes I one g minus one. It makes |
|
|
60:15 | compiler think that there's some sort of across the on DSO in this case |
|
|
60:23 | not, uh, do any kind from any kind of finalization on that |
|
|
60:28 | book. So that's the first problem we notice. However, it did |
|
|
60:34 | an executable, so let's go ahead run it on. When we do |
|
|
60:38 | , we'll see that there is another here. That the code at runtime |
|
|
60:49 | had some trouble by trying to, , uh, access the Arab buffers |
|
|
60:57 | with this, uh, they find our code and given illegal address editor |
|
|
61:05 | the good online copy from device to , that was being profound and in |
|
|
61:09 | case compromising manner as well as I'm try to free the memory on the |
|
|
61:16 | world. Eso Let's first make sure remove our first issue. That |
|
|
61:24 | uh, making sure that this guy not have any trouble. But let's |
|
|
61:31 | with removing this memory issue so that can affects criminal code. So the |
|
|
61:37 | way you can do that is by providing theme. The managed barometer, |
|
|
61:48 | target taxes. Right. Uh, do that. You still have this |
|
|
62:00 | , uh, dependence from the But now way do not have Uh |
|
|
62:08 | . The memory access issue. So solved one problem. Now we need |
|
|
62:13 | solve the second problem that we need make sure that our look gets paralyzed |
|
|
62:18 | we already know that there is no in it. So what we can |
|
|
62:22 | is, rather than using this a sec kernels director, we can |
|
|
62:29 | the primary sec part of the And as we just saw in the |
|
|
62:33 | , that the difference between colonels and directors that colonel's objectively is pretty much |
|
|
62:39 | to the compiler to decide what the and whatnot. But in this |
|
|
62:43 | we know that there is no So we can explicitly tell the complainer |
|
|
62:48 | we need to paralyzed these particular Yeah, so we can do |
|
|
62:57 | You can combine this coat and still see what happens if we just removed |
|
|
63:13 | , managed, uh, again. mhm Motion on thio. Okay. |
|
|
63:50 | . Oh, right. So as just saw that the previous case |
|
|
63:57 | at runtime way got theater off memory . This time, even the compiler |
|
|
64:05 | us that this is not going to . You need toe, do something |
|
|
64:10 | the take a management structure, and we provide, he managed memory |
|
|
64:19 | parameter. Obviously that how was compiled And that particular loop waas paralyzed by |
|
|
64:28 | compartment on 20. Yeah, on we can already see that there is |
|
|
64:43 | speed up between the Colonel's director of , the Look production director. So |
|
|
64:52 | there any question of too long? we solved two problems May live. |
|
|
64:57 | we made sure that our looks at lives using the loop uh, construct |
|
|
65:04 | very mature. There's no issue with accesses by a T least for now |
|
|
65:10 | the manage memory, uh, any questions after that? Okay, |
|
|
65:26 | . Then we can already see that getting a good, uh, speed |
|
|
65:31 | . But still, we're using this manage memory paradigm for a construct eso |
|
|
65:40 | . Need to do something about that have a good data management strategy. |
|
|
65:45 | in the next one, for we can try to use this |
|
|
65:51 | see data construct and make sure what we need to do with help |
|
|
65:57 | So, as you can see for we're using the coffee construct. So |
|
|
66:03 | , which is, uh, which too? Allocating the memory and |
|
|
66:09 | copping that, uh, that offers from the host memory to the device |
|
|
66:17 | on, then copping, uh, output from device memory back to the |
|
|
66:24 | and that copy in. There's one step that it's competitive. Competitive to |
|
|
66:31 | first private close of open and see that locates the memory and device and |
|
|
66:35 | copies. And, uh, uh that data from that was a |
|
|
66:43 | meant but for that, you should been able to use create. |
|
|
66:49 | That's right. Right, right, . So we can Okay, maybe |
|
|
66:52 | coming. Uh, well, it's coming, but that's right. We |
|
|
66:57 | we can use create. I just on Yeah, that was in your |
|
|
67:02 | , but yes, we can also create for that since we do not |
|
|
67:05 | the initial values, uh, introduces amount of traffic. Correct? |
|
|
67:13 | Uh, but yes. Uh, . So now for at least for |
|
|
67:21 | example, keep Yes, we can that. Uh huh. Yes. |
|
|
67:31 | notice that now we will not provide managed memory clause. And the reason |
|
|
67:39 | the opening ceases standard. It says if you, uh, define your |
|
|
67:45 | data management strategy using these pregnancy data , then you should not provide the |
|
|
67:52 | memory process. If you do, what the compiler is going to do |
|
|
67:56 | going to disregard all the data management that you may have a light in |
|
|
68:00 | code and just use the manage memory for, uh, managing the data |
|
|
68:09 | . So make sure whenever you apply data management don't add the managed A |
|
|
68:15 | . Reflect Thank you way stepped up this example. The copy that we're |
|
|
68:26 | is redundant because way didn't need thio any memory, right? Way only |
|
|
68:33 | to allocate. Yes, we just Yeah, so we can We can |
|
|
68:37 | this copy and by creating Okay, . Um but then Mr were like |
|
|
68:44 | making it better one step at a , right? Well, at least |
|
|
68:49 | this case, it's not going to better. Spoiler alert. But let's |
|
|
68:56 | . Let me let me just complained yes, a Z you can see |
|
|
69:09 | generated a copy for a and copy for in you and it also battle |
|
|
69:14 | all the loops that we wanted it on and we run it. It's |
|
|
69:25 | not perform as well as the previously manage memory. However, if you |
|
|
69:34 | these two compilation, uh, you can. And if you count |
|
|
69:40 | number off data movements like copy and else, that's, um, exactly |
|
|
69:47 | same for these two. So this one copy in here and one copy |
|
|
69:51 | and one copy in here. So to copy in and one copy |
|
|
69:56 | And here is the same. That's copy in one copy in here and |
|
|
70:00 | coffee out. So that's again, , to copying and one copy |
|
|
70:04 | And these are redundant. Since these only executed, the data is not |
|
|
70:09 | present on the device where it is present here. Uh, the main |
|
|
70:16 | is, uh, in many it could happen. That theme manage |
|
|
70:23 | . Since it's managed by the food run time, the code at runtime |
|
|
70:26 | actually do a better job at moving between the device and the host. |
|
|
70:31 | the reason is when you use a memory, the cooler runtime decides when |
|
|
70:38 | block of data is needed on the , and it moves the data at |
|
|
70:43 | particular instance when it's actually needed or all. It can also apply some |
|
|
70:49 | optimization in doing so, and so not necessarily, uh, true that |
|
|
70:56 | will always get good results with your data country. That's the only reason |
|
|
71:02 | try to keep. I've got particular . Is there any question okay, |
|
|
71:14 | not Move to the next one. here is what we just saw in |
|
|
71:21 | of the last sites that we just that we can also use these private |
|
|
71:28 | enter and exit data process. What mainly does is if you have a |
|
|
71:35 | of court blocks or your code is arised. You don't want to put |
|
|
71:41 | data management laws across. All the are across all the models. What |
|
|
71:46 | can simply do is you can define section where you your data enters the |
|
|
71:54 | , generally speaking, and then for particular enter data clause, you also |
|
|
72:01 | to have an exit date across and between that enter and exit section |
|
|
72:08 | you can keep telling your theory, , that, hey, these buffers |
|
|
72:15 | already present on the device member using present cause. So that s so |
|
|
72:21 | the compiler of the front time, does not have to worry about data |
|
|
72:25 | president on the device memory at this point in the execution. So this |
|
|
72:31 | a very simple example, but way pretty much everything in just one, |
|
|
72:38 | , function. A good example would something like this. So if you |
|
|
72:42 | a function that performs initialization, another that does the actual computations, another |
|
|
72:51 | that does that second look which does slapping off values and and you and |
|
|
72:57 | the end, another function that de the memory. So in this |
|
|
73:01 | you can put a enter clause in initialization function, and the driver |
|
|
73:08 | like main function, would call these functions on in that in those |
|
|
73:14 | you can use these president across, the front time and the compiler that |
|
|
73:19 | worry about the data to be president not, it's going to be already |
|
|
73:23 | since we used this, uh, enter data constructs. But it is |
|
|
73:29 | to know that for every enter data , you need to have an exit |
|
|
73:35 | cause otherwise you get a peril from compiler orbital from time. Um, |
|
|
73:47 | a compilation would be exactly what Just go ahead and run this, |
|
|
73:52 | , on you can see still us a slightly better than what we had |
|
|
74:00 | . Quite a little bit. Still you have there any questions on |
|
|
74:12 | Uh, it's not then the last . I have a question for |
|
|
74:20 | Klaus. Sorry. Say that You said the President Klaus was there |
|
|
74:27 | tell. Um, Well, not if I would call it a hint |
|
|
74:31 | the compiler. Um, but it letting it know that we don't need |
|
|
74:37 | . Like Like it says it's already . There s You're pretty much telling |
|
|
74:42 | that don't worry about data being present not. We are guaranteeing that, |
|
|
74:49 | , thes buffers will be present in device memory at this particular point in |
|
|
74:53 | execution. Does that make sense? . Yeah. Um thanks. So |
|
|
75:08 | is a question for all of you . So this is a simple matrix |
|
|
75:15 | program. On these are the three loops that you may have become familiar |
|
|
75:19 | now, after your assignments. So question here is there's two questions. |
|
|
75:28 | , do you think that using these loop constructs will give us freedom? |
|
|
75:35 | the second question is, will you a correct result? They're not |
|
|
75:57 | Anyone okay? E just run this and see what happens. Eso this |
|
|
76:17 | will compile without an issue. but what's gonna happen is this was |
|
|
76:22 | execution in time. Whoever it failed one of the elements or there would |
|
|
76:28 | many elements that were expected to have certain value. But in the result |
|
|
76:35 | was computed, they had some other because there was something wrong with |
|
|
76:39 | The one thing wrong with our code that the C E. J was |
|
|
76:43 | accessed by multiple threads, or Andi you try to paralyze your innermost |
|
|
76:52 | which caused the race condition across, for this particular element across the across |
|
|
76:59 | threads and which may have resulted in incorrect assignment off the values to that |
|
|
77:08 | element. And as we have seen threats do not synchronized until explicitly being |
|
|
77:16 | eso. The simplistic, simplest solution this particular problem is by using a |
|
|
77:23 | about which we also saw in the of open MP and using the reduction |
|
|
77:29 | . So we asked plus, as operator to be applied upon them and |
|
|
77:35 | if you remember that thes the valuable you plaster reduction closet goes in as |
|
|
77:41 | private. And so if you compile on this code, we will see |
|
|
77:48 | not only, uh, your quote finished in less time, but it |
|
|
77:53 | passed. So the main motive force this example was that even though you're |
|
|
77:59 | may have applied all the finalization strategy of four and it is running |
|
|
78:04 | You should also make sure that you're a direct result out off out of |
|
|
78:09 | coat. So don't just get complacent you get your coat running at Lord |
|
|
78:15 | time, I should also make sure we're getting correct results that will |
|
|
78:24 | It was pretty much, uh Do mind if I ask you a |
|
|
78:30 | E So I understand why it produced results, right? Because there was |
|
|
78:40 | a race condition on the intimacies loop C at I J. Um, |
|
|
78:45 | I don't understand is why there was speed up on the last one. |
|
|
78:51 | , the main speed up is because believe that something, maybe people to |
|
|
78:57 | it much in a much better way what there is a term called cash |
|
|
79:02 | or just trashing, I would say multiple threads are trying to access a |
|
|
79:10 | single element. And when that the data element may be accessed by |
|
|
79:20 | thread. However, before that particular tried toe updated, some other threat |
|
|
79:26 | that particular element. And then this threat has toe update its own |
|
|
79:31 | And so, in this process, a bunch off member needs going |
|
|
79:38 | The cash is in the main so I don't know I'm doing it |
|
|
79:42 | I'm doing a good job explaining it as a doctor, Johnson may have |
|
|
79:45 | good explanation now that that totally makes . I it didn't occur to me |
|
|
79:51 | with with with the race condition, there's a possibility of threshing. But |
|
|
79:58 | only if only one connects that at time that it would make much more |
|
|
80:01 | that there was no thrashing right, is what we do by introducing |
|
|
80:07 | I'm very, very good. Okay, that makes some Thank |
|
|
80:13 | Theo. There are no questions Okay? Yeah. Yeah. So |
|
|
80:32 | guess that the high level many things not that different between opening Suzie and |
|
|
80:40 | empty in terms of have to go cold, and a lot of it |
|
|
80:44 | has to deal with managing data, or memory. I've been trying to |
|
|
80:54 | emphasize throughout this course so far, that's why it's important to Ah, |
|
|
81:02 | quite knowledgeable about them. That a architectures in order to get decent performance |
|
|
81:09 | try to figure out how to help and other tools manage or express more |
|
|
81:18 | the intent. Dive in there. Okay, Yeah. This, |
|
|
81:36 | stop falling at this point. In fact, some questions for |
|