© Distribution of this video is restricted by its owner
Transcript ×
Auto highlight
Font-size
00:21 Wait one More minute, I Yeah. Is this is this visible

01:09 you? Should I make the font ? Yeah. Is that okay?

01:17 a little bit more loud, I think. Yeah. Let me

01:22 it a little bit larger. This be Okay, I think.

01:36 Okay. Also sometimes when you have projecting thanks them the light. Take

01:49 , getting the life I guess. was just throwing them off.

01:53 okay. I think that's okay. I'll start with the demo. So

02:06 right now I'm showing this demo on uh the data center Institute uh on

02:15 U S. Campuses carry a cluster I'm using the Gpu note on that

02:22 carrier cluster for now, uh and can easily get access to that note

02:32 a simple interact command, which is a s on command from Sillerman passing

02:41 uh specific uh parameters here, I'm everyone is now aware of how to

02:48 Slam at this point. So that's Okay. Now, in order the

03:03 do any open sec programming on these ? The PG I compilers, they

03:09 as a uh as a part of in the uh Envy HPC uh

03:18 So here I have already loaded the HPC module uh using the module Lord

03:25 . And so once you do uh just do PG and if you

03:30 tab a couple of times it will you pretty much all day PG I

03:34 uh commands that are available. So this one command called PG Excellent.

03:41 it gives you information about accelerators that available uh on the on the

03:47 the P G C plus plus is C plus plus compiler for open SCC

03:53 G C C is the C And in case you use FORTRAN,

03:57 , there are FORTRAN compilers as well you can use. Uh first I'll

04:03 show this uh, output of PG info and this command basically tells you

04:11 the information about the accelerator, as said. So we have the Tesla

04:16 100 which is the water? We . Uh, NVIDIA Gpus on this

04:21 . So there's and there's only one on this note. So that's why

04:25 a device number zero and nothing, else. In case you have multiple

04:29 . On a note, you'll see such outputs as uh as the output

04:33 Fiji. Excellent for and you'll see device numbers from zero to whatever number

04:40 gps you have on bridges to you a gps on each note,

04:43 he's seven such outputs as when you PG Excellent. For oh sorry,

04:49 00 to 7 to 88 outputs. then here, you can see all

04:55 information about the Gpu, you can the clock rate. Uh you can

05:01 the memory clock rate, the memory width. So as you can see

05:04 memory memory busses wider uh on gps compared to the Cpus usually have 64

05:11 uh, busses on Cpus for the here, you have the wider wider

05:18 . Um then you have all these sort of NVIDIA conventions of uh the

05:28 and the computer art is what to , uh where you have threads as

05:34 uh smallest entity going up. One , one layer is the uh

05:42 And then the topmost levels in the computer hierarchy are the grids. So

05:48 that um the one important flag to a note about is this one which

05:55 you, which you will use while the code? So this is the

06:00 70 means the compute capability seven dot , so that's just the a compute

06:07 version of this particular Gpu for older , you might see CC 35 or

06:12 on Yeah, Bridges do not, see see see 35 for the K

06:16 gps that they have on there. that's just the basic information you can

06:23 about the uh about the Gpu here , yep. Also because of the

06:33 , but when you can conclude that have this I band with memory

06:39 It would not have that manipulates Yeah, I think if you go

06:46 NVIDIA website, it says that these use the H P M two memory

06:51 I remember correctly. Right? Uh yeah, first example, I think

07:01 remember it from the uh last lecture the slides. So first uh this

07:12 again, simply did jacoby example the construct that we saw first was

07:19 battle of loop construct here, which used to uh explicitly paralyze these uh

07:25 loops. So to do that, simply have the drag my cc parallel

07:30 is similar to the open MPI constructs and then have a reduction of uh

07:36 maximum production for the error variable Uh We also add a pragmatist CC

07:44 to paralyze the inner loop uh uh and the same thing we can do

07:49 the second set of nested loops Now to compile these codes, uh

08:01 these are C programs will use the G C C uh C compiler uh

08:06 you need to provide flag, dash C C which stands for enabling the

08:12 sec compilations. Uh you also provide fast uh flag, which is um

08:21 an optimization flag. Which is quite to the 03 optimizations that you've been

08:26 in gcc and I C c compilers firstly uh will compile it for the

08:34 first and to do that, you the D A flag, which is

08:39 target accelerator, uh and its value be multi core. And when you

08:45 it multicore, that means you're compiling the CPU The other flag. That

08:51 also helpful is called uh dash M four and M S capital here and

08:58 . Either you can do all that it will give you information about what

09:02 did for the CPU as well as accelerator. Or if you're only interested

09:06 getting accelerator information. Uh what the did for the accelerator part of the

09:11 the code, you can give it as well. Uh here, I'll

09:15 do all here and then you can the I think it's covering up the

09:27 here. So your source code name and I'll try to be consistent with

09:37 five limbs that I've been giving and , and then your output file

09:44 And when you do this, you get the compiler tells you exactly what

09:49 did for each of the sections. , uh this 37 40 and 43

09:56 are the first set of two nested nested for loops. So you can

10:02 it generated multi core code. Uh is no mention of Tesla or GPU

10:08 or anything. It also generated the uh a similar uh, messages given

10:18 for the second set of the uh the for loop for the outer for

10:23 . And since in the uh second for loops, the inner loop is

10:28 just a swap of the uh where E N. N. U.

10:32 just a memory copies. Also generated memory copy operation for the for the

10:39 loop. Now on on multicore. there's one extra thing that that you

10:45 do is you can also choose how courses do you want to run your

10:50 on and to do that. I the uh yes, there is an

10:59 variable that you can use. It's sec numbers. And you can simply

11:06 its value to uh the number of that you would want to use.

11:13 so let's say on these uh these , I think we have 24 course

11:19 the CPU Uh on one socket. we have two sockets with 24 cores

11:29 . So that's 48 cores. So can simply do export sec non course

11:37 to, let's say 48. And when you execute uh your program,

11:46 will run on all the all the cores. So I'm only running at

11:52 attractions for now. So that's why quite fast here. But uh

11:57 you can you can choose the Of course you can execute your program

12:01 um comment yesterday, Korea. It like they do not have a great

12:09 , I prefer doing in april they have only one thread burke or

12:15 , there's no hyper starting uh Right? Um So yes, that

12:23 the multicore compilation now we need to this code for uh the uh the

12:36 . So that's for that. The thing you need to change is the

12:41 accelerator. Flags value. Remember we C C 70. So you need

12:46 provide that and that's pretty much it you need to change compiled for the

12:55 and now you can see uh it's the Tesla cord uh, generated all

13:01 loop gangs, uh, the vector that it chose for, uh,

13:06 second set of uh, for loops and all the data movement operations as

13:12 . Uh, now, not that , this is just simply uh,

13:20 on the open sec. Run time manage data transfers. If you want

13:24 use the on the unified memory then you need to add one extra

13:36 to the target accelerator. Uh, me, uh, to the target

13:42 . Flag. And that's uh separated a comma and then say managed.

13:50 , once you compile it, the output is very likely going to be

13:56 same. It's just that now, you run this program, the unified

14:01 uh, system will take care of data movement. The open sec runtime

14:05 not do it and do it for . So as you can see,

14:09 compiler output is basically the same. no difference. Just at the wrong

14:14 , a different system will take care the data movements. And from some

14:18 my experiments, I've seen that this memory system has been uh quite good

14:25 handling data management, uh and the times have been smaller for the manage

14:30 programs that I've used on this particular here, but it could be different

14:35 older versions of open sec or in the cuda drivers that you might

14:39 . So it could be different for versions. Okay. Any questions from

14:48 students? Yeah, I don't know there's anything in the chat now questions

15:00 don't really see in their classrooms. okay. Alright. Uh, next

15:10 , again, we saw this in slides by using the Colonel's Director.

15:16 if you'll remember the Colonel's Director, , basically allows you to say to

15:23 compiler that this is a section of code that I want to paralyze and

15:26 on a certain accelerator. Do something it. So here, you can

15:33 wrap your, um, programming the . I see. See kernels

15:40 So you see, I don't have frag MMA sec parallel loop construct.

15:46 this cardinals. Director ends here. just uh, wrapped the whole thing

15:52 the like my sec. Cardinals and . And so, combining uh,

16:04 program is again the same. I compile it for the multi core now

16:11 something jumped to the uh, GPU because there's an interesting thing that's gonna

16:20 here. Yeah. Okay. All . So, does anyone see any

16:36 with this compilation output here, Yes. But was there any loop

16:52 in the court? Let me open court again. Yeah. Yeah.

17:00 we see any loop dependencies In this section here from line 42 line

17:08 Mm hmm. All right. I think so. Well, the

17:20 is, and it's a bit subtle , uh, with open sec

17:27 and really you need to define your memory allocation pointer uh with a keyboard

17:37 because this particular keyword, it tells compiler that there's only this and this

17:45 point at that points to the specific location and there's no pointer aliasing going

17:50 . Pointer aliasing means that several pointers be pointing to the same memory location

17:55 so there could be corruption. When uh multiple pointers try to change the

18:00 of the same memory location and We were basically tells the gun violence

18:04 nothing of such sort is going So you need to do that uh

18:10 allocating dynamic memory in europe and sec when you add this restrict keyword and

18:20 the same compilation again. Now you're goes too smoothly and it generates the

18:26 little code for the accelerator, other , some people I was informed it's

18:39 they will be because the memory is , but it's still being pointed by

18:44 just one pointer. So the runtime can handle the memory accesses as long

18:51 there is only a single pointer pointing that memory location. Go ahead.

19:03 , Yeah. So yeah, that's a very subtle thing. But you

19:06 to remember that because you might end just because everyone's just familiar but simply

19:11 Matlock and moving on and don't realize you need that keyword uh to prevent

19:17 an issue. Yeah. Okay, was the next example I had.

19:31 , yes, uh yes. One I forgot to say. So even

19:36 the colonel's directive you can provide the the managed uh parameter option. One

19:47 . So against even with kernels directive can use the unified number. Uh

19:54 . Again, the compilation output will the same even if you use managed

19:58 at the same time the uh unified and time will take care of your

20:03 movements. All right. What else I had? Okay. Mhm.

20:16 right. Yeah. So up till we've been pretty much leaving the data

20:23 to either the compiler or the or run time to take care of.

20:29 you can as we saw in the slide you can also use these data

20:35 to specify exactly what you want to with your with your data movement.

20:41 here what I've simply done is before wild loop have this pragmatism, sec

20:47 closure or directive I should say. Here I tell the runtime or all

20:55 compiler that I need the uh need matrix to be copied uh to be

21:04 a copy operation. And copy means you enter the parallel region, it

21:09 copy the data onto the device. when you exit the region it will

21:13 back the data from device memory to host memory while just create clothes here

21:22 a new because we don't need the back on the host memory which is

21:27 being used as a temporary storage for intermediate results. So we only allocate

21:33 space for that particular array on the the GPU. We don't really need

21:37 transfer any data even though it's you don't need to transfer them.

21:41 can simply just allocate the memory on GPU. And when you come out

21:44 the uh the data region um that gets allocated and just get your a

21:52 that you actually wanted as the And this uh data region it ends

22:01 uh with the uh with the loop the structure block that is defined

22:06 So uh it ends after the after wind looks you'll get your updated values

22:11 the while loop is after the while is done. So the data lifetime

22:17 only for that particular structure block for you define the data. Uh The

22:23 director here. Mhm. All So compilation again the same.

22:33 The important part here with explicit data is if you specify managed flag while

22:43 a code that has explicit data then the unified runtime will take

22:49 No, no matter what it the compiler will basically ignore all your

22:53 data movements and the unified memories and will handle all the data movements.

22:59 remember that even if you have explicit management, if you use managed

23:05 everything will be ignored that you have . So keep that in mind.

23:14 . Yeah, there is one more to define the lifetime of your data

23:21 using data regions when you have let's some parallel sections that are separated by

23:30 sequential sections in in the middle. but you still want your data to

23:34 on the accelerator memory and not be back to the host memory while you're

23:40 some sequential operations in the middle. is there is a way to do

23:46 and as you see here so you do that by using prime a sec

23:55 data. Uh directive and you can that. Uh Well this is pretty

24:01 exactly the same example that I should before this. I have the data

24:06 just before the wild loop. But for example let's see we have this

24:11 ec enter data. Uh you copy your data on the on the memory

24:15 now and see this you rather than copy, I have a copy in

24:22 . So that means at this point only putting the data on the device

24:26 . I'm not copying copying it back on the host member yet.

24:31 so I do my parallel loops uh everything else now. Imagine rather than

24:43 these printers and my timer timer calls , I may have had some sequential

24:50 code running and then again another parallel running here. So for those sex

24:57 the for that second parallel section, would have had that data already residing

25:03 the devices, memory, I wouldn't to perform anymore. Uh uh data

25:07 from host to device data would have been there. And let's say as

25:11 as I came out of that um reason again, let's on this line

25:18 here then I can call this sec exit data and say the lead

25:24 you. And also I should have a copy out for a year.

25:29 a mistake. So I can also a copy out for a and get

25:33 of my data back on the So therefore you can extend your lifetime

25:38 the data on the on the device excavators member here. Does that make

25:50 ? Mhm. Okay, so you you uh there's something it's coming out

26:04 the Pc I express bus, the . P. I. Uh bus

26:09 is the interconnect between two sockets of abuse. Yes. To connect

26:16 you go through the PcR Express. . Yeah, that's an intel proprietary

26:26 and just U. P. Is uh intel's proprietary interconnect uh fabric

26:35 a M. D. Uh they that, call it the infinite

26:40 I think. Yes. So that's they use between course generally and in

26:53 old but it's not for sort of things like Gpus or network interface

27:05 They all use the Pc I expressed standard bus, that is not proprietary

27:10 between their own products. They tend use their own fabric, yep.

27:20 . So I think I mentioned it that and media as well, uh

27:30 own and to connect fabric between Gpus is called and the link and some

27:42 system vendors have circuit boards that supports the link. So in that case

27:50 Gps are not connected the to express the link to the Cpus but that's

28:01 of special case. Yes, the connect to physical Gps to each

28:12 they have this protocol and fabric call willing. Oh, okay,

28:20 All right. Yeah. And I I'll need a little bit of your

28:25 , dr johnson to explain the last the collapse and aisles. Okay.

28:34 I think if it would be better you explain it. Okay, So

28:39 , this is the collapse one. , so uh and I think we

28:45 it also in the case of open in that case it was a pretty

28:51 an example where you want to to so to a multi dimensional array and

28:59 order between how you traverse the index doesn't really matter for correctness. So

29:07 of having a look test you can make a single loop traverse the index

29:18 . So and then the compiler can out what is the right way to

29:26 traverse the index base as so either victimization and they can get a good

29:34 factors or can get good memory accesses so instead of, we talked about

29:42 before that, the RAM of the memory is by no means random access

29:49 whether you for instance access by rows columns, depends on how the compiler

29:57 the multi dimensionally Index space into one which is the way things are eventually

30:04 out in physical memory. So again compiler can and decide whether which way

30:12 traverse the memory space in order to the best performance out of memory.

30:17 that's the whole idea about this collapse that you can then the finance to

30:24 on this example as he collapsed to instance in this case there's two

30:29 Now it happens only to be two , but uh if there were more

30:33 , it doesn't mean that to collapse of them. So in this case

30:37 too as well as specified on that in this case, both of the

30:42 youths are actually being collapsed, so what the collapse causes is. Let's

30:51 to include performance and have the computer figure it out exactly what the order

30:57 their commercials should be. Mhm that's yeah, so that's useful again

31:08 loops as overhead associated with them both setting them up and then figure out

31:14 , pounds etcetera etcetera. So if just have one uh reduces both overhead

31:21 as I said, allows for compile tree, decide how to traverse the

31:27 . We spoke to actualization and memory . Mhm Thank you. Yeah,

31:45 other one was the tile direct Right? Yeah, so so the

31:56 it is similar idea kind of to too help the compiler in this case

32:08 out how to traverse the memory space get the maximum performance. Um so

32:17 this case it kind of does blocking I have some slides show you in

32:23 bit that maybe give a sort of picture of what it means, but

32:30 Partition in this case with the two , the index range into sub branches

32:38 then you can basically into leave the to do it um ties all basically

32:46 small two dimensional blocks and then you traverse the blocks, you neither or

32:53 a major order and that sometimes very for the memory system and we'll talk

33:01 that Yeah. Uh coming up and a little bit about matrix, multiply

33:08 a different way than you've done before doing blocked versions in order to to

33:13 from cache hierarchies. And so the is often used to try to help

33:22 die. Second party in this case pass specific, give specific, you

33:29 , the index range for the two in the Thailand command. Sorry,

33:36 don't let the compiler figured out the size but it's intended again to to

33:43 partition the look and build a loop and I think you have in my

33:50 something, you did some timing so can talk about it but look at

33:54 bench parking the effect for the tiling this case for the Kobe will stop

34:01 successful. But in terms of Beijing's , its usually very beneficial and that

34:12 a general sense not for this particular . Uh So josh can witness and

34:17 about because good exercise for a different . Yeah, but uh dial sizes

34:31 have a significant effect on the There's a yes. So it's a

34:41 off between how much how much data load from uh the main memory to

34:47 on ship memory and then how much you use you can you can get

34:52 that particular time isn't really more efficient do tiling. And I think that's

34:58 of what today's lecture will be, smaller dials of matrices from the main

35:06 to the on chip memory. Uh than going through row row vectors and

35:11 vectors of the matrix. Uh to to get better performance for volume,

35:16 multiplication or any calls that involve matrix matrix algebra there, is it

35:25 I would say exactly like dancers. What what do you have in mind

35:31 you say that it's like second. uh yeah, it's basically a duty

35:42 that you can you can you grab the from the main memory. Um

35:48 And that's generally a rather better way do matrix multiplication, I'm going through

35:53 entire column and if you I won't into the details there. But if

35:59 try to compute the data reuse that get from either of the two strategies

36:04 doing tiling or going through the column row vectors. You get better data

36:09 when you do that, when you the tiling there. Yes,

36:18 And the task can have as many as the loop. Nist says.

36:24 this case there's two loops or there , you know, tired is also

36:27 dimensional. But if you had three loops, the task, they also

36:31 dimensional, that's kind of a cube paralympic kids. So sort of so

36:41 point is and we'll talk about that talk a little bit matrix multiply and

36:47 tell in a future lecture. So we all remember that. The standard

36:53 multipliers to and cube and it was guess in some of the early assignments

37:00 , whereas the data set is n . Right. So on average,

37:03 is kind of end operations for data . And that means if you can

37:11 good data used by scheduling the operations a good order, then you get

37:18 work more out of cash now, of the main memory. And then

37:23 you have the memory cache hierarchy, modern systems have then you do tiling

37:28 for each level of the memory but I will talk more about

37:33 But just to give you some intuition . But the point is that,

37:38 we are sure it is to get reuse for each level of the memory

37:43 , that is faster than the lower , further away from the functional units

37:49 you are. So the code doesn't particularly pretty in that case because you

37:59 as, you know, two or dimensional loop or three nests of three

38:04 becomes look there's so many loops over block sizes for each level of cash

38:16 a good compiler. And I'm actually a decent job and figure it

38:20 but it's quite significant effectively rewrite of code of the program again, if

38:34 confused more than help any questions on , I think of make more sense

38:43 you talk through some of the when you talk about metal sandwiches with

38:48 multiplication. Yeah. Okay. Thank for that. So let's see what

38:57 have in mind else. Yeah. . Yes. Uh there is another

39:07 environment variable that you can use. called P G I S E

39:11 Notify. Uh So there's a couple things here, if you said its

39:17 to one and then run your open programs, basically the runtime will report

39:22 the Colonel launches on the on the . Uh that goes on. Um

39:28 you set it to two, then reports all the memory transactions. That

39:33 also if you don't know that's going be a very long list of just

39:36 transaction information, how how wide it and how many bikes that transfer and

39:43 on. So if you you can it for some quick debugging or just

39:48 sure what's going on in the code you're curious. So here I'll just

39:53 run one of the examples uh Just managed. Okay let's see. I

40:01 it's gonna be a long one. . Yeah. So if you just

40:05 take one of the lines here Let's so uh 100 attractions of the of

40:13 Jacoby colonel years likely it's 100 lines for Cuda Kernel launches or should be

40:18 because you have to uh colonel launches both. Uh for each of the

40:25 nested for loops. Um So it tells you the number of gangs again

40:32 media vocabulary. Uh that were Number of workers, the vector

40:39 the grid size and the block That was used for each uh this

40:44 gonna launch and there amount of shared I believe it. And bites.

40:48 not entirely sure but I'll check uh was used for um each of the

40:55 launches and similarly you can start it . And that gives you the uh

41:08 uh the memory transfers that that occurred uh for the uh for your kernel

41:15 a year. This particular line shows the there one of the data transfer

41:21 error variable which is double type So that means that this particular transaction

41:29 transferred eight bytes from the device to memory. Mhm. So it's a

41:36 . So it's from host to Yes. All right. So that's

41:44 thing you can use now. Uh obviously need to profile your codes and

41:50 me just do it zero. So don't get any unnecessary messages here.

41:55 So for profiling your goals, you use the N V P O

41:59 Which is again uh part of the envy HPC model. You don't need

42:03 load any other module for that. first let's see how you can get

42:09 the power consumption values. So for you can do and we profit system

42:18 on. And just simply provide uh execute herbal name here and that runs

42:28 program with the profiler. Let me up a little bit. Okay.

42:36 yeah, it runs the program and gives you a breakdown of the amount

42:42 time it took for different sections of uh of the uh of your

42:49 So here mostly the gpu activities which your kernels. Your reduction Colonel.

42:55 was the first nested for loop. good, a memory, a man

43:00 from device to host. And one the meme sets that we use for

43:04 a new uh matrix here. And the FBI calls are more specific to

43:10 NVIDIA library because as we know, sec code pretty much gets translated to

43:17 code by these uh these open sec and anything that's going on. It's

43:22 cuda Calls in one way or So you get the Gouda uh functions

43:31 and you also get some information about open sec uh functions here as

43:41 And yes, so in the end you get the temperature. Uh he

43:51 got the streaming multi processor clock and clocks as well. And you get

43:56 power consumption. And here this average consumption is your average consumption for the

44:04 time execution period of your entire So it's not specific to just the

44:10 sec sections here. This is just power consumption during the whole execution period

44:16 . And this is the one that should probably try to uh report in

44:22 in your assignments. But yes, also reports the maximum power consumption and

44:27 minimum power consumption during the entire period education of your programs. Um There

44:37 a few more metrics that you can collect with the profiler um for that

44:45 simply need uh need to provide. I should first show you how to

44:50 the metrics. Yeah, so you do envy problem. Uh query metrics

44:57 that tells you all the metrics is to what you guys were collecting with

45:02 . Uh But this is for Gpu and then we provided. So it

45:07 you all the uh all the metrics you can collect about your programs are

45:18 . You got the flop count in precision. There's single precision flop count

45:23 well. If you use single Uh there's delivery through board write throughput

45:28 all the uh there's several other metrics . The one we are interested in

45:39 is flub count dp and I think going over time here but I'll be

45:53 you. Don't worry. Okay. , so you can set the uh

46:02 to be flop count. Double That tells you the double precision operations

46:08 by the uh uh these uh this gonna launch here and notice that these

46:20 there were 100 invocations of this colonel I had 100 patricians and these numbers

46:29 are uh But yes they are for uh numbers. So 400 invocations you

46:40 to multiply uh well in this case max and average are exactly the

46:45 So just multiply one of these numbers 100 to get the total number of

46:49 flappy gums. And if you're wondering this particular number comes from.

46:56 So in our jacoby code we had 5 13 point operation. How was

47:03 ? Because this is one floating point . Multiply Second edition three four.

47:13 there's one more here. That's five the subtraction of but to Old and

47:19 values to get the error value. that's five um operations in each

47:27 the memory swap is not counted as floating point operation. Even though you're

47:32 you're putting a new value into a point variable but it's not a plus

47:37 divide or multiply us? That's not as a floating point operation. It's

47:41 memory copy operation. So you have operations and we have uh four K

47:50 four K grid sizes. So just that. Bye Bye. five and

47:56 get uh this particular value for each here. That's what that's how one

48:03 you can verify uh these uh numbers . Uh The other one that will

48:09 useful for you guys for the assignment the dirham read transactions and you can

48:17 the other metrics as well. But one is also interesting to see and

48:24 can see that the number of uh and read transactions that are performed.

48:32 Yes. So here, if I correctly you need to multiply the number

48:38 transactions first with the number of invocations have the total transaction and also multiply

48:44 with the data size. I think in the in the read me the

48:49 formula to use. But yes you to multiply it with the data

48:52 So we're using double positions. That's fights. You can get the total

48:57 of bytes that were derived from the memory. Yes. So we got

49:05 consumption floating point counts and the memory and I just have one final example

49:22 Mhm. Okay. Okay. Yes. Okay. So question for

49:45 guys, do you think this matrix colonel paralyzed using the pragmatists parallel and

49:57 is a loop construct? Uh First it give you speed up? And

50:02 second question is even if you get up, you think your output will

50:08 correct here or not? Mhm. . It's running on GPU you compile

50:36 run it for the views in Yes, no one. Okay.

50:49 won't be a speed up. And what about the correctness of the

51:06 ? I also know. Why wouldn't be a speed up then?

51:16 Oh yeah. Uh speed up compared running on CPU let's say I should

51:29 been here. Sorry? Yeah. , let's leave the speed up or

51:42 you think there's a uh you'll get output. No. Why? So

51:51 is a chapter from did you may to look at uh lose the rebels

52:00 not be corrected independence really? Like . Yeah, variables would not be

52:11 due to dependence on uh No, you're close. Uh Okay, let

52:24 simply run this card. Let's You would be here. And

52:29 I've compiled it for Gpu So don't about compilation. Part of compilation.

52:34 is famous the other phones and here at me. Yes, you

52:45 are quite close. Yes. So yes. So first I then

52:52 the speed apart, but yes. I know that Diamonds for CPU as

52:56 . And this this time taken for sec is obviously quite fast. And

53:01 would happen on CPU so yes, will get speed up. Uh but

53:08 is error in the output. You're some value that was not expected.

53:14 uh the part that's wrong with this is that you have a race condition

53:20 ci J because you paralyzed this innermost here. So because of the race

53:27 , obviously you might get incorrect out here and so the the right way

53:37 doing this matrix multiplication is and yes. So you need to add

53:43 temporary variable could be some time whatever you do the reduction. So just

53:50 point of this example was that even you put all the parallel pragmatism,

53:55 looks whatever uh directives in your you'll see speed up in your

54:00 But it's also important to make sure you're getting the correct output there because

54:05 , it's same. Right? Just open MPI The compiler runtime, No

54:10 will warn you of such a race . So just keep that in mind

54:15 have a verification section in your Even for your assignments or your

54:20 whenever you work on it, make you verify your outputs. Don't simply

54:25 happy if you're getting speed up with coat. So we're running now.

54:31 works. Yes, you need the . Right? Yes. So

54:40 you see the Timings for both parts almost the same .44 and .41

54:48 But here the verification section has because data matches matched with what was expected

54:55 the C matrix C. Yes. the two students who said it in

55:01 chat. Yeah you guys were quite here and that's pretty much it what

55:06 had. Okay thank you. Let's if there is an additional comments or

55:13 from students. Mhm. Yeah, don't see any for now.

55:27 Alright. So I guess I There's a so what kind of music

55:37 have? What do you mean higher sizes? Yeah. Oh okay.

55:50 . I think that's for for the multiplication. No no no I uh

55:57 it with managed uh Yeah, I have any uh data classes in

56:04 So uh Yeah you are using, see how about Uh huh. In

56:21 community it's not that no. Um so that does not depend on the

56:32 size you have mostly the bottom leg the Pc Express Lane. So as

56:39 saw in the initial part and the that all the data starts from the

56:45 memory, you have to transfer it the accelerate your memory, do your

56:49 there and then get back, get results back. So whether you used

56:55 memory or uh explicit data movement. Directors, those are just two ways

57:02 moving moving the data between the host memory. Host and the device.

57:09 yeah, the bottom leg is actually is the pc I express Lane doesn't

57:16 conversation because we want to manage uh your town. So God doesn't no

57:28 is not much role played by the cash in this case because the communication

57:34 pretty much happening from the deer. memory on the whole side to the

57:38 ram memory on the device side, is only come into play when you're

57:43 to get data to the Cpus functional . In that case you need to

57:48 the amount of time spent to get from the Deer a memory. So

57:52 why you get some data in the work on it started back because catches

57:58 obviously the first remember you know. but you're in case of host to

58:02 transfers. The CPU caches does not much much older. Yeah.

58:12 So most systems today they make use it's known as the direct memory access

58:19 remote memory direct memory access. So means the CPU is very likely involved

58:28 they're starting and stuff in transactions but going ahead um or not moving things

58:36 the CPU and writing it back Does that answer your question? Does

58:48 answer your question? Yes I think had the question. You can go

58:54 and if you want to type or can unmute yourself. Yes I am

59:04 am aware of and we both are of uh a few times and Q

59:10 on register for you're going about um so if you mhm. Those of

59:20 who already have accounts on either so china or Korea um should be able

59:29 use them from the class. Uh I will just let me know,

59:36 add you to the allocation we have research for the group or mine.

59:45 So that could be done quickly if has let us know, I don't

59:50 , don't take all the specific procedure you can tell me what, tell

59:55 what you need to do since they set it up and I don't remember

60:00 . I only got a question from Data Science Institute folks um to validate

60:09 it's okay to give you an Uh like yeah if you if you

60:16 have an account either on Korea or No it's uh johnson it would be

60:21 if they use Kariya open. She profiling is not. Yeah.

60:25 I know but you can use both unfortunately cannot do everything on open but

60:32 can debug codes and and do some but you can't do the measurements but

60:39 . Uh you can you can if don't have an account you can go

60:43 Data science Institute's website and I'll send the link where you can sign

60:48 Uh and also add the uh uh as principal investigators. So at doctor

60:54 name there. So doctor johnson's email they'll send him uh email to confirm

61:01 add you to their allocation. If already have an account just send uh

61:06 your user name. So you can that not do in Korea.

61:13 Area. And open here. I'll out the links for signing up on

61:18 of the cultural sites. So let's remember what the name of dr johnson

61:24 the principal investigator and his uh email . D. The C.

61:28 Dot U. S. Strategy that has used to communicate with them.

61:33 copy it and send it to both us. Yes. For back up

61:40 none of us misses it. Yes and yes I will also send

61:45 uh so mostly the compilation profiling all commands will be the same. It's

61:52 only difference will come from the command get interactive access to the Gpu

61:58 And I'll send you their tutorial pages as well along with the sign up

62:05 . So yes I'll do that. terms of compiling you won't see any

62:11 . Yeah it's getting to be Gpu will be a little bit different than

62:15 I showed you. Oh okay. think you can take a word.

62:26 . Okay. Yeah so all right share more the speed. Just see

62:34 happens here. Uh That's not Okay 15 months then you get

62:45 Uh huh. Not yet. Yeah we got it. Okay. Um

62:55 so um I will have some slides part of what he has already talked

63:03 a couple of things probably before the of today's lecture but so this is

63:12 last lecture plan for open sec or and Gpu so again feel free to

63:17 questions as they go along. So oh so the first thing is

63:27 picture for what so you have already in terms of this update directive.

63:34 is a way for explicitly making sure the CPU and the Gpu are saying

63:44 points where it's important that they are thinking or they both have the same

63:50 a bit too strong to stay in but since they both can compute independent

63:56 each other uh that means if you a race that exists on both the

64:03 and the Gpu uh the corrections of program sometimes does not depend on them

64:13 on the exact same data and you still work. Okay. And sometimes

64:18 some point uh you may want to sure that before some other face of

64:24 code that they do have the most both of them. So you can

64:30 and make sure through this update director the sort of attributes to that is

64:39 himself. So device means that to update there device to GPU or what

64:47 current values are of their in this a and the self means that the

64:55 retrieves the current state in the case the baby. So it's just directional

65:02 on which way the data is copied the device yourself but it's a way

65:07 explicitly doing and not waiting for instance in the regions or some other parallel

65:15 . Uh And so this is just this point of view. Maybe you

65:19 to check point the data that is via the eye, you person have

65:23 done by the Cpus and our You need to retrieve the data from

65:28 GPU and then right out to disk however you do the check pointing on

65:34 CPU. Uh The other band was to also talked about this. They

65:42 and exit data to re flexible way defining the lifetime or various. There

65:49 structures that you have and they need be prepared. So you have an

65:56 data. It needs to be at point followed by an exit data.

66:00 it's so they need to be matched it's flexible how they can be much

66:06 the sense of where they can exist the cult and then there's clauses and

66:12 . Uh So yeah, I should mention right for the enter that means

66:18 can use the the greater data or in and the data from the environment

66:24 you have when you get to the and on the exit you can get

66:29 copy out or delete uh yes to up to date a space and that

66:35 also demoed by as Russian here is some examples of the how they

66:41 This is pretty. The unstructured example not really all that interesting because we're

66:48 not that different from the so called when you have this not making use

66:55 their enter except data statements. But the other one is a little bit

67:03 more how that they can be flexible terms of where they and drink cold

67:09 they're located in the cold. But just need to make sure that somehow

67:13 logic is correct in that that are carpeting or created at some point also

67:26 or but the accepted her statement. and then we had this that we

67:36 about the collapse in Thailand it was um more example on slides. So

67:42 can remember perhaps better than just from presentation in adults in the video but

67:49 just shows the same thing collapse and loops and the explicit coby statements about

67:57 the collapse. And in this case two, Yasha, as I

68:02 it ran the benchmark and redid it this class. And there was no

68:07 benefit in this case actually it's like , it's like penalty in terms of

68:11 the collapse which is slow down man 2% but not much. So an

68:19 in this particular case collapsing the Um and he is kind of the

68:27 I promised in terms of the We're in this case if you have

68:31 four by four things you want to , I could potentially do it and

68:37 of two x 2 times and in case you get sort of implicitly you

68:43 up constructing Aloofness. The four loops the traverse each Thailand, just a

68:51 loops and then you know, traverse they say X and Y dimensions on

68:57 Thailand. So that's another two What is done implicitly for the tile

69:03 . All right attribute that you had the loops and then for something more

69:10 you probably have a bigger tiles and was an example again going back to

69:15 in which case maybe actually choose somewhat better tile size for a larger erase

69:25 in terms of finding out the performance is a little bit of trial and

69:29 or sometimes you can have insights if know spices cash is in the cache

69:36 . How to choose the ties, time sizes for each level in the

69:42 . When I talked about how to major response apply in a future

69:47 I'll go through high actually at least a good estimate of what the time

69:53 this ought to be for each level there memory hierarchy. And I guess

70:01 was also done, you know we're playing around mhm What uh introducing the

70:10 argument and in this case Yeah, didn't have neither tiny tiny bit positive

70:18 for the CPU but not much and for the give you kind of almost

70:27 it back to no tiling For the x 32 file size. I don't

70:33 if you want to comment anything on suggestions you did run it? I'm

70:39 particularly because mainly just small life size really are the performance that you need

70:46 do uh compromised stands on secure this previously when I did these benchmarks on

70:53 of the older Gpus them. there you could see a significant performance

71:00 . Uh for last year I remember gone somewhere from 10 to almost 14

71:09 was it 14% improvement and performance by an optimal dinos eyes. So you

71:15 go in this example you see a in performance but choosing an optimized site

71:22 game they have good performance for some the girls. Uh huh. So

71:27 company did better this time around. . So and here is just showing

71:35 benchmark numbers and I guess good Um No and okay, so I

71:48 one more slide. Alright so some comments again About these three levels of

72:04 true hierarchical structure that just showed also terms of the benchmarking in terms of

72:11 and part of the vocabulary to me confused and that it can open a

72:16 . C used the notion of gang and vector and indeed there with a

72:25 down they have different notions um So in their case I have it on

72:36 different slide but basically threads as more less similar meaning in the thread is

72:42 that executes in the score and then have collection of threats and I think

72:47 have it on slides coming up That for the 70 units that in this

72:55 is as different vocabularies in terms of sea, prove that. So this

73:04 just the next few slides, I'll to the the NVIDIA language but

73:12 I tried to pray or I do to use um open standards, definition

73:21 not proprietary. So that's why you open as you see that is an

73:27 standard and then it's also a little less worried about details of how things

73:34 actually than executed. This whole the presumed is supposed to have an open

73:41 to figure out some of the details mapping it onto the actual specific hardware

73:46 . So and then so it was of the open HTC using again the

73:54 gang worker in vector versus uh and ideas that don't use gang, they

74:04 thread blocks um and worker I believe stop being translated to words but we'll

74:14 on an exciting but you can, know, try to direct things by

74:22 these attributes for the loop constructs. were training constructs as well as sexual

74:29 . So this freedom mix and Mhm And then you can also specify

74:36 the numbers, you know how many this case gangster translates thread blocks in

74:45 of and media and I think that's making you aware of the open sec

74:55 if you were to use knowing class just use the NVIDIA but in recent

75:02 and they have been quite successful Also we're selling Gpus for data centers

75:10 scientific and engineering applications or they just names so but they all understand the

75:18 a sissy vocabulary. So here is another example uh specifying mhm um the

75:30 don't copy example and here's a benchmark . So I just made so let's

75:38 . But as was shown in particular there um tiling example the competitors have

75:49 better over the years and at this it may not be you may actually

75:55 rather than trying to so but it again how good the compiler is for

76:02 particular platform they're using. Yeah, in this slide to best to again

76:10 to make you aware of the vocabulary by the different vendors. Um so

76:22 media has been, you know dominating use for attached processors but not for

76:29 P U S totally intel is the producer of GPS but there are integrated

76:37 the same ship. So they are , you know pcs and laptops but

76:43 attached processors but now in recent years and he has actually been dominating in

76:51 of gaming DP use attached processors But the scientific market and media has been

76:59 dominating one. So typically you talk the engineering and science community, they're

77:05 used to cuDA and the stuff but can some of them are common.

77:11 Colonel means the same across the three but there are other things that are

77:18 confusing. I would say that you a shared memory has different meanings.

77:23 memories, shared memory on cuba and . Local memory when I am Gpus

77:30 decided to Combined the two in shared local memory. Um and local

77:38 may be private or something else. it may be useful in the long

77:44 to understand that an intel is about put out the start to compete with

77:49 other two vendors in terms of attached um, and a little bit more

77:59 specifically about the NVIDIA Gpus and what mentioned thread book blocks and warps and

78:09 all these things mean. Some of may already be familiar with it but

78:14 remember, So the structure that which pretty much certainly the same for AMG

78:24 uh very uh how the Gpus are together in a hierarchical structure and the

78:32 for NVIDIA is streaming multi processors. start here but it's a corresponding thing

78:40 a family. Mm And so in of how things are mapped onto use

78:52 sort of one level that is at streaming multi processors and will show in

78:57 next few slides and then things that well then the streaming multi processor has

79:05 certain amount of sharing that can be but typically there is no sharing in

79:13 of synchronization between different streaming multi process that case Things that executing different streaming

79:22 processes they can share as it shows level two level 3 memory but between

79:28 threats in the different of the there is no single points uh and

79:40 both MD and video has this notion Their underlying 70 units as the car

79:50 been shown a few times that there a single instruction operating a multiple data

80:00 that's where this notion of warps comes . So that's a scheduling unit of

80:08 school schedule, each one of us the units. And typically That means

80:15 to 32 threats. And if you have 32 threads in the warp,

80:19 you basically lose performance as we talked that some previous lecture. So now

80:29 then thread blocks is kind of the higher level in vocabulary form and

80:35 And the third blocks that showed in demo and It was that did that

80:44 the number 10, and one of sides, that is the maximum number

80:47 threads in a thread block. And means to do the operations on this

80:56 , potentially threats in a thread block get carved up into Sets of 32

81:02 , some of them scheduled together as war. And then it's your multiple

81:09 blocks and media college grades. And then the third blocks then,

81:16 that the thread box can nap be across the multi processors or thread blocks

81:25 Designed only to one. So only as multi processor works on any given

81:32 block. But there can be multiple blocks, Society given multi processor.

81:41 I think that's also this one tells little bit about sink and already

81:49 And the next slide is this notion coalescing memory accesses. So that means

81:58 the different threats uh if they need to level two or global memory that

82:06 in the previous slide, the system to make sure that it fill up

82:11 the kind of wires two, the to make full use of the memory

82:20 . So that's coalition as it's And again the compiler try them to

82:26 out the schedule things. So this actually happen. And to make produce

82:31 the memory bandwidth. Yeah. And is a 384 within that's typically when

82:39 use um the graphics or type of memory, that's not as G D

82:48 R. And that also comes from the demand bugs into the circuit

82:56 But for some of the GPU still that because it's cheaper for my

83:03 Whereas the Korea Gpus used to high memory that are integrated in the same

83:11 . And then you have a wider bus and well I guess my time

83:20 up but I want to have a of slices off to show a little

83:25 and this has an media linked memory the CPU that was being built some

83:31 purpose I would say circuit boards to for large systems to the Department of

83:38 . So in this case that has of the IBM Cpus On the circuit

83:45 and each one of them has six three to each one of the same

83:52 . Yeah. And it also supports notion of this unifying memory. So

83:59 think I'll stop there because my time up. There's a couple of more

84:02 basically comparing open MPI to open But we're not doing open empty for

84:11 using the glass of you can be similarity. We'll stop there. Mm

84:25 . Yeah. Yeah. Any questions , Yeah. Uh huh. And

84:37 will be back in the classroom on traveling today. Mhm Oh.

84:47 No more questions here. Okay, stop recording that. Yeah.

-
+