© Distribution of this video is restricted by its owner
00:21 | Wait one More minute, I Yeah. Is this is this visible |
|
|
01:09 | you? Should I make the font ? Yeah. Is that okay? |
|
|
01:17 | a little bit more loud, I think. Yeah. Let me |
|
|
01:22 | it a little bit larger. This be Okay, I think. |
|
|
01:36 | Okay. Also sometimes when you have projecting thanks them the light. Take |
|
|
01:49 | , getting the life I guess. was just throwing them off. |
|
|
01:53 | okay. I think that's okay. I'll start with the demo. So |
|
|
02:06 | right now I'm showing this demo on uh the data center Institute uh on |
|
|
02:15 | U S. Campuses carry a cluster I'm using the Gpu note on that |
|
|
02:22 | carrier cluster for now, uh and can easily get access to that note |
|
|
02:32 | a simple interact command, which is a s on command from Sillerman passing |
|
|
02:41 | uh specific uh parameters here, I'm everyone is now aware of how to |
|
|
02:48 | Slam at this point. So that's Okay. Now, in order the |
|
|
03:03 | do any open sec programming on these ? The PG I compilers, they |
|
|
03:09 | as a uh as a part of in the uh Envy HPC uh |
|
|
03:18 | So here I have already loaded the HPC module uh using the module Lord |
|
|
03:25 | . And so once you do uh just do PG and if you |
|
|
03:30 | tab a couple of times it will you pretty much all day PG I |
|
|
03:34 | uh commands that are available. So this one command called PG Excellent. |
|
|
03:41 | it gives you information about accelerators that available uh on the on the |
|
|
03:47 | the P G C plus plus is C plus plus compiler for open SCC |
|
|
03:53 | G C C is the C And in case you use FORTRAN, |
|
|
03:57 | , there are FORTRAN compilers as well you can use. Uh first I'll |
|
|
04:03 | show this uh, output of PG info and this command basically tells you |
|
|
04:11 | the information about the accelerator, as said. So we have the Tesla |
|
|
04:16 | 100 which is the water? We . Uh, NVIDIA Gpus on this |
|
|
04:21 | . So there's and there's only one on this note. So that's why |
|
|
04:25 | a device number zero and nothing, else. In case you have multiple |
|
|
04:29 | . On a note, you'll see such outputs as uh as the output |
|
|
04:33 | Fiji. Excellent for and you'll see device numbers from zero to whatever number |
|
|
04:40 | gps you have on bridges to you a gps on each note, |
|
|
04:43 | he's seven such outputs as when you PG Excellent. For oh sorry, |
|
|
04:49 | 00 to 7 to 88 outputs. then here, you can see all |
|
|
04:55 | information about the Gpu, you can the clock rate. Uh you can |
|
|
05:01 | the memory clock rate, the memory width. So as you can see |
|
|
05:04 | memory memory busses wider uh on gps compared to the Cpus usually have 64 |
|
|
05:11 | uh, busses on Cpus for the here, you have the wider wider |
|
|
05:18 | . Um then you have all these sort of NVIDIA conventions of uh the |
|
|
05:28 | and the computer art is what to , uh where you have threads as |
|
|
05:34 | uh smallest entity going up. One , one layer is the uh |
|
|
05:42 | And then the topmost levels in the computer hierarchy are the grids. So |
|
|
05:48 | that um the one important flag to a note about is this one which |
|
|
05:55 | you, which you will use while the code? So this is the |
|
|
06:00 | 70 means the compute capability seven dot , so that's just the a compute |
|
|
06:07 | version of this particular Gpu for older , you might see CC 35 or |
|
|
06:12 | on Yeah, Bridges do not, see see see 35 for the K |
|
|
06:16 | gps that they have on there. that's just the basic information you can |
|
|
06:23 | about the uh about the Gpu here , yep. Also because of the |
|
|
06:33 | , but when you can conclude that have this I band with memory |
|
|
06:39 | It would not have that manipulates Yeah, I think if you go |
|
|
06:46 | NVIDIA website, it says that these use the H P M two memory |
|
|
06:51 | I remember correctly. Right? Uh yeah, first example, I think |
|
|
07:01 | remember it from the uh last lecture the slides. So first uh this |
|
|
07:12 | again, simply did jacoby example the construct that we saw first was |
|
|
07:19 | battle of loop construct here, which used to uh explicitly paralyze these uh |
|
|
07:25 | loops. So to do that, simply have the drag my cc parallel |
|
|
07:30 | is similar to the open MPI constructs and then have a reduction of uh |
|
|
07:36 | maximum production for the error variable Uh We also add a pragmatist CC |
|
|
07:44 | to paralyze the inner loop uh uh and the same thing we can do |
|
|
07:49 | the second set of nested loops Now to compile these codes, uh |
|
|
08:01 | these are C programs will use the G C C uh C compiler uh |
|
|
08:06 | you need to provide flag, dash C C which stands for enabling the |
|
|
08:12 | sec compilations. Uh you also provide fast uh flag, which is um |
|
|
08:21 | an optimization flag. Which is quite to the 03 optimizations that you've been |
|
|
08:26 | in gcc and I C c compilers firstly uh will compile it for the |
|
|
08:34 | first and to do that, you the D A flag, which is |
|
|
08:39 | target accelerator, uh and its value be multi core. And when you |
|
|
08:45 | it multicore, that means you're compiling the CPU The other flag. That |
|
|
08:51 | also helpful is called uh dash M four and M S capital here and |
|
|
08:58 | . Either you can do all that it will give you information about what |
|
|
09:02 | did for the CPU as well as accelerator. Or if you're only interested |
|
|
09:06 | getting accelerator information. Uh what the did for the accelerator part of the |
|
|
09:11 | the code, you can give it as well. Uh here, I'll |
|
|
09:15 | do all here and then you can the I think it's covering up the |
|
|
09:27 | here. So your source code name and I'll try to be consistent with |
|
|
09:37 | five limbs that I've been giving and , and then your output file |
|
|
09:44 | And when you do this, you get the compiler tells you exactly what |
|
|
09:49 | did for each of the sections. , uh this 37 40 and 43 |
|
|
09:56 | are the first set of two nested nested for loops. So you can |
|
|
10:02 | it generated multi core code. Uh is no mention of Tesla or GPU |
|
|
10:08 | or anything. It also generated the uh a similar uh, messages given |
|
|
10:18 | for the second set of the uh the for loop for the outer for |
|
|
10:23 | . And since in the uh second for loops, the inner loop is |
|
|
10:28 | just a swap of the uh where E N. N. U. |
|
|
10:32 | just a memory copies. Also generated memory copy operation for the for the |
|
|
10:39 | loop. Now on on multicore. there's one extra thing that that you |
|
|
10:45 | do is you can also choose how courses do you want to run your |
|
|
10:50 | on and to do that. I the uh yes, there is an |
|
|
10:59 | variable that you can use. It's sec numbers. And you can simply |
|
|
11:06 | its value to uh the number of that you would want to use. |
|
|
11:13 | so let's say on these uh these , I think we have 24 course |
|
|
11:19 | the CPU Uh on one socket. we have two sockets with 24 cores |
|
|
11:29 | . So that's 48 cores. So can simply do export sec non course |
|
|
11:37 | to, let's say 48. And when you execute uh your program, |
|
|
11:46 | will run on all the all the cores. So I'm only running at |
|
|
11:52 | attractions for now. So that's why quite fast here. But uh |
|
|
11:57 | you can you can choose the Of course you can execute your program |
|
|
12:01 | um comment yesterday, Korea. It like they do not have a great |
|
|
12:09 | , I prefer doing in april they have only one thread burke or |
|
|
12:15 | , there's no hyper starting uh Right? Um So yes, that |
|
|
12:23 | the multicore compilation now we need to this code for uh the uh the |
|
|
12:36 | . So that's for that. The thing you need to change is the |
|
|
12:41 | accelerator. Flags value. Remember we C C 70. So you need |
|
|
12:46 | provide that and that's pretty much it you need to change compiled for the |
|
|
12:55 | and now you can see uh it's the Tesla cord uh, generated all |
|
|
13:01 | loop gangs, uh, the vector that it chose for, uh, |
|
|
13:06 | second set of uh, for loops and all the data movement operations as |
|
|
13:12 | . Uh, now, not that , this is just simply uh, |
|
|
13:20 | on the open sec. Run time manage data transfers. If you want |
|
|
13:24 | use the on the unified memory then you need to add one extra |
|
|
13:36 | to the target accelerator. Uh, me, uh, to the target |
|
|
13:42 | . Flag. And that's uh separated a comma and then say managed. |
|
|
13:50 | , once you compile it, the output is very likely going to be |
|
|
13:56 | same. It's just that now, you run this program, the unified |
|
|
14:01 | uh, system will take care of data movement. The open sec runtime |
|
|
14:05 | not do it and do it for . So as you can see, |
|
|
14:09 | compiler output is basically the same. no difference. Just at the wrong |
|
|
14:14 | , a different system will take care the data movements. And from some |
|
|
14:18 | my experiments, I've seen that this memory system has been uh quite good |
|
|
14:25 | handling data management, uh and the times have been smaller for the manage |
|
|
14:30 | programs that I've used on this particular here, but it could be different |
|
|
14:35 | older versions of open sec or in the cuda drivers that you might |
|
|
14:39 | . So it could be different for versions. Okay. Any questions from |
|
|
14:48 | students? Yeah, I don't know there's anything in the chat now questions |
|
|
15:00 | don't really see in their classrooms. okay. Alright. Uh, next |
|
|
15:10 | , again, we saw this in slides by using the Colonel's Director. |
|
|
15:16 | if you'll remember the Colonel's Director, , basically allows you to say to |
|
|
15:23 | compiler that this is a section of code that I want to paralyze and |
|
|
15:26 | on a certain accelerator. Do something it. So here, you can |
|
|
15:33 | wrap your, um, programming the . I see. See kernels |
|
|
15:40 | So you see, I don't have frag MMA sec parallel loop construct. |
|
|
15:46 | this cardinals. Director ends here. just uh, wrapped the whole thing |
|
|
15:52 | the like my sec. Cardinals and . And so, combining uh, |
|
|
16:04 | program is again the same. I compile it for the multi core now |
|
|
16:11 | something jumped to the uh, GPU because there's an interesting thing that's gonna |
|
|
16:20 | here. Yeah. Okay. All . So, does anyone see any |
|
|
16:36 | with this compilation output here, Yes. But was there any loop |
|
|
16:52 | in the court? Let me open court again. Yeah. Yeah. |
|
|
17:00 | we see any loop dependencies In this section here from line 42 line |
|
|
17:08 | Mm hmm. All right. I think so. Well, the |
|
|
17:20 | is, and it's a bit subtle , uh, with open sec |
|
|
17:27 | and really you need to define your memory allocation pointer uh with a keyboard |
|
|
17:37 | because this particular keyword, it tells compiler that there's only this and this |
|
|
17:45 | point at that points to the specific location and there's no pointer aliasing going |
|
|
17:50 | . Pointer aliasing means that several pointers be pointing to the same memory location |
|
|
17:55 | so there could be corruption. When uh multiple pointers try to change the |
|
|
18:00 | of the same memory location and We were basically tells the gun violence |
|
|
18:04 | nothing of such sort is going So you need to do that uh |
|
|
18:10 | allocating dynamic memory in europe and sec when you add this restrict keyword and |
|
|
18:20 | the same compilation again. Now you're goes too smoothly and it generates the |
|
|
18:26 | little code for the accelerator, other , some people I was informed it's |
|
|
18:39 | they will be because the memory is , but it's still being pointed by |
|
|
18:44 | just one pointer. So the runtime can handle the memory accesses as long |
|
|
18:51 | there is only a single pointer pointing that memory location. Go ahead. |
|
|
19:03 | , Yeah. So yeah, that's a very subtle thing. But you |
|
|
19:06 | to remember that because you might end just because everyone's just familiar but simply |
|
|
19:11 | Matlock and moving on and don't realize you need that keyword uh to prevent |
|
|
19:17 | an issue. Yeah. Okay, was the next example I had. |
|
|
19:31 | , yes, uh yes. One I forgot to say. So even |
|
|
19:36 | the colonel's directive you can provide the the managed uh parameter option. One |
|
|
19:47 | . So against even with kernels directive can use the unified number. Uh |
|
|
19:54 | . Again, the compilation output will the same even if you use managed |
|
|
19:58 | at the same time the uh unified and time will take care of your |
|
|
20:03 | movements. All right. What else I had? Okay. Mhm. |
|
|
20:16 | right. Yeah. So up till we've been pretty much leaving the data |
|
|
20:23 | to either the compiler or the or run time to take care of. |
|
|
20:29 | you can as we saw in the slide you can also use these data |
|
|
20:35 | to specify exactly what you want to with your with your data movement. |
|
|
20:41 | here what I've simply done is before wild loop have this pragmatism, sec |
|
|
20:47 | closure or directive I should say. Here I tell the runtime or all |
|
|
20:55 | compiler that I need the uh need matrix to be copied uh to be |
|
|
21:04 | a copy operation. And copy means you enter the parallel region, it |
|
|
21:09 | copy the data onto the device. when you exit the region it will |
|
|
21:13 | back the data from device memory to host memory while just create clothes here |
|
|
21:22 | a new because we don't need the back on the host memory which is |
|
|
21:27 | being used as a temporary storage for intermediate results. So we only allocate |
|
|
21:33 | space for that particular array on the the GPU. We don't really need |
|
|
21:37 | transfer any data even though it's you don't need to transfer them. |
|
|
21:41 | can simply just allocate the memory on GPU. And when you come out |
|
|
21:44 | the uh the data region um that gets allocated and just get your a |
|
|
21:52 | that you actually wanted as the And this uh data region it ends |
|
|
22:01 | uh with the uh with the loop the structure block that is defined |
|
|
22:06 | So uh it ends after the after wind looks you'll get your updated values |
|
|
22:11 | the while loop is after the while is done. So the data lifetime |
|
|
22:17 | only for that particular structure block for you define the data. Uh The |
|
|
22:23 | director here. Mhm. All So compilation again the same. |
|
|
22:33 | The important part here with explicit data is if you specify managed flag while |
|
|
22:43 | a code that has explicit data then the unified runtime will take |
|
|
22:49 | No, no matter what it the compiler will basically ignore all your |
|
|
22:53 | data movements and the unified memories and will handle all the data movements. |
|
|
22:59 | remember that even if you have explicit management, if you use managed |
|
|
23:05 | everything will be ignored that you have . So keep that in mind. |
|
|
23:14 | . Yeah, there is one more to define the lifetime of your data |
|
|
23:21 | using data regions when you have let's some parallel sections that are separated by |
|
|
23:30 | sequential sections in in the middle. but you still want your data to |
|
|
23:34 | on the accelerator memory and not be back to the host memory while you're |
|
|
23:40 | some sequential operations in the middle. is there is a way to do |
|
|
23:46 | and as you see here so you do that by using prime a sec |
|
|
23:55 | data. Uh directive and you can that. Uh Well this is pretty |
|
|
24:01 | exactly the same example that I should before this. I have the data |
|
|
24:06 | just before the wild loop. But for example let's see we have this |
|
|
24:11 | ec enter data. Uh you copy your data on the on the memory |
|
|
24:15 | now and see this you rather than copy, I have a copy in |
|
|
24:22 | . So that means at this point only putting the data on the device |
|
|
24:26 | . I'm not copying copying it back on the host member yet. |
|
|
24:31 | so I do my parallel loops uh everything else now. Imagine rather than |
|
|
24:43 | these printers and my timer timer calls , I may have had some sequential |
|
|
24:50 | code running and then again another parallel running here. So for those sex |
|
|
24:57 | the for that second parallel section, would have had that data already residing |
|
|
25:03 | the devices, memory, I wouldn't to perform anymore. Uh uh data |
|
|
25:07 | from host to device data would have been there. And let's say as |
|
|
25:11 | as I came out of that um reason again, let's on this line |
|
|
25:18 | here then I can call this sec exit data and say the lead |
|
|
25:24 | you. And also I should have a copy out for a year. |
|
|
25:29 | a mistake. So I can also a copy out for a and get |
|
|
25:33 | of my data back on the So therefore you can extend your lifetime |
|
|
25:38 | the data on the on the device excavators member here. Does that make |
|
|
25:50 | ? Mhm. Okay, so you you uh there's something it's coming out |
|
|
26:04 | the Pc I express bus, the . P. I. Uh bus |
|
|
26:09 | is the interconnect between two sockets of abuse. Yes. To connect |
|
|
26:16 | you go through the PcR Express. . Yeah, that's an intel proprietary |
|
|
26:26 | and just U. P. Is uh intel's proprietary interconnect uh fabric |
|
|
26:35 | a M. D. Uh they that, call it the infinite |
|
|
26:40 | I think. Yes. So that's they use between course generally and in |
|
|
26:53 | old but it's not for sort of things like Gpus or network interface |
|
|
27:05 | They all use the Pc I expressed standard bus, that is not proprietary |
|
|
27:10 | between their own products. They tend use their own fabric, yep. |
|
|
27:20 | . So I think I mentioned it that and media as well, uh |
|
|
27:30 | own and to connect fabric between Gpus is called and the link and some |
|
|
27:42 | system vendors have circuit boards that supports the link. So in that case |
|
|
27:50 | Gps are not connected the to express the link to the Cpus but that's |
|
|
28:01 | of special case. Yes, the connect to physical Gps to each |
|
|
28:12 | they have this protocol and fabric call willing. Oh, okay, |
|
|
28:20 | All right. Yeah. And I I'll need a little bit of your |
|
|
28:25 | , dr johnson to explain the last the collapse and aisles. Okay. |
|
|
28:34 | I think if it would be better you explain it. Okay, So |
|
|
28:39 | , this is the collapse one. , so uh and I think we |
|
|
28:45 | it also in the case of open in that case it was a pretty |
|
|
28:51 | an example where you want to to so to a multi dimensional array and |
|
|
28:59 | order between how you traverse the index doesn't really matter for correctness. So |
|
|
29:07 | of having a look test you can make a single loop traverse the index |
|
|
29:18 | . So and then the compiler can out what is the right way to |
|
|
29:26 | traverse the index base as so either victimization and they can get a good |
|
|
29:34 | factors or can get good memory accesses so instead of, we talked about |
|
|
29:42 | before that, the RAM of the memory is by no means random access |
|
|
29:49 | whether you for instance access by rows columns, depends on how the compiler |
|
|
29:57 | the multi dimensionally Index space into one which is the way things are eventually |
|
|
30:04 | out in physical memory. So again compiler can and decide whether which way |
|
|
30:12 | traverse the memory space in order to the best performance out of memory. |
|
|
30:17 | that's the whole idea about this collapse that you can then the finance to |
|
|
30:24 | on this example as he collapsed to instance in this case there's two |
|
|
30:29 | Now it happens only to be two , but uh if there were more |
|
|
30:33 | , it doesn't mean that to collapse of them. So in this case |
|
|
30:37 | too as well as specified on that in this case, both of the |
|
|
30:42 | youths are actually being collapsed, so what the collapse causes is. Let's |
|
|
30:51 | to include performance and have the computer figure it out exactly what the order |
|
|
30:57 | their commercials should be. Mhm that's yeah, so that's useful again |
|
|
31:08 | loops as overhead associated with them both setting them up and then figure out |
|
|
31:14 | , pounds etcetera etcetera. So if just have one uh reduces both overhead |
|
|
31:21 | as I said, allows for compile tree, decide how to traverse the |
|
|
31:27 | . We spoke to actualization and memory . Mhm Thank you. Yeah, |
|
|
31:45 | other one was the tile direct Right? Yeah, so so the |
|
|
31:56 | it is similar idea kind of to too help the compiler in this case |
|
|
32:08 | out how to traverse the memory space get the maximum performance. Um so |
|
|
32:17 | this case it kind of does blocking I have some slides show you in |
|
|
32:23 | bit that maybe give a sort of picture of what it means, but |
|
|
32:30 | Partition in this case with the two , the index range into sub branches |
|
|
32:38 | then you can basically into leave the to do it um ties all basically |
|
|
32:46 | small two dimensional blocks and then you traverse the blocks, you neither or |
|
|
32:53 | a major order and that sometimes very for the memory system and we'll talk |
|
|
33:01 | that Yeah. Uh coming up and a little bit about matrix, multiply |
|
|
33:08 | a different way than you've done before doing blocked versions in order to to |
|
|
33:13 | from cache hierarchies. And so the is often used to try to help |
|
|
33:22 | die. Second party in this case pass specific, give specific, you |
|
|
33:29 | , the index range for the two in the Thailand command. Sorry, |
|
|
33:36 | don't let the compiler figured out the size but it's intended again to to |
|
|
33:43 | partition the look and build a loop and I think you have in my |
|
|
33:50 | something, you did some timing so can talk about it but look at |
|
|
33:54 | bench parking the effect for the tiling this case for the Kobe will stop |
|
|
34:01 | successful. But in terms of Beijing's , its usually very beneficial and that |
|
|
34:12 | a general sense not for this particular . Uh So josh can witness and |
|
|
34:17 | about because good exercise for a different . Yeah, but uh dial sizes |
|
|
34:31 | have a significant effect on the There's a yes. So it's a |
|
|
34:41 | off between how much how much data load from uh the main memory to |
|
|
34:47 | on ship memory and then how much you use you can you can get |
|
|
34:52 | that particular time isn't really more efficient do tiling. And I think that's |
|
|
34:58 | of what today's lecture will be, smaller dials of matrices from the main |
|
|
35:06 | to the on chip memory. Uh than going through row row vectors and |
|
|
35:11 | vectors of the matrix. Uh to to get better performance for volume, |
|
|
35:16 | multiplication or any calls that involve matrix matrix algebra there, is it |
|
|
35:25 | I would say exactly like dancers. What what do you have in mind |
|
|
35:31 | you say that it's like second. uh yeah, it's basically a duty |
|
|
35:42 | that you can you can you grab the from the main memory. Um |
|
|
35:48 | And that's generally a rather better way do matrix multiplication, I'm going through |
|
|
35:53 | entire column and if you I won't into the details there. But if |
|
|
35:59 | try to compute the data reuse that get from either of the two strategies |
|
|
36:04 | doing tiling or going through the column row vectors. You get better data |
|
|
36:09 | when you do that, when you the tiling there. Yes, |
|
|
36:18 | And the task can have as many as the loop. Nist says. |
|
|
36:24 | this case there's two loops or there , you know, tired is also |
|
|
36:27 | dimensional. But if you had three loops, the task, they also |
|
|
36:31 | dimensional, that's kind of a cube paralympic kids. So sort of so |
|
|
36:41 | point is and we'll talk about that talk a little bit matrix multiply and |
|
|
36:47 | tell in a future lecture. So we all remember that. The standard |
|
|
36:53 | multipliers to and cube and it was guess in some of the early assignments |
|
|
37:00 | , whereas the data set is n . Right. So on average, |
|
|
37:03 | is kind of end operations for data . And that means if you can |
|
|
37:11 | good data used by scheduling the operations a good order, then you get |
|
|
37:18 | work more out of cash now, of the main memory. And then |
|
|
37:23 | you have the memory cache hierarchy, modern systems have then you do tiling |
|
|
37:28 | for each level of the memory but I will talk more about |
|
|
37:33 | But just to give you some intuition . But the point is that, |
|
|
37:38 | we are sure it is to get reuse for each level of the memory |
|
|
37:43 | , that is faster than the lower , further away from the functional units |
|
|
37:49 | you are. So the code doesn't particularly pretty in that case because you |
|
|
37:59 | as, you know, two or dimensional loop or three nests of three |
|
|
38:04 | becomes look there's so many loops over block sizes for each level of cash |
|
|
38:16 | a good compiler. And I'm actually a decent job and figure it |
|
|
38:20 | but it's quite significant effectively rewrite of code of the program again, if |
|
|
38:34 | confused more than help any questions on , I think of make more sense |
|
|
38:43 | you talk through some of the when you talk about metal sandwiches with |
|
|
38:48 | multiplication. Yeah. Okay. Thank for that. So let's see what |
|
|
38:57 | have in mind else. Yeah. . Yes. Uh there is another |
|
|
39:07 | environment variable that you can use. called P G I S E |
|
|
39:11 | Notify. Uh So there's a couple things here, if you said its |
|
|
39:17 | to one and then run your open programs, basically the runtime will report |
|
|
39:22 | the Colonel launches on the on the . Uh that goes on. Um |
|
|
39:28 | you set it to two, then reports all the memory transactions. That |
|
|
39:33 | also if you don't know that's going be a very long list of just |
|
|
39:36 | transaction information, how how wide it and how many bikes that transfer and |
|
|
39:43 | on. So if you you can it for some quick debugging or just |
|
|
39:48 | sure what's going on in the code you're curious. So here I'll just |
|
|
39:53 | run one of the examples uh Just managed. Okay let's see. I |
|
|
40:01 | it's gonna be a long one. . Yeah. So if you just |
|
|
40:05 | take one of the lines here Let's so uh 100 attractions of the of |
|
|
40:13 | Jacoby colonel years likely it's 100 lines for Cuda Kernel launches or should be |
|
|
40:18 | because you have to uh colonel launches both. Uh for each of the |
|
|
40:25 | nested for loops. Um So it tells you the number of gangs again |
|
|
40:32 | media vocabulary. Uh that were Number of workers, the vector |
|
|
40:39 | the grid size and the block That was used for each uh this |
|
|
40:44 | gonna launch and there amount of shared I believe it. And bites. |
|
|
40:48 | not entirely sure but I'll check uh was used for um each of the |
|
|
40:55 | launches and similarly you can start it . And that gives you the uh |
|
|
41:08 | uh the memory transfers that that occurred uh for the uh for your kernel |
|
|
41:15 | a year. This particular line shows the there one of the data transfer |
|
|
41:21 | error variable which is double type So that means that this particular transaction |
|
|
41:29 | transferred eight bytes from the device to memory. Mhm. So it's a |
|
|
41:36 | . So it's from host to Yes. All right. So that's |
|
|
41:44 | thing you can use now. Uh obviously need to profile your codes and |
|
|
41:50 | me just do it zero. So don't get any unnecessary messages here. |
|
|
41:55 | So for profiling your goals, you use the N V P O |
|
|
41:59 | Which is again uh part of the envy HPC model. You don't need |
|
|
42:03 | load any other module for that. first let's see how you can get |
|
|
42:09 | the power consumption values. So for you can do and we profit system |
|
|
42:18 | on. And just simply provide uh execute herbal name here and that runs |
|
|
42:28 | program with the profiler. Let me up a little bit. Okay. |
|
|
42:36 | yeah, it runs the program and gives you a breakdown of the amount |
|
|
42:42 | time it took for different sections of uh of the uh of your |
|
|
42:49 | So here mostly the gpu activities which your kernels. Your reduction Colonel. |
|
|
42:55 | was the first nested for loop. good, a memory, a man |
|
|
43:00 | from device to host. And one the meme sets that we use for |
|
|
43:04 | a new uh matrix here. And the FBI calls are more specific to |
|
|
43:10 | NVIDIA library because as we know, sec code pretty much gets translated to |
|
|
43:17 | code by these uh these open sec and anything that's going on. It's |
|
|
43:22 | cuda Calls in one way or So you get the Gouda uh functions |
|
|
43:31 | and you also get some information about open sec uh functions here as |
|
|
43:41 | And yes, so in the end you get the temperature. Uh he |
|
|
43:51 | got the streaming multi processor clock and clocks as well. And you get |
|
|
43:56 | power consumption. And here this average consumption is your average consumption for the |
|
|
44:04 | time execution period of your entire So it's not specific to just the |
|
|
44:10 | sec sections here. This is just power consumption during the whole execution period |
|
|
44:16 | . And this is the one that should probably try to uh report in |
|
|
44:22 | in your assignments. But yes, also reports the maximum power consumption and |
|
|
44:27 | minimum power consumption during the entire period education of your programs. Um There |
|
|
44:37 | a few more metrics that you can collect with the profiler um for that |
|
|
44:45 | simply need uh need to provide. I should first show you how to |
|
|
44:50 | the metrics. Yeah, so you do envy problem. Uh query metrics |
|
|
44:57 | that tells you all the metrics is to what you guys were collecting with |
|
|
45:02 | . Uh But this is for Gpu and then we provided. So it |
|
|
45:07 | you all the uh all the metrics you can collect about your programs are |
|
|
45:18 | . You got the flop count in precision. There's single precision flop count |
|
|
45:23 | well. If you use single Uh there's delivery through board write throughput |
|
|
45:28 | all the uh there's several other metrics . The one we are interested in |
|
|
45:39 | is flub count dp and I think going over time here but I'll be |
|
|
45:53 | you. Don't worry. Okay. , so you can set the uh |
|
|
46:02 | to be flop count. Double That tells you the double precision operations |
|
|
46:08 | by the uh uh these uh this gonna launch here and notice that these |
|
|
46:20 | there were 100 invocations of this colonel I had 100 patricians and these numbers |
|
|
46:29 | are uh But yes they are for uh numbers. So 400 invocations you |
|
|
46:40 | to multiply uh well in this case max and average are exactly the |
|
|
46:45 | So just multiply one of these numbers 100 to get the total number of |
|
|
46:49 | flappy gums. And if you're wondering this particular number comes from. |
|
|
46:56 | So in our jacoby code we had 5 13 point operation. How was |
|
|
47:03 | ? Because this is one floating point . Multiply Second edition three four. |
|
|
47:13 | there's one more here. That's five the subtraction of but to Old and |
|
|
47:19 | values to get the error value. that's five um operations in each |
|
|
47:27 | the memory swap is not counted as floating point operation. Even though you're |
|
|
47:32 | you're putting a new value into a point variable but it's not a plus |
|
|
47:37 | divide or multiply us? That's not as a floating point operation. It's |
|
|
47:41 | memory copy operation. So you have operations and we have uh four K |
|
|
47:50 | four K grid sizes. So just that. Bye Bye. five and |
|
|
47:56 | get uh this particular value for each here. That's what that's how one |
|
|
48:03 | you can verify uh these uh numbers . Uh The other one that will |
|
|
48:09 | useful for you guys for the assignment the dirham read transactions and you can |
|
|
48:17 | the other metrics as well. But one is also interesting to see and |
|
|
48:24 | can see that the number of uh and read transactions that are performed. |
|
|
48:32 | Yes. So here, if I correctly you need to multiply the number |
|
|
48:38 | transactions first with the number of invocations have the total transaction and also multiply |
|
|
48:44 | with the data size. I think in the in the read me the |
|
|
48:49 | formula to use. But yes you to multiply it with the data |
|
|
48:52 | So we're using double positions. That's fights. You can get the total |
|
|
48:57 | of bytes that were derived from the memory. Yes. So we got |
|
|
49:05 | consumption floating point counts and the memory and I just have one final example |
|
|
49:22 | Mhm. Okay. Okay. Yes. Okay. So question for |
|
|
49:45 | guys, do you think this matrix colonel paralyzed using the pragmatists parallel and |
|
|
49:57 | is a loop construct? Uh First it give you speed up? And |
|
|
50:02 | second question is even if you get up, you think your output will |
|
|
50:08 | correct here or not? Mhm. . It's running on GPU you compile |
|
|
50:36 | run it for the views in Yes, no one. Okay. |
|
|
50:49 | won't be a speed up. And what about the correctness of the |
|
|
51:06 | ? I also know. Why wouldn't be a speed up then? |
|
|
51:16 | Oh yeah. Uh speed up compared running on CPU let's say I should |
|
|
51:29 | been here. Sorry? Yeah. , let's leave the speed up or |
|
|
51:42 | you think there's a uh you'll get output. No. Why? So |
|
|
51:51 | is a chapter from did you may to look at uh lose the rebels |
|
|
52:00 | not be corrected independence really? Like . Yeah, variables would not be |
|
|
52:11 | due to dependence on uh No, you're close. Uh Okay, let |
|
|
52:24 | simply run this card. Let's You would be here. And |
|
|
52:29 | I've compiled it for Gpu So don't about compilation. Part of compilation. |
|
|
52:34 | is famous the other phones and here at me. Yes, you |
|
|
52:45 | are quite close. Yes. So yes. So first I then |
|
|
52:52 | the speed apart, but yes. I know that Diamonds for CPU as |
|
|
52:56 | . And this this time taken for sec is obviously quite fast. And |
|
|
53:01 | would happen on CPU so yes, will get speed up. Uh but |
|
|
53:08 | is error in the output. You're some value that was not expected. |
|
|
53:14 | uh the part that's wrong with this is that you have a race condition |
|
|
53:20 | ci J because you paralyzed this innermost here. So because of the race |
|
|
53:27 | , obviously you might get incorrect out here and so the the right way |
|
|
53:37 | doing this matrix multiplication is and yes. So you need to add |
|
|
53:43 | temporary variable could be some time whatever you do the reduction. So just |
|
|
53:50 | point of this example was that even you put all the parallel pragmatism, |
|
|
53:55 | looks whatever uh directives in your you'll see speed up in your |
|
|
54:00 | But it's also important to make sure you're getting the correct output there because |
|
|
54:05 | , it's same. Right? Just open MPI The compiler runtime, No |
|
|
54:10 | will warn you of such a race . So just keep that in mind |
|
|
54:15 | have a verification section in your Even for your assignments or your |
|
|
54:20 | whenever you work on it, make you verify your outputs. Don't simply |
|
|
54:25 | happy if you're getting speed up with coat. So we're running now. |
|
|
54:31 | works. Yes, you need the . Right? Yes. So |
|
|
54:40 | you see the Timings for both parts almost the same .44 and .41 |
|
|
54:48 | But here the verification section has because data matches matched with what was expected |
|
|
54:55 | the C matrix C. Yes. the two students who said it in |
|
|
55:01 | chat. Yeah you guys were quite here and that's pretty much it what |
|
|
55:06 | had. Okay thank you. Let's if there is an additional comments or |
|
|
55:13 | from students. Mhm. Yeah, don't see any for now. |
|
|
55:27 | Alright. So I guess I There's a so what kind of music |
|
|
55:37 | have? What do you mean higher sizes? Yeah. Oh okay. |
|
|
55:50 | . I think that's for for the multiplication. No no no I uh |
|
|
55:57 | it with managed uh Yeah, I have any uh data classes in |
|
|
56:04 | So uh Yeah you are using, see how about Uh huh. In |
|
|
56:21 | community it's not that no. Um so that does not depend on the |
|
|
56:32 | size you have mostly the bottom leg the Pc Express Lane. So as |
|
|
56:39 | saw in the initial part and the that all the data starts from the |
|
|
56:45 | memory, you have to transfer it the accelerate your memory, do your |
|
|
56:49 | there and then get back, get results back. So whether you used |
|
|
56:55 | memory or uh explicit data movement. Directors, those are just two ways |
|
|
57:02 | moving moving the data between the host memory. Host and the device. |
|
|
57:09 | yeah, the bottom leg is actually is the pc I express Lane doesn't |
|
|
57:16 | conversation because we want to manage uh your town. So God doesn't no |
|
|
57:28 | is not much role played by the cash in this case because the communication |
|
|
57:34 | pretty much happening from the deer. memory on the whole side to the |
|
|
57:38 | ram memory on the device side, is only come into play when you're |
|
|
57:43 | to get data to the Cpus functional . In that case you need to |
|
|
57:48 | the amount of time spent to get from the Deer a memory. So |
|
|
57:52 | why you get some data in the work on it started back because catches |
|
|
57:58 | obviously the first remember you know. but you're in case of host to |
|
|
58:02 | transfers. The CPU caches does not much much older. Yeah. |
|
|
58:12 | So most systems today they make use it's known as the direct memory access |
|
|
58:19 | remote memory direct memory access. So means the CPU is very likely involved |
|
|
58:28 | they're starting and stuff in transactions but going ahead um or not moving things |
|
|
58:36 | the CPU and writing it back Does that answer your question? Does |
|
|
58:48 | answer your question? Yes I think had the question. You can go |
|
|
58:54 | and if you want to type or can unmute yourself. Yes I am |
|
|
59:04 | am aware of and we both are of uh a few times and Q |
|
|
59:10 | on register for you're going about um so if you mhm. Those of |
|
|
59:20 | who already have accounts on either so china or Korea um should be able |
|
|
59:29 | use them from the class. Uh I will just let me know, |
|
|
59:36 | add you to the allocation we have research for the group or mine. |
|
|
59:45 | So that could be done quickly if has let us know, I don't |
|
|
59:50 | , don't take all the specific procedure you can tell me what, tell |
|
|
59:55 | what you need to do since they set it up and I don't remember |
|
|
60:00 | . I only got a question from Data Science Institute folks um to validate |
|
|
60:09 | it's okay to give you an Uh like yeah if you if you |
|
|
60:16 | have an account either on Korea or No it's uh johnson it would be |
|
|
60:21 | if they use Kariya open. She profiling is not. Yeah. |
|
|
60:25 | I know but you can use both unfortunately cannot do everything on open but |
|
|
60:32 | can debug codes and and do some but you can't do the measurements but |
|
|
60:39 | . Uh you can you can if don't have an account you can go |
|
|
60:43 | Data science Institute's website and I'll send the link where you can sign |
|
|
60:48 | Uh and also add the uh uh as principal investigators. So at doctor |
|
|
60:54 | name there. So doctor johnson's email they'll send him uh email to confirm |
|
|
61:01 | add you to their allocation. If already have an account just send uh |
|
|
61:06 | your user name. So you can that not do in Korea. |
|
|
61:13 | Area. And open here. I'll out the links for signing up on |
|
|
61:18 | of the cultural sites. So let's remember what the name of dr johnson |
|
|
61:24 | the principal investigator and his uh email . D. The C. |
|
|
61:28 | Dot U. S. Strategy that has used to communicate with them. |
|
|
61:33 | copy it and send it to both us. Yes. For back up |
|
|
61:40 | none of us misses it. Yes and yes I will also send |
|
|
61:45 | uh so mostly the compilation profiling all commands will be the same. It's |
|
|
61:52 | only difference will come from the command get interactive access to the Gpu |
|
|
61:58 | And I'll send you their tutorial pages as well along with the sign up |
|
|
62:05 | . So yes I'll do that. terms of compiling you won't see any |
|
|
62:11 | . Yeah it's getting to be Gpu will be a little bit different than |
|
|
62:15 | I showed you. Oh okay. think you can take a word. |
|
|
62:26 | . Okay. Yeah so all right share more the speed. Just see |
|
|
62:34 | happens here. Uh That's not Okay 15 months then you get |
|
|
62:45 | Uh huh. Not yet. Yeah we got it. Okay. Um |
|
|
62:55 | so um I will have some slides part of what he has already talked |
|
|
63:03 | a couple of things probably before the of today's lecture but so this is |
|
|
63:12 | last lecture plan for open sec or and Gpu so again feel free to |
|
|
63:17 | questions as they go along. So oh so the first thing is |
|
|
63:27 | picture for what so you have already in terms of this update directive. |
|
|
63:34 | is a way for explicitly making sure the CPU and the Gpu are saying |
|
|
63:44 | points where it's important that they are thinking or they both have the same |
|
|
63:50 | a bit too strong to stay in but since they both can compute independent |
|
|
63:56 | each other uh that means if you a race that exists on both the |
|
|
64:03 | and the Gpu uh the corrections of program sometimes does not depend on them |
|
|
64:13 | on the exact same data and you still work. Okay. And sometimes |
|
|
64:18 | some point uh you may want to sure that before some other face of |
|
|
64:24 | code that they do have the most both of them. So you can |
|
|
64:30 | and make sure through this update director the sort of attributes to that is |
|
|
64:39 | himself. So device means that to update there device to GPU or what |
|
|
64:47 | current values are of their in this a and the self means that the |
|
|
64:55 | retrieves the current state in the case the baby. So it's just directional |
|
|
65:02 | on which way the data is copied the device yourself but it's a way |
|
|
65:07 | explicitly doing and not waiting for instance in the regions or some other parallel |
|
|
65:15 | . Uh And so this is just this point of view. Maybe you |
|
|
65:19 | to check point the data that is via the eye, you person have |
|
|
65:23 | done by the Cpus and our You need to retrieve the data from |
|
|
65:28 | GPU and then right out to disk however you do the check pointing on |
|
|
65:34 | CPU. Uh The other band was to also talked about this. They |
|
|
65:42 | and exit data to re flexible way defining the lifetime or various. There |
|
|
65:49 | structures that you have and they need be prepared. So you have an |
|
|
65:56 | data. It needs to be at point followed by an exit data. |
|
|
66:00 | it's so they need to be matched it's flexible how they can be much |
|
|
66:06 | the sense of where they can exist the cult and then there's clauses and |
|
|
66:12 | . Uh So yeah, I should mention right for the enter that means |
|
|
66:18 | can use the the greater data or in and the data from the environment |
|
|
66:24 | you have when you get to the and on the exit you can get |
|
|
66:29 | copy out or delete uh yes to up to date a space and that |
|
|
66:35 | also demoed by as Russian here is some examples of the how they |
|
|
66:41 | This is pretty. The unstructured example not really all that interesting because we're |
|
|
66:48 | not that different from the so called when you have this not making use |
|
|
66:55 | their enter except data statements. But the other one is a little bit |
|
|
67:03 | more how that they can be flexible terms of where they and drink cold |
|
|
67:09 | they're located in the cold. But just need to make sure that somehow |
|
|
67:13 | logic is correct in that that are carpeting or created at some point also |
|
|
67:26 | or but the accepted her statement. and then we had this that we |
|
|
67:36 | about the collapse in Thailand it was um more example on slides. So |
|
|
67:42 | can remember perhaps better than just from presentation in adults in the video but |
|
|
67:49 | just shows the same thing collapse and loops and the explicit coby statements about |
|
|
67:57 | the collapse. And in this case two, Yasha, as I |
|
|
68:02 | it ran the benchmark and redid it this class. And there was no |
|
|
68:07 | benefit in this case actually it's like , it's like penalty in terms of |
|
|
68:11 | the collapse which is slow down man 2% but not much. So an |
|
|
68:19 | in this particular case collapsing the Um and he is kind of the |
|
|
68:27 | I promised in terms of the We're in this case if you have |
|
|
68:31 | four by four things you want to , I could potentially do it and |
|
|
68:37 | of two x 2 times and in case you get sort of implicitly you |
|
|
68:43 | up constructing Aloofness. The four loops the traverse each Thailand, just a |
|
|
68:51 | loops and then you know, traverse they say X and Y dimensions on |
|
|
68:57 | Thailand. So that's another two What is done implicitly for the tile |
|
|
69:03 | . All right attribute that you had the loops and then for something more |
|
|
69:10 | you probably have a bigger tiles and was an example again going back to |
|
|
69:15 | in which case maybe actually choose somewhat better tile size for a larger erase |
|
|
69:25 | in terms of finding out the performance is a little bit of trial and |
|
|
69:29 | or sometimes you can have insights if know spices cash is in the cache |
|
|
69:36 | . How to choose the ties, time sizes for each level in the |
|
|
69:42 | . When I talked about how to major response apply in a future |
|
|
69:47 | I'll go through high actually at least a good estimate of what the time |
|
|
69:53 | this ought to be for each level there memory hierarchy. And I guess |
|
|
70:01 | was also done, you know we're playing around mhm What uh introducing the |
|
|
70:10 | argument and in this case Yeah, didn't have neither tiny tiny bit positive |
|
|
70:18 | for the CPU but not much and for the give you kind of almost |
|
|
70:27 | it back to no tiling For the x 32 file size. I don't |
|
|
70:33 | if you want to comment anything on suggestions you did run it? I'm |
|
|
70:39 | particularly because mainly just small life size really are the performance that you need |
|
|
70:46 | do uh compromised stands on secure this previously when I did these benchmarks on |
|
|
70:53 | of the older Gpus them. there you could see a significant performance |
|
|
71:00 | . Uh for last year I remember gone somewhere from 10 to almost 14 |
|
|
71:09 | was it 14% improvement and performance by an optimal dinos eyes. So you |
|
|
71:15 | go in this example you see a in performance but choosing an optimized site |
|
|
71:22 | game they have good performance for some the girls. Uh huh. So |
|
|
71:27 | company did better this time around. . So and here is just showing |
|
|
71:35 | benchmark numbers and I guess good Um No and okay, so I |
|
|
71:48 | one more slide. Alright so some comments again About these three levels of |
|
|
72:04 | true hierarchical structure that just showed also terms of the benchmarking in terms of |
|
|
72:11 | and part of the vocabulary to me confused and that it can open a |
|
|
72:16 | . C used the notion of gang and vector and indeed there with a |
|
|
72:25 | down they have different notions um So in their case I have it on |
|
|
72:36 | different slide but basically threads as more less similar meaning in the thread is |
|
|
72:42 | that executes in the score and then have collection of threats and I think |
|
|
72:47 | have it on slides coming up That for the 70 units that in this |
|
|
72:55 | is as different vocabularies in terms of sea, prove that. So this |
|
|
73:04 | just the next few slides, I'll to the the NVIDIA language but |
|
|
73:12 | I tried to pray or I do to use um open standards, definition |
|
|
73:21 | not proprietary. So that's why you open as you see that is an |
|
|
73:27 | standard and then it's also a little less worried about details of how things |
|
|
73:34 | actually than executed. This whole the presumed is supposed to have an open |
|
|
73:41 | to figure out some of the details mapping it onto the actual specific hardware |
|
|
73:46 | . So and then so it was of the open HTC using again the |
|
|
73:54 | gang worker in vector versus uh and ideas that don't use gang, they |
|
|
74:04 | thread blocks um and worker I believe stop being translated to words but we'll |
|
|
74:14 | on an exciting but you can, know, try to direct things by |
|
|
74:22 | these attributes for the loop constructs. were training constructs as well as sexual |
|
|
74:29 | . So this freedom mix and Mhm And then you can also specify |
|
|
74:36 | the numbers, you know how many this case gangster translates thread blocks in |
|
|
74:45 | of and media and I think that's making you aware of the open sec |
|
|
74:55 | if you were to use knowing class just use the NVIDIA but in recent |
|
|
75:02 | and they have been quite successful Also we're selling Gpus for data centers |
|
|
75:10 | scientific and engineering applications or they just names so but they all understand the |
|
|
75:18 | a sissy vocabulary. So here is another example uh specifying mhm um the |
|
|
75:30 | don't copy example and here's a benchmark . So I just made so let's |
|
|
75:38 | . But as was shown in particular there um tiling example the competitors have |
|
|
75:49 | better over the years and at this it may not be you may actually |
|
|
75:55 | rather than trying to so but it again how good the compiler is for |
|
|
76:02 | particular platform they're using. Yeah, in this slide to best to again |
|
|
76:10 | to make you aware of the vocabulary by the different vendors. Um so |
|
|
76:22 | media has been, you know dominating use for attached processors but not for |
|
|
76:29 | P U S totally intel is the producer of GPS but there are integrated |
|
|
76:37 | the same ship. So they are , you know pcs and laptops but |
|
|
76:43 | attached processors but now in recent years and he has actually been dominating in |
|
|
76:51 | of gaming DP use attached processors But the scientific market and media has been |
|
|
76:59 | dominating one. So typically you talk the engineering and science community, they're |
|
|
77:05 | used to cuDA and the stuff but can some of them are common. |
|
|
77:11 | Colonel means the same across the three but there are other things that are |
|
|
77:18 | confusing. I would say that you a shared memory has different meanings. |
|
|
77:23 | memories, shared memory on cuba and . Local memory when I am Gpus |
|
|
77:30 | decided to Combined the two in shared local memory. Um and local |
|
|
77:38 | may be private or something else. it may be useful in the long |
|
|
77:44 | to understand that an intel is about put out the start to compete with |
|
|
77:49 | other two vendors in terms of attached um, and a little bit more |
|
|
77:59 | specifically about the NVIDIA Gpus and what mentioned thread book blocks and warps and |
|
|
78:09 | all these things mean. Some of may already be familiar with it but |
|
|
78:14 | remember, So the structure that which pretty much certainly the same for AMG |
|
|
78:24 | uh very uh how the Gpus are together in a hierarchical structure and the |
|
|
78:32 | for NVIDIA is streaming multi processors. start here but it's a corresponding thing |
|
|
78:40 | a family. Mm And so in of how things are mapped onto use |
|
|
78:52 | sort of one level that is at streaming multi processors and will show in |
|
|
78:57 | next few slides and then things that well then the streaming multi processor has |
|
|
79:05 | certain amount of sharing that can be but typically there is no sharing in |
|
|
79:13 | of synchronization between different streaming multi process that case Things that executing different streaming |
|
|
79:22 | processes they can share as it shows level two level 3 memory but between |
|
|
79:28 | threats in the different of the there is no single points uh and |
|
|
79:40 | both MD and video has this notion Their underlying 70 units as the car |
|
|
79:50 | been shown a few times that there a single instruction operating a multiple data |
|
|
80:00 | that's where this notion of warps comes . So that's a scheduling unit of |
|
|
80:08 | school schedule, each one of us the units. And typically That means |
|
|
80:15 | to 32 threats. And if you have 32 threads in the warp, |
|
|
80:19 | you basically lose performance as we talked that some previous lecture. So now |
|
|
80:29 | then thread blocks is kind of the higher level in vocabulary form and |
|
|
80:35 | And the third blocks that showed in demo and It was that did that |
|
|
80:44 | the number 10, and one of sides, that is the maximum number |
|
|
80:47 | threads in a thread block. And means to do the operations on this |
|
|
80:56 | , potentially threats in a thread block get carved up into Sets of 32 |
|
|
81:02 | , some of them scheduled together as war. And then it's your multiple |
|
|
81:09 | blocks and media college grades. And then the third blocks then, |
|
|
81:16 | that the thread box can nap be across the multi processors or thread blocks |
|
|
81:25 | Designed only to one. So only as multi processor works on any given |
|
|
81:32 | block. But there can be multiple blocks, Society given multi processor. |
|
|
81:41 | I think that's also this one tells little bit about sink and already |
|
|
81:49 | And the next slide is this notion coalescing memory accesses. So that means |
|
|
81:58 | the different threats uh if they need to level two or global memory that |
|
|
82:06 | in the previous slide, the system to make sure that it fill up |
|
|
82:11 | the kind of wires two, the to make full use of the memory |
|
|
82:20 | . So that's coalition as it's And again the compiler try them to |
|
|
82:26 | out the schedule things. So this actually happen. And to make produce |
|
|
82:31 | the memory bandwidth. Yeah. And is a 384 within that's typically when |
|
|
82:39 | use um the graphics or type of memory, that's not as G D |
|
|
82:48 | R. And that also comes from the demand bugs into the circuit |
|
|
82:56 | But for some of the GPU still that because it's cheaper for my |
|
|
83:03 | Whereas the Korea Gpus used to high memory that are integrated in the same |
|
|
83:11 | . And then you have a wider bus and well I guess my time |
|
|
83:20 | up but I want to have a of slices off to show a little |
|
|
83:25 | and this has an media linked memory the CPU that was being built some |
|
|
83:31 | purpose I would say circuit boards to for large systems to the Department of |
|
|
83:38 | . So in this case that has of the IBM Cpus On the circuit |
|
|
83:45 | and each one of them has six three to each one of the same |
|
|
83:52 | . Yeah. And it also supports notion of this unifying memory. So |
|
|
83:59 | think I'll stop there because my time up. There's a couple of more |
|
|
84:02 | basically comparing open MPI to open But we're not doing open empty for |
|
|
84:11 | using the glass of you can be similarity. We'll stop there. Mm |
|
|
84:25 | . Yeah. Yeah. Any questions , Yeah. Uh huh. And |
|
|
84:37 | will be back in the classroom on traveling today. Mhm Oh. |
|
|
84:47 | No more questions here. Okay, stop recording that. Yeah. |
|