© Distribution of this video is restricted by its owner
00:00 | Yeah. Okay, so, last , I just got to start to |
|
|
00:09 | about that Virginia's computing or notes So I'll pick up where I left |
|
|
00:21 | last lecture and let's see if this So it was a little bit of |
|
|
00:29 | out time for today. So since first couple of points here is just |
|
|
00:36 | recap of the last few comments of lecture on, then talk a little |
|
|
00:41 | more about the programming aspects are heterogeneous in particular about something called Open a |
|
|
00:49 | C that some may be familiar It's kind of in the same spirit |
|
|
00:54 | open MP, and I'll try to out a little bit differences on commonalities |
|
|
01:04 | why we use open a sissy. anything was about etcetera, genius |
|
|
01:13 | And for this class, I would it is, uh, the |
|
|
01:19 | Aziz, you have used it so in terms of the stampede, for |
|
|
01:25 | , and some attached processor that maybe or an F B J or some |
|
|
01:32 | device. Uh, but for this , uh, will be GP use |
|
|
01:40 | So here's the point. Try Thio towards the end of last lecture that |
|
|
01:45 | is kind of the no architecture at high level, if you like, |
|
|
01:53 | will be the basis for the next assignment. And that's typical in terms |
|
|
02:01 | what to may use for lots of these days when you used the |
|
|
02:08 | or potentially even FDA s or some device. So, as I pointed |
|
|
02:16 | , main difference or differences are, that attend to be. Now there's |
|
|
02:24 | memory spaces and there's two instruction So then we'll come back to that |
|
|
02:33 | the lecture on then the next is examples very quickly, in case you |
|
|
02:41 | , um, come across that except using them from some weapon to |
|
|
02:48 | But here's what a kind of GPU model may look like, and they |
|
|
02:58 | attach or connect to this. Are bus station PC I express bus and |
|
|
03:06 | you can see in this kind of the other word golden type pins at |
|
|
03:10 | edge of the card. That's to about plugs into this bus, and |
|
|
03:15 | can also kind of C on the left hand corner here that deep use |
|
|
03:20 | to be power hungry, and it a lot of cooling. So not |
|
|
03:26 | of what you see it can. defense. And in addition, Thio |
|
|
03:32 | the GP use. Um, we're to use the bridges computers this |
|
|
03:41 | but we also have them at the Science Institute. Jovic and I think |
|
|
03:48 | punch has on GP use on it many off the cloud for riders Amazon |
|
|
03:57 | against the super Microsoft. Many of they do have keep the use in |
|
|
04:02 | of their notes, so there are commonly accessible in many shapes and |
|
|
04:10 | Then, um, we have a sort of beginning to show up |
|
|
04:17 | become a little bit more common. tools for programming. FPs has been |
|
|
04:24 | quite a bit over the years, they offer some benefits as kind of |
|
|
04:30 | compromise between fully custom piece of silicon a standard CPU. And you can |
|
|
04:40 | get them on PC expressed cards or can do what Microsoft did. They |
|
|
04:47 | Alibaba, the other with Chinese Cloud and big, uh, Internet |
|
|
04:55 | They also use FDA for their search and some of their functions, |
|
|
05:01 | and in terms of Microsoft, also through their cloud service, mhm We |
|
|
05:08 | not use FPs in this course, , but we should be aware |
|
|
05:13 | That's another element in terms of heterogeneous that are now rather the available, |
|
|
05:19 | through cloud services. And then the example I have in terms of the |
|
|
05:25 | of accelerators is so those of you are interested in machine learning and that |
|
|
05:33 | designed their own DPU, a tensor unit that is then used to support |
|
|
05:42 | like tensorflow. And these days they sell thes units. But they do |
|
|
05:48 | access to them in terms of their power, and it has the benefits |
|
|
05:54 | claimed over GP use. That's why did it so that in a custom |
|
|
05:59 | of silicon, um, and they're on their third generation to do |
|
|
06:04 | But it means basically programming heterogeneous And then I want to just a |
|
|
06:13 | briefly point out that in terms of computing, this has been the norm |
|
|
06:18 | a long time, in fact, in that case you don't use connectivity |
|
|
06:24 | an Ohio bus. Everything is on same piece of silicon, and that |
|
|
06:28 | a big difference. But in terms both performance aspect and programming aspects, |
|
|
06:35 | we will not help interview it in course. But is one example from |
|
|
06:40 | instruments that, um, for a they also did chips for mobile |
|
|
06:47 | And but they stay. They're more on teach other single processes. But |
|
|
06:55 | it's usually, you know, one of CPU core. There's low power |
|
|
06:59 | was produced by arm, that you have seen a lot of headlines in |
|
|
07:04 | years, also in terms of high computing and that NVIDIA is in the |
|
|
07:11 | of buying. But it shows that have done a number of different functional |
|
|
07:17 | that needs to be programming. There's one from Qualcomm in terms of their |
|
|
07:23 | phone ship designs that has it Number of processing engines on the same |
|
|
07:28 | of silicon. And here is another that's just from into us and older |
|
|
07:33 | . About four. They're stopping mobile and, as you can see towards |
|
|
07:37 | lower right 10 corner in the picture says GP use included together with a |
|
|
07:44 | set, of course, and by way, that may not the something |
|
|
07:51 | necessarily think of. But in in tone as the largest producer of |
|
|
07:58 | . But as of yet, they not have a kind of discrete component |
|
|
08:06 | units, all integrated with on a of silicon that at about this time |
|
|
08:13 | claim that they will release their first . We'll not stand alone by GPU |
|
|
08:19 | separately and to be connected over in bus. So they decided Thio seriously |
|
|
08:30 | up the competition with the media and in terms of having discreet GPS on |
|
|
08:39 | is another one boy and be in of their integrated um GP use on |
|
|
08:46 | same piece of city you know where were just examples. So say that |
|
|
08:50 | , keep used also exists as integrated the same piece of silicon that that's |
|
|
08:59 | something I recover in these course how deal with it. Those in how |
|
|
09:03 | program them. Any questions on this general kind of, or whether you |
|
|
09:12 | accelerators on some flavor, mostly GP in terms of communicating the the work |
|
|
09:26 | done by the accelerator? What are implications between having an attached versus |
|
|
09:35 | It's a huge difference. Uh, back to it, but very quickly |
|
|
09:42 | this point in the lecture. So ones that I integrated Andi used to |
|
|
09:50 | them a B use application processing units the graphics processing and CP use on |
|
|
09:59 | other accelerators on the same piece of . And so them biggest difference, |
|
|
10:09 | would say, and I will come to that is that when it's integrated |
|
|
10:13 | the same piece of silicon, the the different devices tend to has access |
|
|
10:25 | equal access, even to the same , which is not true when it |
|
|
10:31 | to the attach processing or accelerators. , and it also means that the |
|
|
10:40 | paths between C to use and accelerate or share. So it's even though |
|
|
10:51 | instructions that's are different for different But it's a lot more homogeneous in |
|
|
10:59 | of the kind of silicon infrastructure that being used by the different computational |
|
|
11:06 | as supposed to when you have attached so it effects. Other programming is |
|
|
11:13 | done and the tools being used as as the performance and I'll try Thio |
|
|
11:22 | that out as we go here in next few slides. So this |
|
|
11:30 | I think, the last night I last time on. Just try to |
|
|
11:36 | out that in terms off parallelism or of threads that can be supported, |
|
|
11:50 | guess between CPUs and deep use, a huge difference. So typically about |
|
|
11:57 | the magnitude, um, difference for . And for that mhm. The |
|
|
12:06 | part, I would say, is . I'll come back to that, |
|
|
12:09 | . Is that to get the full or high utilization off streaming processes that |
|
|
12:19 | use that will be the focus for rest of the lecture. You really |
|
|
12:24 | tohave your application capable of exploiting Cindy , instructions. And, as you |
|
|
12:38 | see, if you look at the column here, basically there is not |
|
|
12:43 | used a difference in terms of p between CPU see or CPUs and DP |
|
|
12:53 | . Yes, a factor. Five by no means being north. But |
|
|
12:56 | not orders of magnitude. And whether actually get this factor of five or |
|
|
13:02 | , it's highly dependent on whether you actually get victimization was simply toe work |
|
|
13:09 | your application and also, in this , the GP use that I put |
|
|
13:16 | the slides. Other ones there are specifically, I would say, to |
|
|
13:24 | with service sleepy use. So there other for DP use that may be |
|
|
13:33 | focused on supporting machine learning, and that case, they may still |
|
|
13:40 | The single precision capability is shown on slide, but typical. Their double |
|
|
13:46 | performance is way lower. So it's something to keep in mind in about |
|
|
13:54 | nature of the devices and what it to you get good utilization off them |
|
|
14:00 | terms off the nature of the application and the code being generate. So |
|
|
14:09 | are coming a little bit, I , to help answer the question that |
|
|
14:13 | didn't. But as I think this picture on the right of the |
|
|
14:19 | tried Thio illustrate some of it on text kind of make it kind of |
|
|
14:27 | . So the attached DP use It's true for the integrated GPS that they |
|
|
14:40 | not complete processors, so they all a host. You see, to |
|
|
14:46 | , it's a totally standard on It doesn't need anything else that has |
|
|
14:52 | . It has a lthough instruction decoding as memory. It does everything needed |
|
|
14:59 | execute coat CPUs have much more limited . Um, in terms of flexibility |
|
|
15:13 | dealing with cold. So that's why basically need to see for you. |
|
|
15:18 | that's the one thing that, it's important to keep in mind. |
|
|
15:27 | as when I talked in in the of lecture, talked to a little |
|
|
15:32 | about GPS, and I talked about , and so far most have been |
|
|
15:38 | on CPUs and the tours, and it said in the previous slides. |
|
|
15:44 | maybe if you up to a few , of course, on the pieces |
|
|
15:50 | . There is the CPU, whereas terms off cores, when it comes |
|
|
15:56 | deep use that tend to be in thousands. So again, the level |
|
|
16:03 | parallel this much they have on How can exploit is, you |
|
|
16:10 | up to two orders of magnitude higher on this. If you one of |
|
|
16:17 | advantages off GPS has bean that they been typically over ever since the, |
|
|
16:29 | , I just started to appear as , you can help him to annoy |
|
|
16:33 | Bus. They had 5 to 10 higher memory banquets than what specifically |
|
|
16:44 | So one of the big advantages for use has been how much how you |
|
|
16:50 | memory. Bad words, on the hand, coming back to the question |
|
|
16:55 | I haven't just asked if you have integrated GPU, it uses the same |
|
|
17:01 | . That means, yeah, doesn't the advantage off significantly higher memory |
|
|
17:07 | So it just has it. It's bandwidth eyes the Samos that for the |
|
|
17:17 | . The other difference that is important keep in mind is that GPU memory |
|
|
17:25 | tends to be a lot smaller than memory on SCP. So today, |
|
|
17:36 | of high end GP use may have to 32 gigabytes. Um, |
|
|
17:41 | on the other hand, see, you use may have terabytes, |
|
|
17:43 | memory on It's not the typical, nothing that prevents you toe configure an |
|
|
17:51 | the issue terabytes from memory. And of the knows that his per supercomputers |
|
|
17:59 | some of the richest knows they have from memory on So and then the |
|
|
18:08 | that connects is this are you And last time I talked a little |
|
|
18:12 | about the PCR Express bus and show that is kind of a very thin |
|
|
18:19 | compared thio the memory Busses, even the CPU, as well as even |
|
|
18:26 | so compared to the seat for So here is kind of the model |
|
|
18:36 | how get Virginia snows in terms of attached processors works. So it's basically |
|
|
18:48 | things start to an end on the , and in order to get anything |
|
|
18:53 | , one has to, to the the application that comes on some initial |
|
|
19:01 | , it starts out on the C in the CPU memory, and it |
|
|
19:05 | done to be moved over to the . That's his call on the GPU |
|
|
19:12 | this class. Then you have to the code over, and then it |
|
|
19:19 | start the execution, and then the typically can then proceed a synchronously on |
|
|
19:26 | device and whatever it is CP, may want to do on. At |
|
|
19:31 | point, the results are supposed to back to the CPU now in |
|
|
19:40 | since the GPU memories, it's significantly than the CPU memory. It is |
|
|
19:48 | not possible to move all of the data over to the DPU before execution |
|
|
19:57 | , but it actually has to be in faces where things gets moved over |
|
|
20:03 | maybe come back to the CPU. there's maybe a fair amount of interaction |
|
|
20:10 | the CPU and GPU during the execution the killed. In order to be |
|
|
20:16 | to process the entire data set on is just reemphasizing uh, how it |
|
|
20:26 | of works. Things starts on the hand side and gets moved over to |
|
|
20:30 | GPU, possibly in faces and the . I express bus, maybe having |
|
|
20:38 | severe performance. In fact, depending how much competition you can do in |
|
|
20:44 | GPU for transfer of data between the on the GPU on one has to |
|
|
20:52 | out when your leader literature if people actually telling you the full story in |
|
|
21:00 | off reduction in compute time or speed if it just looks at the GPU |
|
|
21:09 | by itself or if they actually do the bus transfers on the EU bus |
|
|
21:15 | get the total time speed up. when it needs to be careful, |
|
|
21:22 | there is not much competition capable even in itself, it may speed |
|
|
21:28 | uh, to a large factor if totally get killed, sort off by |
|
|
21:35 | slow transfers between the post on the . Ah, any questions on that |
|
|
21:48 | general picture or understanding how the structure on the trade offs. So the |
|
|
21:56 | era that points to the right offload . That's the Ah, three instructions |
|
|
22:03 | O. R, I guess when for phrasing it earlier, the code |
|
|
22:07 | transferred to the GPU, right? , of course, one of the |
|
|
22:12 | one that's just transfer data is, the data on which that instruction will |
|
|
22:20 | ? Right? Um, so are using the same PC I E. |
|
|
22:25 | Lane to do that? Yes. it's, um, it's it uses |
|
|
22:36 | same PCR Express lanes. There is difference in terms off lanes being used |
|
|
22:45 | cold and planes being used for But it's a good question. So |
|
|
22:54 | sees PC Express bus, maybe often 16 bits wide. When you use |
|
|
23:02 | 16 bits, both for code and , that's sometimes there may be to |
|
|
23:09 | but the common is after, depending what the devices that's being attached for |
|
|
23:17 | or 16 bits wide. So here's little bit off them because again, |
|
|
23:27 | device, the G, P or P. J and for the TPU |
|
|
23:33 | its own instruction set. So that there is again said early on different |
|
|
23:41 | that so many of you are know . And maybe some of you have |
|
|
23:46 | used cooled off for programming and the A GP use to lawyers, proprietary |
|
|
23:54 | or crew that doesn't work on completing use. Um, that's why I |
|
|
24:02 | stayed away from using it. In course. Open CEO is an open |
|
|
24:09 | that, uh, is supported in , at least by several vendors. |
|
|
24:18 | waas initially driven by Apple and A B and had quite a few |
|
|
24:26 | uh, buying into the thing, Intel in and video. But in |
|
|
24:34 | of the media and they focus on and opens here is a bit of |
|
|
24:39 | step child. I will say Intel been a little bit more forthcoming in |
|
|
24:47 | of trying thio support. The construction good compilers for open C, L |
|
|
24:54 | AMG has also been, um, good supporter of open zeal, but |
|
|
24:59 | hasn't had the financial resources off NVIDIA Intel. Some of it hasn't. |
|
|
25:05 | open CEO has kind of been it improved over the years, but, |
|
|
25:12 | , it's still a little bit of issue of using that. And that's |
|
|
25:17 | for this class and decided not to it in part of availability off those |
|
|
25:22 | for it, open it to see I will focus on for the rest |
|
|
25:27 | this class is something we'll use on . Give more background toe. Why |
|
|
25:33 | the next few slides? The other that one needs to pay attention |
|
|
25:39 | as that was the focus of last is the ability for compilers to generate |
|
|
25:49 | , code or Cindy codes. And in particular critical for GP use because |
|
|
25:54 | z the basis for getting good um, GPS and open CL again |
|
|
26:04 | designed to support generating, called for using a good way. But, |
|
|
26:10 | I said has been compilers of not have the level of sophistication as Comme |
|
|
26:20 | and Fukuda or openness the seat. then we talked about, you |
|
|
26:28 | for CP use focused on open MP one of them programming paradigms for |
|
|
26:35 | So, um, a little bit open SNC and and open and P |
|
|
26:46 | ? This came about and then The difference is a little bit and |
|
|
26:51 | I'll talk about open a sissy. of the purpose of I would be |
|
|
26:57 | to use it for the next Look, so a little bit That |
|
|
27:08 | say on any one of these but open MP was started by the |
|
|
27:20 | Uncle community, big, broader And it from the start waas an |
|
|
27:30 | standard, uh, with both academics companies supporting the idea off open MP |
|
|
27:39 | a way off programming multi courtships. , as I said when I started |
|
|
27:48 | talk about open MP it Waas to to great simplified way or layer on |
|
|
27:55 | proposing spreads, for instance so make a little bit more collectable to deal |
|
|
28:03 | multi threaded systems. So I was much focused on again how to use |
|
|
28:12 | easy to use and inherit properties. huh, off those types off multicourse |
|
|
28:21 | . And that's meant Iwas kind of as, like, say, as |
|
|
28:27 | prescriptive system. Tell a lot what I want the system or the compare |
|
|
28:36 | do and the now over time. addition together, many course on on |
|
|
28:46 | periods also keep used made in in , an accelerant just quite becoming quite |
|
|
28:59 | and, um, open a sissy started as by a few vendors and |
|
|
29:09 | being one of them on pray being . And that was whether to, |
|
|
29:18 | guess, main companies that I do . But they kept at the proprietor |
|
|
29:24 | because at the time graze computer, of Highland High end Systems and the |
|
|
29:31 | they had also started to use DP . And that was at the |
|
|
29:37 | And they they didn't have much competition Terms Off GP used for engineering scientific |
|
|
29:45 | . As I said, until still the number one producer of JP |
|
|
29:50 | But they're all integrated on a piece silicon in terms off laptop or desktop |
|
|
29:58 | of use, and the that has a significant keep you manufacture as |
|
|
30:07 | They focused on the gaming market more NVIDIA. I would say both did |
|
|
30:12 | in that market, but they were or less. I would say equal |
|
|
30:17 | . And they did not use that in terms off trying to build something |
|
|
30:23 | the data center or scientific and engineering . So so basically, open agency |
|
|
30:29 | a separate, proprietary effort for a years, and I think after five |
|
|
30:35 | six years, they, like many realize that proprietor is not necessarily a |
|
|
30:41 | idea for acceptance by spread. So this point, open a sissy is |
|
|
30:49 | an open standard, but open nice see a society they started with and |
|
|
30:55 | being a strong driver. So that they tried to figure out how to |
|
|
31:00 | programming. Uh, the central genius with DP use being, ah, |
|
|
31:10 | little bit easier than using cola and programming of the use. So the |
|
|
31:18 | very highest level the idea What's the ? They're using directives and trying to |
|
|
31:24 | their layers on top of to it to make the programming of accelerators all |
|
|
31:32 | . Somewhat easier. But the end , the starting point was, has |
|
|
31:38 | difference that open, empty, standard . That's a different notion of threads |
|
|
31:47 | capabilities, of course. Being significantly , the GPU course I am, |
|
|
31:55 | , at that time Waas, I say, compared to Cebu course exceedingly |
|
|
32:02 | , so open a sissy started. massive parallel is on very simple threads |
|
|
32:11 | and capabilities off the course, whereas and peace started at the other |
|
|
32:17 | But a zit says on the starting about five years back open MP |
|
|
32:26 | , helping MP then started to also out how to extend the capabilities of |
|
|
32:33 | MP to deal with accelerators. So the open M p e. I |
|
|
32:44 | there's no 5 ft tall standard is many of the features, uh, |
|
|
32:50 | A. C C. Has and , open a sissy. As many |
|
|
32:55 | the features that open MP has the still a difference in the models, |
|
|
33:06 | to speak, that open empty, tends to be more prescriptive. And |
|
|
33:13 | idea, at least the argument from age, openness and see folks is |
|
|
33:18 | there's approaches, descriptive and leads more to from fighters to figure out how |
|
|
33:26 | generate Good cop. And I will to show some examples of what capabilities |
|
|
33:33 | open a C C compilers later in lecture. So here's a little bit |
|
|
33:41 | on the history, then a society points where independent, um, and |
|
|
33:49 | idea waas in the community that both has their merits. At some |
|
|
33:58 | uh, these two efforts should be of merge or integrated into one, |
|
|
34:07 | , approach if you like, and set our competitors being capable or having |
|
|
34:12 | best of both works didn't quite um, the open MP community. |
|
|
34:27 | I said it was focused on multi for a long time until the version |
|
|
34:35 | of open, empty Andi, as know, intelligence being the dominating player |
|
|
34:40 | terms of see to use. so they were kind of highly focused |
|
|
34:48 | making open and pay good tools for , um, off the court |
|
|
34:55 | And then the started to branch out because today, Intel is also interested |
|
|
35:02 | accelerated systems. So they, as of you may know, they even |
|
|
35:09 | chips that has integrated F b J on the same piece of silicon or |
|
|
35:14 | the same package. I should Sarah, simply use and they're about |
|
|
35:20 | release stand alone GPS. Meanwhile, open A, C c and on |
|
|
35:32 | two hardware manufacturers of are in the Open A C C. Consorts and |
|
|
35:36 | was also a compiler software company on PG or Portland group and which built |
|
|
35:47 | independent compiler companies. Or they built for C. P. U S |
|
|
35:53 | heterogeneous systems that a few years back they were required by in video So |
|
|
36:01 | I now owned by NVIDIA is, would say, really highly focused on |
|
|
36:09 | sure that compiled code using their compiler really well and e j g p |
|
|
36:18 | . It means that it also has work reasonable on CP use on Because |
|
|
36:26 | video needs the attached versions off and GP use needs a host, and |
|
|
36:35 | systems use Intel CPU. Some they generate code for CPUs as well. |
|
|
36:43 | the Indian and and in town that in media started, focused on selling |
|
|
36:51 | you, so that's where their focus . So this kind of merge did |
|
|
36:57 | have. So, unfortunately, I characterize things as a bit of a |
|
|
37:09 | in terms off programming attached processors So, um, the Andy, |
|
|
37:25 | for us as and users have had pretty strong comeback and from a couple |
|
|
37:37 | serious downturns in their business and are competitive both in terms of CPUs and |
|
|
37:45 | and the entire range all CPUs and , um, so and they have |
|
|
37:55 | off not abandon open cl. But have now pushing run open source effort |
|
|
38:05 | as the radio open. Compute, , initiative on they have some buy |
|
|
38:15 | in terms of that on making progress making things usable on both Intel and |
|
|
38:23 | CPUs as well as, um, own GP use and and the |
|
|
38:34 | The reason is clearly that they also they want to make sure that cold |
|
|
38:41 | and the DDP use is recently easy work on to their GPS. Just |
|
|
38:46 | marketing reasons, Intel has, you , been pushing the more recent versions |
|
|
38:55 | open MP thio than the good in off being able to generate code also |
|
|
39:04 | , um, keep you use, , and recently have started another initiative |
|
|
39:12 | make talk a little bit later But it's not out there yet more |
|
|
39:18 | in their aversion, something they call a p I that is supposed to |
|
|
39:24 | basically based on open MP, and is a peace to be able to |
|
|
39:33 | kind of the same source called to CPU CP. Use FB ace and |
|
|
39:40 | other accelerators. And as I uh, NVIDIA, they're kind of |
|
|
39:47 | open a sissy. That generous Tsukuda I B M. That doesn't really |
|
|
39:56 | too much of a stake in the wars they have kind of been predominantly |
|
|
40:04 | compartments for open MP. But because high end systems as used use, |
|
|
40:13 | have been focused on using and media , DPU soul. They kind of |
|
|
40:21 | could a cold but map code for the GPS. And it also means |
|
|
40:31 | sites like Pittsburgh that doesn't have D. M hardware. They don't |
|
|
40:39 | necessarily good compilers, open MP compilers could support in the video or other |
|
|
40:50 | and then pray and now have also their own compilers on can their customers |
|
|
40:59 | been using and and the area until . Now some other customers are actually |
|
|
41:06 | to use in GPU, so we'll how that kind of war plays |
|
|
41:13 | And then finally, GCC, they s o far basically been focused on |
|
|
41:21 | MP, the open standing more. has a little bit broader backing than |
|
|
41:25 | open a C. C also open . So it's unfortunate in the end |
|
|
41:34 | . It depends what platform you're What compilers are available for that particular |
|
|
41:43 | . In our case, we use that has in video, keep you |
|
|
41:51 | and in tow CPUs. And so , the best Buy combination off Intel |
|
|
42:02 | and Video GP use is the open C C compilers. So any questions |
|
|
42:14 | that? So that's the reason why end up. Not just in |
|
|
42:20 | can continue. Thio Talk about open , too. Switch on introduced openings |
|
|
42:27 | see, so it might be a off topic. But would you say |
|
|
42:34 | the job market for compiler engineers is and well? I hope that is |
|
|
42:43 | , and I I'm quite sure it . Um, now there's on your |
|
|
42:52 | . There's a bit of hesitation and software and in particular software in the |
|
|
43:03 | form of tools, and I will compilers as a tool. The program |
|
|
43:10 | has been a difficult one. I PG I was an independent compiling |
|
|
43:18 | They were acquired by NVIDIA 3 to years ago. I don't remember exactly |
|
|
43:26 | of the earlier compiling companies were no Cook and Associates, and they were |
|
|
43:36 | independent, completely company. That did well. Uh, and when Intel |
|
|
43:43 | thio multi core chips, they realized the programming or this is a lot |
|
|
43:51 | complicated. Andi, we need better on. Then they started to because |
|
|
43:57 | had their money to build up a software issue or software effort, and |
|
|
44:05 | acquired this company. Took an So today I don't know too many |
|
|
44:12 | I don't know practically any independent compiler . So Intel has an effort IBM |
|
|
44:21 | efforts and lead as efforts. And has efforts and the complexity of |
|
|
44:29 | Modern systems is increasing, and to extent they have the money. They |
|
|
44:35 | spending a lot off building very including better compilers. So So you |
|
|
44:47 | , from my perspective, I hope of you get into that business because |
|
|
44:54 | make uh, the productivity or generating code higher is would benefit the entire |
|
|
45:09 | . So on to talk more specifically open a C. C. It's |
|
|
45:16 | open MP, the same idea director S. O. And here is |
|
|
45:23 | basically the additional complication is said that to the original open MP, things |
|
|
45:32 | quite different not only in terms off target, um, core capabilities being |
|
|
45:44 | different. Uh, but there to That's and two memories, princess. |
|
|
45:52 | even though on in principle at least kinds of devices to CPU and the |
|
|
45:59 | so one need to generate codes for different things on figuring out how to |
|
|
46:06 | these two memory spaces. So the right? Uh, both win the |
|
|
46:15 | Europe extended open MP 4.5, uh, I know has then their |
|
|
46:22 | to also tell competitors like Open a C. What is supposed to be |
|
|
46:30 | code for and a touch device versus for the host device. So in |
|
|
46:39 | with compiler generated code, the runtime on that January, the code then |
|
|
46:47 | what gets executed, where and data that is, and coach transfer. |
|
|
46:57 | here is just still. A bit the structure is taken from in IBM |
|
|
47:02 | businesses that they have the piece of that is targeted for their power type |
|
|
47:11 | and then told a generate intermediate And then the optimized that for their |
|
|
47:17 | CP use and then, based on directives that tells what's supposed to be |
|
|
47:22 | this case, something foreign radia deep that part of the coal gas and |
|
|
47:31 | by basically NVIDIA tools compilers to optimize for and the NDP use and then |
|
|
47:41 | code a code for their GPU so kind of an integrated system, but |
|
|
47:47 | kind of makes use of two different type processes and co generations in order |
|
|
47:54 | eventually come up with piece of code gets linked together. And then the |
|
|
47:59 | system knows what's supposed to be And this is what, yes, |
|
|
48:08 | . So here's a little bit what claim is of open a sissy and |
|
|
48:13 | in case it's just as you may , and then you just mentioned that's |
|
|
48:18 | the name that IBM juices for their end processors in some way, wasif |
|
|
48:26 | remember that's one of the Chinese on . X 86 is both Intel and |
|
|
48:35 | and supports X 86 instruction set, it's not identical instructions. That's what |
|
|
48:41 | core is the same. Um, I think the this slide I as |
|
|
48:50 | as I can tell, it's only true, because once t g i |
|
|
48:59 | , that was again The compiler company Open A C C. Was acquired |
|
|
49:03 | in video I things to stop evolving cogeneration and optimization for nd GPU. |
|
|
49:14 | then the entirely focused on and vida um, the structure is very much |
|
|
49:22 | same, uh, that again, in the form of Prague Mus And |
|
|
49:29 | only difference at the highest level is this HST instead of about so just |
|
|
49:37 | in this case that and open a C compiler recognized This is something that |
|
|
49:43 | should be with us, a directive figuring out how to generate code proper |
|
|
49:51 | and attached device. And there's otherwise direction, no point, something in |
|
|
50:02 | vocabulary that some of you may be to if you have that GP |
|
|
50:10 | And it's a bit unfortunate again to that the terminology is different, but |
|
|
50:17 | is, and I guess you should it so and I will use |
|
|
50:22 | So in terms of the you use talks about workers vectors and gangs and |
|
|
50:34 | as he don't on top. my, uh, remember, exactly |
|
|
50:42 | how much of this is also carried to the open MP 5.0 version, |
|
|
50:49 | I think it might be the structure these things is really reflection on the |
|
|
50:55 | of GPS. So that same why I showed, uh, the |
|
|
51:03 | way back in against lecture form, I talked about GPS but also now |
|
|
51:08 | on this case, So GPS are together as a replication off in a |
|
|
51:20 | way. I would say at the level there is this, I think |
|
|
51:28 | processing clusters, I think NVIDIA, this comes out of calls them. |
|
|
51:34 | in this case, for the current of thinkers, six of this D |
|
|
51:41 | on a single piece off silicon inside one of these graphics processing clusters, |
|
|
51:51 | streaming off the processors. They have access. And the current version of |
|
|
51:59 | too. All this streaming of the in each one off. We're actually |
|
|
52:09 | of them, I should say, , also in each, I |
|
|
52:19 | uh, I only remember. Ah, there's a bunch of them |
|
|
52:26 | each of these G P. C . And then inside each of |
|
|
52:30 | uh, streaming multi processors are the cooler course. A stand out. |
|
|
52:37 | cool decor is kind of the processing on principle. It is the corresponding |
|
|
52:48 | too, you know, x 86 and into the rmd CPU, but |
|
|
52:57 | pointed out to started out Maybe they and these cars are much, much |
|
|
53:08 | than an exodus six core. But also means that the footprint in silicon |
|
|
53:15 | a lot smaller, and that's why can get so many of them and |
|
|
53:19 | of them on a single piece of . Yeah, so in the management |
|
|
53:29 | parallel list, one needs to be or this structure is actually reflected and |
|
|
53:37 | programing. So a worker is something is assigned through a single core crew |
|
|
53:50 | when it comes to in video. this is a similar thing. When |
|
|
53:55 | comes to in the d P there is basically a single threat |
|
|
54:02 | There's no multi threading include, of , single thread per court. Now |
|
|
54:10 | number of these courts, um, in each one of these strange and |
|
|
54:20 | processors. And typically there are 32 these cores in each one of these |
|
|
54:26 | multi processors, and this is where factory ization and simply features comes into |
|
|
54:36 | the scene. If they do anything the same time, they all do |
|
|
54:43 | same instruction on different data. So 70 feature happens among these could a |
|
|
54:54 | in one of these straining multi So that's what the notion of this |
|
|
55:00 | is that and in principle. So can think of the the worker as |
|
|
55:12 | threat in the open MP vocabulary the we have thought about this before and |
|
|
55:18 | vector is kind of a similar so it's not quite the same |
|
|
55:26 | um, the vector feature in open being exist on a single threat. |
|
|
55:36 | this case, it means you get of multiple threads, pulls together sort |
|
|
55:41 | one thread Perko decor as a And then there is the notion off |
|
|
55:52 | , that is then, well, member of the gang is mapped to |
|
|
56:00 | streaming multi processor. So these concepts critical to keep in mind for how |
|
|
56:10 | parallels is supposed to work on is from this slide to realize that unless |
|
|
56:18 | can use the vector feature, you a lot on the processing power zp |
|
|
56:25 | instead of using 32 32 true, of course, you may only being |
|
|
56:32 | to use a single one. All , Any question on that, we'll |
|
|
56:42 | examples on this coming up. So all right? So this concept is |
|
|
56:56 | much identical to they open and pay director. So does the same |
|
|
57:04 | but now we call them gang instead threat. But each member on the |
|
|
57:12 | execute identical code. So it's redundant . So here is a classic example |
|
|
57:21 | the four lope. That's what it . And I think we have the |
|
|
57:25 | again type of example money I talked . Open, empty. That's probably |
|
|
57:30 | what you nations I want to So, like, what? Open |
|
|
57:35 | . You want some work sharing so you can get more parallels than yours |
|
|
57:43 | gold and basically divide up the Yeah, they're working there. Follow |
|
|
57:49 | in that case instead of four than sec and use the name Luke. |
|
|
57:58 | that's the way that you get work in affordable. So at this |
|
|
58:06 | kind of pretty much just the same as an open and peak. |
|
|
58:15 | right. So it also again, what? Open MP is immediately following |
|
|
58:22 | . Uh, that's work church, to speak. So this is kind |
|
|
58:28 | what happens from now. You can , uh, more use our load |
|
|
58:39 | , so to speak, within Uh huh. It takes a |
|
|
58:45 | you know, type of the gold order to the code. Of |
|
|
58:49 | they're actually being able to support that a message. The other thing. |
|
|
58:57 | try to point out that, a bunch of be careful and basically |
|
|
59:03 | the parallel loop constructs separately for each . Comments to be paralyzed on here |
|
|
59:12 | now. We'll talk about the few these clauses that can be used together |
|
|
59:17 | parallel constructs. You can manage both in terms off wait racing classes. |
|
|
59:28 | can help manage the number of members gangs, the number of workers, |
|
|
59:37 | director length. You can specify Bice , and so, and also in |
|
|
59:45 | of the management, recognize many things the open MP in terms of copy |
|
|
59:51 | company out and also about far first, private and etcetera. So |
|
|
59:58 | a little bit about this, but all that much, um, in |
|
|
60:05 | because there are very similar to open on. Um, don't want to |
|
|
60:14 | too much beyond what budget for open openness, you see there is also |
|
|
60:24 | I think the parallel construct is a bit off. Openness is effort to |
|
|
60:31 | , um, similar to open but they started out with having one |
|
|
60:37 | , in other construct that they call colonel's construct on. I will not |
|
|
60:42 | about it too much today, but next time. But that schemes a |
|
|
60:48 | more freedom to the compiler Thio restructure and try to optimize things for the |
|
|
60:56 | . So at least more freedom to compiler. This is in the descriptive |
|
|
61:03 | of open A C C versus the of open empty. So here is |
|
|
61:10 | little bit policies that, um, is, um, in terms of |
|
|
61:22 | parallel construct is similar to began there are open and P things remains |
|
|
61:29 | Once things will set up for, instance, of parallel region have the |
|
|
61:35 | number of gangs and workers and vectors the region, it doesn't dynamically |
|
|
61:43 | Um, whereas in the carnal things are a lot more flexible and |
|
|
61:51 | can happen. So, um, now let's talk about one clause, |
|
|
62:02 | then I'll talk about, um, an example on. The only clause |
|
|
62:09 | mentioned was reduction clause that because I it in the example of the show |
|
|
62:14 | , so that's also supported in it . C. And that means that |
|
|
62:18 | computer, like an open MP generates code to make sure that things happens |
|
|
62:25 | in terms of reduction. And and pretty much that take home message from |
|
|
62:31 | slide. Um, and this production are the common one, supportive in |
|
|
62:38 | open, empty and most programming So now a little bit off getting |
|
|
62:47 | the example. And they will do example showing a little bit off, |
|
|
62:54 | , a couple of the compiler flags then what the consequence are using |
|
|
62:59 | Maybe for a simple example. So SEC has this fast flag that basically |
|
|
63:10 | encouraged the computer to do whatever I do to try to optimize the |
|
|
63:16 | Eso. That's I think, what will be used in the examples I |
|
|
63:22 | . There's another flag that allows you get the information about what the compiler |
|
|
63:29 | done to the code. And, , there is a different, you |
|
|
63:36 | , options for the flag. And I mean for you can I will |
|
|
63:42 | you all up changes that did, certain optimization or things just focused on |
|
|
63:48 | , um accelerator. Mm. as I mentioned that you need because |
|
|
64:00 | cold needs to be generated for both and accelerator. But you can use |
|
|
64:07 | open racism C compiler like the open take on violence to generate cold for |
|
|
64:14 | host only. And then you use targets. Love the car, |
|
|
64:21 | flag or the multi core attributes for targets flag. Or you can |
|
|
64:30 | uh, the after Testa on. use that for the GPU because Tesla |
|
|
64:38 | one of the product things foreign media use on the same day. I |
|
|
64:43 | that, um, as part of attributes for the black. But then |
|
|
64:50 | another after you managed and that tells compiler that it should kind of manage |
|
|
64:58 | for you. And I'll show you that works in the next example on |
|
|
65:09 | . Maybe I'll do a little bit this example and then take some questions |
|
|
65:15 | somebody want to ask a question So this again Matrix, matrix, |
|
|
65:27 | , multiplication, matrix, matrix, and Kobe. Its methods for each |
|
|
65:32 | as an interactive solar is canonical examples are used by compiler people and held |
|
|
65:40 | people very, very often. So was in this before, and they |
|
|
65:45 | see it again. I'm sure before course is over. So, |
|
|
65:55 | it just have solver applied to the equation as basically relax ation scheme where |
|
|
66:05 | this case, you use the blue basically the average of the values and |
|
|
66:10 | the blue points to get the value the red point In the center of |
|
|
66:15 | , this is just the square. now this is then sequential told, |
|
|
66:29 | , for this Jacoby reiteration type scheme what it is. It's too loops |
|
|
66:40 | the upper half of this slide. going go through all the points and |
|
|
66:47 | kind of two d trade off. , great points a values in the |
|
|
66:57 | it x and Y r i N directions. So the statement in the |
|
|
67:02 | is just doing the averaging for each of the points. And then you |
|
|
67:08 | to figure out whether this thing and the error is, and eventually you |
|
|
67:14 | things for toe converge, so they compute the average of the point to |
|
|
67:23 | point is a new, and then figure out what the error is. |
|
|
67:30 | then for the first two loops to out then what the maximum error is |
|
|
67:35 | this anywhere across the grid. And , um what, once you have |
|
|
67:47 | that to get because you're Kobe Iteration of works in a very structured or |
|
|
67:56 | way. You have to use all old points, Um, before you |
|
|
68:03 | and use any new point. So what you see. It doesn't quite |
|
|
68:09 | in the equation on the top, a Kobe tradition basically valued all red |
|
|
68:16 | before they go to the next decoration make them blue. So once you |
|
|
68:25 | computing all the read points, the points updated, then you basically make |
|
|
68:33 | blue. And then asl, long Theo Air is not sufficiently small or |
|
|
68:43 | have been tired of iterating and reached maximum. You keep looping. So |
|
|
68:48 | the way I look. So this the way the sequential code works. |
|
|
68:57 | now, trying to use it to for doing this business as remember, |
|
|
69:04 | kind of starts and ends on the to you. So you have Thio |
|
|
69:10 | code and data to the GPU on the idea in this case is you |
|
|
69:15 | all the computations on the GPU. it's all said and done, then |
|
|
69:20 | result will be moved back to the memory. So now, using open |
|
|
69:31 | sissy to try to get this job ? Mm hmm. In this |
|
|
69:37 | there's a parable ization all, um are a look on the two Lucas |
|
|
69:46 | day first look, best thio all the red points. So to |
|
|
69:51 | on. Then the next thing making the red points bill points. And |
|
|
69:56 | this case, Thio 20 correctness producers cross and a similar tourism was a |
|
|
70:04 | example for open MP, but it's the same thing that you have a |
|
|
70:10 | of independent actors. Ah, during workload sharing for the after four |
|
|
70:18 | And in order to make sure that reduction happens correctly, you conduced use |
|
|
70:25 | reduction clause and have the compiler generate proper instructions to make sure that things |
|
|
70:32 | global Max air Yeah, it's properly . So in this school, |
|
|
70:44 | So now we're supposed to compile right? So the first effort here |
|
|
70:50 | yes, the generate code for the . In this case, it shows |
|
|
71:01 | that fast and also the, flag to the compiler is free to |
|
|
71:07 | whatever it can do to optimize code the most for the city, to |
|
|
71:11 | . And then what did some information happened in terms off acceleration. But |
|
|
71:19 | so says that can be accelerated. things were, uh, to be |
|
|
71:26 | to be generated for the CPU. and basically says that, you |
|
|
71:32 | things that are unknown in terms of GPU site. Yeah, so this |
|
|
71:37 | not particularly interesting, but it just what you can get. I was |
|
|
71:43 | after a couple of more slides on questions now. So here's no one |
|
|
71:50 | . So yes, um, the did Generally coat was executed on in |
|
|
71:58 | off. There was an intense safe and this is the model and all |
|
|
72:03 | . So it got the three times up on the 10 core, |
|
|
72:10 | which is not too impressive as far I'm concerned. But it did. |
|
|
72:16 | least managed to get some parent was on some of the course. |
|
|
72:23 | so again, it's the caveat. , always have to be a bit |
|
|
72:29 | again. Um, and videos don't want thio. We have to see |
|
|
72:36 | they used to their do their best . And if you it is interesting |
|
|
72:42 | read the literature on the papers. When things are published by Intel people |
|
|
72:48 | a media GP use versus and media using Intel c p use how the |
|
|
72:55 | in terms are, how the different of devices perform. Um, |
|
|
73:02 | well, talk a little bit one that decided to instead generate code |
|
|
73:10 | and the GP use and let the pretty much figure everything out. So |
|
|
73:19 | , in this case, if you at left column on the slider tells |
|
|
73:25 | the compiler did on in this case these, uh so the gangs basically |
|
|
73:33 | code for independent stream of the Remember, a member of a gang |
|
|
73:41 | assigned to each streaming off the so they tried to use, |
|
|
73:48 | several all their streaming units. the other thing, uh, it |
|
|
73:55 | manage data traffic. And so it a again things starts and finished on |
|
|
74:03 | CPU. So it allocates memory on GPU for the variables, or erase |
|
|
74:11 | you need. And it also initialize so in a here the values air |
|
|
74:18 | from the CPU memory to the GPU since the copy and and I'll talk |
|
|
74:23 | little bit more about this, clauses that, um, follows |
|
|
74:30 | Then it also, uh, returns the values of a thio CPU. |
|
|
74:41 | then, uh, on the second , Yeah. And then I guess |
|
|
74:47 | doesn't say explicitly, but it does memory for a new, but a |
|
|
74:54 | is just looks used in the competitions are now on the GPU. So |
|
|
75:01 | is no need to transfer a new the CPU on the GPU. |
|
|
75:14 | So now also what happened? So and of course, uh, |
|
|
75:24 | using open a sissy and let him take care of everything you got 37 |
|
|
75:32 | speed up compared Thio the single core and a little bit more than 10 |
|
|
75:38 | 12 times speed up compared to um, the tank corps CPU now |
|
|
75:54 | . Um, one can certainly be in terms off the speed by my |
|
|
76:03 | . Um, but I also want encourage you to look a little bit |
|
|
76:07 | step beyond it on. And that's I put the remarks on the bottom |
|
|
76:13 | the slide that shows that relatively the efficiency or the fraction of |
|
|
76:24 | that the managed to get on the is actually higher than the fraction of |
|
|
76:29 | they got on their own device. understand? Because there's been a little |
|
|
76:34 | cautious then I think this is, , take questions before moving on |
|
|
76:42 | No, they want you to. one is easy to remember, but |
|
|
76:46 | he this slide, this makes it to ask questions. Uh, I'll |
|
|
76:52 | this one up for a little bit I continue toe origins. Tell that |
|
|
77:09 | guess I'm almost, but I will maybe I'll show a couple of more |
|
|
77:17 | . Justus. An intro, I , to next lecture. So it's |
|
|
77:26 | , I would say, get mixed many times the compartment has a very |
|
|
77:31 | job in terms off, taking care everything. But as always, |
|
|
77:40 | the programmer or the one that knows application and the data may be able |
|
|
77:45 | do a better job, then the that has to infer everything from the |
|
|
77:54 | . So the next variation of this I'm going to show you is F |
|
|
78:00 | . As a user, try manage the data transfers in particular |
|
|
78:12 | But before that I will talk about notion off, Unified Member. So |
|
|
78:20 | is, uh the idea of the memory is, and I think, |
|
|
78:25 | shown on this slide. Um, I said there are physically separate memories |
|
|
78:34 | he has, you know, three data paths between them. Um, |
|
|
78:41 | the CPU memory has the memory bus the CPU on the GPU. Memory |
|
|
78:47 | also in memory of us to the . And between the two devices there |
|
|
78:54 | the PC I express bus. So exceedingly Numa, if you like. |
|
|
79:02 | is non uniformed memory access because, , this highly different capabilities off these |
|
|
79:10 | process involved in moving data. The memory notion is that you can treat |
|
|
79:19 | kind of as one address space. like in Newman is one address space |
|
|
79:26 | the shared memory in the note, is by no means uniforms in access |
|
|
79:34 | to them. So this is the , unified memory that exists also in |
|
|
79:41 | empty 4.5 or later. So, , so this is now what the |
|
|
79:52 | used when you tell it to manage for it, it uses this notion |
|
|
79:58 | is unified memory. And I guess that point my time is up, |
|
|
80:04 | we'll continue with this example next and I'll take some questions that you |
|
|
80:10 | it. Okay, stop that screen and see if there are questions. |
|
|
80:39 | , so far there is, I , mostly talked about. There's lots |
|
|
80:42 | similarities, but underlying hardware structure uh, visible on becomes more increasingly |
|
|
80:54 | the more you try to optimize Okay, there's no questions. I |
|
|
81:13 | start with the first in the region the |
|