© Distribution of this video is restricted by its owner
00:00 | Okay, so today, as it on this slide, I will trying |
|
|
00:08 | tell you a little bit what compilers trying to do on bond. What |
|
|
00:16 | can do. A zoo, a or user. Our systems and tools |
|
|
00:23 | compiler do what one hopes it would . So first, I'll talk just |
|
|
00:30 | little bit very general and about coming to code efficiency. Then give some |
|
|
00:37 | that you may know where they have up when they looked at the compiler |
|
|
00:41 | and what they need at all. , mentioned that very briefly. And |
|
|
00:47 | we'll move on to talk and what are trying to do to improve the |
|
|
00:53 | , respecting how it's supposed to be on platforms a couple of quotes |
|
|
01:01 | That's why I start this course to focusing and understanding and learning about single |
|
|
01:09 | and how to get performance and sort going first from just single threaded performance |
|
|
01:15 | then trying to figure out multi threaded part on a single note using |
|
|
01:21 | empty on uh, so anyone that not be familiar with the name being |
|
|
01:31 | . He is quite well known scientists Waas instrumental off and one of the |
|
|
01:39 | in developing pretty much every generation of MP I standard, for instance, |
|
|
01:46 | programming clusters and, uh, now at the University of Illinois. But |
|
|
01:53 | worked in argon national Labs for many and is recipient of many awards and |
|
|
02:03 | . And, um, try Thio . Emphasize that I've done many times |
|
|
02:10 | the past to to get efficient One needs to understand the platform |
|
|
02:17 | thunder, take good algorithms, and money is, too. Try to |
|
|
02:22 | sure that, and also get good . So it's one of these |
|
|
02:30 | which is is kind of vertical One needs to understand the layers that |
|
|
02:35 | across kind of contract to the common off using by labor layers of |
|
|
02:43 | And you don't necessarily need to worry players Villa one above. Just work |
|
|
02:49 | your own layer. But well, , not to work too well when |
|
|
02:53 | comes to getting good efficiency out of system on just a little bit as |
|
|
03:05 | it is often perhaps not, said . It's not the focus of today's |
|
|
03:10 | , but we'll talk about it later in the course. But a lot |
|
|
03:14 | the progress in terms off getting useful out of computer systems also come from |
|
|
03:20 | in algorithms. And I thought that a nice just to include that. |
|
|
03:24 | what has happened. And this comes different. The scientific computing domain. |
|
|
03:31 | on, uh, in terms off , yes, cold issues. And |
|
|
03:41 | is says, You know, it's common. You can get a factor |
|
|
03:46 | 10 and in fact, it's not , even get a lot more even |
|
|
03:49 | that. And that's part of the why, uh, in this |
|
|
03:55 | start to give you the tools to understanding about on the actual performances in |
|
|
04:01 | off resource utilization and then get you understanding what the resource is actually |
|
|
04:08 | um, so I think now that's of a comments on this slide. |
|
|
04:18 | . Here is one that another very known person in the computer science field |
|
|
04:23 | license, and that's also as a at MIT on Bond also have won |
|
|
04:30 | awards, and he is the creator his group. I should say instant |
|
|
04:35 | of this silk, um, programming . If you like some kind of |
|
|
04:43 | turn system that was later this. effort was then required by Intel, |
|
|
04:50 | it's not part on the Internet set tools. But anyway, it shows |
|
|
04:58 | difference in this case going from the code to highly optimized C code the |
|
|
05:06 | instructions. So it's orders of Um, that potentially is to begin |
|
|
05:15 | , they called. That might be good in terms of program. My |
|
|
05:19 | in that in terms of a resource non human resource, is that means |
|
|
05:27 | hardware. It may not be all good. So depending upon what your |
|
|
05:34 | is for the cold, if it's to be a production code or something |
|
|
05:38 | you're just doing as a sort of tools to understand something than your |
|
|
05:45 | except it's likely need to be very . So then, including here, |
|
|
05:52 | examples from what is the compiler flags in terms of what type of optimization |
|
|
06:04 | lack of their off that and is at the various levels of optimization. |
|
|
06:15 | I guess the sailor level is this do nothing and take it as it |
|
|
06:20 | , whereas level one and try to execution time by allowing for a number |
|
|
06:28 | optimization that enlisting here not going to them all but one of the main |
|
|
06:37 | or restrictions on the optimization is being is for the level one. There's |
|
|
06:45 | lot off attention paid to the extent which optimization is may increase the code |
|
|
06:54 | and we'll talk off some of these to some detail later on in this |
|
|
07:00 | year. That so sometimes there is trade off between speed and code sites |
|
|
07:07 | and than with this level to optimization still, you know, trying to |
|
|
07:16 | things. But now also saying include ization type features. And I will |
|
|
07:22 | about factory ization a little bit later this class today on also next |
|
|
07:31 | Um, so in this case, completely right, just trying to make |
|
|
07:39 | guesstimate as to what? The potential in code size, whether that is |
|
|
07:47 | eventually leading Thio reduced running time, . But that may or may not |
|
|
07:54 | be the case, and then the level, that is typically there is |
|
|
08:01 | for their optimization that is listening here names. We'll talk about some of |
|
|
08:06 | things just to give you a sense what, on a compiler most of |
|
|
08:13 | times, a recently successful to do but in a state of the art |
|
|
08:19 | . But whether there is a success not depends on the context of the |
|
|
08:24 | and whether the code analysis, as a result that these transformations of |
|
|
08:33 | code is safe, it may not able to conclude that it's safe and |
|
|
08:38 | compile it cannot do. These optimization on the other hand, you as |
|
|
08:43 | program, and they know that it's . And then by structure in your |
|
|
08:48 | , you can make the car the successful Eso here is a list off |
|
|
09:00 | . Optimization is my name, and will talk about some of these, |
|
|
09:05 | not all of them just again, I focused on things that are |
|
|
09:13 | uh, it's easy to do in of how your rights air so |
|
|
09:20 | And here is an example of that terms of the stream benchmark, I |
|
|
09:24 | that you can see the numbers s , this is just compiling the same |
|
|
09:31 | with different optimization flags and was using GCC compiler as it says. And |
|
|
09:40 | you look at the in the So peace off illustration here on the |
|
|
09:50 | . And if you look at the rates, the second column. The |
|
|
09:54 | column is the names of the extreme and the second columnist the data rates |
|
|
10:00 | for those functions. And if you at the one that used no optimization |
|
|
10:06 | , if just pick one of them look at that, the try it |
|
|
10:12 | says it wants seven, about seven per second. In terms of the |
|
|
10:16 | rates, then the level one optimization jumped to 13.4. So it's almost |
|
|
10:24 | in terms off the data it and the cold size even got down |
|
|
10:31 | about two thirds. So that level paid off very handsomely, both in |
|
|
10:37 | and code size. On the level optimization did Young looking at the trial |
|
|
10:45 | went from 13 4 to 13 so it give a little bit better |
|
|
10:52 | their rights for that particular stream But the cold sites not grew a |
|
|
10:59 | bit, not much. But if you went to the most aggressive |
|
|
11:05 | , the code size pretty much became off a level one optimization, and |
|
|
11:12 | performance in fact, actually dropped. this is again the trade off where |
|
|
11:17 | was intended to be good. But large part, sometimes the optimization is |
|
|
11:22 | not have been successful, but that also have been affected by the code |
|
|
11:27 | that may have produced, um, performance because code has to be loaded |
|
|
11:35 | the same data path except very close the functional units. But everything through |
|
|
11:42 | two caches are shared in terms of . Pass in order. Code music |
|
|
11:47 | with data for some of the Shelley . So anyone want tohave question or |
|
|
11:58 | on this kind of trade offs and it worked out in this particular |
|
|
12:10 | it's not. So. This is pretty much what I already said, |
|
|
12:15 | the compiler writers have to make sure the correctness of the code is |
|
|
12:21 | Recount, do anything that's somewhat And to be seemingly conservative, it's |
|
|
12:29 | always possible and analysis toe figure Marching source code transformations are possible and |
|
|
12:40 | things potentially going wrong in some extreme . So here is, I |
|
|
12:48 | No, I'm not. I'm sure is from a look, too most |
|
|
12:53 | science students, but Maybe some of know not so but compare list and |
|
|
13:00 | some of what is known as the and the back end on the front |
|
|
13:05 | is the thing that takes it program or their CEO fortune and then translates |
|
|
13:14 | into it's known as an intermediate language only intermediate format on that many |
|
|
13:23 | That's common to several programming languages. you see your fortune, it's intermediate |
|
|
13:29 | Mr Same, and that means and then use the same kind of |
|
|
13:36 | tools working on this intermediate formats. then one has a cold generator that |
|
|
13:43 | decoupled to some degree from the programming , but it is dependent on the |
|
|
13:51 | platform. So it has instruction sets the particular platforms and knows about their |
|
|
13:57 | on that platform, and in that , the optimization is that also happened |
|
|
14:03 | that level. But then they are more machine dependent, supposed toe dependent |
|
|
14:09 | there, largest for the function off code. So I'll talk mostly about |
|
|
14:17 | machine independent transformations today, Um, the other thing is that doing once |
|
|
14:31 | as whole program analysis is very as hopefully anyone can imagine if you |
|
|
14:41 | hundreds of thousands of lines of code the whole program analysis is an incredibly |
|
|
14:50 | data dependence rascal. So many uh, compilers, they all look |
|
|
15:00 | what's known asses basic plot that I , uh, last time. |
|
|
15:06 | yeah, just repeat the picture that showed last time in terms of what |
|
|
15:10 | basic box are. Oh, So within those things are recently safe |
|
|
15:17 | do. But then, once me beyond the basic blocks to do the |
|
|
15:25 | than it becomes a lot more analysis and time consuming to do the |
|
|
15:36 | So one of my colleagues, when was at Yale University Waas one of |
|
|
15:46 | first to try to expose this idea construction level parallelism and this. But |
|
|
15:54 | there's no one has to be like right format. So he designed a |
|
|
16:02 | as you know, started the company build the machine, for it was |
|
|
16:07 | as multiple for this belong W But as people said, you know |
|
|
16:13 | need a computer to actually completely code your computer, so compiling the code |
|
|
16:18 | take hours when you try to the analysis cold. So that's fine. |
|
|
16:29 | , the extent of its compilers do into procedural analysis, is kind of |
|
|
16:40 | . Um, so now I will about some of what these names means |
|
|
16:49 | if talk through an example off what on one particular processors and show what |
|
|
17:00 | mean. So one particular Intel architectures in lining of functions is essentially |
|
|
17:11 | Instead of having the function call, makes kind of in some ways a |
|
|
17:16 | and clean coat, you basically remove call and substitute the explicit text of |
|
|
17:30 | for the function into the line of . Or this is just an |
|
|
17:36 | So before in lining your best, call this bread function three times, |
|
|
17:43 | , whereas after in lining, you each one of the calls with statements |
|
|
17:49 | is used to define the prince So it's nothing magic, but what |
|
|
17:54 | does it the voids, the overhead during function calls. And that can |
|
|
18:00 | be the big game. So that's as in lining and compilers recently successful |
|
|
18:09 | least as long as the functions are too complex on if it's mean by |
|
|
18:16 | company, right it to be an that it's likely to win, but |
|
|
18:22 | clearly than willing increase the cold But this is part of our compilers |
|
|
18:32 | . But if you're concerned about overhead bending or not certain that the compiler |
|
|
18:40 | be successful at it on your own and that this is okay, |
|
|
18:45 | it would become a little less, know, being in pretty cold that |
|
|
18:51 | is potential significantly more efficient. Dr . Yeah. Is this primarily |
|
|
19:00 | um, eliminating the overhead of a frame? Because I can't imagine anything |
|
|
19:07 | that it would Very helpful, Is basically setting up the stack frame |
|
|
19:13 | keeping it, and then so that's it does. So it takes some |
|
|
19:18 | that the way on this is another that is again to me. He |
|
|
19:30 | stood programming practice E guess official is as cold movement meant a code hosting |
|
|
19:39 | essentially don't do things inside loops that , in variant perspective, the |
|
|
19:46 | So I'll do that once instead of it every time in the loop. |
|
|
19:51 | in this case is basically operation to on either the number two, you |
|
|
19:58 | , pilots keeps changing, so that's you could as well, computer outside |
|
|
20:03 | loop on, then just use it the loop. So, in actual |
|
|
20:07 | , yes, it costs you tempt . But, uh, in this |
|
|
20:12 | is very trivial, but introduces the of operations inside the loop on being |
|
|
20:22 | . And it's a little bit more this detail. Example that border from |
|
|
20:28 | that cmu eso it shows a little again and being careful all what? |
|
|
20:39 | you write things. I would say nothing wrong. All of these things |
|
|
20:43 | just source to source transformations that is at reducing the amount of work that |
|
|
20:49 | to be done sometimes at the expense . Some additional there are about from |
|
|
20:57 | locations. So in this case, example, is just that. Now |
|
|
21:05 | again inside the loop. In this , it's an array reference on this |
|
|
21:09 | . Instead of doing the end I not quotation every iteration in the |
|
|
21:15 | statement Thio just compared to the outside than avoiding transportation in every sign. |
|
|
21:23 | statement and this Waas yes and example in this case, what the compiler |
|
|
21:33 | to dio. I'm not going to to explain all of the assembly |
|
|
21:38 | even though I recommend anyone that want do good code optimization to look at |
|
|
21:46 | assembly, that the compiler generous and to understand it because that sees what |
|
|
21:52 | happens so to Jell O boxes is before and after a compilation with In |
|
|
21:59 | case, they level one optimization and shows up in this case, the |
|
|
22:05 | . They do the right thing and a new variable to the product and |
|
|
22:10 | I and then that's in fact used the in the loop. It also |
|
|
22:21 | a little bit more, as you see that it also used this in |
|
|
22:27 | pointer and then use a simple increment the pointer instead of having to do |
|
|
22:34 | array reference. Yeah, um, this example? Waas. Yes. |
|
|
22:44 | this has a deal with Just says the top of the slide. Try |
|
|
22:52 | . Keep things local two registers. , so if one deals with the |
|
|
23:00 | , sometimes the compiler is difficulty figuring to keep a rave variables local and |
|
|
23:08 | . And it may in fact, things in memory for every integration and |
|
|
23:13 | particular loop on this case, it's you, some all the elements of |
|
|
23:21 | off the area into back Trevi, Vesper is one element and be for |
|
|
23:27 | rule of s we just add up and this little example. So that |
|
|
23:33 | seem it becomes many Iraqi references that hard for that, uh, |
|
|
23:43 | sometimes the optimized out and create a variable. So I think in this |
|
|
23:50 | , it was what I have There's see what happens in these situations |
|
|
23:59 | basically says that that the these initializing they ace initialized. And then we |
|
|
24:11 | this function thio computer oh, sums store it and be, um, |
|
|
24:21 | the value of B shows here for generation. But it means there's kind |
|
|
24:25 | memory traffic again to the race for on the in looks as well. |
|
|
24:35 | the thing is to try to avoid is to use a local variable. |
|
|
24:43 | talked about this when he talked about MP and the sharing of variables. |
|
|
24:48 | in that case, there was a with respect to potential. Um, |
|
|
24:57 | line is, um, moving between caches, and this is similar, |
|
|
25:05 | now sort of focusing on the memory . So in this case. But |
|
|
25:13 | a local variable in this case developed , um, the computer is doesn't |
|
|
25:20 | to worry about any ray issues or some other processor. Does things, |
|
|
25:30 | , accord us something in to be ready that it's not the wear |
|
|
25:34 | So it just works on its local , and it keeps it and |
|
|
25:38 | So the number off memory references significantly . So it's, um, the |
|
|
25:49 | of programming concept of you like is same as we talked about in terms |
|
|
25:55 | the open MP. And avoiding this line session between cash is now in |
|
|
26:06 | example. I'm going to use, , on uses. You know, |
|
|
26:14 | per element is a common where you're to get a good idea off how |
|
|
26:21 | things are. Then that's being used the examples I I'm going team talk |
|
|
26:28 | in the next many sites. maybe I should, uh, pause |
|
|
26:35 | a minute and see if there's any so far, the fairly simple transformations |
|
|
26:41 | basically coding styles that is encouraged to sure you got good, efficient |
|
|
26:49 | um, on the example of updating array position, I think it would |
|
|
26:54 | a position I plus equals some Um, so you said that Teoh |
|
|
27:00 | that away. We should be introducing temporary variable to sum up and then |
|
|
27:04 | access the array once of the first . Okay, So because if it's |
|
|
27:10 | single variable, that doesn't leave quote in in the right context, even |
|
|
27:16 | it's the same memory location. If a single variable, the computers of |
|
|
27:23 | very successful link keeping that variable in register instead off allocating memory for |
|
|
27:31 | Okay, so So this is just simple model. Harmony, and the |
|
|
27:44 | told Time is just there's over some and then on funds in the cycles |
|
|
27:50 | instruction time or part elements, or times the number of elements so clearly |
|
|
27:56 | to get as few cycles for elements possible. This is what I want |
|
|
28:01 | try to talk about, and, , and I guess they see someone |
|
|
28:07 | two over the two versions on this that I just went through. This |
|
|
28:12 | the piece on to when the local on this piece someone on this slide |
|
|
28:19 | when you had explicit the very reference the innermost loop so lawyers. You |
|
|
28:25 | see that Thio cases worked out as functional number of elements, and there |
|
|
28:31 | not giant still the savings for very expense in this case. So this |
|
|
28:41 | the and you can see the slope the lines are than the cycles per |
|
|
28:50 | . So now there's this an example shows a little bit again how to |
|
|
28:58 | Thio, right? I would say told there's nothing wrong with any one |
|
|
29:05 | the code versions in terms of correctness its, uh, you know, |
|
|
29:12 | logic, respect from the use of . If you're focused on what you |
|
|
29:16 | to get done, Thio write it particular way. But one of the |
|
|
29:25 | in this case is that do you in the loop statement before statement You |
|
|
29:38 | basically a function evaluation carrying the vector and every integration of the loop, |
|
|
29:46 | you also then and have another function to get factor element that to use |
|
|
29:53 | in the body on the loot, then you do something simple where the |
|
|
29:59 | here, the tree from, the array now mhm. So in |
|
|
30:10 | case that is, you know, timing results, doing it for some |
|
|
30:17 | of the rate that I'm sorry. . Figure out what the race |
|
|
30:23 | Also, just take the number for they are. Eso This shows when |
|
|
30:31 | don't do any optimization at all on particular tiny piece of code. But |
|
|
30:39 | also shows that the level one type in this case have a good |
|
|
30:46 | and we'll talk more what that waas the next few slides, and we'll |
|
|
30:52 | get to see how even the old optimization can be improved significantly. In |
|
|
31:02 | , by the time we're done with example, the code would have approved |
|
|
31:06 | more than an order of magnitude and . So there is a sort of |
|
|
31:14 | thing one can do because the vector doesn't change in the loop. So |
|
|
31:19 | is just kind of being cleans. is no point in evaluating, |
|
|
31:26 | or calling the function rector lengthen jazz lines and every iteration of the |
|
|
31:31 | Since it's an in variant, just it once again. And so that |
|
|
31:38 | the first level of optimization in this that will be done by at the |
|
|
31:42 | level. Um and I said I'm sorry by the outside noise that |
|
|
31:52 | not happened at this time over What Somebody decided to burn the |
|
|
32:00 | Hopefully, it's not too annoying. , anyway, so it just says |
|
|
32:06 | hard again sometimes for the compiler Thio that this safe to move these things |
|
|
32:11 | so it may or may not succeed doing this. Ah, Level two |
|
|
32:17 | . But it's not too cumbersome for to actually do it and think of |
|
|
32:22 | in variant again and put what's seen things outside the look s O. |
|
|
32:30 | was kind of a fairly simple, , optimization. The other one is |
|
|
32:37 | a little bit perhaps less intuitive but maybe not. And it's also |
|
|
32:47 | they're simpler operations than calling this function get the starting position, um, |
|
|
32:55 | the actual location you want for every of the loop. So you can |
|
|
33:01 | just better yet than Thio? to starting a relocation or memory location |
|
|
33:07 | them the array and then just implemented in this case things were fairly |
|
|
33:13 | You were just going to walk through way from start to finish or the |
|
|
33:18 | point you wanted So is a very stride. One, uh, walk |
|
|
33:24 | the right. So there's no point we can calling this Get back |
|
|
33:30 | Get start. Ah, yeah, integration. So, um, |
|
|
33:42 | I think that was something That's So this one more and then there |
|
|
33:50 | be some more, um, timing . I think on the next slide |
|
|
33:59 | also then use, uh, and temporary er for the accumulation if we |
|
|
34:08 | it to they standing. So now again not using a regular reference, |
|
|
34:15 | , uh, no, this so has to be. But it's accumulation |
|
|
34:23 | a local, very variable. So I guess the sum of all |
|
|
34:36 | um, on first Waas uh, the default time and gain from optimization |
|
|
34:46 | one than the three code restructuring that talked about Loss first to move there |
|
|
34:56 | for the vector lengths outside the That is, in a variant second |
|
|
35:00 | to just get the pointed to the and then the walk through by implement |
|
|
35:07 | one and then the third wants to its local variable in the loop. |
|
|
35:15 | this cold restructuring now improved the performance a bit. We respect to even |
|
|
35:25 | one optimization. That was a factor better than the no optimization case by |
|
|
35:30 | compiler. So these four simple changes this loop conceptually simply than for more |
|
|
35:40 | of, uh gave what the factor three or better And, uh, |
|
|
35:47 | for the double precision mouth and the or eight, I think more or |
|
|
35:51 | for the integer app. We'll talk little bit more about these difference in |
|
|
35:57 | improvement in the coming slides, but are all you know, this |
|
|
36:04 | new tube, um, they're moving calls and, um, already |
|
|
36:15 | making it very simple instead of arrhythmic disease or others to do yes, |
|
|
36:21 | by one. So any questions so ? So these are to me |
|
|
36:36 | Not necessarily what one yet on tartans is keeping cold, clean, but |
|
|
36:46 | to understand it. That's so Little bit more difficult, but it |
|
|
36:52 | have huge performance gains. So let's what ah now I'm going to get |
|
|
37:02 | the topic of loop unrolling, and going to take quite a few slides |
|
|
37:08 | I'm done with this example on Here's the most trivial example First and |
|
|
37:15 | what do you live? More elaborate , second to the point. And |
|
|
37:21 | is, um, basically Thio reduce number off nuclear iteration and in |
|
|
37:35 | Instead, she ate a few integrations the loop in the loop body. |
|
|
37:42 | that means two things may happen when the looping overhead. Because you don't |
|
|
37:50 | to do the tests assed many times you are at the end off the |
|
|
37:58 | or not on. And it also helps the compiler again when it has |
|
|
38:07 | the statements in the loop body to things in registers. So there |
|
|
38:14 | yeah, two games and their other that I will show you in the |
|
|
38:19 | several slides. But so the first of business, it is loping over |
|
|
38:26 | on, help the compiler and keep in registers. Dr. Johnson. |
|
|
38:33 | . So this loop unrolling more at at paralyzing at the core level, |
|
|
38:42 | not necessarily parallels. And so you will help you even on a single |
|
|
38:49 | . Okay, I think I like paralyzing in the sense that we're |
|
|
38:54 | Max effectiveness of each cycle, Because in terms of, um, |
|
|
39:00 | dependencies, is this the kind of that we're looking at with loop and |
|
|
39:05 | . So, um, the data is are the same. So what |
|
|
39:15 | in this case is to take the thing here. Um um, So |
|
|
39:27 | is I guess a messed up I'm sorry. So where they should |
|
|
39:30 | put you some instead of the Litex . It's another variable y or something |
|
|
39:36 | why outside? Not as a loop . So I didn't spot that. |
|
|
39:42 | apologize what it means, um, compiler may generate load stores on the |
|
|
39:52 | pick a loop case for econ for it now while the delete while |
|
|
39:59 | so it loads it. And then does whatever this function delete does to |
|
|
40:03 | X or an assignment statement. And there's memory traffic generated potentially for each |
|
|
40:13 | in the loop that is then reduced this case by a factor of five |
|
|
40:19 | the under oath. So it saves . But the dependency is the |
|
|
40:28 | It's just that instead of a one access to register, it may be |
|
|
40:35 | hundreds of cycles to memory for the . Okay, so I'm not sure |
|
|
40:42 | would call that sour um, paralyzed that's why I say even in happens |
|
|
40:50 | a single thread that you avoid round to memory. Most likely when you |
|
|
40:57 | look controlling and for every alteration in loop, when you're at the end |
|
|
41:06 | the loop and get to the you have to do test Insee and |
|
|
41:10 | down or not that takes not but it takes a few cycles to |
|
|
41:16 | that test so it eliminates some, , test arithmetic conditional operation comparison its |
|
|
41:29 | . Compare, uh, in the structure, and it potentially eliminates around |
|
|
41:37 | to memory. Eso here is just example you may it takes multiply that |
|
|
41:50 | will see more times than you care in this course because it's a simple |
|
|
41:55 | structure, but it shows the and this is just one example of |
|
|
42:00 | some enrolling in that was used on particular machine, and it shows, |
|
|
42:07 | , the again I waas, depending their sort of look playing for |
|
|
42:16 | Up and down for the loops were the compute Great. It was, |
|
|
42:21 | know, it's a very old and the numbers are not very interesting |
|
|
42:24 | themselves. But the difference is In terms of that, you got |
|
|
42:29 | factor off three improvement in this performance for this particular loop by doing |
|
|
42:36 | loop unrolling and again again is round to memory avoided most of the king |
|
|
42:43 | here, plus some reduction in looking . But most of it is do |
|
|
42:48 | reduce memory traffic. So, I'm going to come to talk about |
|
|
42:58 | more detailed example. I was done intel Haswell CPU, which is the |
|
|
43:05 | nothing will be using for the next . Pittsburgh Supercomputing Center The bridge is |
|
|
43:14 | . The house fall sick to Um so here's a little bit all |
|
|
43:21 | number of functional units it has for integer and floating point operations. And |
|
|
43:28 | , um, how some of the units are operating. When I talked |
|
|
43:36 | processors, I don't think I Uh huh. Too much about functional |
|
|
43:44 | type designs in terms of vacancy and ports and cycles per issue. |
|
|
43:52 | talked about the Leighton see to various of cash is and main memory. |
|
|
44:01 | I talked about processes, but I think I mentioned these other pipeline depths |
|
|
44:06 | is sometimes called or Layton City. almost on functional units, whether |
|
|
44:16 | uh, multipliers ADDers or you're an decoding is done in the form of |
|
|
44:25 | units in modern processors. So what means It's like, um, I |
|
|
44:35 | a typical cafeteria or counting right that go and pick up a tray and |
|
|
44:44 | you go by the various pick up and eventually get to They have the |
|
|
44:49 | end. And as you pass through different stations, other people can, |
|
|
44:55 | know, be at your previous so as opposed to a version where |
|
|
45:06 | person would have to pick up all different pieces and pay before the next |
|
|
45:12 | can pick up the trade. So is what hardware designers do, And |
|
|
45:24 | instead off taking, you know, points add or multiply. And |
|
|
45:33 | you know, have some notion of point numbers, right that they have |
|
|
45:36 | man Tisa, and they haven't so they create sort of a dynamic |
|
|
45:42 | by modifying the exponents. So if want to add to floating point |
|
|
45:48 | one is tow. Line them so they basically have the same |
|
|
45:54 | Otherwise, you can't deal with what's as the Mantis A or the |
|
|
45:59 | So there are several different operations to performed in order to do, add |
|
|
46:05 | multiply on numbers, so that means fair amount of logic. And then |
|
|
46:12 | hardware designers do They are make sort stages, or you may do in |
|
|
46:19 | stage of our parents, and that's certain amount of logic. And then |
|
|
46:25 | kind of a register a buffer where keep the result of that. And |
|
|
46:30 | it moves that down toe, Next thing you want to do and |
|
|
46:34 | another parent numbers can be aligned, . So what, it means you |
|
|
46:40 | , um, we get the Crater is dependent on the longest kind |
|
|
46:47 | slowest path for each one of the , as supposed to the whole |
|
|
46:53 | So it's again the car kite and dependent on more or less, or |
|
|
47:01 | people move into the cafeteria line, upon the slowest station along the cafeteria |
|
|
47:11 | . So what this idea or pipe is that they allows designers to increase |
|
|
47:18 | cooperate compared to during a complete operation a single entity, Um so it |
|
|
47:27 | clock rates. And by that, you following and cafeteria analogy, it |
|
|
47:34 | the truth. So that's what you on this thing. Not to say |
|
|
47:38 | , you know, low store and . Multiply and add and even the |
|
|
47:43 | point multiply. Add. Um, just take. You can add or |
|
|
47:50 | new operations into the pipeline every so that's what the cycles per issue |
|
|
47:54 | . One, but it takes more one cycle to do. The store |
|
|
48:00 | in this case takes four. In of the floating point multiply, it |
|
|
48:04 | five cyclists to get the result of multiply, but you can start a |
|
|
48:09 | one again every cycle. So this the way it works now. The |
|
|
48:13 | one to pay attention to, which unfortunately to in most cases, is |
|
|
48:20 | , yeah, the division is fairly operation, and many times that is |
|
|
48:31 | a pipeline function. So that's what think that some lecture in the |
|
|
48:36 | I mentioned that division is very expensive no means comparable thio multiplication in modern |
|
|
48:46 | and the way division is done is through some attractive algorithm, in |
|
|
48:50 | and the one that is so typically variation and Newton's iteration being done to |
|
|
48:58 | the division. All right? There a lot about this. Any questions |
|
|
49:07 | this before I continue with the Okay, so there is just, |
|
|
49:18 | know, cartoonish examples of what it for flight. The pipeline respect to |
|
|
49:22 | particular still tiny code in the yellow . That's the first computer product of |
|
|
49:29 | and B and another product of A C and then your computer product of |
|
|
49:33 | two products of us created. So this case, when you have this |
|
|
49:39 | you can takes in this case three to do a lot of applications. |
|
|
49:45 | a times being moves through the pipeline , you know, uh um, |
|
|
49:51 | four sort of big, then a eight times b is ready to be |
|
|
49:57 | , but in this case, and to use a multiplier to start the |
|
|
50:04 | product 18 times, see, and just follows 38 times me by one |
|
|
50:10 | . And so, in cycle then the eight EMC product is |
|
|
50:14 | and no one can use it to the product. All the to partial |
|
|
50:21 | . And so that started in that . Five. So in this |
|
|
50:27 | It's what it takes to do these statements. So this is the notion |
|
|
50:32 | loop unrolling a social on the next slides. Place enrolled and hard to |
|
|
50:37 | . Reduced. They start off Okay, this is when I said |
|
|
50:46 | , Um, now let's see what here. This waas, I think |
|
|
50:55 | was on the previous slightest combined for waas. The three optimization is that |
|
|
51:00 | explicitly made by the programmer to this simple loop on the outcome Was is |
|
|
51:06 | four line in the table in the left hand corner and then on the |
|
|
51:13 | of the table. It shows what late in sea bound times will |
|
|
51:20 | Andi, Those comes from that. table on the lower right hand side |
|
|
51:27 | is again respects off the pipeline depth the latest for each one of the |
|
|
51:32 | units. So this means it's the from this combined four optimization is pretty |
|
|
51:43 | to what one can expect if one bound by Leighton. See, But |
|
|
51:51 | it would be like in this case be bound by the throughput of the |
|
|
51:57 | per issue. Ideally, you would to get close to one result for |
|
|
52:02 | cycle at least, or even better your multiple functional units. So here's |
|
|
52:11 | reason why things are not throughput bound late in Saigon. And so this |
|
|
52:18 | to illustrate with this kind of uh, multiplication. If it |
|
|
52:25 | you know, same thing would be ad, but that there is sort |
|
|
52:33 | one of these multiplication is used in next integration of the loop. |
|
|
52:40 | So if you look at his loop we have in the Jell O boxes |
|
|
52:45 | t, uh, in the new is depending on the old T plus |
|
|
52:51 | result off that wasn't an excellent using as the operation. So that means |
|
|
53:00 | has to wait until the the result that is ready before you can add |
|
|
53:08 | to teach. So that's why you this dependency graph that No, it |
|
|
53:17 | based. Ah, limited by the or the functional units. So |
|
|
53:25 | obviously trying to work around that somehow one way would be to try to |
|
|
53:31 | something like this, Um, that do look enrolling as we can see |
|
|
53:38 | . Um, where, you do two iterations of the loop. |
|
|
53:45 | this No preparation statement. So now see what happens. Anyone and one |
|
|
53:58 | actually volunteer. Did this help on ? They should help in terms of |
|
|
54:11 | overhead, because in terms of the , they can't. That didn't actually |
|
|
54:20 | away the vacancy. Eagle candidates. let's see what happened. Um, |
|
|
54:29 | here is now what they did. , the unrolling that was just done |
|
|
54:34 | the previous slide, and nothing really much except a little bit for the |
|
|
54:42 | for mhm. Andi, I won't into that too much, because the |
|
|
54:48 | is, we wanted to get beyond late in sit down. Mhm. |
|
|
54:56 | the recent things didn't improve because we have a statement. Well, that |
|
|
55:03 | the same later, this penance as have before. So in this |
|
|
55:10 | the loop unrolling didn't help. But doesn't mean that loop unrolling is not |
|
|
55:15 | useful thing to do to figure out to use pipelines better. So |
|
|
55:20 | We're doing different version. Um, , um, now we do the |
|
|
55:28 | different this in all they try to on the two. Uh, the |
|
|
55:35 | was first on. Then we added the accumulation variable. So now if |
|
|
55:43 | do that, um, we're going see that this is the one A |
|
|
55:50 | two times one a version? Things actually includes quite a bit. |
|
|
56:02 | one the kind of wonder, How what is so different between the association |
|
|
56:09 | the two different ways, and that depends on actually the underlying architecture. |
|
|
56:18 | , um that has to load store on, in this case also to |
|
|
56:27 | units for floating point operations. So is kind of what actually now happens |
|
|
56:36 | the code that the pair off variables loaded at in a single cycle because |
|
|
56:47 | the two little story units and they pipelines. So they have a |
|
|
56:52 | the ones so basically every cycle, can load new things. And then |
|
|
57:01 | fact that there is a Leighton see the result is available that Leighton see |
|
|
57:08 | to each off the multiplication is in case being done. So they're basically |
|
|
57:15 | each other just by one cycle. a late DNC off three for the |
|
|
57:21 | one, but then things comes every . So then accumulation is done every |
|
|
57:28 | , so this structure than allows using both those stories units concurrently, and |
|
|
57:37 | accumulation does not become late in seat . So now on this to get |
|
|
57:45 | factor to improvement as we show on Children, the previous slide, any |
|
|
57:55 | on that? I'll get a couple more tricks. So this is again |
|
|
58:03 | it's important to understand the architecture, and most of the architectures today they |
|
|
58:12 | at least two and many times they more replication off those store units, |
|
|
58:21 | , to keep things somewhat balanced with replication or functional units in this VL |
|
|
58:28 | type architectures. So here's another Try to use two separate accumulators. |
|
|
58:41 | now, still loop unrolling. But of using a single variable, |
|
|
58:46 | now, we add things up into next year on the next one. |
|
|
58:52 | now we can certainly still use Then two of those story units, |
|
|
59:00 | one for the volume 1 50 y one in the same cycle. So |
|
|
59:08 | , um, think we'll get the result as with the other, |
|
|
59:17 | single variable best association using to those units? Well, we did because |
|
|
59:40 | are using the do little stories, we didn't get beyond that. So |
|
|
59:52 | least if in this slide also shows that we have moved beyond the latency |
|
|
59:58 | things the we still have some distance go to something that is limited by |
|
|
60:07 | troops. So, um, so is what it says, right? |
|
|
60:17 | we have still the, um, for each one of these two |
|
|
60:23 | right. So we had enough functioning to do things in parallel. But |
|
|
60:34 | still have the latest dependency in each of these variables. So here's small |
|
|
60:40 | , right that we didn't illustrating these kind of pipelines because we have multiple |
|
|
60:48 | units that can do the same thing the same cycle. But then, |
|
|
60:54 | , we have a lately see in of the units. So now I'm |
|
|
61:03 | to try to combine these two to to the things that are bound by |
|
|
61:09 | don't. And that is to combine loop unrolling and using multiple, |
|
|
61:22 | accumulators and do the association in a way. So now I'm not like |
|
|
61:30 | showing a cold example, but simply happens with the EU's off unrolling and |
|
|
61:39 | accumulators in this case for the multiplication single accumulator in this case, |
|
|
61:51 | than regardless of the loop unrolling You're still late in C bomb. |
|
|
61:58 | you can see in this case if end up gets the best case is |
|
|
62:02 | getting to the throughput about limitation and is 0.5, because again, you |
|
|
62:11 | multiple functional units. So even though throughput or there the cycle spreader, |
|
|
62:22 | , issue was one you can you know, one pair into the |
|
|
62:29 | recycle, but you have two of . So that's why the throughput pound |
|
|
62:33 | 20.5 on. You get very close . It also shows a little bit |
|
|
62:41 | that if you start over doing it unrolling to very high degree with multiple |
|
|
62:47 | , you may actually becomes like a efficient. So there's usually trade off |
|
|
62:52 | in how if you want to unroll on and it's all depending on, |
|
|
63:01 | , the residents, the number of you have accessible because for you have |
|
|
63:07 | accumulators, you have several variables to track coming at some point to get |
|
|
63:13 | returns. And then this is, guess, for the integer part to |
|
|
63:19 | some similar trend that get you can things just like the different set up |
|
|
63:25 | terms of latest since and others. the principle is the same. |
|
|
63:34 | yes. So this was, working through hard to use loop unrolling |
|
|
63:44 | also got benefit in the case where units of pipeline in this case was |
|
|
63:50 | on their athletic units being pipeline. questions on that? So, |
|
|
64:05 | final rolling is something compilers definitely there to do, and they're usually really |
|
|
64:14 | at it. But, um, also something once and do by |
|
|
64:21 | the depth of the enrolling is something that is the best is platform |
|
|
64:29 | Because, as I mentioned, it on the size or the number of |
|
|
64:33 | you have, or, as is called, you know, size of |
|
|
64:37 | file. Um, so it depends you know how much can be kept |
|
|
64:43 | to the functional units without having to to cash is or, uh, |
|
|
64:48 | memory. If you're too ambitious, go too far. Things may not |
|
|
64:54 | registers anymore, and then things that said gets built to memory. So |
|
|
65:05 | ? No, a few more things the compartments do this They tried to |
|
|
65:10 | killed that defines its not usedto. the code size, obviously, if |
|
|
65:14 | never called, Um, like in case, this variable Isay is has |
|
|
65:24 | no you. So just get rid this piece of go don't execute |
|
|
65:30 | Don't use just this other former common expression elimination is just evaluate an expression |
|
|
65:37 | and then use it in particular. there's complex functions that logs also mentioned |
|
|
65:45 | , it's an expensive operation is not one cycle operations log and the intrinsic |
|
|
65:50 | sign co sign and another trick functions exponentially ations all those functions that are |
|
|
65:58 | thing and most programming languages. They quite expensive compared to add a multiplication |
|
|
66:07 | using ah for evaluating them once instead repeatedly for the same argument. Iss |
|
|
66:16 | practice on. I think this may now this is kind of another kind |
|
|
66:21 | strength conduction that maybe should have been but is used as common side of |
|
|
66:27 | in this case for the array arguments trying to see if there explicit, |
|
|
66:33 | is maybe have nice is to read understand the cold. Andi, then |
|
|
66:40 | the right hand side, is economizing the number of arithmetic operations that it |
|
|
66:45 | by using a couple of 10 variables this is other things compilers tried to |
|
|
66:54 | . If you happen to know a compile time was certain. Very |
|
|
67:01 | Then the compiler may actually just totally that operations like adding zero doesn't do |
|
|
67:07 | . Multiplying by zero doesn't do anything by one doesn't do anything, so |
|
|
67:13 | may eliminate those operations from your And this is I think, what |
|
|
67:20 | already said what compilers tried to do what a show this example of this |
|
|
67:26 | source the source transformations Thio Want to this a programmer or do it in |
|
|
67:35 | first place So you don't have to rewrite, uh, but being aware |
|
|
67:40 | what compilers try to do to do the source transformation in order to make |
|
|
67:46 | that, uh can execute it Now, I had a couple of |
|
|
67:55 | things, um, in terms off level, perilous. So this was |
|
|
68:03 | too much of it a little bit terms of using potentially among multiple functional |
|
|
68:08 | at the same time. But this more of a still a very simplistic |
|
|
68:15 | off in modern processor as we know stampede and, you know, has |
|
|
68:26 | 24 cores remember rights and that makes up the skylight version. No. |
|
|
68:32 | 28 I guess. One of over sky like a different version. I |
|
|
68:37 | remember on top of my head right , which one is stampede? But |
|
|
68:41 | beside the point. And then each of the corn have also been replicated |
|
|
68:49 | units of the same kind. So this case, this little graph shows |
|
|
68:56 | the athletic units have been replicated, that's a separate load in separate store |
|
|
69:03 | . So the previous example was actually old story in this, uh that |
|
|
69:11 | units that can do both loads and into anyway. So these so called |
|
|
69:17 | instructions that is, or realized that you that may contain the singer option |
|
|
69:25 | , uh allows you to potentially schedule in the same cycle for several of |
|
|
69:32 | units. Eso this is Guess what already said. And sometimes it's called |
|
|
69:39 | W. Sometimes super scaler has been for quite some time on Dhere is |
|
|
69:48 | repeat for the hustle. In terms magnetic units, there are there, |
|
|
69:56 | , two floating about multipliers, One one divide and for integer units. |
|
|
70:02 | then there are slow story units. , so now for the sky |
|
|
70:11 | it has the what's known as the Factor instructions. Version two. So |
|
|
70:19 | the A V X two for short it comes to Intel's vocabulary and also |
|
|
70:26 | bit wide instructions on it is just little bit when it comes to data |
|
|
70:33 | or data. What maybe contains in 512 bits? Um, So, |
|
|
70:44 | , they can have, you for about precision. This doesn't come |
|
|
70:56 | in my mind. So this is of the graph doesn't quite fit. |
|
|
71:01 | us it should be an eight time is 5 12. Uh, it's |
|
|
71:07 | right. So this you see, every time you look at something, |
|
|
71:12 | for that. But the idea is . So this basically, um, |
|
|
71:18 | it's based on, um yeah, myself. That's what happens when you |
|
|
71:30 | at the board kind of thing. but the 12 2 is 512 |
|
|
71:36 | and that's allows you to have eight position numbers or 16 single position |
|
|
71:49 | but the principal I'll talk about um, the same. Now Cindy |
|
|
71:59 | not exactly the same as well A and something can talked about before. |
|
|
72:05 | Cindy is the same operation was so the same op code or instruction for |
|
|
72:14 | array. So that's why, for , the stream benchmarks take the add |
|
|
72:24 | . We add corresponding elements off to race or for each pair of elements |
|
|
72:29 | used the add instruction. So that's available to Cindy. Operations on the |
|
|
72:37 | allows you to combine different operations in same wide instruction. So they seem |
|
|
72:44 | operation is, you know, something is usable when there are multiple others |
|
|
72:50 | multipliers, for instance. Then you have this Cindy instruction. And this |
|
|
72:56 | shows depending upon the width of the , you get fewer of them. |
|
|
73:02 | one what if you So now I this, I believe, is final |
|
|
73:13 | on this example that I borrowed from colleagues at CMU that now if you |
|
|
73:23 | use the fact that you have this capability in each, um off the |
|
|
73:33 | than in this case, you get factor, all for in terms off |
|
|
73:41 | even, um, in terms off sorry and yes, more or |
|
|
73:48 | It depends on the particular operation. it is postal factor for for the |
|
|
73:55 | point. And it's a little bit for their integer add because there are |
|
|
74:03 | other units than there were 14 20 in this particular processor architectures. So |
|
|
74:12 | this case, um, one what see using the tricks of unrolling etcetera |
|
|
74:19 | then using the simile feature on top that to use, exploit all the |
|
|
74:25 | functional units than the outcome is very to the lower bound, at least |
|
|
74:35 | the floating point. The integral mouth , I guess, quite a successful |
|
|
74:41 | this case. So I think, , any questions on that? So |
|
|
74:56 | I think the punch line all along example is that I would say in |
|
|
75:05 | , it normally days to be aware this. And now count on the |
|
|
75:15 | being successful in restructuring a fairly nice clean cold as supposed to something that |
|
|
75:24 | prepared for what the compartment will do rewriting your cold. Because again, |
|
|
75:31 | know what safe to do. And compiler has to infer it from something |
|
|
75:38 | different cold potential. Okay, And I had a couple more slides in |
|
|
75:46 | of branching. And branching is a bit tricky about in case one, |
|
|
75:56 | not familiar. Whether it's, I will just quickly show what modern |
|
|
76:04 | do. Most of them have what's as branch prediction. So when there |
|
|
76:14 | conditions in the cold, they tried guess which way the execution is going |
|
|
76:22 | go on based on the gas, the execution is going to go than |
|
|
76:32 | load instructions and data, assuming that going to happen and it's began strong |
|
|
76:46 | they, in a way need to and basically on their needs. Toe |
|
|
76:58 | on the air. Either backtrack or on the execution, um, past |
|
|
77:04 | branch and point, um, and out which way to go. So |
|
|
77:12 | it has executed help, then there's need to roll back. Because then |
|
|
77:16 | what has been computed is wrong. . Um, if yes, |
|
|
77:24 | then the preparation for what brands to ready wasted, and it basically then |
|
|
77:32 | to start to load the thing according the, uh, rounds that the |
|
|
77:38 | actually wants to take and supposed to way was just so This is kind |
|
|
77:44 | a little bit what these settle slide that what on the branch and kind |
|
|
77:54 | works. And nowadays, processors have sophisticated record keeping on the desk of |
|
|
78:06 | statistics for branches and see for particular on the data set for is being |
|
|
78:17 | at run time and see, you how many times branching was going this |
|
|
78:22 | and how many times it's went that . And it continuously update the history |
|
|
78:29 | branches, taking in order to try end, uh, minimize execution |
|
|
78:36 | So this since my time is I will just show you what slides |
|
|
78:43 | in there and you can take a at it. But it's just explicit |
|
|
78:47 | showing what I just said, how behaves in terms off French connection, |
|
|
78:55 | one is lost on that. So that I think power stop and today |
|
|
79:06 | because again you're summarizing we were off machine architecture. Er, I'm trying |
|
|
79:17 | challenge the component more than necessary to your coat and then use good tools |
|
|
79:26 | understand what actually happens. And there's on this slide or maybe on the |
|
|
79:32 | slide. And so the nationals health asked him they'll cut the assembly and |
|
|
79:39 | what happens. Yes. Okay, we'll stop there and take if there's |
|
|
79:52 | questions and stop sharing screen for the . What? Any questions? |
|
|
80:17 | not No. I had a question the last part. That is theme |
|
|
80:25 | . So the effect off loops plan performance. Can you show the last |
|
|
80:31 | just now? That you? Yeah. Uhz Mhm. 74 of |
|
|
80:52 | . Okay. Okay. All Hmm. So this'll this branching on |
|
|
81:04 | slide. Okay. And now Mm. Was so the prediction I |
|
|
81:12 | not get the last part. How the prediction done? Like, how |
|
|
81:17 | the performance and the prediction linked with to the loops. Okay. So |
|
|
81:27 | does not necessarily need to be in loops or they're independent concepts. |
|
|
81:35 | so you may have conditioners without Luke is just, uh, straight |
|
|
81:43 | cold. No loops, and depending certainly could data values. You want |
|
|
81:48 | do one thing or another? Eso branching has to deal with what code |
|
|
81:57 | going to be executed depending upon whether is to a false for instance. |
|
|
82:08 | that's what the branching is about. if one has no prior knowledge, |
|
|
82:19 | you know the best guess is basically probability for each branch. If you |
|
|
82:28 | this case, the conditional is inside loop at some level than there has |
|
|
82:42 | a history saying that in the first 30 times this conditional waas uh, |
|
|
82:53 | executed if we call it just the and the right pants, you |
|
|
83:00 | And 25 off these 30 cases, left branch was taken and then only |
|
|
83:06 | the right branch was taken. So I then set myself up for assuming |
|
|
83:13 | the new time that this condition is , it will probably most likely go |
|
|
83:22 | same way the 25 other ones. So then I can pre load all |
|
|
83:28 | instructions in the data for that branch the condition so on the other. |
|
|
83:39 | when it actually evaluates the condition, turns out it was going to the |
|
|
83:44 | direction where only five instances have been then. Obviously, I kind of |
|
|
83:53 | to restart things from where the condition in the state that is proper at |
|
|
84:00 | stadium. The court, not Not as some, um, latest |
|
|
84:07 | that assumed that it was going the way. So you have to, |
|
|
84:10 | that case, in the formative backtracked competitions. Or if they're things you |
|
|
84:16 | do without using the variables coming in the from the condition. Okay. |
|
|
84:28 | you, song. So that's what branching doesn't. And it shows in |
|
|
84:38 | next some detail. Hide works in particular exam. Yeah. Okay. |
|
|
84:53 | else? So I guess I'll stop recording, then. Thank |
|