© Distribution of this video is restricted by its owner
00:05 | mhm books trying to get through the thing. So okay, so today |
|
|
00:16 | we'll talk about these issues. so basically getting to talking about tools |
|
|
00:25 | understanding, performance of codes on. that's really the focus on today will |
|
|
00:33 | one particular tool that is listed at bottom of this light, properly known |
|
|
00:38 | poppy for performance application programming interface. first, I will give some higher |
|
|
00:47 | , you know, concept that hopefully . Uh huh, not too |
|
|
00:57 | And then we'll talk about Poppy and three hours with the demo of some |
|
|
01:02 | the public commands that was used for next assignment s. So, |
|
|
01:16 | for the class off today, the is really yes. On. I |
|
|
01:23 | even single thread, not just single , but that also they talk about |
|
|
01:29 | applies to both single threads and note and cluster performance over. It's by |
|
|
01:36 | means that the tools are just focus one thing, but the focus of |
|
|
01:42 | classes just to to listen, I say maybe not the simplest in actual |
|
|
01:48 | , but in complexity, it's just to dealing with a single threat. |
|
|
01:54 | force of graduating and dealing with more situations with multi conference. And it |
|
|
02:03 | points out when I come back to a few times. Uh, but |
|
|
02:08 | steps is one thing when you trying understand the performance are supposed to be |
|
|
02:14 | for correctness. Um, I guess that, too. But it's |
|
|
02:19 | Thio be very carefully in selecting data so they don't necessarily want very large |
|
|
02:29 | sets because that generates a lot of that this perhaps very hard to penetrate |
|
|
02:35 | figure out someone needs to be fairly in the choice off the data set |
|
|
02:40 | using for trying to assess performance. then, of course, depending on |
|
|
02:46 | your objectivist, you select proper tools we'll try to point you for some |
|
|
02:51 | in this class, ones that And in one way or another the |
|
|
02:57 | needs to be instrument. And so actually do you get the information that |
|
|
03:02 | hoped again in order to get insights the performance, and then you run |
|
|
03:07 | things, and then it's often a job to try to analyze what actually |
|
|
03:14 | on in the code and wants to out where the this just might be |
|
|
03:19 | least to poor performance and to try make some changes to the code that |
|
|
03:24 | will improve the performance. And then basically kind of literate. The number |
|
|
03:31 | times until one is kind of happy the performance of the goat eso the |
|
|
03:39 | premises. There's good three things you need to know on. The first |
|
|
03:45 | is, I would like to know application and and what it actually requires |
|
|
03:54 | the characteristics of basic properties. the code or I should say all |
|
|
04:01 | application rather than the code. I'll very trying to be more precise as |
|
|
04:05 | go. So in this case, application is the problem you're trying to |
|
|
04:11 | , and you have selected some algorithms solving that problem. And once you |
|
|
04:18 | selected algorithms as well as know what trying to find out, then you |
|
|
04:25 | get basically an assessment off. What of workload requirements are in terms off |
|
|
04:34 | ? A logic needs in terms of many floating point of integral ups and |
|
|
04:41 | memory references do you potentially need, at least how much data do you |
|
|
04:47 | have and then so we talked about a couple of lectures ago about this |
|
|
04:54 | intensity measures that then gives an idea the balance between computation in which I |
|
|
05:03 | and the data access requirements. And it's also in terms. Off comes |
|
|
05:13 | from the problems and the algorithms. much partly listen, do you |
|
|
05:18 | Maybe that so much sequential dependency is you can't really huge use of a |
|
|
05:24 | of parallels. So these are kind high level concepts that comes from again |
|
|
05:30 | the application and sort of bridging a bit, too. The hardware or |
|
|
05:37 | platform you're using is, I would , since 30 Weakest link, typically |
|
|
05:43 | most applications, is the memory system to understand what are the memory |
|
|
05:54 | And there is perhaps you have sufficient of data or that you can't just |
|
|
06:04 | in the notes. So you need have based on your data sets. |
|
|
06:10 | can decide that I need whatever number nose has required just to fit the |
|
|
06:16 | . Another one, if it turns that there were application, is most |
|
|
06:21 | memory band of Limited. Maybe you to use based the number of those |
|
|
06:27 | choose not on how much data you , but about your memory than with |
|
|
06:35 | . Ideally, is in the work get the execution times you hope for |
|
|
06:42 | , um and then what kind of memory requirements. But that's depends a |
|
|
06:49 | bit again on your algorithms, but this has come to starting with the |
|
|
06:55 | , and the algorithm should get the off your requirements. And when it |
|
|
07:02 | to parallel competition, there is another in terms of how you distribute the |
|
|
07:06 | that this typically orders or bonds. Later on, the of course, |
|
|
07:11 | talk about the parallel aspect. So now, for the assignment one, |
|
|
07:16 | simply just dealing with this single threat start with Dr Johnson. Yeah, |
|
|
07:23 | you be able to give an example fewer processes with more threads would be |
|
|
07:31 | ? Um, okay, so So is, um, try to tease |
|
|
07:39 | question apart a little bit, so insurance so at kind of ah, |
|
|
07:50 | level, there's this thing that was , I guess in last lecture, |
|
|
07:55 | terms off, hyper threading or simultaneous threading mhm s O. That means |
|
|
08:03 | I have to our a few execution sharing the same piece hardware. And |
|
|
08:16 | things are constrained by your memory it might be useful to use multiple |
|
|
08:27 | to share a single core because that threads that are waiting for things from |
|
|
08:35 | to use. The functional units that would be item if you have a |
|
|
08:44 | tends to be fairly heavy on compute logic operations than using multiple threats to |
|
|
08:54 | the same thing that is in critical may, in fact, a great |
|
|
09:01 | . So in that case, a so the detail core level. It |
|
|
09:08 | on the nature of the application, you want tohave, multiple friends or |
|
|
09:13 | sharing and cold. So that's for instance, the difference in I |
|
|
09:20 | that kind off science or applications that two centers were using Pittsburgh versus tax |
|
|
09:31 | has their case enabled multiple physical whereas at least something recently peaceful. |
|
|
09:40 | not to do that now, depending so if the how many What level |
|
|
09:53 | parallels to choose. So that means ? How many cores you want, |
|
|
10:00 | know, a core today, if talk floating point numbers and they do |
|
|
10:07 | few tens off giga flops per And if you have something that needs |
|
|
10:14 | and you don't want to wait we choose the number. Of course |
|
|
10:19 | want. Based on their own you'd like to to get based on |
|
|
10:23 | hammer functional units you have. So most things today Ah, and pretty |
|
|
10:34 | everything. When you do some from neural network type stuff or the the |
|
|
10:41 | and engineering computation, most of the are large enough that in order to |
|
|
10:47 | reasonable execution times to choose, I say hundreds or even thousands, or |
|
|
10:55 | in perhaps more extreme, faces millions feds in order to get recent execution |
|
|
11:03 | . Now, if you have something is memory bandwidth limited, and so |
|
|
11:14 | memory channel today, with a memory put on it probably do somewhere 20 |
|
|
11:20 | 30 gigabytes per second in terms, its data, great capabilities. |
|
|
11:27 | if your data set this terabytes, , so even in which the country |
|
|
11:32 | fifth in the memory about even if pits today there are some you know |
|
|
11:38 | they call, you know fat nodes may have want terabyte of memory. |
|
|
11:45 | you try to just access that data a single memory channel on assume. |
|
|
11:51 | that for simplicity? 25 gigabytes per , then it takes 40 seconds. |
|
|
11:57 | to read the data once, and that's acceptable. Maybe not. So |
|
|
12:01 | that case you made to choose, , execution threads and parallel is based |
|
|
12:09 | how many memory channels you want to reasonable access rates to the memory. |
|
|
12:17 | again, many of the computational science to food mechanics, structure mechanics. |
|
|
12:22 | if you work, you know, terms of machine learning and to get |
|
|
12:28 | done. People use very large data with sometimes in the various an |
|
|
12:37 | um or is the recognition. And terms of machine learning, they may |
|
|
12:44 | in the training part, billions of . And that's, you know, |
|
|
12:50 | of data. So in that again, you need high memory bandwidth |
|
|
12:54 | order to get being reasonable execution times there in that help. Yeah. |
|
|
13:06 | I didn't give you a particular and we're not trying to give you |
|
|
13:10 | competition and data such requirements and then it maps to the hardware. Thank |
|
|
13:17 | . Thank you. Talk to Okay. Um So then they at |
|
|
13:26 | very other end, is the system the platform they're using. That kind |
|
|
13:30 | briefly has talked about in answering the . Yes, stand tryingto understand how |
|
|
13:38 | processes that are gone owed, or that they used interchangeably processes and chocolate |
|
|
13:45 | sockets. And, um, from lecture Danger mostly said there are or |
|
|
13:53 | previous election that many of them The most common thing in alternative is |
|
|
13:58 | to socket notes or two processes for , but for it's not uncommon |
|
|
14:06 | So that gives you one thing. that's part of what the exercise |
|
|
14:10 | An assignment one to learn how you out what your harder environment is. |
|
|
14:17 | it also tells, you know, many course on how many effects can |
|
|
14:20 | use per core, so that all these questions that you should ask yourself |
|
|
14:26 | trying to give the model off what expect in terms of the problem you |
|
|
14:33 | given the platform they're going to So part of it is the processing |
|
|
14:39 | , and the other part is the system, and then if you have |
|
|
14:48 | typically knowns knowns with accelerators, so have something with GP use. |
|
|
14:54 | it really depends on the platform So in terms of Judah laptop processor |
|
|
15:00 | the GP uses most of the time into the same piece of silicon, |
|
|
15:07 | when you look at clusters or things for more computational intensive, used, |
|
|
15:18 | deep use are typically something that is the Nyo bus. So then you |
|
|
15:24 | to figure out what the ability to data between your accelerator or GPU and |
|
|
15:32 | processing civil. This is kind of system architecture, er, level thing |
|
|
15:38 | will help you set expectations for what execution time ideally could be if things |
|
|
15:49 | very well used. So that's again motivation for some of the task you're |
|
|
15:56 | to do in the assignment, one . Then learn how to find this |
|
|
16:02 | so you can build your model for to expect. This is just a |
|
|
16:08 | . Won't go into this is pretty the same thing. Uh, had |
|
|
16:13 | summer slide in the processor lecture last on then the third component. It's |
|
|
16:23 | the code because the cold is the . That basically is the mapping between |
|
|
16:29 | application and the hardware and, your right to CO, uh, |
|
|
16:37 | your problem and C or C plus or some other conventional programming languages. |
|
|
16:45 | they don't really, and just anything the actual characteristics of the platform you're |
|
|
16:54 | . So they are supposed to be representation what you want to get |
|
|
17:00 | So that means compilers and other parts the software stacked that is the bridge |
|
|
17:08 | your text inform off the source code the action hard work, and the |
|
|
17:17 | these days is quite complex. So your description of what you want to |
|
|
17:23 | done in to what actually is happening complex. And the performance tool talk |
|
|
17:33 | is trying to help you assess good did the code eventually up being |
|
|
17:46 | And it zed Here there is the to try to approach is is often |
|
|
17:51 | the first tried to figure out what the performance critical parts of the |
|
|
17:55 | And for that, when you're something profilers that we'll talk about next picture |
|
|
18:01 | , then once you have found the that maybe the ones that take the |
|
|
18:08 | time, so in that sense performance , you're trying to figure out more |
|
|
18:13 | about what goes on in those and today we'll talk about tools started |
|
|
18:19 | that part. And, um so one tooth. Three things to keep |
|
|
18:30 | mind, understand the application, understand hardware, and then try to understand |
|
|
18:37 | cold. And we'll talk later on some coming lecture about the compilation process |
|
|
18:46 | optimization that goes on and how one kind of helped compilers doing the things |
|
|
18:52 | hoping would do automatically. And this back to whenever the comment is that |
|
|
18:59 | need to be a conscientious when you to the performance optimization or debugging |
|
|
19:07 | Careful about the data set to choose too much and not too little. |
|
|
19:13 | , um, I said said. that's what I was trying to emphasize |
|
|
19:17 | with this three parts application system. Cold is Toe have a model in |
|
|
19:24 | red what to expect, And based that, we try to then attack |
|
|
19:30 | problem of optimizing the performance. And is just things that none of these |
|
|
19:39 | will solve the problem for you. definitely something run. Use the tools |
|
|
19:44 | gain the insights, but it's unlikely that these automatic tools will actually solve |
|
|
19:53 | problems. So in understanding off these parts that I mentioned and then proper |
|
|
20:01 | is what you should expect to be , get the good performing coach. |
|
|
20:12 | the tools are typically not one size all type, but they have. |
|
|
20:19 | focused on different part off kind of system that you have. So I |
|
|
20:30 | , you know, the first slide Papa is the focus to today, |
|
|
20:37 | it's kind of focused on the process part, even though it does |
|
|
20:44 | um, features that also addresses network . But it's it started out as |
|
|
20:56 | a process of focus, tool but then being expanded to cover additional |
|
|
21:03 | And then there are other tools, , addressing more of the memory hierarchy |
|
|
21:10 | for parallel cold later on. of course, we'll talk about this |
|
|
21:15 | . Passing interface for that is used a programming paradigm from clusters, and |
|
|
21:25 | against performance tools that the dressing uh things for NPR. And then there |
|
|
21:33 | one number L. A tool that talk about the next class, which |
|
|
21:37 | known as Tau. But today it be poppy. Excuse me off. |
|
|
21:46 | the very first thing one time to is just to collect timing information. |
|
|
21:52 | , make sure the time and we about that early on. Think second |
|
|
21:58 | also. But it's important than to the proper told to collect execution |
|
|
22:06 | And the general advice that I have to use something that count cycles because |
|
|
22:16 | iss anything that otherwise used time. may not give you enough insight because |
|
|
22:23 | upon other things happening in the but counting politics is just sort of |
|
|
22:31 | accurate. And then you can do at the various levels of success. |
|
|
22:37 | besides, you can do it, , for segments of your code or |
|
|
22:41 | the whole program. Um, to understand the workload, then you |
|
|
22:50 | to look at things are related to arithmetic or logic operations that the code |
|
|
22:58 | supposed to do. So that's kind targeting the function of units and try |
|
|
23:04 | understand how well those are being used the cold. Eventually on. To |
|
|
23:10 | insights into what actually happened is humane to look more at about going on |
|
|
23:16 | terms of them memory, heartache in off cash hits and misses, and |
|
|
23:24 | talk more about Josh behaviors and insights that in the coming lecture. And |
|
|
23:34 | these are kind of high level The other things, when you try |
|
|
23:39 | then use stools, has to be that tools may in fact both add |
|
|
23:50 | and change the behavior off the So that's what I kind of warned |
|
|
23:55 | in terms of the bugging lecture. you do print deaf the ads |
|
|
24:00 | and it potentially not only changed execution , but it can also change cold |
|
|
24:08 | . So they they're different tools. of them are not intrusive at |
|
|
24:14 | Um, on some of them are quite intrusive, so but depending on |
|
|
24:22 | you do, it's not necessarily that a tool that end up potentially |
|
|
24:29 | the behavior of code is a bad to do. Maybe that's what's necessary |
|
|
24:34 | get the information and as it's a area, so some of it means |
|
|
24:41 | such statement and the Kolding after re another case, you may link it |
|
|
24:45 | some library that basically ties to the , execute hable and then collect information |
|
|
24:52 | on the binary and in the course Thio mostly used open source tools because |
|
|
25:03 | are, well, easy. The . You don't need to sign a |
|
|
25:10 | or buy them from some vendor because neural, the vendor tools are not |
|
|
25:19 | . Um, and it also means you use the open source tools that |
|
|
25:22 | should cover, not all. Then are platforms, but the cover usually |
|
|
25:27 | of vendor platform. So that sense a way of working this kind affordable |
|
|
25:33 | you get familiar with the tools, , under 20 they may have specific |
|
|
25:39 | for the specific platforms that they Source tools don't have. So there |
|
|
25:44 | has always benefits central bathroom, each . Um, yeah, they're aspect |
|
|
25:54 | tools. Uh, yes. To aware of part of it's static. |
|
|
26:00 | that you do it prior to You young maybe, you know, |
|
|
26:05 | to insert in the code and So that's clearly and start against |
|
|
26:11 | There are other approaches that allows you dynamically instrument called depending upon what you |
|
|
26:17 | after him at Atlanta and how this done is either they can be done |
|
|
26:23 | in automatic. Yes, this will you an idea of. It's a |
|
|
26:27 | rich and set the tools out We will only deal with a couple |
|
|
26:33 | them for doing mostly processor focused instrumentation a copy. And then, |
|
|
26:43 | the profiling part. Well, yeah, that's true town. But |
|
|
26:49 | is just just to make you aware , there are lots of tools out |
|
|
26:55 | that's on. This is kind of a cartoonish thing. I'm just trying |
|
|
27:02 | , um, point out another Many of the tools are basically based |
|
|
27:07 | sampling because too much detailed information Sometimes we too overwhelming. It generates |
|
|
27:18 | lot of data. And if you lots of data, that also means |
|
|
27:23 | does generate or change the program behavior those things tends to be needed to |
|
|
27:30 | . Were it into disks because he gigabytes of data, yes, to |
|
|
27:36 | what goes on. But if you the statistic example type approach, it |
|
|
27:42 | means it is statistics. So you have perhaps all the details such a |
|
|
27:48 | want and this is just so direct . And so if you do direct |
|
|
28:01 | , the gaps exactly the target, you want on it, we can |
|
|
28:07 | very data, the information that the tools may not give you. But |
|
|
28:11 | I said, it is often changing execute hable and can cause overheads. |
|
|
28:19 | this is just again a learning you trade offs. What is tools |
|
|
28:24 | So when you try to interpret, end outcome of using the tools you |
|
|
28:30 | to be aware of. Uh It affected your code, so you |
|
|
28:36 | , you know, pass the Um, and these are, you |
|
|
28:43 | , event triggers tools that can measure , as it points out, theory |
|
|
28:49 | exclusively or inclusively depending upon how things being done and then, um, |
|
|
28:56 | are kind of atomic events, but is again just to a largely to |
|
|
29:04 | characteristics of the tool and for But for the class, we'll talk |
|
|
29:12 | exactly about what copy does in this on than again. Used to be |
|
|
29:25 | , as I mentioned already a few that one needs to be aware off |
|
|
29:30 | the tools potentially affect the execution, over headed introduces in terms of and |
|
|
29:38 | corrupting the measurements. And as we out in the debugging lecture, the |
|
|
29:46 | of the timers encounters are important and come back to this concept of the |
|
|
29:53 | when we talk about party. And I said that what the granularity is |
|
|
30:01 | the measurement and typically the process is to start with some profiling and trying |
|
|
30:08 | . Find out what they are time parts of the colds are and try |
|
|
30:13 | focus on those. And then you sort of kind of try to drill |
|
|
30:19 | unless you have a good hunch about the problem might be in which trance |
|
|
30:24 | can sort of dive, too, more detail explosion off where things of |
|
|
30:29 | happens in statements so called. So is kind of just summarizing what I |
|
|
30:38 | . That, unfortunately on tends to up using a few tools for a |
|
|
30:45 | of tools. Once one learns more the behavior of the code, |
|
|
30:53 | so now any questions on this kind general introduction to performance optimization? Otherwise |
|
|
31:03 | will sort of dive in to talk the specific tools after we will be |
|
|
31:13 | . I don't see any questions in chat. Okay, thank you so |
|
|
31:19 | I will talk about Poppy. So a first general background too Poppy and |
|
|
31:36 | and many users, community communities, is a key issue because, |
|
|
31:49 | cotton, many are wanting. There toe kind of run faster, try |
|
|
31:54 | figure out what goes on. And is the case, uh, that |
|
|
32:02 | processors today they have what's known as that focus on recording data from program |
|
|
32:20 | Early on and there's performance game, would say that the vendors, Intel |
|
|
32:29 | the idea others have built the processors want to get access to the information |
|
|
32:37 | they viewed it as being very For competitive reasons, too, have |
|
|
32:45 | gain insight into detail insight into how processors work. Eventually, they came |
|
|
32:53 | and realized that having the user community tell them what the potential issues is |
|
|
33:01 | their processor was actually helpful for designing the next generation processor, and that |
|
|
33:07 | couldn't really do all the sort of themselves and gaining all the insights so |
|
|
33:16 | allowed the academic community. Thio eventually tools to access these counters that were |
|
|
33:28 | into the processors and collect useful So even kind of end users, |
|
|
33:34 | programmers, our applications could learn how that a structure their codes to get |
|
|
33:45 | . So the person that kind of , the, um, leader are |
|
|
33:56 | they pushed the hardest and got processor Thio agree to open up Once this |
|
|
34:03 | from UT that was the original designer contributor to Poppet that is now basically |
|
|
34:14 | lots of people. So I was about the various parts on popular that |
|
|
34:23 | relevant and that it will be exposed and using Poppy and all the so |
|
|
34:32 | kind of the inter talk in a quick review of some of the |
|
|
34:35 | And then so, Joshua, your on them Oh, are some of |
|
|
34:40 | features that are particularly useful in understanding behaviors, respect to processors that you |
|
|
34:49 | . So by now it's basically supports most processors out there. That and |
|
|
34:58 | terms off processor vendors, that's a , D and info pretty much |
|
|
35:08 | The Exodus six market. And then is idea, um, that does |
|
|
35:13 | end server processing their terms of their X serious and more recently, process |
|
|
35:22 | design based on arm has also become common on gaining and adoption. So |
|
|
35:30 | also supported, and they also support pews from both in the end and |
|
|
35:38 | . So certainly what they're being using the course. There is support for |
|
|
35:44 | by these poppy tools, and what allows you to do is to collect |
|
|
35:53 | , not just about timings on, it also gives you kind of summary |
|
|
36:01 | , like instructions per cycle for cold of entire colds or per threads, |
|
|
36:08 | you can choose how you want and I'll give some examples and going |
|
|
36:12 | . And then you can also understand behaviors, branching behaviors and memory and |
|
|
36:22 | stalls. So a lot of the information for understanding what potentially is limiting |
|
|
36:32 | performance of your coat. And the few years they have also added |
|
|
36:38 | Since now, energy empire consumption is one of the critical aspects of is |
|
|
36:44 | interest. And as I mentioned you can get this information down Thio |
|
|
36:52 | sort of basic block level in your or process or perfect, and just |
|
|
37:00 | case the concept of Basic block is familiar to someone. It's the little |
|
|
37:07 | up. You know, the computer talk a lot about, and it's |
|
|
37:13 | the code segment for which there is which there is no entry points. |
|
|
37:19 | this simple slash ocean the piece of on the left and bunch of |
|
|
37:26 | and then on the right hand you kind of an illustration of the |
|
|
37:30 | blocks. Um, the main point just showing that what the basic block |
|
|
37:35 | something to which you cannot jump into when the exit basically after having ability |
|
|
37:41 | go somewhere else. So this is the concept. I'll come back to |
|
|
37:46 | when I talked about compilation but just familiars and give you What is |
|
|
37:53 | Basically now Poppy works with these performance that exist in the processors, and |
|
|
38:08 | they kind of classify things fine, she can recall by events and the |
|
|
38:16 | , or then again classified as even events. That is kind of directly |
|
|
38:23 | that the performance counters tell you. then there are precept events that may |
|
|
38:29 | groups of these native events that perhaps more useful in understanding the code |
|
|
38:36 | And I think I have some sites the precept events and the differences, |
|
|
38:42 | I think suggestion also demo some of . Now events are then often collected |
|
|
38:51 | event sets. And we have some off some of these events is maybe |
|
|
38:59 | want the one thing to be aware . And that's unfortunate. May make |
|
|
39:04 | a bit cumbersome on some to use like Poppy. And it's not the |
|
|
39:12 | with Poppy. It's a problem that is often a limited number off counters |
|
|
39:23 | can be used to collect these, , the information from the process of |
|
|
39:33 | I counsel all kinds of things. to read some of the counters that |
|
|
39:38 | vendors used to get the information and stick it in some other counter sectors |
|
|
39:44 | for pop and they tend to be . So that means sometimes one has |
|
|
39:50 | make several runs to collect all the after one. Because any given |
|
|
39:58 | you can only well let's say five 10 pieces of information that, |
|
|
40:06 | is a little bit of complexity because a statistical tool so everyone may not |
|
|
40:13 | an identical run. So that's something one has to be aware of, |
|
|
40:22 | because of the statistical nature or I'm and having to the multiple runs, |
|
|
40:30 | may not be exactly the same. , So So there are again the |
|
|
40:43 | events that this collection off useful that the designers are probably thought that |
|
|
40:55 | lot of people would like the collected collection of native events or sometimes just |
|
|
41:03 | native events by itself, Um, you don't necessarily have to create, |
|
|
41:12 | , kind of an events have to . Now, not all precept events |
|
|
41:20 | supported on every platform. So in case, for stampede to they will |
|
|
41:27 | there are 59 according Thio on the recent info, we have that. |
|
|
41:36 | you are generated yesterday out of the fact, 108 possible events that they |
|
|
41:41 | at the moment 59 are available or in your life and something All those |
|
|
41:54 | are kind of events that are composed . If you native events. The |
|
|
42:02 | events there are unique to every So it's whatever I am the decides |
|
|
42:10 | support or into decides to support. I am bm decided to support. |
|
|
42:15 | there is no common agreement among the vendor is a process of vendors that |
|
|
42:22 | all have to support something. They most of the native events are |
|
|
42:27 | but not all of them. And style also tells you and see actual |
|
|
42:34 | . That's how you find out what's available. Eso the copy rail tells |
|
|
42:40 | what all the precept events are in pipe in Native tells you which the |
|
|
42:45 | events are and again see, we'll talk more about that in 10 |
|
|
42:53 | . Um, then my halves, you dio select what eventually want |
|
|
43:02 | The cord, you know, should decide what is primary. And for |
|
|
43:09 | thing, whether it's some efficiency measure or it's sort of global thing |
|
|
43:16 | more details information about cash behaviors or , something about the kind of memory |
|
|
43:24 | that comes in this translation. Look Buffers T. L. B |
|
|
43:29 | um, which we will not talk about in this class, but that |
|
|
43:34 | happen to answer questions about it. , you're not taking computer architect. |
|
|
43:41 | classes should be familiar with it, it's modern architecture's air, not kind |
|
|
43:46 | given point. Not all the memory actually map directly, so you go |
|
|
43:52 | one of these local side buffers tables . When you need to remap or |
|
|
43:58 | some other part of Remember, you access to um, and there's all |
|
|
44:06 | of other details. So so now fit to a few sides because I |
|
|
44:14 | this is what found suggestible covering, fact, and this is just the |
|
|
44:21 | for the 59 events. There's three four slides here on. If you |
|
|
44:25 | at it, you can. We not particular ordered, and I don't |
|
|
44:29 | what the print order waas in this about the things deals with There were |
|
|
44:36 | deliver two. There were three caches and it counts, Mrs. And |
|
|
44:42 | can separate both data instruction caches, , both hits and misses. And |
|
|
44:51 | you can see you under the veil , you can see some which ones |
|
|
44:56 | available on stampede to in this case which ones are not. And in |
|
|
45:01 | cases you can see whether an event in fact, a native event or |
|
|
45:06 | . If you look at the very one, this is Level one data |
|
|
45:09 | misses. It's not the derive events actually native event. On the other |
|
|
45:14 | , if you drop down to the Linus, it's Level two data cache |
|
|
45:19 | . That's apparently something that is composed a few native events and and you |
|
|
45:28 | see that tends thio look down past middle of cycles for foreign point units |
|
|
45:35 | idle. So you get the information kind of things are not basically are |
|
|
45:40 | and waiting for something. Um, then I'm trying t o other things |
|
|
45:49 | that important to you and that waas know, something that tried to point |
|
|
45:53 | in terms off stream it often. in most processor, when pre fetching |
|
|
46:02 | not successful, things ends up being and you probably let you collect |
|
|
46:09 | And one thing so stalling and waiting this case for memory accesses well and |
|
|
46:14 | more detailed if it's stalled because of or writes, Um, it also |
|
|
46:20 | information on branching, whether branches, taking or not. And, |
|
|
46:28 | there is more additional information, so encourage you to take a look at |
|
|
46:33 | sites about blocked. Mhm is and again, you just need to |
|
|
46:42 | when you do the assignment. This a veil statement, and that will |
|
|
46:47 | what it's on the slides on. , um, I guess I should |
|
|
46:55 | this and then I will hand it to solution. Then I'll come back |
|
|
46:58 | some time waste after his demo. , but there's both to levels, |
|
|
47:07 | level interface and the level interface, high level that's except down fairly |
|
|
47:12 | Um, inside you may get what it easy to use. Where is |
|
|
47:18 | low level interface gives you much more than abilities to specified your sort of |
|
|
47:25 | events. That's that you may And I think I am. |
|
|
47:30 | um, okay, maybe I do couple more slides and then I left |
|
|
47:38 | takeover. So there's just some of high level, um interface that is |
|
|
47:49 | basically in this case there. You know, high level, |
|
|
47:57 | calls that you can use for And then there's three in this case |
|
|
48:04 | that gives you instructions per cycle or per cycle or floating point operations for |
|
|
48:13 | point instructions or floating point operations per . Um, and those are not |
|
|
48:20 | instructions per cycle and operations per Because remember when I talked about the |
|
|
48:27 | , many of them are the Allied or can do many operations in a |
|
|
48:33 | instructions or separating out instructions from operations an important desperate. And I think |
|
|
48:42 | is just a detail and I will that an interest of time or and |
|
|
48:48 | the questions that can come back to and I think there's a couple of |
|
|
48:51 | here, there's chose. You can this high level interface basically said, |
|
|
48:56 | the counter and then you have your and then you stop the counters and |
|
|
49:01 | you get Uh huh. What number transactions in this case in terms of |
|
|
49:08 | high level interferes gives you in this for the events is to define the |
|
|
49:15 | number instructions and the total number of cycles that WAAS occurring between the start |
|
|
49:21 | stop for the country's. For this cold on the low level interface has |
|
|
49:29 | lot more details on. Here's just example or what you can do then |
|
|
49:37 | family detail information, and I think will stop and maybe come back. |
|
|
49:42 | if so, Yasha will start your by commenting on the lower level. |
|
|
49:47 | example, I want you to have time, so that's why kind of |
|
|
49:51 | exhibit. I'll cover it in the . Okay, so then, |
|
|
49:57 | take over. So then there is a few slides that's basically is more |
|
|
50:03 | less screen screenshots off the devil That Joshua will do. And if there's |
|
|
50:11 | left, I will come back and about the concrete example. But then |
|
|
50:15 | hand it over to say yes. , so we just start my start |
|
|
50:21 | share my screen. Oh, I got it. Okay. |
|
|
50:33 | great. So is my screen showing now? Yes. Okay. |
|
|
50:41 | And just if we increase the fun now, my wondering. Yeah. |
|
|
50:48 | it visible enough? It's okay with . I hope the students can |
|
|
50:56 | Okay. Great. Right. as a doctor, Johnson mentioned puppy |
|
|
51:02 | a performance measurement toe. Now, tool provides you, uh, two |
|
|
51:10 | . One is command line based. second, it provides you an interface |
|
|
51:16 | a as a library itself. like any other library you have in |
|
|
51:20 | language, you can use poppy through function calls. Uh, now, |
|
|
51:27 | understand what Poppy does as way Professor talked through the slides that each |
|
|
51:35 | these processors nowadays that we have, have a certain set off hardware counters |
|
|
51:42 | can be configured to measure some performance based on which event we choose and |
|
|
51:49 | we wanted. Thio collect now to Poppy. First thing that you would |
|
|
51:57 | to do is to make sure you loaded the model on at least on |
|
|
52:02 | our any other cluster. Basically, make sure first you are on a |
|
|
52:07 | notes. So I'm going to compute here, and we already have the |
|
|
52:13 | model loaded. It will still, to show the command you can use |
|
|
52:19 | Lord puppy toe the latest latest version the, uh, a party modules |
|
|
52:27 | make sure we have it. So use model list and you can make |
|
|
52:31 | it is there now a Z There are two kinds off events that |
|
|
52:39 | CPUs and puppies are able to access those are the native events and the |
|
|
52:45 | events s on the left. You see that kind of description of |
|
|
52:50 | So native events are pretty much all events that are available on a certain |
|
|
52:57 | that includes some general events and all events that are architectures specific. |
|
|
53:03 | when you want Thio, check all native events for a particular CPU. |
|
|
53:10 | command that you can use would be native available and this is going to |
|
|
53:15 | a very long list, so I suggest you use it with a pipe |
|
|
53:21 | more command. When you do you will see that you can get |
|
|
53:28 | the information about the CPU that you using. And then I also tells |
|
|
53:36 | that this this particular CPU has 11 counters. So that means there are |
|
|
53:43 | physical counters that our president on the and you can Onley configure Aziz many |
|
|
53:51 | to collect during a single run off program that can that can fit. |
|
|
53:57 | to say in these numbers thes number hardware counters. Uh, so |
|
|
54:04 | So if you just press enter and going down, you can see all |
|
|
54:08 | kinds of native events that are That's going to be a very long |
|
|
54:13 | . I'll just show you a few these. It's now coming out |
|
|
54:18 | It s the next would be the events. Now, preset events are |
|
|
54:24 | derived from the native events are just off native events. Thio two |
|
|
54:33 | uh, to some general naming conventions these metrics. Now, what puppy |
|
|
54:41 | was they took most common events across CPUs, and then they just created |
|
|
54:47 | mapping to some standard names for the events. So when you want to |
|
|
54:56 | all the preset events that are uh, you can just use puppy |
|
|
55:02 | . And so this will be a shorter list, so I just go |
|
|
55:07 | the top. So this will be sort of output that you will get |
|
|
55:13 | poppy avail again. All the information your your CPU and all the preset |
|
|
55:20 | that are available on the CPU or the RV generator. You can also |
|
|
55:26 | which events are available, which one derived, and get more description about |
|
|
55:33 | particular event. Now. Poppy avail has a few flags that you can |
|
|
55:41 | it along with. So if you poppy avail, uh, with minus |
|
|
55:48 | and then let's say you provide one the events name. Let's about the |
|
|
55:53 | one total cash, Mrs. So you can get more details about that |
|
|
56:01 | , uh, preset event. So they've been named, uh, you |
|
|
56:07 | also see that it's a derived event It has two different native events that |
|
|
56:13 | that it's derived from, so you see it's native event L one d |
|
|
56:19 | and l two requests. These are together to derive this lobster thio, |
|
|
56:27 | this particular pre settlement. Uh Right. So those are the two |
|
|
56:36 | off events that are available on most the processor architectures. Now, when |
|
|
56:43 | want to access these, you have options forces the puppy, high level |
|
|
56:49 | , and second is the puppy low MPs. Oh puppy, low level |
|
|
56:53 | . It can access both the preset the native events, so it gives |
|
|
56:58 | more control over What do you want collect and what you want to do |
|
|
57:03 | your court? It's a It's a detailed FBI. And if you just |
|
|
57:07 | Thio, collect a few preset you can simply choose the high level |
|
|
57:13 | , which can only access pre It cannot access native events, but |
|
|
57:18 | is much more easier to use if just want toe, do some minor |
|
|
57:23 | performance measurements. Eso now, before move to, uh, the |
|
|
57:30 | uh, function calls, there are few more commands that you would want |
|
|
57:35 | , Uh, no. About so the puppy event chooser. So as |
|
|
57:42 | saw that there are, uh, number of hardware counters, that also |
|
|
57:49 | that during a certain run, you only measure, uh, some on |
|
|
57:55 | small set off these events. And there is another thing that the, |
|
|
58:02 | quite a quite a lot off events not compatible with each other. So |
|
|
58:07 | would want to make sure that you the, uh the events that are |
|
|
58:14 | with each other during a single execution your code. If you use |
|
|
58:19 | events that are not compatible with each , you will most likely end up |
|
|
58:26 | nonsensical values for your for your performance . So you should always make sure |
|
|
58:32 | using events that are compatible with each and pop. Even chooser is the |
|
|
58:37 | that would allow you thio check the off events. So the way you |
|
|
58:44 | users you would use pop, even , and then the next he would |
|
|
58:50 | is what kindof events that you want check that are compatible with. Let's |
|
|
58:58 | , this particular particular event, which the single number of single precision operations |
|
|
59:06 | you do that, it will tell that can give you a basic summary |
|
|
59:12 | the CPU and also tell you what tree set events are compatible with this |
|
|
59:21 | event. So let's say you happen be collecting single precision operations for your |
|
|
59:28 | . You can also collect total number instructions, total number of cycle and |
|
|
59:33 | on. Alongside that, uh, event so you don't have to run |
|
|
59:40 | code unnecessarily multiple number of times. is another command that you can choose |
|
|
59:50 | use his poppy come online, and that does is it allows you to |
|
|
59:56 | if a certain event is available, , on this particular CPU. So |
|
|
60:04 | to give an example, let's uh uh, there was there was |
|
|
60:08 | event in the Native Events list that wanted to check, but you don't |
|
|
60:13 | to go through the whole list as saw. There's quite a quite a |
|
|
60:16 | list, so you can choose to a puppy command line to check if |
|
|
60:21 | if it gets, uh, if available on the CPU or not. |
|
|
60:25 | for now, we'll just again use poppy l 20 cm. Uh, |
|
|
60:29 | is again a pre settlement. But you run it, it will give |
|
|
60:36 | a message that, yes, 20 cm was able Thio puppy was |
|
|
60:42 | to add that event and it ran micro benchmark, just a simple called |
|
|
60:47 | in the inside puppy. And it able to collect some performance measurement for |
|
|
60:52 | . So this is just to make that, uh, this event is |
|
|
60:56 | or not, Or can you use or not? Just to give you |
|
|
61:01 | sense of what happens if use, an event that's not available. So |
|
|
61:06 | party floating point operations is not, , available on this particular CPU. |
|
|
61:12 | when you try to add that, will give you an editor that, |
|
|
61:16 | , this event does not exist and was an error been adding this So |
|
|
61:19 | can, uh, no beforehand which you can use on which events you |
|
|
61:26 | Eso Is there any questions till Okay, that's the comments and run |
|
|
61:39 | since some of the things I might to try to again understand. |
|
|
61:45 | you know, the arithmetic workload and than the floating point operations is not |
|
|
61:50 | that one can get anymore. It to be, but in full turned |
|
|
61:55 | off their processes. But I remember you can still get the number of |
|
|
62:02 | point instructions. Yes, you can that. A swell as there |
|
|
62:07 | there's another event that they've made it . That's the single precision ops. |
|
|
62:12 | that yeah, I was I'm actually . Why? What's going on with |
|
|
62:18 | two events? Because floating point operations to give single precision ops as |
|
|
62:22 | And this is pretty much the same , right? So, you |
|
|
62:25 | we should probably check about that as , just to say anyone should |
|
|
62:31 | Yes, to make clear that it's that one. Cannot you have any |
|
|
62:36 | on arithmetical logic workload, but maybe as precise as you want, so |
|
|
62:43 | can get something at the instruction, , but not necessarily at the operation |
|
|
62:48 | Operation electrodes, Right, Right. common Part of that is, |
|
|
62:59 | related to the architecture, because when have this very long instruction words, |
|
|
63:07 | doesn't mean that the full width is filled with operations or the compiler might |
|
|
63:14 | operations. But it made only There were third field there was |
|
|
63:19 | So instructions may something perhaps been more in some ways on the operation count |
|
|
63:26 | the operation countries would require additional insights just counting the instructions. And I |
|
|
63:33 | that's part of the reason why they of didn't want to tell us about |
|
|
63:38 | action operation count. That was an . Okay, so, yes. |
|
|
63:51 | those were the main commands. there's one more command that you guys |
|
|
63:57 | want to know as poppy men If you use that, you can |
|
|
64:04 | , uh, the memory, information about the processor, specifically the |
|
|
64:09 | and the TLV. So you can for this processor in l one is |
|
|
64:15 | split cash. So it has a cache and instruction cache separate. I'll |
|
|
64:19 | is a unified cash, and l again is a unified cash. So |
|
|
64:23 | member and four can also be Uh, right. So those were |
|
|
64:30 | commands now moving on to the functional . Or you can say it does |
|
|
64:37 | library off puppy that you can use C programs. Uh, so this |
|
|
64:43 | somewhat, uh, how the puppy . But look, So is there |
|
|
64:48 | question? Okay. Uh, So when you want to use poppy |
|
|
65:00 | your gold, first thing you should you need to do is you need |
|
|
65:05 | include the puppy dotage verified of Uh, you can ignore this line |
|
|
65:15 | lines from now. But first, you let's say you want to check |
|
|
65:19 | many counters are available on the system now s. So if you just |
|
|
65:26 | off just like that, yeah. the function that you can use and |
|
|
65:33 | it was in the slides as well the puppy numb counters eso This function |
|
|
65:39 | tells you how many counters that are on your seat on the particular |
|
|
65:47 | Then you can, uh, print off it. One thing you should |
|
|
65:52 | remember, whenever you call any poppy , you should always compare it with |
|
|
65:59 | constant poppy. Okay, so this condition should be like this. |
|
|
66:04 | if the return value off any um, happy function is not equal |
|
|
66:11 | this constant. That means that function up somewhere, and it's not working |
|
|
66:15 | . So you should go ahead and sure you have loaded the puppy modules |
|
|
66:20 | have the puppy dotage included. If still doesn't work. Then there's something |
|
|
66:25 | with Poppy. But anyway, make sure you always check this now |
|
|
66:32 | you want to compile this code, , the command to do that is |
|
|
66:42 | by using GCC. Then you need include they include but off happy your |
|
|
66:51 | code name, uh, part to lib directory off puppy. Then there |
|
|
66:58 | these couple off. Well, this flag that you want to add and |
|
|
67:03 | just the name of your execute evil you want eso when you do that |
|
|
67:09 | were and compile it, it will on executing felt like this. And |
|
|
67:16 | when you do that, you'll get output as popping the number of countries |
|
|
67:20 | are available. So any question about ? Okay, uh, moving on |
|
|
67:33 | the example for poppy. High level PS. Remember that, poppy? |
|
|
67:37 | level A p. I can only Prince settlements. Uh, so when |
|
|
67:44 | want to use just preset events for performance off your code, you can |
|
|
67:48 | use the puppy high level FBI Make sure you have poppy dotage |
|
|
67:54 | And what this code basically does is just multiplies, uh, two |
|
|
68:00 | with each other, which were initialized the beginning. Now, when you |
|
|
68:04 | todo measure some preset events, you first need to create a, |
|
|
68:11 | area that contains the name more So to say, for for the |
|
|
68:19 | event that you want to measure, you also want a raid that that |
|
|
68:27 | , that will read the values of counters. And in this case, |
|
|
68:32 | should use a long, long, data type for this for this particular |
|
|
68:38 | because in many cases it could be large values. Once you have got |
|
|
68:44 | both of these things set up, first function to call is the |
|
|
68:49 | Start counters. You don't have to any other set of for Bobby. |
|
|
68:54 | start, start the counters, tell it which events it should be |
|
|
69:00 | and the number of events that are be measured. So in this case |
|
|
69:04 | just one. And again make sure on, I think is going wrong |
|
|
69:09 | the public calls. As soon as done that, you can just start |
|
|
69:15 | the work that you're supposed to in code. And when you're done with |
|
|
69:20 | work, you can just call Poppy counters with With the parameter past as |
|
|
69:28 | output output variable that should contain the . When you do that, what |
|
|
69:35 | things are going toe happen First, value of the counter will be read |
|
|
69:40 | things particular variable that you provide And Thekla counters will be resented. So |
|
|
69:48 | you read the counters again, it most likely give you some weird |
|
|
69:55 | It will not give you values that at here again. So remember reading |
|
|
69:59 | reading, the counters once it raises values inside those eso As soon as |
|
|
70:08 | read it, you can just print out. And always remember, when |
|
|
70:13 | done with your work, make sure relinquish control off. All the resource |
|
|
70:17 | that you've got to make sure called stop counters with stops the puppy runtime |
|
|
70:24 | of to be specific, it stops those counters again. Compiling off this |
|
|
70:32 | is similar to what we did for previous example. So when you run |
|
|
70:37 | court again off the high level, will tell you the total instructions that |
|
|
70:45 | executed for this particular court block. any question about that Okay, so |
|
|
71:01 | moving on to poppy low level. remember poppy low level. It can |
|
|
71:08 | both the native events as well as research events. Uh, and it |
|
|
71:15 | gives you much more control off what want to do with your with your |
|
|
71:19 | as well. Eso There's not ah lot of change when you're trying to |
|
|
71:26 | simply read some events for your The first thing that is different from |
|
|
71:35 | level is you first need to make you initialize the puppy low level |
|
|
71:40 | Uh, now here you want to this function puppy create events said What |
|
|
71:46 | does is it creates an empty events toe, which you can add events |
|
|
71:51 | later on when you once you've done . So here. See that first |
|
|
71:58 | am adding a puppy preset event, is the puppet total instructions and then |
|
|
72:06 | the same event, said I'm also a native event which just measures the |
|
|
72:12 | one cache misses. As soon as done that again, you can call |
|
|
72:18 | start here Azaz compared to high you would call puppy start counters |
|
|
72:24 | It's probably start for low level then you would perform your computations. |
|
|
72:30 | again, just in the end you call Poppy read to read the read |
|
|
72:35 | counters. You can then print it just called stop when you're done. |
|
|
72:42 | again, just compiling is again the . And so you can see the |
|
|
72:51 | that we got for those two So total number of instructions were close |
|
|
72:55 | this, and the number of cash were 67 now s O. That |
|
|
73:03 | pretty much all about how you can a simple usage of poppy. Now |
|
|
73:09 | that let's say you have, 1000 or 2000 lines off colds |
|
|
73:16 | You are also using multi threading and sorts of funny business. Uh, |
|
|
73:23 | these poppy calls to your source called creates a large executable, obviously, |
|
|
73:30 | also it may add more overhead, it's not that easy to use. |
|
|
73:35 | , it's it's a little bit easier newer poppy versions, but at least |
|
|
73:39 | the version that we have here, 5.7. It's not that user |
|
|
73:44 | If you try to do it so is Dr Johnson said. In |
|
|
73:48 | next class, we will see, , tool called Tao during an |
|
|
73:53 | Utilities. It allows you to keep source code the way it is. |
|
|
73:58 | that means just this computation except all the puppy calls. And it does |
|
|
74:07 | instrumentation operations automatically. So you don't to worry about adding thes functions. |
|
|
74:16 | in your source code. Eso. we'll see it in the next |
|
|
74:22 | and that's pretty much it. So questions? Okay, so no |
|
|
74:36 | Right about time. I I'll stop . Okay, I will just point |
|
|
74:45 | there is only a couple. Two presumably left. So waas to |
|
|
75:09 | Yeah, so point out quickly that the slide set for today, that |
|
|
75:24 | it what I wanted There is this so this there is a simple example |
|
|
75:31 | to use park it again. Some and a simple example is just |
|
|
75:40 | takes, multiply and commonly used for much anything. When you teach compilers |
|
|
75:45 | anything else. Eso It's very Simple makings multiply algorithm that hopefully everyone |
|
|
75:52 | familiar with three nested loops do multiplying the two major sees and on |
|
|
76:00 | left side years what's known as the math mountain on the right one is |
|
|
76:04 | that the only thing I was changed there was. They'll open to change |
|
|
76:12 | . That's the only thing. And point is to show that that can |
|
|
76:16 | an impact on performance and how you get, um, then Poppy to |
|
|
76:22 | insights. And then the witness was show with that. So the leftist |
|
|
76:28 | stupid known as in a product basically or magics A and Times column a |
|
|
76:36 | B. Where is the other Is kind of scaling on Collins and |
|
|
76:44 | scale columns to compute see colonize as or mathematical. What's happening? And |
|
|
76:52 | was kept this slide and just pointed walk happen and how probably can be |
|
|
76:59 | . So there's a uh on the hand side. You see the particular |
|
|
77:05 | that you get information about from Poppy and using the particular public command. |
|
|
77:11 | then there is the outcome for the multiply using in the products and for |
|
|
77:16 | other one. What's the reordered or version? And if you look at |
|
|
77:22 | , you can see that the time from about 13 seconds or something to |
|
|
77:28 | three seconds, or basically the factor improvement in performance and the instruction county |
|
|
77:33 | the same. Nothing changed in terms again instructions being executed about the time |
|
|
77:40 | tanker factor for. And then you look at what's done below in terms |
|
|
77:45 | instructions per cycle. Or if you at the very bottom in terms off |
|
|
77:52 | cash behavior that you can see and along. That's the real time. |
|
|
77:58 | per cycle went up from about, know, 0.35 or something. 2.1 |
|
|
78:03 | seven and the total number of cycles down, and you should be correspondent |
|
|
78:09 | in time. And I guess on next side you can get the needs |
|
|
78:13 | in terms off the cash behavior. in this case is here that on |
|
|
78:19 | cash request rate, in fact went , but the administration down substantially from |
|
|
78:27 | 0.3 per instruction to point double 07 this gives a little bit inside what |
|
|
78:33 | . That's the cash behavior. Got lot better. Um, number of |
|
|
78:38 | is the same because the number of animal most or the instruction reduced to |
|
|
78:46 | the multiply add listen embedded in the nested loop in a loop. Bonds |
|
|
78:51 | the same. The same number of being executed. But the memory behavior |
|
|
78:56 | considerably better. So this is kind an example of the insight you can |
|
|
79:02 | from using, Poppy? Andi, know my time is up, so |
|
|
79:06 | will stop with that. But I you to kind of look at some |
|
|
79:09 | the You're such scenario. Probably that in the side. Someone up on |
|
|
79:22 | questions you want. I had one question. Yeah, Um, what |
|
|
79:30 | is the feature sides of, the CPU, the feature sizes? |
|
|
79:38 | what you said just to make Yes. So, what do you |
|
|
79:45 | of as features? Because to it's the cash to sedate A |
|
|
79:53 | um, cash lines and the other . I believe he is asking about |
|
|
79:59 | feature size, The numbers. in terms of silicon. Yeah. |
|
|
80:05 | . Okay, so I think the to, if I remember correctly, |
|
|
80:09 | 14 nanometers, right? 71 So just measuring this like the smallest or |
|
|
80:16 | piece of silicon that is on the or right. So the feature sizes |
|
|
80:23 | of the minimum features size from a point of view, so it sometimes |
|
|
80:32 | referred to as a no. Then because as the footprint of a transistor |
|
|
80:37 | terms of nanometer, it also tells little bit how wide the silicon wires |
|
|
80:45 | on the chip. Um, so that sense, it's tends to be |
|
|
80:52 | minimum with their other characteristics. Stuff between layers, and that's a lot |
|
|
80:58 | . But this is basically the extent the horizontal dimension. Directions. |
|
|
81:05 | these features you can do sort of the smallest possible quantum if you wanted |
|
|
81:11 | . Okay, yes, so and related to, um so you may |
|
|
81:22 | it or not. But there's Kip is basically photographic technology to take |
|
|
81:29 | exposed silicon, and there's all kinds trickery how you actually make the pattern |
|
|
81:34 | the imprint on the piece of silicon depending upon what technologies use use, |
|
|
81:41 | limits the smallest feature size you can in terms of the, um, |
|
|
81:47 | kind of extent, and it so it's related to the wavelength of |
|
|
81:52 | you used to exposure. And then why when, um, the state |
|
|
81:59 | the art today's when needs to use ultraviolet because the feature sizes is at |
|
|
82:09 | . One wavelength of the light is to shine the pattern onto the |
|
|
82:13 | That's where it's come from. Cool. Thank you, Dr |
|
|
82:19 | Okay. Here, welcome is good . So I had I had I |
|
|
82:26 | I had one more, if that's . Yeah. Question. Um, |
|
|
82:30 | believe it's the very last one on homework. Um, it says that |
|
|
82:35 | each of the benchmark functions seek to a model of performance as a function |
|
|
82:39 | the data set size. Would you elaborating a little bit on that |
|
|
82:46 | Okay. Um, so the e mean, the first thing I thought |
|
|
82:57 | was, for example, some of benchmarks might have thresholds at a different |
|
|
83:02 | . Like for the stream one. was noticing that it makes like an |
|
|
83:05 | down parabola after you reach a certain that coincides with the cash. |
|
|
83:11 | is that the type of analysis that question is sort of asking for or |
|
|
83:18 | , it's not unrelated. What I in mind was simply, um, |
|
|
83:24 | back to my first. I guess or three pieces. Whereas I know |
|
|
83:29 | application or your systems or your So I think it comes from matrix |
|
|
83:36 | Thing is that the number of arithmetic of the workload in the first place |
|
|
83:42 | to end cube since we use square . And the other part is that |
|
|
83:51 | you're supposed to use a single Fred , um so then it's what is |
|
|
83:59 | max capability off the floating point performance a single core. And, |
|
|
84:07 | then you can kind of have two models as well, if it is |
|
|
84:15 | um, basically single functional units or kind of what people can do scale |
|
|
84:23 | Or it can basically do one multiply for cycle. So then your model |
|
|
84:31 | be. You know, this is much works, and here's the capability |
|
|
84:36 | the hardware. So my expectation if it's a truly 100% efficient |
|
|
84:41 | this should take this amount of time the other kind of extreme. And |
|
|
84:47 | the you stampede to that, the of the skylink it can actually do |
|
|
84:54 | to double position floating point operations in single cycle on the single core. |
|
|
85:01 | then you get the totally different uh, number of operations per seconds |
|
|
85:07 | you can have a model for predicting time if it's 100% efficient. So |
|
|
85:13 | was the kind of model I had mind that you can then set the |
|
|
85:18 | . Or what time should it have and compared to it the time it |
|
|
85:23 | took? And then you can get idea. Was the cold really |
|
|
85:29 | Or was it yes, using a small fraction off the actual capability off |
|
|
85:36 | hardware so as a norm. And why I thinking, I think I |
|
|
85:44 | one slide that showed that when I to, like, matrix multiply well |
|
|
85:51 | cold can get like 98 99% off peak performance. On the other |
|
|
85:57 | a typical application cold that is even . They only give you 3% and |
|
|
86:03 | the code doesn't even do 1/10 of off good use of their cold. |
|
|
86:13 | that was behind that question, and it wasn't really elaborated, but that |
|
|
86:19 | the intent to get you to think how well is the platform being |
|
|
86:24 | Okay, so it's more along the of relating it to the theoretical |
|
|
86:29 | Yes. Okay, grant I was to do incur fitting toe what you |
|
|
86:36 | . Yeah. Yeah, for All right. Thank you. Dr |
|
|
86:39 | . E Welcome. That's a good to Yes. So a lot of |
|
|
86:52 | assignments that are basically stated to again trying Thio foster you tohave a model |
|
|
87:04 | what good performers would, uh, out to be in terms of |
|
|
87:09 | You observe and use just measuring So you have an expectation what would |
|
|
87:14 | good And then compared to what you observed any more question stop recording at |
|
|
87:38 | |
|