© Distribution of this video is restricted by its owner
00:04 | So this is the last week of or is basically today and Wednesday's lecture |
|
|
00:10 | . So I was trying to think things that might be the most |
|
|
00:17 | um, for the future on do some degree some of the projects and |
|
|
00:29 | We'll talk mostly about then aspects of algorithms, and I'll pick very simple |
|
|
00:35 | , not Thio. The kind of conceptual legroom is very hard to |
|
|
00:44 | Quite simple, but focus on the that's common to many algorithms, but |
|
|
00:52 | basically a linear algebra thio illustrate those . So this is just the somewhere |
|
|
01:00 | pretty much what I have covered in course. And it's all been focused |
|
|
01:07 | trying to understand together good performance, off current types of systems and both |
|
|
01:15 | the individual. No level, homogeneous and heterogeneous nodes are accelerated notes |
|
|
01:21 | then clusters. And to get good when really need to understand the |
|
|
01:27 | Um, because it's, uh, bit too optimistic, Corrine realistic to |
|
|
01:35 | that the compartment of folks will figure how to translate their code that is |
|
|
01:42 | structured in a nice way for platforms do the translation from the source code |
|
|
01:50 | something that is performing well on the platform. Eso and much of it |
|
|
01:58 | been basically try to understand how to good use of memory hierarchy and, |
|
|
02:05 | , also a little bit in terms have to use theme the well I |
|
|
02:11 | and simply type features for CP Use GP use. And then more |
|
|
02:20 | we talked about when the problem has be partitioned, the data structures Thio |
|
|
02:29 | or be allocated to different knows in cluster how to petition things and how |
|
|
02:38 | place things into the different note. now a little bit more about this |
|
|
02:46 | aspect of parallel algorithms and, um focus in this case is to try |
|
|
02:55 | . Be careful the data copying that well as how to achieve no |
|
|
03:03 | And hopefully this will become more clear I go. Um, so this |
|
|
03:11 | kind of what I we'll use a things again. Linear algebra and the |
|
|
03:17 | thing kind of a linear algebra matrix Matrix matrix. Multiplication, too, |
|
|
03:23 | some of the points about data management a swell as organizing all the computations |
|
|
03:31 | order to limit on data replication. , um, try Thio minimize sort |
|
|
03:44 | temporary variables as well. That is in order to just capability of |
|
|
03:51 | And the way I like to do is to start to do things using |
|
|
03:59 | known as historic algorithms. I'll talk them. They were kind of the |
|
|
04:04 | things to do. Um, quite time ago. But in the early |
|
|
04:10 | of the current generations of parallel computers they have become popular again on |
|
|
04:19 | bran companies design accelerators in particular for learning. So it's a good principle |
|
|
04:29 | , for example, always viewed in of understanding, um, Dana interaction |
|
|
04:36 | parallel algorithms. Okay, so, , I think I'm pretty much said |
|
|
04:43 | already as a preamble, Thio. I want to talk about and today |
|
|
04:52 | will not covers forest matrices by my is to say something about it on |
|
|
05:02 | . And it's not only for lots scientific and engineering applications, but is |
|
|
05:08 | has turned out to be an important of operations for machine learning. And |
|
|
05:16 | thing that tends to happen going often when somebody gets interviewing distributed algorithms is |
|
|
05:26 | use the client server model as a for how toe get prattle is |
|
|
05:33 | Unfortunately, I would not recommend that for cluster Start computing because it tends |
|
|
05:42 | let me scalability. Either their server the bottleneck or many times what's being |
|
|
05:48 | is excessive. Replication is one aspect another is that the way they conserving |
|
|
05:58 | is implemented in the algorithms is such the country was solving a larger problem |
|
|
06:05 | you could have solved on a single anyway. So, um, it's |
|
|
06:10 | the particular good mindset for truly distributed . Um, on the other thing |
|
|
06:22 | I guess will become clear, has through tonight's lecture. Is that when |
|
|
06:30 | spoke about that are petitioning most of waas basically in the context of the |
|
|
06:36 | when I was a single data structure got carved up on Monday of operations |
|
|
06:42 | that. And it wasn't so much Germaine for several different objects or data |
|
|
06:51 | that is being used in computation and relationship between how they are partition and |
|
|
06:58 | becomes important, becomes exceedingly simple in vector operations. But at least I |
|
|
07:05 | to be able to get the principal and think in general more questions on |
|
|
07:17 | general notions before I dive into specific of algorithms. So, um, |
|
|
07:35 | competition is basically the notion off things come in, arrest me quiet. |
|
|
07:45 | , you does compute, communicate, , communicate. It's kind of pumping |
|
|
07:51 | through. So it is in the days from this notion they came, |
|
|
07:59 | , popular design principle for parallel It waas a lot in terms of |
|
|
08:03 | processing or embedded processing where they operate streams. That will be the parents |
|
|
08:09 | the way. I'm talking about things the next several slides, but just |
|
|
08:17 | show us that, he said, is kind of an embedded things and |
|
|
08:21 | . So this is an example of Cisco router that there's this this historic |
|
|
08:26 | for practice processing and, of that this kind of inherent and type |
|
|
08:32 | stream processing, where you have different for checking the header and figuring out |
|
|
08:39 | to do so on they designed this rather so happened basically eight different pipelines |
|
|
08:49 | can operate in parallel and in its . Then there's different stages two different |
|
|
08:55 | off the packet processing that it's required try to figure out what to |
|
|
09:00 | and so that gives, uh, well. It's very well structured computations |
|
|
09:10 | many times a limited amount of branching and conditional is involved on |
|
|
09:17 | then also avoids around trips to memory basically passing things from stage to stage |
|
|
09:27 | has go into it. So the memory and they're out. So that's |
|
|
09:35 | the one. And they're still the . It's fairly common in how high |
|
|
09:42 | routers or switches started sign, and is the other one. That's when |
|
|
09:47 | know from living machine learning. Now of you are probably know all or |
|
|
09:55 | even have used the Google's tends to unit processing unit, the tpu, |
|
|
10:01 | sure, and they disorder to this design principle and streaming they're processing through |
|
|
10:12 | . Take for you. So just . All things using this system, |
|
|
10:20 | principal again in terms of us mentioned accelerators for machine learning. So now |
|
|
10:29 | going to talk specifically about this, expect the multiplication of a couple of |
|
|
10:35 | ways to try to illustrate this historic on point out the benefits of doing |
|
|
10:44 | , and it turns out I also it because it takes eyes very |
|
|
10:50 | Asai mentioned earlier, too. Make data interactions very apparent. Andi kind |
|
|
10:58 | forces want to think about them. is the beneficial also, for what |
|
|
11:04 | call the in memory algorithms or is longer streaming algorithms. But the data |
|
|
11:10 | , and the note off, a cluster. But it helps them |
|
|
11:17 | excessive memory requirements, being careful and how the data interaction happens. So |
|
|
11:27 | a very simple, cartoonish thing about made to expect the multiplication on for |
|
|
11:35 | particular case. The organization is such there is a process for every process |
|
|
11:45 | for every component of the factory uh, clearly sort of very little |
|
|
11:52 | , except maybe a little bit But it's fairly simple to take this |
|
|
12:00 | of extreme part of this model and . Inning it. Go basil mapping |
|
|
12:07 | calling the processes that are shown making them kind of virtual processors that |
|
|
12:12 | emulated by the physical processes. that mapping principle is very straightforward, |
|
|
12:18 | I won't cover that today. um, focus on necessary the data |
|
|
12:27 | . So as we do know, major expect a multiplication, each elements |
|
|
12:33 | component of the vector X multiplies a of The Matrix A. Someone look |
|
|
12:40 | this image, then we follow. that? You know the processor has |
|
|
12:49 | in it. That's basically column zero the nature say so. It's nothing |
|
|
12:56 | about that. So basically, the is that, in this case, |
|
|
13:00 | stream, some metrics A into this dimensional array of processing elements. |
|
|
13:11 | well, it's a little bit mhm about this picture is that we also |
|
|
13:18 | that any for every component of why the inner product off the roll off |
|
|
13:25 | Matrix? A. With a column . So basically, each processor in |
|
|
13:32 | case computes mhm. Partial contribution to inner product on a single in a |
|
|
13:43 | gets accumulated by passing partial in their from left to right in this |
|
|
13:49 | So as things kind of Chuck's along feeding the Matrix A into the processing |
|
|
14:02 | than components off, there is something by gets output on the right hand |
|
|
14:11 | . Now, the other things to attention to is that Arizona some kind |
|
|
14:18 | skewing going on between the successive columns a going from left to right from |
|
|
14:25 | column to subsequent columns. Now that's to avoids excessive or temporary storage by |
|
|
14:38 | , feeling a or elements of columns a into processing unit until the corresponding |
|
|
14:47 | in a product is also arriving at processor, someone trying to get them |
|
|
14:51 | arrive at the same time. So are synchronized so and so on. |
|
|
14:58 | think I have a little bit here of pictures off are these things were |
|
|
15:05 | ? And this kind of skewing or off successive columns off is something that |
|
|
15:12 | will come back to in many shapes forms and how to structure algorithms not |
|
|
15:16 | for matrix sector but for matrix matrix . And that's whether they are kind |
|
|
15:23 | array processing, where things have streamed and processing array, or it is |
|
|
15:29 | the in memory type algorithms. Now it was planning to cover both the |
|
|
15:39 | factory matrix matrix multiplication today, so fairly fast and you just the point |
|
|
15:47 | to give you the concepts that are simple, I think, and then |
|
|
15:53 | can look at the slides and there , in fact, many more |
|
|
15:56 | and I'm going to talked about in slide activity illustrate these points Now, |
|
|
16:06 | , given that time going somewhat Um, today I will just, |
|
|
16:14 | , selling the things here in terms time, speed and efficiency and trying |
|
|
16:22 | explain how those things came about instead just showing this, formulas were asked |
|
|
16:29 | come up with a formula. So the time, the way to think |
|
|
16:34 | it in good way, is to what is kind of a critical path |
|
|
16:43 | terms of getting the computation done. in this case, the critical Paris's |
|
|
16:56 | feeling in the first column. All the Matrix Hey into the left, |
|
|
17:06 | processor in this case that has X , and that takes, obviously, |
|
|
17:12 | steps on your feet in one at time on the elements in the column |
|
|
17:20 | When you're done with a column, you're also doing the last competition in |
|
|
17:24 | case for the last in a product in a product for the last |
|
|
17:30 | or a to be precise. But only the first element of that in |
|
|
17:36 | product, and then in the product to be accumulating but passing things to |
|
|
17:42 | right, and in this case, . Columns of the Matrix? |
|
|
17:47 | So that's why you get this essentially path of length, P plus |
|
|
17:55 | That is three critical path. Then did it a little bit more detail |
|
|
18:01 | this case and assume that it takes Time T survey to do in an |
|
|
18:07 | operation and a time t. C communicate the elements between a pair of |
|
|
18:20 | processors in this processing array. So so then also assuming that it is |
|
|
18:33 | overlapping competitions and communication, you'll get this additive formula, which have figured |
|
|
18:39 | how many arithmetic operations and sequence that up being in how many communications and |
|
|
18:44 | it is along this critical path. then you end up with this formula |
|
|
18:50 | the time, so there's nothing. just but teachers or calculus here. |
|
|
18:58 | it out. But the principle is . Follow the critical path and in |
|
|
19:03 | case is soon overlapping earth communication operations re assistance. It may very well |
|
|
19:10 | it doesn't change the qualitative aspects of outcome. So just first in place |
|
|
19:16 | them have you been us paying non . And then we have this standard |
|
|
19:23 | for that speed up on the See? And, um, the |
|
|
19:36 | have the time. And if there just a single processing unit for me |
|
|
19:42 | expect the multiplication is this the two operations and that and then we'll supply |
|
|
19:51 | each off the Matrix centrists on DSO terms of the competition, it's |
|
|
19:59 | Just did it on a single process it will be there two times the |
|
|
20:04 | of matrix element times the time to each one of these operations. And |
|
|
20:08 | we have the formula for the time this parallel case, so that gives |
|
|
20:14 | speed up on then the efficiency is divided by the number off processing elements |
|
|
20:21 | the youth. So we got this reformer and I'm going to spend some |
|
|
20:26 | playing around with that formula to um, the behavior a little |
|
|
20:31 | That's a function or the matrix So here's what came from the last |
|
|
20:41 | . So now someone can look at kind of extreme cases, and either |
|
|
20:47 | that there are being is much larger Cuba. That means are many more |
|
|
20:53 | in the major, send their columns when Just look at this expressions and |
|
|
20:58 | fairly simple. This emphasis simplifying you ignore. Q. With P is |
|
|
21:04 | much larger. Thank you. So that's something that is basically order |
|
|
21:09 | The coefficient is even very small. it's it got something that is close |
|
|
21:14 | 100% efficient. On the other if to happens to be largely efficient |
|
|
21:20 | , pretty deport. So, maybe I should stop for a second |
|
|
21:30 | try to see if anyone has a for how come? Why this is |
|
|
21:36 | terms off basic problem with this organization the computations. If you it's much |
|
|
21:47 | than being well, then, problem , is that the time? If |
|
|
22:07 | is very large, then it is of dominated by the time to effectively |
|
|
22:15 | the matrix. The, uh que number of processing elements is fairly |
|
|
22:22 | So the accumulation of the inner product partial in their products towards the right |
|
|
22:29 | the linear array doesn't add much to total computation time. So when, |
|
|
22:39 | the other hand, when P is relative to Q, this pipeline time |
|
|
22:45 | , and the pipeline is not very the number of processing elements and does |
|
|
22:51 | up not being very efficiently used. cause there's fury in the products products |
|
|
22:57 | passed in a very long pipeline. , so if it so happened that |
|
|
23:08 | , you have any columns? Not many rows of the Matrix, then |
|
|
23:16 | , this wouldn't be a good So anyone has an idea off the |
|
|
23:23 | question on this lines. And what you want to do if it's the |
|
|
23:29 | ? Yeah, or that that is two is my charges that he will |
|
|
23:36 | structure the computations that well, so is sort of one way of flipping |
|
|
23:47 | around effectively by saying okay instead of in a products in space. So |
|
|
23:58 | decided to accumulating the products in time Lee in a single processing element so |
|
|
24:08 | now we know that each element of results is the inner product off |
|
|
24:16 | A row off a and the column So now and this organization, |
|
|
24:24 | that means that rose off a needs be entered into the corresponding processor that |
|
|
24:37 | that in their products over time and , in order to gasp in a |
|
|
24:45 | computed correctly, that means, as of a row off the Matrix A |
|
|
24:51 | entering the corresponding component of X needs enter their processing elements. And |
|
|
25:01 | by not doing any kind of broadcast storage part again pipe lining in this |
|
|
25:08 | , the vector X through this ray they call them. Yeah, the |
|
|
25:17 | X is used for every roll off ended in the matrix vector multiplication. |
|
|
25:23 | that means, um, every element X needs to enter every processing |
|
|
25:33 | Thio then compute successive inner products and different processing elements. So now, |
|
|
25:42 | again X not being broadcast are replicated being passed through. Then there's also |
|
|
25:50 | . So the delay off different roles into the different processing elements along this |
|
|
25:57 | array of processing elements. So one again. There's kind of a skewing |
|
|
26:06 | how data is entered into this array order to avoid temporary storage on kind |
|
|
26:12 | minimize, um, memory requirements along way and how we can do the |
|
|
26:20 | thing for them. The critical paths out speed up and efficiency, and |
|
|
26:27 | the same principle as I talked about other algorithm, and now we can |
|
|
26:32 | through and figure out what happens on case. Now, if you is |
|
|
26:39 | larger than P, then they get efficiency, as we'd hoped for by |
|
|
26:45 | things. On the other hand, organization is not very good. |
|
|
26:50 | uh, T what's the thing that significantly larger than you and E guess |
|
|
26:58 | it. Side remarks, textbooks and things that tend to use all square |
|
|
27:05 | , which tends to be recently If you have single, knows that |
|
|
27:12 | metrics shapes usually enough, that's But it turns out matrix shift shapes |
|
|
27:18 | very important when it comes to designing algorithms for distributed settings. So so |
|
|
27:29 | ? We learned Waas another way of at it. So when the serial |
|
|
27:36 | was large, the first case p number of rows I was entered sequentially |
|
|
27:42 | large than I got good efficiency. in the second design, Then there |
|
|
27:52 | the number of columns for large, the elements in a column was centrist |
|
|
27:57 | . Then you also that good efficiency that case, So yeah, |
|
|
28:03 | which is tends to be true that many designs that it's not too hard |
|
|
28:13 | get very high efficiency if the parallel limited. So that means that also |
|
|
28:23 | back to some notion they have before terms off strong scaling strong scaling means |
|
|
28:29 | to increase the number, the product this and the serial components on this |
|
|
28:37 | , and that makes it hard to through deficiency. And the other thing |
|
|
28:44 | to what I just mentioned. That this distributed algorithms is really imperative that |
|
|
28:54 | also think about when it comes to are liberal operations. It's very simple |
|
|
29:01 | terms of talking about shapes, but it's just to our multi dimensionally raised |
|
|
29:08 | the shape is family straightforward or more ? Theda Structures is not sleazy. |
|
|
29:15 | talking about shapes. But it is to realize that there is the relationship |
|
|
29:22 | how you organize your competition and how allocate the data as a function or |
|
|
29:29 | the data structures are you got from . It's simple. Dr Johnson. |
|
|
29:40 | . Um, we'll be talking. huh. If it's possible, Thio |
|
|
29:46 | the transfers Operation on the matrix. ah. We do it sometimes on |
|
|
29:54 | graphics class to preserve the normals. , e will bring it up |
|
|
30:02 | Um next lecture I It is quite to figure out to do it, |
|
|
30:10 | it's not entirely tribute, but I sometimes I do talk about it in |
|
|
30:17 | class, but down that slides for . But I can have some next |
|
|
30:26 | . Okay, Cool. Thank So there is a fair amount of |
|
|
30:31 | to do May 6 transports in place an efficient way. Enough. Just |
|
|
30:39 | , an extra copy. So there also so at it. Um, |
|
|
30:49 | guess, uh, particular I don't Scott needs to do it. But |
|
|
30:58 | terms of fast Fourier transform, the algorithms tends to be used The most |
|
|
31:06 | algorithms they introduced something that is known bit reversal. How they operate and |
|
|
31:16 | traverse. Soliz kind of a little well, has similarities with the Matrix |
|
|
31:22 | case. And, um, um though they're different. And it |
|
|
31:33 | out that either May 6 transposition or Traverse ALS represents the worst case data |
|
|
31:45 | from many interconnection networks. But I try to talk a little bit more |
|
|
31:51 | it next time. That's a good for Thank you. Okay, so |
|
|
32:03 | what I was content to next was with this simple, um, |
|
|
32:11 | all these two systolic algorithms on the D array. Try to use them |
|
|
32:17 | a little bit of model or conceptual for how to do things. If |
|
|
32:25 | think of your processing elements as being number one the array. But |
|
|
32:30 | off Taylor being streamed through the processing array data is allocated to it the |
|
|
32:38 | they would typically be in a So, um, here is, |
|
|
32:49 | Now, what's a typical thing you see and how people do data allocation |
|
|
32:56 | clusters. They in this case, shall come one, the petition ings |
|
|
33:03 | the idea is to allocate the collection rose to each one of the processing |
|
|
33:10 | . And so you get this kind role like panels, it's called Sometimes |
|
|
33:16 | elements of the Matrix. A is to each one of the, |
|
|
33:23 | processing elements of notes in the Now there's another thing about this |
|
|
33:32 | I did the same thing too. vectors, y and X so |
|
|
33:43 | The objects in this case on are kind of in a similar fashion that |
|
|
33:50 | all evenly distributed across the nodes of custom. Now, depending on if |
|
|
34:01 | manage it yourself. Let's is kind the case. If one were to |
|
|
34:07 | NPR, you decide what goes into node. You may choose to do |
|
|
34:12 | differently that there are also attempts. Uh, in fact, there are |
|
|
34:21 | known as kind of data parallel in which it's the task of the |
|
|
34:30 | to figure out how to allocate the to the notes and the most away |
|
|
34:41 | a compiler to do it, to all the rays equally in some |
|
|
34:46 | They would not necessarily understand the semantics the program, but just do things |
|
|
34:53 | on the dimensionality on the vectors and to petition things evenly. So it |
|
|
35:02 | mean like a Nativity. Every it would just petitioned one dimension, |
|
|
35:06 | I will talk about other petitions for metrics. But the most simplistic way |
|
|
35:12 | just okay, petition one dimension, sometimes that works just fine. So |
|
|
35:18 | is kind of the premise for this illustration of how data may be allocated |
|
|
35:25 | . The underlying reason is set for simplistic that could easily be automated if |
|
|
35:32 | compiler is giving the global address space notes. All right, so here's |
|
|
35:44 | God said about this thing. So see if anyone can identify a little |
|
|
35:55 | of an issue with this way of things organized. There is anything that |
|
|
36:01 | think is, uh, problematic. up so easy to try to get |
|
|
36:14 | question. But here is kind of point I was trying to make you |
|
|
36:18 | of been thinking about. So So matrix make matrix vector multiplication works |
|
|
36:25 | These component of X just multiply, , the corresponding column away. So |
|
|
36:31 | you have a block off elements of vector X in the know that nicely |
|
|
36:38 | you can only do and not the Panin made to expect a multiplication in |
|
|
36:47 | note, you can only do the matrix vector multiplication correspondent to the elements |
|
|
36:55 | except that particular processor does have. in this case, it means kind |
|
|
37:01 | just diagonal blocks off the natives factor can actually be executed as the data |
|
|
37:08 | initially allocated. So that means that need to, um, do some |
|
|
37:18 | with some flavor, and that depends what kind of algorithm you use. |
|
|
37:23 | let's see, what I did on next slide is kind of what many |
|
|
37:27 | tends to do so fine. I just broadcast X to every process, |
|
|
37:36 | that needs it. So in this case, it would mean and, |
|
|
37:42 | all, broadcast off X. uh, every processor needs all the |
|
|
37:54 | of X. So that's why and all broadcast does make sense. |
|
|
38:02 | as you may remember, from the I lecture, it's even a standard |
|
|
38:09 | routine in NPR. So this perfect certainly works. But it requires now |
|
|
38:22 | memory because now you need a lot memory for X than initially allocated, |
|
|
38:31 | now entire metrics or vector X needs be. It's the world in every |
|
|
38:39 | , so it's not quite as big a because excess, just the vector |
|
|
38:48 | a is. I got the to object so but it's still an increase |
|
|
38:55 | memory in there alone. Later show you can avoid doing this, but |
|
|
38:59 | is kind of the first approach people tend to do, and it's |
|
|
39:05 | very scalable in particular, if you to generalize it too. Matrix |
|
|
39:10 | multiplication Where, then, instead of single vector X, you get the |
|
|
39:17 | . So anyway, so now one do this. Major, expect the |
|
|
39:22 | , and it's all said and and we can then do the usual |
|
|
39:28 | . Eso said they talked about 10 can figure out efficiency, and they |
|
|
39:34 | play around with these numbers. On case. I even put in some |
|
|
39:39 | numbers for speed or communication and And again you see, depending on |
|
|
39:49 | number off rows of respect to the of processing elements or sorry for this |
|
|
39:58 | . P is still the number of and the metrics and and is the |
|
|
40:03 | of processing elements. So he does stand for processes, but for a |
|
|
40:09 | of roles. And remember, the was here that the Matrix is partitioned |
|
|
40:18 | you get? There's kind of p by n roles for uneven distribution in |
|
|
40:26 | one of the notes. So it's this depending upon the air serial component |
|
|
40:34 | the serial component that this now be by end in particular is large than |
|
|
40:42 | are good on. Now, what P is about the number of basically |
|
|
40:51 | of wonderful for processing element equals Then it depends about the ratio between |
|
|
41:01 | arithmetic time and the communication time, that formally doesn't look so bad on |
|
|
41:09 | slide. But if you plug in numbers and remember again communication, maybe |
|
|
41:17 | cluster nodes competitively time for another arithmetic is very large. So for P |
|
|
41:28 | and the efficiency that tends to be low because TA is much, much |
|
|
41:36 | than t. C. So then control of the opposite, then to |
|
|
41:44 | things by columns. And the good now is that if you do |
|
|
41:50 | that's why than by basically column petitioning applied to both a and effectively |
|
|
42:01 | the character X than every processor has matching elements of exit needs for the |
|
|
42:10 | of a that it does have. there is no communication needing whatsoever to |
|
|
42:17 | out The Matrix make vector multiplication. there are still some bad news, |
|
|
42:28 | among spots, that kind of the news. Well, the problem |
|
|
42:36 | uh, when you do this, , okay, then if you look |
|
|
42:42 | it every processor compute a partial contribution every elements on the results. |
|
|
42:58 | Why no process of compute the entire factors. So the corresponding in a |
|
|
43:09 | or partial in the products now needs be accumulated across processing or question |
|
|
43:16 | So now you basically get this and then you need to do kind |
|
|
43:20 | reduction across. So then you have of more tempt storage in this case |
|
|
43:28 | the results sector. So this and you can figure it out. So |
|
|
43:34 | the opposite is true. So this of doing business is good again for |
|
|
43:41 | opposite situation. Respect to the relationship P and Q B. So this |
|
|
43:49 | pretty much the same thing. As said before, We need to pay |
|
|
43:53 | you structure things, respect to the elements. Um, so now how |
|
|
44:03 | you actually do this? Also were on bond. What show? Next |
|
|
44:09 | a way of implemented step by Very simple algorithm and then into |
|
|
44:15 | leaving the broadcast with the major factor to avoid excessive tempt storage. So |
|
|
44:28 | is basically what one can do if think of it doing a secret |
|
|
44:31 | So this is just, you what, you were so called lazy |
|
|
44:36 | , or whenever you go to or the opportunity in the past to go |
|
|
44:40 | Chinese restaurants and they put all the in the center and you spin it |
|
|
44:44 | and then you get eventually, things, and you don't need to |
|
|
44:50 | values as you keep giving them. you best they can replace what every |
|
|
44:54 | to have so it doesn't require and in local stores to do the broadcasting |
|
|
45:01 | way. So this is done here kind of what to do in this |
|
|
45:08 | of sequential algorithm. So they got is what I show. So this |
|
|
45:15 | this a quick thing. And during southward, um on, then eventually |
|
|
45:20 | done and they didn't know need and particularly temporary storage. So when I |
|
|
45:30 | , um, just start to structure you have the in memory algorithms, |
|
|
45:37 | questions of that Before I talked? , why not think of your collection |
|
|
45:45 | knows as being a two D That's kind of more mimics the metrics |
|
|
45:50 | things that mimics a vector. Mhm interrupted. Any final question comes to |
|
|
46:00 | . So So now here is kind one way of organizing it, and |
|
|
46:07 | this case, I did this kind a column major order assignment off blocks |
|
|
46:13 | the Matrix to the processing notes. the numbers basically gives you the processing |
|
|
46:23 | I d. Or trusting old. did still the same thing that won |
|
|
46:28 | race. There are evenly spread for and y. So now if one |
|
|
46:33 | at once, the situation is is here is kind of what it looks |
|
|
46:40 | in terms of the relationship between A X as Thio Walk, um, |
|
|
46:49 | be done in terms of matrix vector . So yes, one can |
|
|
46:54 | um, some block made to expect multiplication. But like process zero doesn't |
|
|
47:01 | all the elements of X it needs its respected columns and the same thing |
|
|
47:06 | all the other process and elements. basically, what's needed? Yes, |
|
|
47:14 | what's sometimes called the segmented after broadcast. To my knowledge, it's |
|
|
47:21 | clear. Oh, my NPR may it, but some other implementations off |
|
|
47:27 | NPR implementation, in fact, allows the segmented also a broadcast. So |
|
|
47:36 | this case, the broadcast doesn't need involved all the processing elements. Only |
|
|
47:44 | processing elements represents the same column of Matrix A. So it's kind of |
|
|
47:54 | all broadcast within processing elements collapse. that can certainly make things more efficient |
|
|
48:05 | in terms of data communication than that's , um, well, still a |
|
|
48:11 | customer. All notes. So after function, Dio made to expect to |
|
|
48:18 | on Ben. What happens is fun the result is this. So now |
|
|
48:27 | has the case that off the results , why again, as one has |
|
|
48:35 | the contribution to each and the product distributed across processing rose in this |
|
|
48:45 | So in this case, an accumulation why it's necessary within processing rolls. |
|
|
48:53 | in this case, you can do what's known as segmented or unalterable reduction |
|
|
49:00 | rules to get the elements such that is evenly distributed across all the |
|
|
49:15 | So anyone spots and issues once this all reduction is actually done well. |
|
|
49:30 | issue is that, yes, things fully computed, but the components off |
|
|
49:39 | the blocks off the vector wine happens be in the wrong location, except |
|
|
49:45 | a few so block one of their factor wise now in processing element |
|
|
49:52 | But it should have been in and what's in processing element one is |
|
|
49:57 | from before that should be and person four. In order, Thio satisfy |
|
|
50:05 | starting point, so to speak, even distribution that the compiler may have |
|
|
50:13 | to the vector life. So there still the communication required. Thio get |
|
|
50:21 | in the right location. So, , this e guess what? The |
|
|
50:33 | result this in terms of how the to be allocated. And that's kind |
|
|
50:37 | a for imitation that I may have this slide on DSO. This |
|
|
50:45 | in fact, with shuffle permutation that about in terms of the communication network |
|
|
50:51 | and also at that point mentioned that is, in fact, what happens |
|
|
50:58 | they re organizing things from Wrote a major order. So So it's a |
|
|
51:05 | fermentation, and it's a Sfar exam . In fact, in terms of |
|
|
51:11 | computing, it's a worthwhile standard permutation very few implementations of communication libraries. |
|
|
51:21 | fact, that stuff Uh okay, this is just summarizing what was done |
|
|
51:32 | the column Major order. And I then I also have slides that |
|
|
51:37 | Boom. Just flip through because I to talk a little bit about matrix |
|
|
51:43 | multiplication today so you can look at slice, and it's just correspondent. |
|
|
51:49 | happen that things, um, are like the difference in terms of where |
|
|
51:56 | need, uh, the communication But yet something that this in the |
|
|
52:04 | , what they are, and then kind of summarize things a little bit |
|
|
52:08 | in terms of the performance. So one aspect, obviously, that is |
|
|
52:13 | if you do this two dimensional partitioning that if you have a collection of |
|
|
52:19 | hard lemonade you assigned to rows and and need to fracture them in terms |
|
|
52:25 | how you organize your notes, and turns out that they can play out |
|
|
52:30 | again, assuming this simple model um, things are kind of non |
|
|
52:37 | to look at things how things have if the matrix dimensions air such that |
|
|
52:43 | evenly divided by the number off processing you assigned to rows and columns for |
|
|
52:50 | than the arithmetic, work is fully balance, regardless how you do |
|
|
52:55 | So the only difference is in the part, so we only need to |
|
|
53:00 | about the communication, and it turns that, and to minimize the |
|
|
53:08 | you should choose the shape off the array to match the shape off the |
|
|
53:20 | . So basically, you get than what's assigned to each processing element is |
|
|
53:28 | blocks that minimize the amount of data community. So this is, um |
|
|
53:38 | think what? Just a flipping quickly this year. So, um, |
|
|
53:46 | just shows the difference. But to this is what I wanted to get |
|
|
53:51 | as a general rule of thumb. when I come to matrix operations, |
|
|
53:59 | , um, block sizes, if does, is to the petitioning Is |
|
|
54:07 | typically that informatics spectrum in square blocks this in factories over to matrix matrix |
|
|
54:14 | as well? Although it's a little more complicated. And I think I |
|
|
54:18 | this a little bit too short. , but I wanted to point out |
|
|
54:23 | if you look at the communication yeah, it shows that for the |
|
|
54:31 | dimensional partitioning on this was just picking square matrix for simplicity. But you |
|
|
54:39 | see that the two dimensional petitioning um, a lower leading term. |
|
|
54:50 | me. Okay. I'm sorry. Otherwise, I'm not six on perfectly |
|
|
55:01 | , given the times. But you see the leading terms is the square |
|
|
55:05 | of the number of processing elements, in the one the petition it doesn't |
|
|
55:11 | to the partitioning tends to minimize the time. So that's a preferable thing |
|
|
55:17 | do here and all many people they to do if they were one dimensional |
|
|
55:23 | make expected. They just choose one partitioning, I guess, for |
|
|
55:28 | But it's really not the best thing do. Mhm so that waas Here's |
|
|
55:38 | stress and there is just a slight were General case off, um, |
|
|
55:46 | role, the column, major order only their diverse, that there are |
|
|
55:53 | and look at so. But we'll for questions on that before I try |
|
|
55:58 | cover a little bit about major Complication. So they take home message |
|
|
56:09 | the magic shape is important. Uh . And you may think about |
|
|
56:16 | so one that's kind of perhaps interested a theoretical aspect. Who cares? |
|
|
56:22 | turns out that those some of you know, uh, computational chemistry. |
|
|
56:30 | cases they have exceedingly rectangular matrix is cities to deal with. They may |
|
|
56:37 | things that are, you know, , 10, 12 or something small |
|
|
56:43 | or either rows and columns, and have thousands off elements in the other |
|
|
56:49 | . So they're highly rectangular, the things that don't cover in this class |
|
|
56:57 | in reality, for large scales your mattresses may not fit. And |
|
|
57:07 | the memory of a single node nor all the memory of all the custom |
|
|
57:11 | . So, um, agencies may stored on disk. And the way |
|
|
57:16 | are processed in the cluster is by kind of end, similar to what |
|
|
57:25 | in the May take. Select a , but basically Onley subsection of the |
|
|
57:31 | metric Fitz and the memory of all custom notes. So in that |
|
|
57:38 | what's being dealt with in terms off algorithms executed by the cluster is in |
|
|
57:45 | dealing with highly rectangular objects. And that case again, shaping you're processing |
|
|
57:55 | is important to get really good All right. Next time to the |
|
|
58:04 | matrix Multiplication, then again, is back to the systolic concept first |
|
|
58:09 | then generalized to to the or in algorithms. So here's what they |
|
|
58:19 | Sort of one of the matrix vector for the X um waas kind of |
|
|
58:26 | . If you like and in the for accumulating in space. So anyone |
|
|
58:32 | use this to design matrix matrix multiplication , which is nothing. But instead |
|
|
58:39 | ex being a single vector, X would represent, you know, the |
|
|
58:44 | . So that means that many columns what you can do is basically stack |
|
|
58:49 | won the race on top of each to then trace something where your metrics |
|
|
58:56 | Eyes in this case fixed in space Did you stream your matrix? |
|
|
59:03 | By and now, since in this the inner products were accumulated in across |
|
|
59:16 | ending output one by one as they mhm, fully computed. Then you |
|
|
59:24 | basically each roll on this designed, computes one column on the result matrix |
|
|
59:37 | correspond to the that particular column of Matrix lead. So I only get |
|
|
59:44 | columns of the Matrix e at for successive roles off these to the array |
|
|
59:51 | computing elements. And now you also this synchronization or the skewing or delaying |
|
|
60:01 | entries and outputs of yoga? Uh , this school stealing our elements. |
|
|
60:09 | that is important, um, to in particular when we talked about in |
|
|
60:19 | computation, matrix, matrix, And it's more typical cluster setting than |
|
|
60:25 | this type of perhaps more embedded like processing. I will go fairly |
|
|
60:35 | , so and hopefully they can concept clear and you can look at the |
|
|
60:42 | on the slides if it's relevant to you like to do or remember at |
|
|
60:48 | point maybe you wont need Teoh know to do it. So again, |
|
|
60:55 | the reminder. And this is what Google TP you guys did, |
|
|
61:00 | use in their design off there processing to now what understanding when this type |
|
|
61:09 | design is sufficient is the same so . But this fund, it turns |
|
|
61:14 | now we have three different characteristics, ? We have the rose on the |
|
|
61:21 | A the columns off, Hey, disappeared by Hugh Matrix. And then |
|
|
61:27 | have the Matrix B that are So there's three different characteristics numbers to |
|
|
61:33 | about. And this design is good he happens to be large. So |
|
|
61:37 | happens, basically inherits the thing we from the symptomatic spectrum application case. |
|
|
61:46 | , Now one can also flipped things and then there is in principle not |
|
|
61:52 | difference between A and B. Conceptually long can then do a similar type |
|
|
61:58 | computing arrayed by making a stationary and of feeling. These and getting sees |
|
|
62:05 | out of this process in your A when. Now I get a little |
|
|
62:10 | difference on this case, it comes . This one is good if the |
|
|
62:16 | of columns of the Matrix B yes the dominating and otherwise it's not such |
|
|
62:22 | great design. So now we We have two options, depending on |
|
|
62:29 | P or R is the dominating But we need another one if to |
|
|
62:36 | to be there dominating dimension. And that, we go back and |
|
|
62:42 | Construct something based on accumulating the inner for elements of the sea matrix when |
|
|
62:48 | goes to matrix matrix multiplication in so to speak. So here is |
|
|
62:57 | of what one does in terms of getting an element, and one can |
|
|
63:04 | construct this one based on the symptoms vector multiplication in, you know, |
|
|
63:10 | a little bit more messy to figure how things are done. So now |
|
|
63:16 | can. They did this so same moral as in the previous case |
|
|
63:26 | . So how about in memory? Gore? And I'm saying now it's |
|
|
63:30 | about two D, and hopefully they to say something about three D as |
|
|
63:34 | . On, uh, we'll I only have about 10 minutes |
|
|
63:40 | So now we can take this algorithms I'll go, um, some of |
|
|
63:47 | first because the and he just tried highlight a little bit how this sink |
|
|
63:56 | from this historic algorithm charges over to . Things needs to be structured in |
|
|
64:03 | off in memory algorithms in order that end up with excessive temp storage. |
|
|
64:13 | , um, so here is what have. Now the idea is again |
|
|
64:18 | same allocation model. So again, the compiler is given a global memory |
|
|
64:24 | all the notes on, then assuming it treats all the rate the same |
|
|
64:30 | in this case, I assume that decides to basically the block allocation for |
|
|
64:37 | one of the ah prints A, and C eso if all accurate or |
|
|
64:45 | a square, that means square Otherwise, there should be, |
|
|
64:51 | not to square processing your race. this is just simple. For illustration |
|
|
64:58 | now the problem is if, as you don't matrix multiplication, um, |
|
|
65:06 | the standard definition, right? It's roll off, a times a column |
|
|
65:10 | be. And what that means is the inner index, as it's called |
|
|
65:17 | index off A and first Index D to be the same. Otherwise, |
|
|
65:23 | math doesn't work. So the only where the inner index off the blocks |
|
|
65:29 | A and B are the same are the diagonal. So, yes, |
|
|
65:34 | can do some computation and doing it way, but most of the NOs |
|
|
65:41 | do anything. Basically, you have to the array of notes and only |
|
|
65:46 | the Iranian connection to computation. So is not very well good for you |
|
|
65:51 | business. So now if And when go back and again look at these |
|
|
65:56 | . The way things worked in the start the case was, in |
|
|
66:00 | that if I look at how things organized, can a limit matrix A |
|
|
66:08 | kind of the normal way grows horizontally columns vertically. But that wasn't the |
|
|
66:14 | for B. They was actually effectively . So So then, of |
|
|
66:23 | uh, this is kind of what shows. And then so now in |
|
|
66:28 | case is choose transport A instead. now if one looks at the inner |
|
|
66:34 | of A and B under discovered yes, all the, um, |
|
|
66:40 | can be a block product off what has off A and B, and |
|
|
66:47 | does make sense. One can look the first column or processing elements for |
|
|
66:54 | so they're going down the column off first column off a The second entry |
|
|
67:03 | a 01 and that matches the 10 terms of inner index, you go |
|
|
67:08 | the down there is 02 and B So what it means is, |
|
|
67:14 | you can do matrix lot products. it also means that all those products |
|
|
67:23 | elements are see double zero. So means the corresponding block off the product |
|
|
67:32 | sexy is partially computed within a column the processing array. So no processor |
|
|
67:44 | the full, uh, product off Block column off a times of block |
|
|
67:53 | off B. So that needs them be assembled by doing a reduction within |
|
|
68:01 | of the major or processing array. , yes, all the processors could |
|
|
68:08 | blocked matrix products, But then it a reduction within columns to get the |
|
|
68:16 | results and play around with this a bit. Well, I think I'll |
|
|
68:23 | a couple of other cases first, , we'll see and the next several |
|
|
68:30 | and really just conflict through. But just shows the steps, then, |
|
|
68:36 | , kind of corresponding to the systolic . So that was what I just |
|
|
68:40 | about was one computation step and this and then did kind of ah, |
|
|
68:46 | or shift of the matrices. So and ends up being in this |
|
|
68:53 | the rotation, because the was fixed this case of a was the one |
|
|
69:00 | was passed through this historical raised on case, it means it gets moved |
|
|
69:05 | and the same thing will need to with See, So So, |
|
|
69:12 | this is not what happened since the is fixed and one does his steps |
|
|
69:18 | shifting off error A and C and click shifts and to the reduction within |
|
|
69:27 | . And eventually is your gap the operation done. So now I can |
|
|
69:36 | the corresponding things instead of having okay was the beast stationary case. Then |
|
|
69:42 | can do, uh, someone that united mess up, be stationary. |
|
|
69:50 | , okay. So what I did instead, waas toe play around. |
|
|
69:55 | one in fact can by doing the of area not just doing the |
|
|
70:04 | then one can get things. in fact, um, you can |
|
|
70:13 | the production off. See, So you get different contributions to see computed |
|
|
70:24 | different locations. And though in the steps of the algorithm, you will |
|
|
70:29 | pass elements of sea around instead of this reduction operation. So at this |
|
|
70:35 | , you get something that's kind of mimics the systolic algorithm by not having |
|
|
70:43 | potential somewhat inefficient. Uh, all reduction within columns being done |
|
|
70:50 | Off. Yes. Reduction by cyclic . We're, um, elements of |
|
|
70:58 | . So this is kind of what algorithm does. So then there are |
|
|
71:06 | various steps anyone is interested can cannot this simulation of the algorithm. |
|
|
71:13 | I'll have to do that. So one can and then also e guess |
|
|
71:27 | this exercise that I did in terms when this album are good on, |
|
|
71:31 | figuring out what the shape should and B was the one that was |
|
|
71:35 | and a n c. Moved. it turns out that from one square |
|
|
71:42 | of B and that's something, And I think about the difference versions |
|
|
71:48 | the algorithms in terms off what waas in the systolic algorithms for A B |
|
|
71:55 | C. So this was the Okay, so So now you can |
|
|
72:03 | the same thing for a I will flip through them and just commit |
|
|
72:09 | There are now hopefully too hard to slides that shows you the step of |
|
|
72:15 | systolic algorithm and converting it to mean algorithm that pretty much just the same |
|
|
72:22 | Aziz compute communicate as the streaming systolic , which is. And if you |
|
|
72:30 | that, you get something like and then you can look at the |
|
|
72:35 | off, how to configure their A other should be square blocks off |
|
|
72:40 | And that's the sort of mimics and these things are. And I will |
|
|
72:47 | to cover this last one, and will stop and take questions and maybe |
|
|
72:52 | some supplements next time for the three . But now you can take this |
|
|
73:00 | . Was he stationary? And then move a B and the set. |
|
|
73:05 | slides that did was a little bit . So this is actually down mawr |
|
|
73:13 | a, um, not fully while a block algorithm with different shapes off |
|
|
73:24 | matrix. So and it's a little more elaborate. The point for you |
|
|
73:30 | remember about this one. It ended the sea stationary version on this |
|
|
73:38 | Multiplication is a very well known algorithm as cannons algorithms, Anyone that looks |
|
|
73:47 | parallel algorithms for linear algebra. They likely come across this notion of cannons |
|
|
73:55 | that waas an algorithm designed by PhD , uh, Montana State University quite |
|
|
74:06 | time ago. And that's basically based C being stationary and being moved A |
|
|
74:13 | C. And the thing that I made it so famous, if nothing |
|
|
74:21 | and actually many where people have used in implementing making smart implication is the |
|
|
74:29 | between a B and C That is killing. I talked about in the |
|
|
74:36 | algorithms. So that's why I tried point out that this include ization is |
|
|
74:43 | , is important for conserving memory, that's kind of to me one of |
|
|
74:48 | fames or reason for fame off this algorithm is avoiding broadcast and tempt storage |
|
|
74:59 | major cities. So let me, , and show you a little |
|
|
75:08 | In this case, I did things show you when things are not square |
|
|
75:15 | it becomes a little bit more How you do this synchronization between streets |
|
|
75:21 | my time is kind of up. I will flip through the slides quickly |
|
|
75:26 | show you a little bit of how are can be done in this case |
|
|
75:31 | the alignment, as I call this killing on the things are dinos |
|
|
75:37 | different ages and bees. In this , eso It turns out that this |
|
|
75:45 | skewing off. A is kind of respect to the number off processing your |
|
|
75:57 | where the scaring off B is with to the number of processing columns. |
|
|
76:04 | it's basically four by eight processing So that's why this killing of a |
|
|
76:11 | works with the terrorize off columns. it will be different again and have |
|
|
76:18 | slides that you can look at when are not just one dimension is a |
|
|
76:25 | of the it gets even more But you can work this through, |
|
|
76:29 | then one less thing to step Step on. One can see that |
|
|
76:34 | every step than only a partial use the matrix speak, and we used |
|
|
76:41 | to her but continues on the blocks A and the in the Matrix multiplication |
|
|
76:50 | , So I'm gonna do this and the end result is that for |
|
|
76:57 | now, they should be square blocks C. So it ends up being |
|
|
77:03 | like in the Matrix vector multiplication case optimum respect to, um performance and |
|
|
77:11 | communication is that the stationary matrix should square blocks, and that is kind |
|
|
77:21 | the rule of thumb. And there's some data upon. So how it |
|
|
77:24 | out to particular cases on being worked . So, um, no, |
|
|
77:33 | think time is really up. So is what I said on e just |
|
|
77:41 | you the slides air here and You're interested in it? Um, |
|
|
77:54 | I will just say it and the in there because they're not entirely |
|
|
77:59 | but so basically what one does in three d case, and if somebody's |
|
|
78:05 | that will cover it next time. what do you do is to kind |
|
|
78:11 | think of your processing array as, , a number of planes. The |
|
|
78:19 | for that is when you do make matrix multiplication, as we do |
|
|
78:23 | you have three nested loops. in fact, matrix matrix multiplication. |
|
|
78:29 | a way, they have three Number over dimension for the number of |
|
|
78:35 | off A and C, another dimension the number of rows or columns of |
|
|
78:41 | and rows of B and the third for the number off columns off C |
|
|
78:47 | B. So it's actually three D the space, and it pays |
|
|
78:53 | It turns out to, in then configure your processing array as |
|
|
79:01 | three d array. And again, shape of the array should be such |
|
|
79:06 | basically, they blocks if you choose have compute, see in place that |
|
|
79:13 | should be Children. No, there's D allocation was, in fact, |
|
|
79:24 | ages ago, and on this connection systems that I told you that they |
|
|
79:29 | a library form and I worked for company. Um, and later on |
|
|
79:35 | notion, waas kind of rediscovered by broke the group, but going Tim |
|
|
79:43 | . So they did what they called . 2.5 the algorithm for doing matrix |
|
|
79:49 | and some other linear algebra operations. on the blackboard website there are some |
|
|
79:54 | their papers, um, for doing than two the metrics algorithms and when |
|
|
80:02 | time taking a few minutes over But because stop and take questions. |
|
|
80:18 | . So the take home message is that Yes, when you have more |
|
|
80:24 | one objects will need to pay attention how the different objects are used relative |
|
|
80:29 | each other in terms of data references all right, configuration off your collection |
|
|
80:38 | processors for minimized minimizing communication then becomes function of the object ships. That's |
|
|
80:46 | big number of take home message, against the last one is that communication |
|
|
80:54 | to million advice by making the minimum area for the volume of their that |
|
|
81:05 | held locally without moving so fairly simple . In the end, not, |
|
|
81:16 | entirely, are easy to figure out to take them simple bottom line and |
|
|
81:23 | it into algorithms. Okay, The time last lecture, I will then |
|
|
81:38 | something about transpose operations. And then intent wants to talk a little bit |
|
|
81:46 | sports matrix stuff That is important. more important than tense. Um, |
|
|
81:53 | it's also a little bit more complex not so little, not only for |
|
|
81:59 | and engineering application, but also for machines arming and deliverance. So that |
|
|
82:06 | be the topic for next? Not . Well, stop screen. Still |
|
|
82:18 | on from taking questions. If there some What? Thank you. |
|
|
82:38 | it's not so thank you for And I tried to pick up my |
|
|
82:45 | 50 notes. Insanity. You, . Great. Thank you. |
|
|
82:59 | Thank you. Thank |
|