© Distribution of this video is restricted by its owner
Transcript ×
Auto highlight
Font-size
00:04 So this is the last week of or is basically today and Wednesday's lecture

00:10 . So I was trying to think things that might be the most

00:17 um, for the future on do some degree some of the projects and

00:29 We'll talk mostly about then aspects of algorithms, and I'll pick very simple

00:35 , not Thio. The kind of conceptual legroom is very hard to

00:44 Quite simple, but focus on the that's common to many algorithms, but

00:52 basically a linear algebra thio illustrate those . So this is just the somewhere

01:00 pretty much what I have covered in course. And it's all been focused

01:07 trying to understand together good performance, off current types of systems and both

01:15 the individual. No level, homogeneous and heterogeneous nodes are accelerated notes

01:21 then clusters. And to get good when really need to understand the

01:27 Um, because it's, uh, bit too optimistic, Corrine realistic to

01:35 that the compartment of folks will figure how to translate their code that is

01:42 structured in a nice way for platforms do the translation from the source code

01:50 something that is performing well on the platform. Eso and much of it

01:58 been basically try to understand how to good use of memory hierarchy and,

02:05 , also a little bit in terms have to use theme the well I

02:11 and simply type features for CP Use GP use. And then more

02:20 we talked about when the problem has be partitioned, the data structures Thio

02:29 or be allocated to different knows in cluster how to petition things and how

02:38 place things into the different note. now a little bit more about this

02:46 aspect of parallel algorithms and, um focus in this case is to try

02:55 . Be careful the data copying that well as how to achieve no

03:03 And hopefully this will become more clear I go. Um, so this

03:11 kind of what I we'll use a things again. Linear algebra and the

03:17 thing kind of a linear algebra matrix Matrix matrix. Multiplication, too,

03:23 some of the points about data management a swell as organizing all the computations

03:31 order to limit on data replication. , um, try Thio minimize sort

03:44 temporary variables as well. That is in order to just capability of

03:51 And the way I like to do is to start to do things using

03:59 known as historic algorithms. I'll talk them. They were kind of the

04:04 things to do. Um, quite time ago. But in the early

04:10 of the current generations of parallel computers they have become popular again on

04:19 bran companies design accelerators in particular for learning. So it's a good principle

04:29 , for example, always viewed in of understanding, um, Dana interaction

04:36 parallel algorithms. Okay, so, , I think I'm pretty much said

04:43 already as a preamble, Thio. I want to talk about and today

04:52 will not covers forest matrices by my is to say something about it on

05:02 . And it's not only for lots scientific and engineering applications, but is

05:08 has turned out to be an important of operations for machine learning. And

05:16 thing that tends to happen going often when somebody gets interviewing distributed algorithms is

05:26 use the client server model as a for how toe get prattle is

05:33 Unfortunately, I would not recommend that for cluster Start computing because it tends

05:42 let me scalability. Either their server the bottleneck or many times what's being

05:48 is excessive. Replication is one aspect another is that the way they conserving

05:58 is implemented in the algorithms is such the country was solving a larger problem

06:05 you could have solved on a single anyway. So, um, it's

06:10 the particular good mindset for truly distributed . Um, on the other thing

06:22 I guess will become clear, has through tonight's lecture. Is that when

06:30 spoke about that are petitioning most of waas basically in the context of the

06:36 when I was a single data structure got carved up on Monday of operations

06:42 that. And it wasn't so much Germaine for several different objects or data

06:51 that is being used in computation and relationship between how they are partition and

06:58 becomes important, becomes exceedingly simple in vector operations. But at least I

07:05 to be able to get the principal and think in general more questions on

07:17 general notions before I dive into specific of algorithms. So, um,

07:35 competition is basically the notion off things come in, arrest me quiet.

07:45 , you does compute, communicate, , communicate. It's kind of pumping

07:51 through. So it is in the days from this notion they came,

07:59 , popular design principle for parallel It waas a lot in terms of

08:03 processing or embedded processing where they operate streams. That will be the parents

08:09 the way. I'm talking about things the next several slides, but just

08:17 show us that, he said, is kind of an embedded things and

08:21 . So this is an example of Cisco router that there's this this historic

08:26 for practice processing and, of that this kind of inherent and type

08:32 stream processing, where you have different for checking the header and figuring out

08:39 to do so on they designed this rather so happened basically eight different pipelines

08:49 can operate in parallel and in its . Then there's different stages two different

08:55 off the packet processing that it's required try to figure out what to

09:00 and so that gives, uh, well. It's very well structured computations

09:10 many times a limited amount of branching and conditional is involved on

09:17 then also avoids around trips to memory basically passing things from stage to stage

09:27 has go into it. So the memory and they're out. So that's

09:35 the one. And they're still the . It's fairly common in how high

09:42 routers or switches started sign, and is the other one. That's when

09:47 know from living machine learning. Now of you are probably know all or

09:55 even have used the Google's tends to unit processing unit, the tpu,

10:01 sure, and they disorder to this design principle and streaming they're processing through

10:12 . Take for you. So just . All things using this system,

10:20 principal again in terms of us mentioned accelerators for machine learning. So now

10:29 going to talk specifically about this, expect the multiplication of a couple of

10:35 ways to try to illustrate this historic on point out the benefits of doing

10:44 , and it turns out I also it because it takes eyes very

10:50 Asai mentioned earlier, too. Make data interactions very apparent. Andi kind

10:58 forces want to think about them. is the beneficial also, for what

11:04 call the in memory algorithms or is longer streaming algorithms. But the data

11:10 , and the note off, a cluster. But it helps them

11:17 excessive memory requirements, being careful and how the data interaction happens. So

11:27 a very simple, cartoonish thing about made to expect the multiplication on for

11:35 particular case. The organization is such there is a process for every process

11:45 for every component of the factory uh, clearly sort of very little

11:52 , except maybe a little bit But it's fairly simple to take this

12:00 of extreme part of this model and . Inning it. Go basil mapping

12:07 calling the processes that are shown making them kind of virtual processors that

12:12 emulated by the physical processes. that mapping principle is very straightforward,

12:18 I won't cover that today. um, focus on necessary the data

12:27 . So as we do know, major expect a multiplication, each elements

12:33 component of the vector X multiplies a of The Matrix A. Someone look

12:40 this image, then we follow. that? You know the processor has

12:49 in it. That's basically column zero the nature say so. It's nothing

12:56 about that. So basically, the is that, in this case,

13:00 stream, some metrics A into this dimensional array of processing elements.

13:11 well, it's a little bit mhm about this picture is that we also

13:18 that any for every component of why the inner product off the roll off

13:25 Matrix? A. With a column . So basically, each processor in

13:32 case computes mhm. Partial contribution to inner product on a single in a

13:43 gets accumulated by passing partial in their from left to right in this

13:49 So as things kind of Chuck's along feeding the Matrix A into the processing

14:02 than components off, there is something by gets output on the right hand

14:11 . Now, the other things to attention to is that Arizona some kind

14:18 skewing going on between the successive columns a going from left to right from

14:25 column to subsequent columns. Now that's to avoids excessive or temporary storage by

14:38 , feeling a or elements of columns a into processing unit until the corresponding

14:47 in a product is also arriving at processor, someone trying to get them

14:51 arrive at the same time. So are synchronized so and so on.

14:58 think I have a little bit here of pictures off are these things were

15:05 ? And this kind of skewing or off successive columns off is something that

15:12 will come back to in many shapes forms and how to structure algorithms not

15:16 for matrix sector but for matrix matrix . And that's whether they are kind

15:23 array processing, where things have streamed and processing array, or it is

15:29 the in memory type algorithms. Now it was planning to cover both the

15:39 factory matrix matrix multiplication today, so fairly fast and you just the point

15:47 to give you the concepts that are simple, I think, and then

15:53 can look at the slides and there , in fact, many more

15:56 and I'm going to talked about in slide activity illustrate these points Now,

16:06 , given that time going somewhat Um, today I will just,

16:14 , selling the things here in terms time, speed and efficiency and trying

16:22 explain how those things came about instead just showing this, formulas were asked

16:29 come up with a formula. So the time, the way to think

16:34 it in good way, is to what is kind of a critical path

16:43 terms of getting the computation done. in this case, the critical Paris's

16:56 feeling in the first column. All the Matrix Hey into the left,

17:06 processor in this case that has X , and that takes, obviously,

17:12 steps on your feet in one at time on the elements in the column

17:20 When you're done with a column, you're also doing the last competition in

17:24 case for the last in a product in a product for the last

17:30 or a to be precise. But only the first element of that in

17:36 product, and then in the product to be accumulating but passing things to

17:42 right, and in this case, . Columns of the Matrix?

17:47 So that's why you get this essentially path of length, P plus

17:55 That is three critical path. Then did it a little bit more detail

18:01 this case and assume that it takes Time T survey to do in an

18:07 operation and a time t. C communicate the elements between a pair of

18:20 processors in this processing array. So so then also assuming that it is

18:33 overlapping competitions and communication, you'll get this additive formula, which have figured

18:39 how many arithmetic operations and sequence that up being in how many communications and

18:44 it is along this critical path. then you end up with this formula

18:50 the time, so there's nothing. just but teachers or calculus here.

18:58 it out. But the principle is . Follow the critical path and in

19:03 case is soon overlapping earth communication operations re assistance. It may very well

19:10 it doesn't change the qualitative aspects of outcome. So just first in place

19:16 them have you been us paying non . And then we have this standard

19:23 for that speed up on the See? And, um, the

19:36 have the time. And if there just a single processing unit for me

19:42 expect the multiplication is this the two operations and that and then we'll supply

19:51 each off the Matrix centrists on DSO terms of the competition, it's

19:59 Just did it on a single process it will be there two times the

20:04 of matrix element times the time to each one of these operations. And

20:08 we have the formula for the time this parallel case, so that gives

20:14 speed up on then the efficiency is divided by the number off processing elements

20:21 the youth. So we got this reformer and I'm going to spend some

20:26 playing around with that formula to um, the behavior a little

20:31 That's a function or the matrix So here's what came from the last

20:41 . So now someone can look at kind of extreme cases, and either

20:47 that there are being is much larger Cuba. That means are many more

20:53 in the major, send their columns when Just look at this expressions and

20:58 fairly simple. This emphasis simplifying you ignore. Q. With P is

21:04 much larger. Thank you. So that's something that is basically order

21:09 The coefficient is even very small. it's it got something that is close

21:14 100% efficient. On the other if to happens to be largely efficient

21:20 , pretty deport. So, maybe I should stop for a second

21:30 try to see if anyone has a for how come? Why this is

21:36 terms off basic problem with this organization the computations. If you it's much

21:47 than being well, then, problem , is that the time? If

22:07 is very large, then it is of dominated by the time to effectively

22:15 the matrix. The, uh que number of processing elements is fairly

22:22 So the accumulation of the inner product partial in their products towards the right

22:29 the linear array doesn't add much to total computation time. So when,

22:39 the other hand, when P is relative to Q, this pipeline time

22:45 , and the pipeline is not very the number of processing elements and does

22:51 up not being very efficiently used. cause there's fury in the products products

22:57 passed in a very long pipeline. , so if it so happened that

23:08 , you have any columns? Not many rows of the Matrix, then

23:16 , this wouldn't be a good So anyone has an idea off the

23:23 question on this lines. And what you want to do if it's the

23:29 ? Yeah, or that that is two is my charges that he will

23:36 structure the computations that well, so is sort of one way of flipping

23:47 around effectively by saying okay instead of in a products in space. So

23:58 decided to accumulating the products in time Lee in a single processing element so

24:08 now we know that each element of results is the inner product off

24:16 A row off a and the column So now and this organization,

24:24 that means that rose off a needs be entered into the corresponding processor that

24:37 that in their products over time and , in order to gasp in a

24:45 computed correctly, that means, as of a row off the Matrix A

24:51 entering the corresponding component of X needs enter their processing elements. And

25:01 by not doing any kind of broadcast storage part again pipe lining in this

25:08 , the vector X through this ray they call them. Yeah, the

25:17 X is used for every roll off ended in the matrix vector multiplication.

25:23 that means, um, every element X needs to enter every processing

25:33 Thio then compute successive inner products and different processing elements. So now,

25:42 again X not being broadcast are replicated being passed through. Then there's also

25:50 . So the delay off different roles into the different processing elements along this

25:57 array of processing elements. So one again. There's kind of a skewing

26:06 how data is entered into this array order to avoid temporary storage on kind

26:12 minimize, um, memory requirements along way and how we can do the

26:20 thing for them. The critical paths out speed up and efficiency, and

26:27 the same principle as I talked about other algorithm, and now we can

26:32 through and figure out what happens on case. Now, if you is

26:39 larger than P, then they get efficiency, as we'd hoped for by

26:45 things. On the other hand, organization is not very good.

26:50 uh, T what's the thing that significantly larger than you and E guess

26:58 it. Side remarks, textbooks and things that tend to use all square

27:05 , which tends to be recently If you have single, knows that

27:12 metrics shapes usually enough, that's But it turns out matrix shift shapes

27:18 very important when it comes to designing algorithms for distributed settings. So so

27:29 ? We learned Waas another way of at it. So when the serial

27:36 was large, the first case p number of rows I was entered sequentially

27:42 large than I got good efficiency. in the second design, Then there

27:52 the number of columns for large, the elements in a column was centrist

27:57 . Then you also that good efficiency that case, So yeah,

28:03 which is tends to be true that many designs that it's not too hard

28:13 get very high efficiency if the parallel limited. So that means that also

28:23 back to some notion they have before terms off strong scaling strong scaling means

28:29 to increase the number, the product this and the serial components on this

28:37 , and that makes it hard to through deficiency. And the other thing

28:44 to what I just mentioned. That this distributed algorithms is really imperative that

28:54 also think about when it comes to are liberal operations. It's very simple

29:01 terms of talking about shapes, but it's just to our multi dimensionally raised

29:08 the shape is family straightforward or more ? Theda Structures is not sleazy.

29:15 talking about shapes. But it is to realize that there is the relationship

29:22 how you organize your competition and how allocate the data as a function or

29:29 the data structures are you got from . It's simple. Dr Johnson.

29:40 . Um, we'll be talking. huh. If it's possible, Thio

29:46 the transfers Operation on the matrix. ah. We do it sometimes on

29:54 graphics class to preserve the normals. , e will bring it up

30:02 Um next lecture I It is quite to figure out to do it,

30:10 it's not entirely tribute, but I sometimes I do talk about it in

30:17 class, but down that slides for . But I can have some next

30:26 . Okay, Cool. Thank So there is a fair amount of

30:31 to do May 6 transports in place an efficient way. Enough. Just

30:39 , an extra copy. So there also so at it. Um,

30:49 guess, uh, particular I don't Scott needs to do it. But

30:58 terms of fast Fourier transform, the algorithms tends to be used The most

31:06 algorithms they introduced something that is known bit reversal. How they operate and

31:16 traverse. Soliz kind of a little well, has similarities with the Matrix

31:22 case. And, um, um though they're different. And it

31:33 out that either May 6 transposition or Traverse ALS represents the worst case data

31:45 from many interconnection networks. But I try to talk a little bit more

31:51 it next time. That's a good for Thank you. Okay, so

32:03 what I was content to next was with this simple, um,

32:11 all these two systolic algorithms on the D array. Try to use them

32:17 a little bit of model or conceptual for how to do things. If

32:25 think of your processing elements as being number one the array. But

32:30 off Taylor being streamed through the processing array data is allocated to it the

32:38 they would typically be in a So, um, here is,

32:49 Now, what's a typical thing you see and how people do data allocation

32:56 clusters. They in this case, shall come one, the petition ings

33:03 the idea is to allocate the collection rose to each one of the processing

33:10 . And so you get this kind role like panels, it's called Sometimes

33:16 elements of the Matrix. A is to each one of the,

33:23 processing elements of notes in the Now there's another thing about this

33:32 I did the same thing too. vectors, y and X so

33:43 The objects in this case on are kind of in a similar fashion that

33:50 all evenly distributed across the nodes of custom. Now, depending on if

34:01 manage it yourself. Let's is kind the case. If one were to

34:07 NPR, you decide what goes into node. You may choose to do

34:12 differently that there are also attempts. Uh, in fact, there are

34:21 known as kind of data parallel in which it's the task of the

34:30 to figure out how to allocate the to the notes and the most away

34:41 a compiler to do it, to all the rays equally in some

34:46 They would not necessarily understand the semantics the program, but just do things

34:53 on the dimensionality on the vectors and to petition things evenly. So it

35:02 mean like a Nativity. Every it would just petitioned one dimension,

35:06 I will talk about other petitions for metrics. But the most simplistic way

35:12 just okay, petition one dimension, sometimes that works just fine. So

35:18 is kind of the premise for this illustration of how data may be allocated

35:25 . The underlying reason is set for simplistic that could easily be automated if

35:32 compiler is giving the global address space notes. All right, so here's

35:44 God said about this thing. So see if anyone can identify a little

35:55 of an issue with this way of things organized. There is anything that

36:01 think is, uh, problematic. up so easy to try to get

36:14 question. But here is kind of point I was trying to make you

36:18 of been thinking about. So So matrix make matrix vector multiplication works

36:25 These component of X just multiply, , the corresponding column away. So

36:31 you have a block off elements of vector X in the know that nicely

36:38 you can only do and not the Panin made to expect a multiplication in

36:47 note, you can only do the matrix vector multiplication correspondent to the elements

36:55 except that particular processor does have. in this case, it means kind

37:01 just diagonal blocks off the natives factor can actually be executed as the data

37:08 initially allocated. So that means that need to, um, do some

37:18 with some flavor, and that depends what kind of algorithm you use.

37:23 let's see, what I did on next slide is kind of what many

37:27 tends to do so fine. I just broadcast X to every process,

37:36 that needs it. So in this case, it would mean and,

37:42 all, broadcast off X. uh, every processor needs all the

37:54 of X. So that's why and all broadcast does make sense.

38:02 as you may remember, from the I lecture, it's even a standard

38:09 routine in NPR. So this perfect certainly works. But it requires now

38:22 memory because now you need a lot memory for X than initially allocated,

38:31 now entire metrics or vector X needs be. It's the world in every

38:39 , so it's not quite as big a because excess, just the vector

38:48 a is. I got the to object so but it's still an increase

38:55 memory in there alone. Later show you can avoid doing this, but

38:59 is kind of the first approach people tend to do, and it's

39:05 very scalable in particular, if you to generalize it too. Matrix

39:10 multiplication Where, then, instead of single vector X, you get the

39:17 . So anyway, so now one do this. Major, expect the

39:22 , and it's all said and and we can then do the usual

39:28 . Eso said they talked about 10 can figure out efficiency, and they

39:34 play around with these numbers. On case. I even put in some

39:39 numbers for speed or communication and And again you see, depending on

39:49 number off rows of respect to the of processing elements or sorry for this

39:58 . P is still the number of and the metrics and and is the

40:03 of processing elements. So he does stand for processes, but for a

40:09 of roles. And remember, the was here that the Matrix is partitioned

40:18 you get? There's kind of p by n roles for uneven distribution in

40:26 one of the notes. So it's this depending upon the air serial component

40:34 the serial component that this now be by end in particular is large than

40:42 are good on. Now, what P is about the number of basically

40:51 of wonderful for processing element equals Then it depends about the ratio between

41:01 arithmetic time and the communication time, that formally doesn't look so bad on

41:09 slide. But if you plug in numbers and remember again communication, maybe

41:17 cluster nodes competitively time for another arithmetic is very large. So for P

41:28 and the efficiency that tends to be low because TA is much, much

41:36 than t. C. So then control of the opposite, then to

41:44 things by columns. And the good now is that if you do

41:50 that's why than by basically column petitioning applied to both a and effectively

42:01 the character X than every processor has matching elements of exit needs for the

42:10 of a that it does have. there is no communication needing whatsoever to

42:17 out The Matrix make vector multiplication. there are still some bad news,

42:28 among spots, that kind of the news. Well, the problem

42:36 uh, when you do this, , okay, then if you look

42:42 it every processor compute a partial contribution every elements on the results.

42:58 Why no process of compute the entire factors. So the corresponding in a

43:09 or partial in the products now needs be accumulated across processing or question

43:16 So now you basically get this and then you need to do kind

43:20 reduction across. So then you have of more tempt storage in this case

43:28 the results sector. So this and you can figure it out. So

43:34 the opposite is true. So this of doing business is good again for

43:41 opposite situation. Respect to the relationship P and Q B. So this

43:49 pretty much the same thing. As said before, We need to pay

43:53 you structure things, respect to the elements. Um, so now how

44:03 you actually do this? Also were on bond. What show? Next

44:09 a way of implemented step by Very simple algorithm and then into

44:15 leaving the broadcast with the major factor to avoid excessive tempt storage. So

44:28 is basically what one can do if think of it doing a secret

44:31 So this is just, you what, you were so called lazy

44:36 , or whenever you go to or the opportunity in the past to go

44:40 Chinese restaurants and they put all the in the center and you spin it

44:44 and then you get eventually, things, and you don't need to

44:50 values as you keep giving them. you best they can replace what every

44:54 to have so it doesn't require and in local stores to do the broadcasting

45:01 way. So this is done here kind of what to do in this

45:08 of sequential algorithm. So they got is what I show. So this

45:15 this a quick thing. And during southward, um on, then eventually

45:20 done and they didn't know need and particularly temporary storage. So when I

45:30 , um, just start to structure you have the in memory algorithms,

45:37 questions of that Before I talked? , why not think of your collection

45:45 knows as being a two D That's kind of more mimics the metrics

45:50 things that mimics a vector. Mhm interrupted. Any final question comes to

46:00 . So So now here is kind one way of organizing it, and

46:07 this case, I did this kind a column major order assignment off blocks

46:13 the Matrix to the processing notes. the numbers basically gives you the processing

46:23 I d. Or trusting old. did still the same thing that won

46:28 race. There are evenly spread for and y. So now if one

46:33 at once, the situation is is here is kind of what it looks

46:40 in terms of the relationship between A X as Thio Walk, um,

46:49 be done in terms of matrix vector . So yes, one can

46:54 um, some block made to expect multiplication. But like process zero doesn't

47:01 all the elements of X it needs its respected columns and the same thing

47:06 all the other process and elements. basically, what's needed? Yes,

47:14 what's sometimes called the segmented after broadcast. To my knowledge, it's

47:21 clear. Oh, my NPR may it, but some other implementations off

47:27 NPR implementation, in fact, allows the segmented also a broadcast. So

47:36 this case, the broadcast doesn't need involved all the processing elements. Only

47:44 processing elements represents the same column of Matrix A. So it's kind of

47:54 all broadcast within processing elements collapse. that can certainly make things more efficient

48:05 in terms of data communication than that's , um, well, still a

48:11 customer. All notes. So after function, Dio made to expect to

48:18 on Ben. What happens is fun the result is this. So now

48:27 has the case that off the results , why again, as one has

48:35 the contribution to each and the product distributed across processing rose in this

48:45 So in this case, an accumulation why it's necessary within processing rolls.

48:53 in this case, you can do what's known as segmented or unalterable reduction

49:00 rules to get the elements such that is evenly distributed across all the

49:15 So anyone spots and issues once this all reduction is actually done well.

49:30 issue is that, yes, things fully computed, but the components off

49:39 the blocks off the vector wine happens be in the wrong location, except

49:45 a few so block one of their factor wise now in processing element

49:52 But it should have been in and what's in processing element one is

49:57 from before that should be and person four. In order, Thio satisfy

50:05 starting point, so to speak, even distribution that the compiler may have

50:13 to the vector life. So there still the communication required. Thio get

50:21 in the right location. So, , this e guess what? The

50:33 result this in terms of how the to be allocated. And that's kind

50:37 a for imitation that I may have this slide on DSO. This

50:45 in fact, with shuffle permutation that about in terms of the communication network

50:51 and also at that point mentioned that is, in fact, what happens

50:58 they re organizing things from Wrote a major order. So So it's a

51:05 fermentation, and it's a Sfar exam . In fact, in terms of

51:11 computing, it's a worthwhile standard permutation very few implementations of communication libraries.

51:21 fact, that stuff Uh okay, this is just summarizing what was done

51:32 the column Major order. And I then I also have slides that

51:37 Boom. Just flip through because I to talk a little bit about matrix

51:43 multiplication today so you can look at slice, and it's just correspondent.

51:49 happen that things, um, are like the difference in terms of where

51:56 need, uh, the communication But yet something that this in the

52:04 , what they are, and then kind of summarize things a little bit

52:08 in terms of the performance. So one aspect, obviously, that is

52:13 if you do this two dimensional partitioning that if you have a collection of

52:19 hard lemonade you assigned to rows and and need to fracture them in terms

52:25 how you organize your notes, and turns out that they can play out

52:30 again, assuming this simple model um, things are kind of non

52:37 to look at things how things have if the matrix dimensions air such that

52:43 evenly divided by the number off processing you assigned to rows and columns for

52:50 than the arithmetic, work is fully balance, regardless how you do

52:55 So the only difference is in the part, so we only need to

53:00 about the communication, and it turns that, and to minimize the

53:08 you should choose the shape off the array to match the shape off the

53:20 . So basically, you get than what's assigned to each processing element is

53:28 blocks that minimize the amount of data community. So this is, um

53:38 think what? Just a flipping quickly this year. So, um,

53:46 just shows the difference. But to this is what I wanted to get

53:51 as a general rule of thumb. when I come to matrix operations,

53:59 , um, block sizes, if does, is to the petitioning Is

54:07 typically that informatics spectrum in square blocks this in factories over to matrix matrix

54:14 as well? Although it's a little more complicated. And I think I

54:18 this a little bit too short. , but I wanted to point out

54:23 if you look at the communication yeah, it shows that for the

54:31 dimensional partitioning on this was just picking square matrix for simplicity. But you

54:39 see that the two dimensional petitioning um, a lower leading term.

54:50 me. Okay. I'm sorry. Otherwise, I'm not six on perfectly

55:01 , given the times. But you see the leading terms is the square

55:05 of the number of processing elements, in the one the petition it doesn't

55:11 to the partitioning tends to minimize the time. So that's a preferable thing

55:17 do here and all many people they to do if they were one dimensional

55:23 make expected. They just choose one partitioning, I guess, for

55:28 But it's really not the best thing do. Mhm so that waas Here's

55:38 stress and there is just a slight were General case off, um,

55:46 role, the column, major order only their diverse, that there are

55:53 and look at so. But we'll for questions on that before I try

55:58 cover a little bit about major Complication. So they take home message

56:09 the magic shape is important. Uh . And you may think about

56:16 so one that's kind of perhaps interested a theoretical aspect. Who cares?

56:22 turns out that those some of you know, uh, computational chemistry.

56:30 cases they have exceedingly rectangular matrix is cities to deal with. They may

56:37 things that are, you know, , 10, 12 or something small

56:43 or either rows and columns, and have thousands off elements in the other

56:49 . So they're highly rectangular, the things that don't cover in this class

56:57 in reality, for large scales your mattresses may not fit. And

57:07 the memory of a single node nor all the memory of all the custom

57:11 . So, um, agencies may stored on disk. And the way

57:16 are processed in the cluster is by kind of end, similar to what

57:25 in the May take. Select a , but basically Onley subsection of the

57:31 metric Fitz and the memory of all custom notes. So in that

57:38 what's being dealt with in terms off algorithms executed by the cluster is in

57:45 dealing with highly rectangular objects. And that case again, shaping you're processing

57:55 is important to get really good All right. Next time to the

58:04 matrix Multiplication, then again, is back to the systolic concept first

58:09 then generalized to to the or in algorithms. So here's what they

58:19 Sort of one of the matrix vector for the X um waas kind of

58:26 . If you like and in the for accumulating in space. So anyone

58:32 use this to design matrix matrix multiplication , which is nothing. But instead

58:39 ex being a single vector, X would represent, you know, the

58:44 . So that means that many columns what you can do is basically stack

58:49 won the race on top of each to then trace something where your metrics

58:56 Eyes in this case fixed in space Did you stream your matrix?

59:03 By and now, since in this the inner products were accumulated in across

59:16 ending output one by one as they mhm, fully computed. Then you

59:24 basically each roll on this designed, computes one column on the result matrix

59:37 correspond to the that particular column of Matrix lead. So I only get

59:44 columns of the Matrix e at for successive roles off these to the array

59:51 computing elements. And now you also this synchronization or the skewing or delaying

60:01 entries and outputs of yoga? Uh , this school stealing our elements.

60:09 that is important, um, to in particular when we talked about in

60:19 computation, matrix, matrix, And it's more typical cluster setting than

60:25 this type of perhaps more embedded like processing. I will go fairly

60:35 , so and hopefully they can concept clear and you can look at the

60:42 on the slides if it's relevant to you like to do or remember at

60:48 point maybe you wont need Teoh know to do it. So again,

60:55 the reminder. And this is what Google TP you guys did,

61:00 use in their design off there processing to now what understanding when this type

61:09 design is sufficient is the same so . But this fund, it turns

61:14 now we have three different characteristics, ? We have the rose on the

61:21 A the columns off, Hey, disappeared by Hugh Matrix. And then

61:27 have the Matrix B that are So there's three different characteristics numbers to

61:33 about. And this design is good he happens to be large. So

61:37 happens, basically inherits the thing we from the symptomatic spectrum application case.

61:46 , Now one can also flipped things and then there is in principle not

61:52 difference between A and B. Conceptually long can then do a similar type

61:58 computing arrayed by making a stationary and of feeling. These and getting sees

62:05 out of this process in your A when. Now I get a little

62:10 difference on this case, it comes . This one is good if the

62:16 of columns of the Matrix B yes the dominating and otherwise it's not such

62:22 great design. So now we We have two options, depending on

62:29 P or R is the dominating But we need another one if to

62:36 to be there dominating dimension. And that, we go back and

62:42 Construct something based on accumulating the inner for elements of the sea matrix when

62:48 goes to matrix matrix multiplication in so to speak. So here is

62:57 of what one does in terms of getting an element, and one can

63:04 construct this one based on the symptoms vector multiplication in, you know,

63:10 a little bit more messy to figure how things are done. So now

63:16 can. They did this so same moral as in the previous case

63:26 . So how about in memory? Gore? And I'm saying now it's

63:30 about two D, and hopefully they to say something about three D as

63:34 . On, uh, we'll I only have about 10 minutes

63:40 So now we can take this algorithms I'll go, um, some of

63:47 first because the and he just tried highlight a little bit how this sink

63:56 from this historic algorithm charges over to . Things needs to be structured in

64:03 off in memory algorithms in order that end up with excessive temp storage.

64:13 , um, so here is what have. Now the idea is again

64:18 same allocation model. So again, the compiler is given a global memory

64:24 all the notes on, then assuming it treats all the rate the same

64:30 in this case, I assume that decides to basically the block allocation for

64:37 one of the ah prints A, and C eso if all accurate or

64:45 a square, that means square Otherwise, there should be,

64:51 not to square processing your race. this is just simple. For illustration

64:58 now the problem is if, as you don't matrix multiplication, um,

65:06 the standard definition, right? It's roll off, a times a column

65:10 be. And what that means is the inner index, as it's called

65:17 index off A and first Index D to be the same. Otherwise,

65:23 math doesn't work. So the only where the inner index off the blocks

65:29 A and B are the same are the diagonal. So, yes,

65:34 can do some computation and doing it way, but most of the NOs

65:41 do anything. Basically, you have to the array of notes and only

65:46 the Iranian connection to computation. So is not very well good for you

65:51 business. So now if And when go back and again look at these

65:56 . The way things worked in the start the case was, in

66:00 that if I look at how things organized, can a limit matrix A

66:08 kind of the normal way grows horizontally columns vertically. But that wasn't the

66:14 for B. They was actually effectively . So So then, of

66:23 uh, this is kind of what shows. And then so now in

66:28 case is choose transport A instead. now if one looks at the inner

66:34 of A and B under discovered yes, all the, um,

66:40 can be a block product off what has off A and B, and

66:47 does make sense. One can look the first column or processing elements for

66:54 so they're going down the column off first column off a The second entry

67:03 a 01 and that matches the 10 terms of inner index, you go

67:08 the down there is 02 and B So what it means is,

67:14 you can do matrix lot products. it also means that all those products

67:23 elements are see double zero. So means the corresponding block off the product

67:32 sexy is partially computed within a column the processing array. So no processor

67:44 the full, uh, product off Block column off a times of block

67:53 off B. So that needs them be assembled by doing a reduction within

68:01 of the major or processing array. , yes, all the processors could

68:08 blocked matrix products, But then it a reduction within columns to get the

68:16 results and play around with this a bit. Well, I think I'll

68:23 a couple of other cases first, , we'll see and the next several

68:30 and really just conflict through. But just shows the steps, then,

68:36 , kind of corresponding to the systolic . So that was what I just

68:40 about was one computation step and this and then did kind of ah,

68:46 or shift of the matrices. So and ends up being in this

68:53 the rotation, because the was fixed this case of a was the one

69:00 was passed through this historical raised on case, it means it gets moved

69:05 and the same thing will need to with See, So So,

69:12 this is not what happened since the is fixed and one does his steps

69:18 shifting off error A and C and click shifts and to the reduction within

69:27 . And eventually is your gap the operation done. So now I can

69:36 the corresponding things instead of having okay was the beast stationary case. Then

69:42 can do, uh, someone that united mess up, be stationary.

69:50 , okay. So what I did instead, waas toe play around.

69:55 one in fact can by doing the of area not just doing the

70:04 then one can get things. in fact, um, you can

70:13 the production off. See, So you get different contributions to see computed

70:24 different locations. And though in the steps of the algorithm, you will

70:29 pass elements of sea around instead of this reduction operation. So at this

70:35 , you get something that's kind of mimics the systolic algorithm by not having

70:43 potential somewhat inefficient. Uh, all reduction within columns being done

70:50 Off. Yes. Reduction by cyclic . We're, um, elements of

70:58 . So this is kind of what algorithm does. So then there are

71:06 various steps anyone is interested can cannot this simulation of the algorithm.

71:13 I'll have to do that. So one can and then also e guess

71:27 this exercise that I did in terms when this album are good on,

71:31 figuring out what the shape should and B was the one that was

71:35 and a n c. Moved. it turns out that from one square

71:42 of B and that's something, And I think about the difference versions

71:48 the algorithms in terms off what waas in the systolic algorithms for A B

71:55 C. So this was the Okay, so So now you can

72:03 the same thing for a I will flip through them and just commit

72:09 There are now hopefully too hard to slides that shows you the step of

72:15 systolic algorithm and converting it to mean algorithm that pretty much just the same

72:22 Aziz compute communicate as the streaming systolic , which is. And if you

72:30 that, you get something like and then you can look at the

72:35 off, how to configure their A other should be square blocks off

72:40 And that's the sort of mimics and these things are. And I will

72:47 to cover this last one, and will stop and take questions and maybe

72:52 some supplements next time for the three . But now you can take this

73:00 . Was he stationary? And then move a B and the set.

73:05 slides that did was a little bit . So this is actually down mawr

73:13 a, um, not fully while a block algorithm with different shapes off

73:24 matrix. So and it's a little more elaborate. The point for you

73:30 remember about this one. It ended the sea stationary version on this

73:38 Multiplication is a very well known algorithm as cannons algorithms, Anyone that looks

73:47 parallel algorithms for linear algebra. They likely come across this notion of cannons

73:55 that waas an algorithm designed by PhD , uh, Montana State University quite

74:06 time ago. And that's basically based C being stationary and being moved A

74:13 C. And the thing that I made it so famous, if nothing

74:21 and actually many where people have used in implementing making smart implication is the

74:29 between a B and C That is killing. I talked about in the

74:36 algorithms. So that's why I tried point out that this include ization is

74:43 , is important for conserving memory, that's kind of to me one of

74:48 fames or reason for fame off this algorithm is avoiding broadcast and tempt storage

74:59 major cities. So let me, , and show you a little

75:08 In this case, I did things show you when things are not square

75:15 it becomes a little bit more How you do this synchronization between streets

75:21 my time is kind of up. I will flip through the slides quickly

75:26 show you a little bit of how are can be done in this case

75:31 the alignment, as I call this killing on the things are dinos

75:37 different ages and bees. In this , eso It turns out that this

75:45 skewing off. A is kind of respect to the number off processing your

75:57 where the scaring off B is with to the number of processing columns.

76:04 it's basically four by eight processing So that's why this killing of a

76:11 works with the terrorize off columns. it will be different again and have

76:18 slides that you can look at when are not just one dimension is a

76:25 of the it gets even more But you can work this through,

76:29 then one less thing to step Step on. One can see that

76:34 every step than only a partial use the matrix speak, and we used

76:41 to her but continues on the blocks A and the in the Matrix multiplication

76:50 , So I'm gonna do this and the end result is that for

76:57 now, they should be square blocks C. So it ends up being

77:03 like in the Matrix vector multiplication case optimum respect to, um performance and

77:11 communication is that the stationary matrix should square blocks, and that is kind

77:21 the rule of thumb. And there's some data upon. So how it

77:24 out to particular cases on being worked . So, um, no,

77:33 think time is really up. So is what I said on e just

77:41 you the slides air here and You're interested in it? Um,

77:54 I will just say it and the in there because they're not entirely

77:59 but so basically what one does in three d case, and if somebody's

78:05 that will cover it next time. what do you do is to kind

78:11 think of your processing array as, , a number of planes. The

78:19 for that is when you do make matrix multiplication, as we do

78:23 you have three nested loops. in fact, matrix matrix multiplication.

78:29 a way, they have three Number over dimension for the number of

78:35 off A and C, another dimension the number of rows or columns of

78:41 and rows of B and the third for the number off columns off C

78:47 B. So it's actually three D the space, and it pays

78:53 It turns out to, in then configure your processing array as

79:01 three d array. And again, shape of the array should be such

79:06 basically, they blocks if you choose have compute, see in place that

79:13 should be Children. No, there's D allocation was, in fact,

79:24 ages ago, and on this connection systems that I told you that they

79:29 a library form and I worked for company. Um, and later on

79:35 notion, waas kind of rediscovered by broke the group, but going Tim

79:43 . So they did what they called . 2.5 the algorithm for doing matrix

79:49 and some other linear algebra operations. on the blackboard website there are some

79:54 their papers, um, for doing than two the metrics algorithms and when

80:02 time taking a few minutes over But because stop and take questions.

80:18 . So the take home message is that Yes, when you have more

80:24 one objects will need to pay attention how the different objects are used relative

80:29 each other in terms of data references all right, configuration off your collection

80:38 processors for minimized minimizing communication then becomes function of the object ships. That's

80:46 big number of take home message, against the last one is that communication

80:54 to million advice by making the minimum area for the volume of their that

81:05 held locally without moving so fairly simple . In the end, not,

81:16 entirely, are easy to figure out to take them simple bottom line and

81:23 it into algorithms. Okay, The time last lecture, I will then

81:38 something about transpose operations. And then intent wants to talk a little bit

81:46 sports matrix stuff That is important. more important than tense. Um,

81:53 it's also a little bit more complex not so little, not only for

81:59 and engineering application, but also for machines arming and deliverance. So that

82:06 be the topic for next? Not . Well, stop screen. Still

82:18 on from taking questions. If there some What? Thank you.

82:38 it's not so thank you for And I tried to pick up my

82:45 50 notes. Insanity. You, . Great. Thank you.

82:59 Thank you. Thank

-
+