© Distribution of this video is restricted by its owner
00:00 | all right. Eso today will be little bit, um, three |
|
|
00:10 | Um, so it's more a question . Probably give you some ideas about |
|
|
00:19 | important aspects in terms of the HBC certain types of computations that just put |
|
|
00:28 | label and then since parts linear algebra . So let's get on with |
|
|
00:37 | So this is the three pieces I . Thio. The reason for the |
|
|
00:44 | one it's something that I know some you are working on in another |
|
|
00:53 | not necessarily the composition about dealing with major disease. And the point of |
|
|
00:59 | about this one is unlike matrix for which the work, so to |
|
|
01:08 | of competitions is evenly distributed across the space. That means, uh, |
|
|
01:17 | the to access, if you like and columns for major X factor type |
|
|
01:23 | . Or as I mentioned for matrix , there are really three, |
|
|
01:28 | coordinates to think about. And, so that's what I mean in terms |
|
|
01:34 | the index place. But when it to early, the composition is not |
|
|
01:39 | . For that race is an And how did you do data allocation |
|
|
01:44 | Custer's? So that's the point, will try to bring across for that |
|
|
01:51 | then e when asked a little bit matrix transposition if I could say something |
|
|
01:57 | it. So I want to say about it. And the last thing |
|
|
02:02 | something that also behaves very differently compared dance matrix operations and there many i |
|
|
02:11 | of it that is important not only science and engineering, but also as |
|
|
02:16 | turns on in machine learning. That something that is of interest. Too |
|
|
02:20 | students, states, states. So the thing. You know, Try |
|
|
02:24 | make some points that hopefully you can and maybe a guide do whatever you |
|
|
02:31 | to do and outside this particular All right, God's elimination. Hopefully |
|
|
02:38 | do you remember? But in case don't think it was probably the high |
|
|
02:43 | method that we aren't in terms about solve systems of equations. Bye. |
|
|
02:50 | a multiple one or the equations in case picture as a matrix. Andi |
|
|
02:59 | variables from the other set of questions multiplying, particularly over the proper multiplication |
|
|
03:08 | , and so that when they do subtraction, the variable disappears from the |
|
|
03:14 | , making equations So there is just that this case, using the first |
|
|
03:23 | as what's known as the people draw multiplying it with it's a factor of |
|
|
03:29 | on, then subtracting the results from second equation. Then the coefficient in |
|
|
03:37 | of the variable excellent disappears and similar do for the rest, and then |
|
|
03:44 | keep doing it step by step. in that case, after this first |
|
|
03:49 | , then you can focus on this a little three by three subsystem towards |
|
|
03:54 | lower right hand corner on. This kind of illustration of the subsequent |
|
|
03:59 | So this is what hopefully you did about Carson elimination hard works. Eso |
|
|
04:05 | is known also as the factory ization . And sometimes it's also on carries |
|
|
04:12 | name of l U the composition um, the matrix off equation coefficients |
|
|
04:22 | , that would be what's on this , uh, to the left of |
|
|
04:27 | equal signs, except including the X . So that's set to the coefficients |
|
|
04:33 | the matrix and after the steps of gas and elimination as we can |
|
|
04:38 | Mr Three, the Matrix is basically triangular, and that's the you in |
|
|
04:46 | L U notion of value decomposition, it turns the l is, |
|
|
04:52 | multiplication factors you usedto force it into upper triangular matrix. So that's and |
|
|
05:01 | sure I will alternatively use Elio or an elimination without thinking too much of |
|
|
05:08 | distinction. So there's two is just this that I'll try to point out |
|
|
05:16 | and talk a little bit about that issue to this line. So a |
|
|
05:24 | on, um, kind of illustrate elimination. Where is you? The |
|
|
05:30 | , then, as with this little towards the lower right hand corner, |
|
|
05:36 | is this kind of square matrix and a number of steps off the Gaussian |
|
|
05:44 | one has the elsewhere the kind off is a dramatic and and white area |
|
|
05:55 | the diagonal towards the left, as as the similar white area towards the |
|
|
06:03 | on the square. And then at step here, one basically, do |
|
|
06:11 | figure out what the it's it's not who should be. In some |
|
|
06:18 | you can just use the equation that would be correspondent to kind of |
|
|
06:26 | um listen, um, purplish reddish in the middle, um, us |
|
|
06:32 | in the draw and then find for one of the questions below. Water |
|
|
06:39 | should be, too. Justifying that with Thio. Eliminate things in the |
|
|
06:46 | blue area, Internet interior, successive . So after that step, you |
|
|
06:52 | have a smaller matrix. Just, , would be the Green Matrix or |
|
|
06:57 | matrix. Now, the problem in that that tends to lead to poor |
|
|
07:06 | is that finding the multiplication factor um so that we're in computing those |
|
|
07:20 | and then using the multiplication factor to the paper draw and subtracting it from |
|
|
07:27 | one of the other Rose is something produce what's kind of a blast one |
|
|
07:36 | vector operations. That in itself is necessarily bad, but it makes generally |
|
|
07:44 | poor use off memory hierarchy. So fact that these air kind of rank |
|
|
07:52 | or scaler operations and updating the make green part of the metrics is something |
|
|
08:00 | wants to avoid. So the schemes doing and to avoid that is to |
|
|
08:06 | to really cast things in a way the l. You their composition. |
|
|
08:15 | fact, we mostly will use matrix multiplication as a sub function and the |
|
|
08:22 | to do that is you don't need for each roll, say, and |
|
|
08:31 | reddish area do what ends up being of another product in updating the green |
|
|
08:42 | . But you don't toe up. to update the green area until you |
|
|
08:50 | to use it to find new pivot and the new, um, set |
|
|
08:58 | multipliers to update things so one can the update off the green area until |
|
|
09:07 | need it. So this is kind the basic strategy to get to something |
|
|
09:12 | can use memory. Heart is much . So in this case, basically |
|
|
09:19 | in these blue and reddish area, may treat things one rolling column at |
|
|
09:26 | time, But then you don't do to the green area until you're kind |
|
|
09:32 | finished figure out. But all the are for each one of the roles |
|
|
09:37 | the reddish area. And then It turns out when you then update |
|
|
09:43 | green area. It is the light rectangle times the kind of reddish, |
|
|
09:53 | , rectangle on top. And now is basically matrix matrix multiplication that is |
|
|
10:00 | used to update the green area. this is the trick that is being |
|
|
10:06 | in all the library routines or anyone wants to right and efficient Matrix or |
|
|
10:14 | factory ization or Gaussian elimination took So I think I have couple of |
|
|
10:23 | of that, and then I'll stop a second and see if you have |
|
|
10:27 | . But this. It's kind of old side on the point. |
|
|
10:30 | essentially to see if one were to what was proposed in the previous |
|
|
10:39 | Then, in fact, the Blue line is as you using this kind |
|
|
10:47 | a trick, and it compares it the performance off the well designed matrix |
|
|
10:54 | that uses memory hard well. And can see for most of the machines |
|
|
11:00 | that is what's on the horizontal axis held. You pretty much behaves or |
|
|
11:06 | as well as the Matrix matrix Multiplication . So that's kind of the take |
|
|
11:12 | message of this slide, and then can player on. We're trying to |
|
|
11:20 | out how much should you delay the of the green areas should you do |
|
|
11:28 | four or, in this case, or whatever number of rows and columns |
|
|
11:33 | those like blue and reddish panels before update the green one. And this |
|
|
11:41 | shows for one of the machines. had some data from that going |
|
|
11:47 | um, rank one or one column wonder what the time Thio doing, |
|
|
11:54 | , 32 rows and columns. At time, we got speed up on |
|
|
12:00 | factor of 3 to 4 for this computer, and this is another one |
|
|
12:07 | this HPC Challenge website and, color coded. They made six multiply |
|
|
12:16 | the caution elimination columns in kind of . And again, you can see |
|
|
12:24 | the efficiency off in this case to the guy's an elimination is comparable or |
|
|
12:30 | too much less than the Matrix multiplying . And so that's what I wanted |
|
|
12:40 | say about one cast or try to girls in elimination. To use memory |
|
|
12:47 | is well in order to delay updates the green area and but those of |
|
|
12:54 | that are interested in it. There's different versions of this strategy on If |
|
|
13:00 | look in the literature, you will something that is called left looking alright |
|
|
13:04 | algorithms, depending upon the what extent are delayed and updating the green |
|
|
13:16 | So Okay, so now I'm switching kind of hard to do things on |
|
|
13:23 | distributor system and Napfa to get efficient of the memory in the thing of |
|
|
13:35 | . So the problem on this is of a reason to talk about something |
|
|
13:39 | didn't talk about before. When it to problems that uses regular race right |
|
|
13:50 | , you know, to the or doesn't need to be just to the |
|
|
13:55 | operations, but essentially things that are a very nice, simple structure. |
|
|
14:02 | here is kind of an illustration off you're different ways that people use and |
|
|
14:09 | , um, there are standard allocation after is being used, for |
|
|
14:16 | an MP I, and dealing with there may be allocated across the |
|
|
14:23 | So the upper well, the first cases here deals with things being |
|
|
14:32 | The thinking, all the collection of knows as just kind of upon the |
|
|
14:40 | , regardless how they're physically interconnect. the upper left hand version since the |
|
|
14:47 | lay out with the kind of blue rectangle in owning what's signed them to |
|
|
14:58 | processor. Andi think you talked about in terms of matrix bacteria and maybe |
|
|
15:04 | the meetings matrix but application the previous . Now, if you think about |
|
|
15:09 | cartoonist example of Gaussian elimination in the right hand corner, the problem when |
|
|
15:17 | comes to gas an elimination that after number of steps, um, process |
|
|
15:23 | the blue one and play out number . We'll have nothing to do. |
|
|
15:29 | basically, you lose processing power or cannot make full use because the computations |
|
|
15:37 | not uniforms across the in the express this case. So then another attempt |
|
|
15:45 | can do to make things. The is Sarah gets to be part of |
|
|
15:51 | business for a longer time. Todo was known a secret they out. |
|
|
15:58 | basically, instead of taking a block columns and assign them, you go |
|
|
16:04 | this picture round robin across the columns with respect to the number of processes |
|
|
16:10 | you have. So after you have up all four processors before columns, |
|
|
16:15 | go back and get the next column assigned to process zero. So in |
|
|
16:19 | case, processes era have things to until kind of the very end, |
|
|
16:24 | there's not enough things to do then the problem with that, You |
|
|
16:33 | , I think I try to say about that. So the problem about |
|
|
16:40 | second cyclically out? Yes, it processes era busy for the majority of |
|
|
16:47 | time, like the other ones. the problem is, you cannot use |
|
|
16:54 | strictly off delaying operations to multiple columns the same time. Quite this |
|
|
17:03 | So and that's quite significant. So clipped in something. I showed a |
|
|
17:10 | time in some early lecture in terms the difference between the scaler things, |
|
|
17:17 | things and run row and column at time, and doing things where there |
|
|
17:21 | takes multiplication. So it's the Jewish in performance, so that's not what |
|
|
17:27 | wants. So the next thing to then realized to be enabled Block operations |
|
|
17:33 | to do this. Boksic the clay . So now one can do get |
|
|
17:38 | benefit off, having processes zero being of the business for a long time |
|
|
17:44 | still using, uh, the block operations. Then, um, one |
|
|
17:58 | try to do things a little bit low balance as well, and try |
|
|
18:03 | do the to the layup that actually , Um, it's not the way |
|
|
18:11 | is. Just blocked two D If you remember when we talked about |
|
|
18:16 | Vector and Matrix Matrix multiplication to show the two D they are resulted in |
|
|
18:23 | communication, which is the weakest part clusters. So then it turns out |
|
|
18:32 | somehow the winner is this thing when does the block cyclically out. So |
|
|
18:38 | is something that you will see a in many off the dense, linear |
|
|
18:43 | type. Um, parallel implementations. , I think this the next few |
|
|
18:53 | just kind of illustrates a little Um, this thing that works out |
|
|
18:59 | terms of the value decomposition and again to cover a few topics. Today's |
|
|
19:06 | going pretty fast, and I will , though if somebody has questions about |
|
|
19:10 | and one or two. Most Well, actually several most lines. |
|
|
19:16 | I'll still stay with this aloof. here is just another just showing the |
|
|
19:20 | slips and steps, and you can of it. This little squares in |
|
|
19:26 | set off slides as being the template showing, um, where, |
|
|
19:38 | processes are assigned to the matrix So each of the square represents an |
|
|
19:44 | across all the processes off data. this is a little bit again summary |
|
|
19:53 | this block Second, they are Now. One thing that is kind |
|
|
19:58 | missed often. And this using this is what I'm going to talk about |
|
|
20:08 | the next change slides and then they'll and take questions if there are |
|
|
20:17 | So this now. So here is of another illustration in terms of this |
|
|
20:25 | where I think of it as being 16 by 16 matrix assigned not block |
|
|
20:32 | by simply in a block order to four by four processing erate. So |
|
|
20:44 | block cyclic allocation. So in this , than the color coded things for |
|
|
20:49 | right in the yellow means I'm thinking the processing as being using to buy |
|
|
20:54 | blocks s 02 sets of two columns rows to being able to do some |
|
|
21:02 | rank operation now not to lose low . What, in fact you can |
|
|
21:11 | is to do the following that. of doing the next set off rows |
|
|
21:19 | columns, you move to do the set of relevant column in the next |
|
|
21:27 | roll and color, and this is legal, depending upon in particular, |
|
|
21:34 | you have what's known as a diagonal matrix where you don't need to won't |
|
|
21:41 | about pivoting order is still the stable . Um, there are also |
|
|
21:49 | you can build one if you need do partial pivoting as it's known to |
|
|
21:55 | , um, accuracy and calcium So then you just keep doing this |
|
|
22:02 | you just keep going, and then wrapped around them until they come back |
|
|
22:06 | you cycle, too. So in end you get something that is equally |
|
|
22:12 | balance. But you don't if for parts of the competitions, because usually |
|
|
22:18 | don't do one thing like a little , it's part of some, |
|
|
22:24 | bigger code, so to speak, does many different operations. And for |
|
|
22:30 | parts of operations, maybe the block is not what you want. Maybe |
|
|
22:35 | to stand a block operation. This be preferred, So this just shows |
|
|
22:40 | you have an alternative way. Thio get load balance for situations when you |
|
|
22:48 | with triangular matrices, in fact, it has certain consequences because the pivoting |
|
|
22:56 | was chosen differently than just going things . It's a role column. Order |
|
|
23:04 | long around, so it causes a . Ation respect. The hard area |
|
|
23:09 | allocated among the note. Someone needs be conscientious off the permutations that |
|
|
23:17 | and I won't go into it. , in effect this way are doing |
|
|
23:25 | . This block six, which elimination on sort of blocked allocated majors, |
|
|
23:35 | the way we built this library on connection machine and the company it is |
|
|
23:41 | work for from coming here. And , we were not paying attention when |
|
|
23:46 | so we were a little confused uh, what the results looked like |
|
|
23:53 | we didn't pay attention to the fact is happening is things, um, |
|
|
24:01 | changed from basically wrote a column major in this process so one can easily |
|
|
24:08 | where things are. And if I to, um, work on it |
|
|
24:13 | rectify department ation, what can this out? What the Perma Titian |
|
|
24:18 | which is, um, eliminates from slides, but you can look at |
|
|
24:24 | , but I don't want to get much into because I wanted to cover |
|
|
24:28 | more topics. So this is just things that you think about alternative ways |
|
|
24:33 | doing business. Take care of the balance across notes. So you get |
|
|
24:40 | the bottom line of the slide was one can use blocks cyclic elimination order |
|
|
24:48 | consecutively off or data allocated in just standard block order. Or you can |
|
|
24:56 | the block cyclic allocation, and then can do kind of a sequential elimination |
|
|
25:02 | respect to the processors. So, . And this is an illustration how |
|
|
25:11 | actually worked on this machine that the built in. Um, this can |
|
|
25:18 | tried to point out the scale Or if one looks at the little |
|
|
25:23 | in the left hand corner, it , you know, mega flops per |
|
|
25:27 | . I was, you know, years ago. So it wasn't giggle |
|
|
25:31 | at the time. But you can going from one no 2000 notes in |
|
|
25:36 | machine than the execution rate. Tono drop that much, So the scalability |
|
|
25:45 | pretty much close to perfect in terms a range from 1 2000 notes and |
|
|
25:56 | also a little bit of personal pride to admit that this beauty system was |
|
|
26:04 | the number one system on the very top 500 list that was ever |
|
|
26:09 | So it was the most powerful machine the time. As per the top |
|
|
26:16 | list. E think that's what about value, their composition and how to |
|
|
26:24 | efficient use of memory hierarchy for U by using delayed updates and then |
|
|
26:32 | to get load balance by either during called dot cyclic allocation or instead choosing |
|
|
26:40 | implementation order if you are free to it or otherwise, doing the processing |
|
|
26:50 | in a block cyclical order, respect the processing nodes and then doing a |
|
|
27:00 | assignment off, Um, favorite roles columns. So any question on that |
|
|
27:11 | quickly X position just to give you things to think about if you work |
|
|
27:18 | these areas, okay, So matrix And in fact, a lot of |
|
|
27:30 | has got into matrix transpose er now go into depth about that either. |
|
|
27:35 | just, uh, give if you as of how things can be done |
|
|
27:43 | . This is clearly the very naive off what The Matrix transpose, in |
|
|
27:50 | , is right. That's what elements to be exchanged and yes, it |
|
|
27:57 | , but it usually doesn't perform very as the points out here. There |
|
|
28:02 | get cash misses a lot on you even make very poor use off cash |
|
|
28:16 | . So suppose things are in the major order, then? Yes, |
|
|
28:22 | they read the cash line from you got a few elements in the |
|
|
28:27 | role. But then so if you look at the first short error |
|
|
28:37 | the two elements of their then when goes to the second row of the |
|
|
28:43 | , that may be very far away the memory. And so that causes |
|
|
28:50 | the year on page fault Typically if that's the case, to get |
|
|
28:55 | second element on in principle in this , you we're only used, at |
|
|
29:03 | for the second roll. It would use one element in that cash |
|
|
29:09 | because then you're going to when you to the right. If it's |
|
|
29:14 | Major ordering the first cash line you may have that element about when you |
|
|
29:22 | to the third role in this, you're not using anything in the cash |
|
|
29:29 | contained the first element in the second , so and it's not likely to |
|
|
29:36 | good and most architectures. So the kind of first type of fix is |
|
|
29:44 | then try to do things in in block order. So in this |
|
|
29:54 | you can unload and this little square , they then make use off. |
|
|
30:04 | , what the data elements that you in a single cash line stretch azi |
|
|
30:11 | as things and fits in the cash , and it's kind of illustrated a |
|
|
30:17 | bit on this side. Then you of do that respect to the cash |
|
|
30:25 | . So going from so in a one cash, then you also do |
|
|
30:32 | at the register level. See, the level one cash to read things |
|
|
30:37 | the register file that you can basically and right back to the level one |
|
|
30:44 | without going to a level one cash than once. And it's a little |
|
|
30:50 | of deceit. And on what? this illustration, because it's kind of |
|
|
30:56 | for the upper left hand corner that kind of a diagonal block. But |
|
|
31:01 | you kind of move down The then the two whites blocks here, |
|
|
31:07 | one to the right on the corner lock in Amman below other ones that |
|
|
31:13 | going to be swapped in the So they both need to kind of |
|
|
31:17 | fit in cash. Eso one is kind of be in reality, worked |
|
|
31:24 | terrorize off these blocks and make sure fifth in the cash and into the |
|
|
31:29 | thing. Possibly respect, um, with several levels of cash. And |
|
|
31:35 | also we respect to You're on page , maybe sensitive exactly how you choose |
|
|
31:43 | white box or the pairs of white . In order. Thio minimize either |
|
|
31:50 | Iran page false or cash, Mrs . So and then the other kind |
|
|
31:58 | things that it's commonly used also, is kind of easy conceptually, is |
|
|
32:05 | recursive type of make X transposition. I said, You know, back |
|
|
32:13 | I talked about major response apply early . It was the idea that these |
|
|
32:20 | algorithm what kind of perfectly match or the perfect job for cash hierarchy by |
|
|
32:29 | the recursive partitioning into smaller, smaller That then corresponds thio the different levels |
|
|
32:36 | cash. Unfortunately, when I came metrics multiplying practice for, um kind |
|
|
32:44 | more detailed engineering, it's just of things were implemented. It didn't kind |
|
|
32:51 | pan out. So for respect, matrix multiplies the block. Detective algorithms |
|
|
32:57 | always tended to beat or has so always beat the recursive algorithms. And |
|
|
33:04 | why most algorithms or library implementations for efficient metrics multiply. Do not use |
|
|
33:12 | recursive style off matrix multiplying. But there was about my six transpose or |
|
|
33:20 | student of mine. All right. ago, you played around with this |
|
|
33:26 | hey implemented them both the naive version recursive version. And in this |
|
|
33:34 | three different blocked versions with different block for different magic sizes or on the |
|
|
33:42 | axis is, um, the log the matrix sizes and on the vertical |
|
|
33:53 | is the number off cycles per So that means lo is good on |
|
|
34:02 | is not good, because that means cycles per element. So as we |
|
|
34:07 | see, the naive version didn't do at all. And unfortunately, the |
|
|
34:14 | one. As people discover whether the month supply Yeah, it, |
|
|
34:21 | wasn't all bad. Eso I did than a naive one. But |
|
|
34:28 | the block versions on these two processors we looked at did better. |
|
|
34:39 | um, let's see, I think had a couple of other slides about |
|
|
34:43 | metrics. Transpose So, yes. I think any questions, though, |
|
|
34:54 | this, that's this single processor matrix algorithm. Yeah, I have one |
|
|
35:05 | . Actually, we don't talk about one next. And then I talked |
|
|
35:10 | the parallel version. Eso way Actually, quite a few years |
|
|
35:18 | Um, this fellow and Giannoulas Ackland doing image processing and image processing people |
|
|
35:27 | use a lot of f fifties. , and at the time, there |
|
|
35:34 | basically it's just computers were not all powerful, Didn't have many levels of |
|
|
35:40 | , and, um, mostly register and maybe some on board memory. |
|
|
35:49 | main memory, but most in the memories are not large. So you |
|
|
35:53 | to deal with disks. But so it's kind of simple memory hierarchy |
|
|
36:01 | like today with the programming languages C Fortune data is stored either in a |
|
|
36:08 | or column major order. So he concerned with how to minimize the |
|
|
36:17 | too. External storage. Um, , as we know, is |
|
|
36:25 | uh, expensive from a performance point view. and it's also expensive from |
|
|
36:29 | energy point of view. So it trying to come up with a scheme |
|
|
36:36 | minimize their round trips to external giving whatever story jihad that was recently |
|
|
36:46 | and energy efficient. So this algorithm up being, you know, known |
|
|
36:54 | sequence algorithm. And it, as would say, stood the test of |
|
|
37:03 | . And so that's why I kind included as a good idea, I |
|
|
37:06 | , to consider. So what the does, as for the illustrated for |
|
|
37:15 | little four by four matrix, is it works on pairs of rows at |
|
|
37:23 | time. So what's required for this version of the algorithm reading a pair |
|
|
37:28 | rows and the hope that they can , um and then do some operation |
|
|
37:37 | that pairs of rows and then write back out to the external storage. |
|
|
37:44 | then, um, they're going through the rows of the Matrix pair Wise |
|
|
37:50 | this case, there will just be steps or to integrations if you |
|
|
37:57 | On step one, that is the column. Some of the first thing |
|
|
38:01 | read errors worked ISS operations on the two roads and the swapping that was |
|
|
38:08 | in these first two rows. And the second iteration is coming to the |
|
|
38:16 | pairs of rows in this first basically then stepping through eventually all the |
|
|
38:21 | on the Matrix. And then what done when the second pair of those |
|
|
38:27 | in sort of fast memory and this that was done them so then clearly |
|
|
38:35 | job is not done. So then they call the second step that is |
|
|
38:42 | this for before Matrix. That's this up being sort of the final thing |
|
|
38:49 | needed to be done, but in in the second step, then still |
|
|
38:53 | of roads. But now, instead pairing rose one and two, the |
|
|
39:00 | is one and three and two and . So what's on top on the |
|
|
39:06 | or the right column? ISS the of one and three and then the |
|
|
39:11 | off two and four and showing what swaps are. And if you look |
|
|
39:17 | it now, in fact, the is done so and then in the |
|
|
39:25 | that it's a short paper is only , Okay, no page in the |
|
|
39:30 | that he works everything else, but worked it out in the context of |
|
|
39:36 | the school to K F 50 and they needed to access external stories. |
|
|
39:41 | it's natural to assume then that everything the size two to the power of |
|
|
39:46 | that is natural, for they call to a clear 50. So |
|
|
39:52 | um, Madonna's as well. Instead just doing one para role. If |
|
|
39:57 | have a little bit more memory, can do blocks, uh, to |
|
|
40:03 | the part of J roles. And still kind of worked with parcel, |
|
|
40:07 | , because again that was useful in off the major excites itself, being |
|
|
40:12 | power of two and terms of number rows and columns. So then he |
|
|
40:17 | worked it out and figure out what number off passes to the external memory |
|
|
40:21 | was necessary to accomplish to transform. this is a very classic algorithm that |
|
|
40:29 | still popular, And, um, think in particular again, given how |
|
|
40:38 | have stored in my memory. 01 Major order is still not the bad |
|
|
40:44 | to think about, and in if you also worry about the Iran |
|
|
40:49 | falls where there's the blocks off. , rose, depending on how big |
|
|
40:56 | Matrix is, Maybe sitting in um, the Iran page. |
|
|
41:04 | you may want to them work something's to be run pages and jump between |
|
|
41:11 | of Constance You're on page fools. I was trying to find, you |
|
|
41:20 | , I'm sure some people have tried work it out with respect to cash |
|
|
41:24 | and cash, er, keys and well as the year on page |
|
|
41:29 | But I didn't in the time I to try to put together something for |
|
|
41:35 | . I didn't, um, find and a good references to put |
|
|
41:42 | So this is a pointer, at for those of you are dealing with |
|
|
41:48 | that you may need to you data to go and look at things because |
|
|
41:56 | as illustrated with this slide, you can have quite significant performance |
|
|
42:02 | How did you do these things? that's what I had for more or |
|
|
42:08 | . Sequential or single? No, takes transports, and then I have |
|
|
42:12 | couple of slides for distributor transpose. any questions on these single note transposition |
|
|
42:30 | ? Okay, so this, um a two slides for distributed versions, |
|
|
42:41 | this one is for and I have do it. Um, when the |
|
|
42:51 | of processor are kind of configured as to the all right eso In this |
|
|
43:00 | , there were the collection of processes kind of configured as, uh, |
|
|
43:07 | with four rows and six columns. then it was done on the square |
|
|
43:15 | . In this case that WAAS 12 12 May checks. Um, but |
|
|
43:23 | is applicable to much larger matrices. you think of the 12 12 as |
|
|
43:29 | A supposed to single elements now, , if the processing array would have |
|
|
43:38 | squared, things would have been even than it is on this particular |
|
|
43:47 | Um, so I choose this. is not the square to illustrate a |
|
|
43:52 | bit of the kind of messiness, complexity of what happens when things |
|
|
43:58 | um, kind of different shapes. paper has quoted a to bottom |
|
|
44:06 | The reference also has, um illustrations algorithm has worked out when there is |
|
|
44:18 | , um, common device except number between the processors and the number in |
|
|
44:27 | the number of processes signed. Two and columns this case various. A |
|
|
44:32 | , greatest common divisor of two between process of numbers. But it's it |
|
|
44:37 | it gets a little bit even more . Um, so what's happening? |
|
|
44:51 | in this case, what shaded is of the Matrix elements or blocks is |
|
|
44:57 | to the process of zero. And you transpose the matrix, what happens |
|
|
45:03 | them? Where are they supposed to ? And if you look at the |
|
|
45:10 | on to the left of the squares on to the top of the |
|
|
45:17 | it kind of reflects the block cyclic off the Matrix elements. Eso, |
|
|
45:25 | this case, the if they go the road direction because of the block |
|
|
45:31 | allocation than the first element or block Iraq gets to assigned processes. |
|
|
45:41 | The next one gets to process the etcetera. And then when you have |
|
|
45:48 | through all your processors in the process role, you came back to the |
|
|
45:57 | processor to get its next look. that's why you see this kind of |
|
|
46:04 | cyclic ordering in the numbering. It's 123 it's 16 so there is basically |
|
|
46:11 | step in the index by the number processes in your own, and they |
|
|
46:19 | down. A column, then is clicked with respect to the number of |
|
|
46:25 | in the column. So then it sing from implemented by four every time |
|
|
46:31 | get to the next block in a processor. So after the transposition, |
|
|
46:38 | still supposed to have the block cyclic off now the transposed matrix and try |
|
|
46:47 | kind of illustrate where things went by of color coding. So there seems |
|
|
46:55 | be one red thing missing. I it's not block number 00 states where |
|
|
47:01 | is. So that's why it's One but basically a color coded elements |
|
|
47:06 | the first a column off the Matrix and just illustrate where those blocks went |
|
|
47:15 | block 00 doesn't go anywhere. This where this state is on the |
|
|
47:21 | Um, but if you go down first column all the matrix than, |
|
|
47:34 | , block fora in the column and the column right column zero, then |
|
|
47:42 | not going to be in the first after the transposition. So with a |
|
|
47:50 | cyclic allocation that elements is not going process and forward s O. That's |
|
|
47:57 | you see. One of the error then you're clear. Look at the |
|
|
48:01 | one that is blocked number eight. get basically allocated the processor eight mods |
|
|
48:11 | and that's now ends up being It took. So this is kind |
|
|
48:17 | the way that things get scattered around the communication pattern is not on that |
|
|
48:25 | trivial If you have this block cyclic eso this is and kind of this |
|
|
48:36 | code that FORTRAN code that they wrote the time is kind of illustrated here |
|
|
48:41 | to do. But there's a lot in in the paper for anyone. |
|
|
48:48 | ? No. And I also put one that kind of for the binary |
|
|
48:55 | you. Because that's still even though in dimensional cubes are not used all |
|
|
49:01 | much anymore in terms of the interconnection , it also works on very nicely |
|
|
49:07 | related networks like the Butterfly Networks. at one point that is not the |
|
|
49:18 | in the other algorithm that I just about Is that for this kind of |
|
|
49:26 | dimensional networks and well, today is of more than one day too. |
|
|
49:32 | you have multiple passed between source and of the things, so you can |
|
|
49:38 | then figure out to use the most resource, which is they're commenting communication |
|
|
49:47 | , and not just to use a link off the links for a |
|
|
49:54 | But you can divide up the blocks is supposed to be communicated between a |
|
|
50:03 | of processors to send them along different . So you can try to end |
|
|
50:09 | using the full bandwidth interconnection network by the amount of data between the pair |
|
|
50:20 | as many chunks as matches the number parallel paths you have between the two |
|
|
50:28 | on. Then making have to be little bit careful in how your schedule |
|
|
50:33 | in order to avoid contention. But that wants to look at that there |
|
|
50:39 | also a couple of papers about that and that I think I was when |
|
|
50:45 | had about The Matrix transports, stop for a second and before I |
|
|
50:54 | a little bit of our sparse They won't be that much, but |
|
|
50:59 | to make some points that I think be useful to some of you and |
|
|
51:05 | questions on the matrix transports. It's quick highlights. And there are |
|
|
51:18 | . And also on the blackboard uploaded of the references that are the bottom |
|
|
51:24 | the various slides used so you can them and gold, and you will |
|
|
51:31 | a lot more. Okay, so I want a little bit about supports |
|
|
51:38 | disease, and I have about 20 left. So, I mean, |
|
|
51:47 | mhm be able to cover a whole more than the enforcement natives representation aspect |
|
|
51:55 | , um, you many may not familiar with that are important. So |
|
|
52:05 | of you who do some unstructured or and engineering code finite element finding volume |
|
|
52:14 | what may be familiar with it and of those of you who I looked |
|
|
52:21 | machine learning papers may be familiar, I know from some of my own |
|
|
52:27 | when they started to read about they wouldn't. We're not familiar with |
|
|
52:32 | Make X representation. So hopefully it be familiar to some of you. |
|
|
52:38 | if there's any time like, talk a little bit tired than right |
|
|
52:44 | using those representations. All right, there's a little bit about some pointers |
|
|
52:54 | where you confine information and even library . So the Berkeley group is one |
|
|
53:03 | place to look. And then a of work for linear algebra, including |
|
|
53:07 | matrix stuff. Um, Forest majors basically Canada comes from some type, |
|
|
53:15 | logical representation. And I talked about when I talked about partitioning. So |
|
|
53:22 | just model things from the graph where we get the Matrix that reflects |
|
|
53:28 | Andi. Here's when it can do where you have, which we did |
|
|
53:34 | terms of the incidents magics before. then you can do things like you |
|
|
53:40 | there. So lab Asian matrix that talked about before. Andi. |
|
|
53:48 | so here's I had some example, I'll skip that in computer graphics on |
|
|
53:53 | . Interesting. You look at the and see what to do in terms |
|
|
53:57 | half. So now getting to the structures a little bit. So first |
|
|
54:01 | is to do things sort of using regular grids, and they'll go very |
|
|
54:06 | and I'll talk more about more arbitrary unstructured first major cities. There is |
|
|
54:14 | thing and the necessary, as you finance, you know, find a |
|
|
54:19 | is thio approximate the derivatives and there equations. Make them disc critized in |
|
|
54:27 | over time. And then if you to find it regular, find a |
|
|
54:33 | in you get some of this And we used it in terms of |
|
|
54:37 | in a number of examples. In , of course, if you do |
|
|
54:42 | matrix representation, um, by Linda . So in this case, if |
|
|
54:49 | is your X Y, when you your matrix representation and when your eyes |
|
|
54:55 | , um, into one, the you get the vector for all the |
|
|
55:01 | values, variable values and all the points eso you take this to the |
|
|
55:09 | space into the one the space than get matrix representations off this flavor not |
|
|
55:22 | get sort off diagonals off elements or non zero on you Eventually. |
|
|
55:30 | in these two day case, we sort of a block. Try the |
|
|
55:34 | structure of the matrix. Um, the size of there are blocks |
|
|
55:44 | then, on the size of the that you're using. So now there's |
|
|
55:53 | quick examples again that they will flick is for kind of also you're not |
|
|
55:58 | with it to get a bunch of things and below. In this |
|
|
56:02 | you can see some things are the matrix may actually look for this trying |
|
|
56:10 | that is on top, and I'll back to this and more illustrations and |
|
|
56:15 | Here's another kind of typical sparse And for those of you are |
|
|
56:21 | they're think in reference slides or references have towards the end of the slide |
|
|
56:28 | . There are now sparse matrix collections people use on that was used when |
|
|
56:34 | talked about petitioning as well. People . They're petitioning. So there are |
|
|
56:40 | of libraries and where you came down matrices from. And here is coming |
|
|
56:48 | one that looks all the way. the point is that anything that is |
|
|
56:53 | coded is something that is not zero quite spaces a zero elements. |
|
|
57:00 | now I wanted to cover the storage , um, at least and this |
|
|
57:08 | . So here says what's known as compressed role storage schemes. So you |
|
|
57:16 | imagine from the examples that show the that the number of non zeros and |
|
|
57:24 | role may be quite few eso that the connectivity. So if you have |
|
|
57:31 | graph problem. You have quite large when millions or even billions on |
|
|
57:42 | But usually the relationships of the connectivity the graph for most notes is quite |
|
|
57:50 | . So the number of edges from note on may not that all scale |
|
|
57:57 | the number on notes in the So the degree of the nodes maybe |
|
|
58:02 | limited. So you may have, you or maybe tens of edges in |
|
|
58:11 | graph for each node, except some and some a little bit more. |
|
|
58:18 | , and I guess, for those you who does, um, some |
|
|
58:23 | problems and model things on this then you will know that if you |
|
|
58:29 | at the globe and then close to pools and you get the print |
|
|
58:34 | ID fan into the polls because all different tests relations off and has a |
|
|
58:43 | point in tow poles. And that's , when you their grades for |
|
|
58:47 | you try to treat the pole areas from the rest of this year's, |
|
|
58:51 | instance, anyway, so this is to say that typically the number of |
|
|
58:58 | in a row is very limited, of the size of the Matrix. |
|
|
59:02 | that's why one doesn't want to store . But then, for our matrix |
|
|
59:09 | about the works, you really need know what elements are if you can |
|
|
59:15 | out the zeros. So here's this Rose Stories scheme that's kind of illustrated |
|
|
59:23 | the bottom for this matrix. So this one does is the store on |
|
|
59:30 | non zero elements. So the first here, just chosen on their elements |
|
|
59:37 | you can throw easy to see comes the Matrix. And then what it |
|
|
59:42 | is then, since the number of zero elements in the different roles are |
|
|
59:48 | necessarily the same, it in this dimensional array off non zero elements. |
|
|
59:56 | thing to do is to figure out each new role stocks. So there |
|
|
60:03 | a road pointer that tells where each stocks in this linear array structure. |
|
|
60:11 | that's what you see at the And then for each one of the |
|
|
60:15 | zero elements the store, the corresponding index. So now you actually have |
|
|
60:21 | way off figuring, um, exactly coordinates of each one off the non |
|
|
60:30 | elements and only store non zero elements they cost is basically one area |
|
|
60:38 | um, non zero values that may floating points and maybe integers depending on |
|
|
60:43 | data types you use. And then have a column. Indices. That |
|
|
60:48 | an array of integers, that for you may need a bunch of |
|
|
60:54 | depending upon the size of the But usually it's at most, |
|
|
61:01 | about two times, uh, the of non zero elements in terms of |
|
|
61:06 | footprint in often less because there may less space for the column in the |
|
|
61:12 | , even though there are, as the non zero elements. So this |
|
|
61:17 | one common storage scheme. And of course, one can do it |
|
|
61:21 | column instead, is just in that is to a column pointers instead of |
|
|
61:25 | pointers and and store each one of own industries. Um, and this |
|
|
61:32 | another variation on that where some algorithm there diagonal a little bit different than |
|
|
61:39 | elements, so it may be convenient pull out the diagonal on. That's |
|
|
61:44 | done on this one. Andi. , there's kind of no magic to |
|
|
61:49 | , Um, another one. Sometimes is again depending about how you use |
|
|
61:57 | sparse matrices. You may in fact to store uses so called coordinate |
|
|
62:04 | But now it's a little bit more because now for Eastern zero element, |
|
|
62:10 | store both going column index. On other hand, it's fairly quick on |
|
|
62:17 | get everything you need about each one the elements. You don't have to |
|
|
62:24 | kind of the A search or look lower column pointers you can directly address |
|
|
62:31 | you want. Then there is was a well known format l pack, |
|
|
62:40 | that was again coming out of the and engineering community. There is elliptic |
|
|
62:48 | solver package that what it stands for on this particular story scheme is also |
|
|
62:54 | occasionally in the machine learning community. what it does is kind of still |
|
|
63:01 | kind of ah matrix like representation off forest matrix. But it, |
|
|
63:10 | compresses rose in this case, so stores only non zero elements. |
|
|
63:20 | but it done generates a rectangular and then you need the second ray |
|
|
63:27 | mimics The Matrix are non zero elements the original matrix, but instead has |
|
|
63:35 | column interest for those. So they're indexes implicit. And you have to |
|
|
63:40 | the column index for each non zero explicit. And then it turns out |
|
|
63:46 | this has been very it was Um, Thio be efficient for vector |
|
|
63:57 | . So in this case, you things by Rose and you get sort |
|
|
64:01 | a vector limpet make correspondent. So the length of a row or the |
|
|
64:06 | of the column, depending on the , things but the last for fairly |
|
|
64:12 | are very efficient memory accesses and When you have victor instructions on this |
|
|
64:22 | another one that just had a for . But at some point, when |
|
|
64:27 | talked about something, I mentioned the prefix operation. Um, on this |
|
|
64:38 | Gabriela on seeing you, the design of a programming language that uses the |
|
|
64:45 | and so primitive. And then they up with a particular efficient stories game |
|
|
64:50 | use in parallel prefix operations and processing agencies. Now, one thing that |
|
|
65:02 | , um, very well commonly used what's known as the skyline or profile |
|
|
65:12 | , in which case you store elements . I think of a role. |
|
|
65:22 | , so if you move left from to right in a row. When |
|
|
65:26 | hit the first non zero element in row, you obviously store it. |
|
|
65:34 | then you store all subsequent element in role. Um, even if its |
|
|
65:41 | . So that's what the skyline of things. And so this kind of |
|
|
65:47 | lines, you know, going from diag along upwards or left words trying |
|
|
65:54 | illustrate that in this case, there the number of elements of the sparseness |
|
|
66:00 | being stored. And in that it includes cereals. So it makes |
|
|
66:07 | more efficient because now you only need store where and it all starts and |
|
|
66:13 | kind of the length of the But then all of the say if |
|
|
66:18 | go through all of the column industries implicit, so they don't need to |
|
|
66:22 | explicitly storm. So that off the sufficiency, both in terms of storage |
|
|
66:30 | sometimes in terms of the processing, you can use vector operations even though |
|
|
66:36 | variable length factor operations Um Okay, , then I guess talking about the |
|
|
66:47 | vector and then I have one more . I wanted to try Thio point |
|
|
66:54 | . So here's a little bit all of stuff you want to do. |
|
|
66:58 | here is now how you would For instance, they come pressed role |
|
|
67:04 | . So the point of this is top. Is the three races exactly |
|
|
67:09 | compressed go formulation and then how, you're doing the Matrix factors? So |
|
|
67:16 | basically sports matrix times the dens vector now sends, um, if you |
|
|
67:25 | through enroll off the Matrix than the that are stored and the and ands |
|
|
67:42 | array. They are not in successive , so you need to figure out |
|
|
67:48 | elements off X you need for that matrix off the road. And that's |
|
|
67:54 | you end up using indirect addressing that kind of marked in red here or |
|
|
68:00 | gather operations. So these gather scattered that I talked about that way back |
|
|
68:06 | the past is very intrinsic and important terms off forest matrix operations. And |
|
|
68:17 | , it's not something that the memory tends to like, so those are |
|
|
68:25 | not very high performing and computer architect's used all kinds of triggers to try |
|
|
68:33 | figure out how to make gather operations goodness that can be, But this |
|
|
68:42 | to show that it becomes a little more complicated and unfortunately, kind off |
|
|
68:48 | slope compared to anything you do with May agencies on, um, there's |
|
|
68:55 | if you turn it around when you the column oriented one, you need |
|
|
69:00 | . They gather in this case for wine X may be fine, but |
|
|
69:05 | they also need a scatter operation in of storing the results back. So |
|
|
69:11 | intrinsic and this force metrics are these carried type operations or indirect adjusting on |
|
|
69:20 | Since an illustration for the the impact anyone interested can take a look at |
|
|
69:25 | I may be relevant for something, doing other classes off projects. |
|
|
69:34 | and this is kind of what you use this notions in the segment it |
|
|
69:38 | and again on some machines that for instance, on the connection |
|
|
69:43 | We have this supreme native operation and was also again something that kind laid |
|
|
69:49 | did for his thesis. Now, other point is something that is important |
|
|
70:02 | Oh, this is something onboard. from a Gandhian Generals group cut Berkeley |
|
|
70:07 | base. The point is that best matrices it's hard because of the structure |
|
|
70:15 | get the whole lot of instruction level data level parallels because of you had |
|
|
70:22 | indirect addressing and things are not nicely a za block. So things tend |
|
|
70:32 | be limited by memory accesses. So making good juice off memory access |
|
|
70:43 | um, higher priority. I would I'm trying to look at how to |
|
|
70:50 | simply instructions or, um, they're W type constructions. It turns out |
|
|
70:58 | during these, focusing on the memory or trying to block matrix sparse matrix |
|
|
71:06 | is in fact also benefiting. How you deal with the instruction set for |
|
|
71:11 | processes? So this is just try use block operations and the next several |
|
|
71:18 | trying to illustrate that where a few that Andi this was this. I'll |
|
|
71:26 | Thio illustrated more by pictures than talking this particular slide to make these |
|
|
71:34 | So here's an example of the structure may be in a common. There |
|
|
71:41 | instructed analysis problems, so you can in this case that's definitely sparse Matrix |
|
|
71:49 | . Also, you can see this simple structure in this case, |
|
|
71:54 | now it turns out, each one these little blue dots. It is |
|
|
71:59 | really adopt its thio focuses. This in fact, I'm kind of a |
|
|
72:06 | structure. We're in this case, looks like, recently done spot. |
|
|
72:11 | I think it in this case, is s O that why you're kind |
|
|
72:17 | instead off trying to do element twice far storage. Now you basically treat |
|
|
72:25 | as the elements on. We treat blocks then as a guns matrix. |
|
|
72:32 | this is kind of trickery. And turns out that this is what you |
|
|
72:36 | to do, even if the these are not dance. But yeah, |
|
|
72:44 | totally sparse policy, Right? So shows a little bit playing around with |
|
|
72:50 | on the factor of four here is out the winner in terms of |
|
|
72:55 | they played their arm of figuring out shapes for this particular case. And |
|
|
72:59 | more examples. Here's another example looking at things, and in this |
|
|
73:06 | , the blocks not really dance. then so here's to look more carefully |
|
|
73:13 | things. And again, you still a lot of structure that you can't |
|
|
73:18 | advantage job, but it turns out on this side. It's cheaper to |
|
|
73:23 | it out the zero and treat these as dense. So it means you |
|
|
73:28 | , waste little bit of storage, in terms of the number of elements |
|
|
73:32 | need to store. But they're on other hand, you don't need to |
|
|
73:36 | either there or column index for each , because now you know it's a |
|
|
73:40 | . So you say on index storage you save. I mean on |
|
|
73:45 | because it's just increment by 11 or going into one column directions. So |
|
|
73:51 | are lots of gains and how the actually works by doing this blocking, |
|
|
73:56 | though it means to store a bunch zero elements. Um, on that's |
|
|
74:05 | . So this shows a little Then the and I wanted to say |
|
|
74:09 | more thing. I even though I'm for time because the next thing, |
|
|
74:15 | , and then different questions. I . So here's what I wanted to |
|
|
74:20 | out as well that the structure of parts mattresses are not inherently well, |
|
|
74:28 | first degree it's inherently given by the . But the mapping between the graph |
|
|
74:36 | the Matrix has a lot of degrees freedom and that impacts? Uh |
|
|
74:43 | This sparse matrix looks so here uh, a bunch of ordering some |
|
|
74:50 | use so called minimum degree ordering that ordering its etcetera, and it's not |
|
|
74:55 | time to talk about them. But show you the pictures off the consequence |
|
|
75:00 | , playing around with this, ordering notes in the graph. So, |
|
|
75:07 | , this is a little bit to about what's natural ordering. Somehow people |
|
|
75:13 | the graph one way on the top of the thing, and it shows |
|
|
75:17 | number of elements that has ended up the Alaska factories ation. In this |
|
|
75:21 | , on the top, that's about or so. And if you use |
|
|
75:27 | other ordering the minimum degree, it of looks more complicated. But in |
|
|
75:31 | in that case of what dropped from 300,000 to about 200,000 non zero elements |
|
|
75:37 | the way, you factored the matrix this particular case on the rest of |
|
|
75:43 | on skipping, just show your Eso here is one. I was |
|
|
75:50 | , uh, some design, slack to the Stanford linear accelerator. |
|
|
75:57 | guess they decided some observation chamber of accelerator, And this represents something that |
|
|
76:06 | out now, whether using different this becomes a lot nicer to look |
|
|
76:14 | , and they're more examples in this , and I'm over time. So |
|
|
76:16 | will stop on that and just try point out that when you work with |
|
|
76:21 | majors is, I would say the order of business is to figure out |
|
|
76:28 | you have labels the graphs and try find a good scheme for labor on |
|
|
76:33 | graph respect to in later stages being , able to use some form of |
|
|
76:40 | structured, sparse matrix. And then represent that blocks structured, sparse matrix |
|
|
76:49 | one of the schemes have talked And that's E. Guess my spiel |
|
|
76:56 | this forest matrices to think about for that works with them in any |
|
|
77:03 | And with that, I will stop lecture and take questions on anything related |
|
|
77:09 | today's lecture or the class. it was a high speed run by |
|
|
77:20 | making pointers to something that the hopefully in other parks in your area. |
|
|
77:28 | or studies. Okay, let's There is a couple of chats. |
|
|
77:41 | see. Okay, as, um those of you who may have checked |
|
|
77:50 | late the presentation that WAAS finalized Thio 15th in the morning from nine. |
|
|
78:01 | 12 on the lecture has been Um, but I Well, stop |
|
|
78:24 | |
|