© Distribution of this video is restricted by its owner
Transcript ×
Auto highlight
Font-size
00:00 uh, so not, well, gonna lead into yeah, today.

00:14 about clusters and programming of clusters. before that, tell him, talk

00:21 some simple concept that, unfortunately, already used. But talk about them

00:27 little bit more and then talk about and cluster programming. We'll see how

00:33 we get. Uh huh. So little bit, Um, I think

00:42 have used the concept of speed up , and I'll try to talk a

00:46 bit more about it and then something I guess it's particularly relevant in terms

00:54 . And it's often used in terms carol computing. Um, and we

01:00 done with pile computing in terms of MP and and you could use.

01:06 it has been first, open, shared memory, the fact of the

01:11 note programming and then two other spaces terms of GPU accelerated notes. But

01:19 didn't necessarily discuss this concept of strong scaling, that you will see pretty

01:25 anything dealing with computing in terms of or talks, and then talk about

01:34 efficiency that is coming on the yes piece on me. How that is

01:39 with in papers. Um clusters are known as distributed memory systems, and

01:48 will try Thio. Talk that and it to what's known as massively parallel

01:54 that also distributed memory type systems and focus on a little bit. Not

02:03 . But these distributed memory system broke them together, which is the communication

02:11 involved, since there are multiple notes need to talk to each other

02:16 And if there's time left, us to get into the programming paradigms for

02:23 types of system, whether it's clusters employees. So first, um,

02:32 , uh, definitions, I would some terms are going to be used

02:38 the first several slides on reduced uh, cyclists per instructions are instruct

02:47 Converse instructions per cycle and talk a bit about compilers and compiler optimization.

02:55 it's a good thing to know on concept, off cycle score instructions or

03:01 instructions per cycle and I was supposed , and the cycles per second that

03:09 something you get from the clock rate were talked about more than once.

03:15 then we have talked about execution has basically whatever the work is that

03:20 supposed to be pretty done divided by corresponding execution time. So hopefully there

03:29 nothing reeling all that much new on , like justice reminder. Um,

03:39 then a little bit of definition of up. And I think it's just

03:42 integrity definition, uh, that they used e, I guess, specifying

03:48 too well before, basically, it's different in execution. Time is one

03:54 of doing it different in execution time and after the change on. Hopefully

04:01 change reduced the execution time someone gets speed up by taking the ratio.

04:07 , conversely, what can you Execution rates before and after on.

04:15 should give you the same conspirator. , so this, um, is

04:26 in general something that one looks at fractional calls on, that it's affected

04:35 a change, whether it's paralyzing it some other kind of improvement. That

04:40 does so try to speed things And so in general, one looks

04:47 the improvement on the fraction of the that does, uh, improve in

04:55 to the change. And then there the remaining fraction that behaves in the

04:58 way as before. So when God , well, equation as as a

05:05 after the changes were made is the that isn't time for the thing that

05:10 not affected, plus the time for part of the code that did get

05:17 . And the point of this thing then up there actual speed up,

05:23 have this level expression at the bottom the slide. That shows that even

05:30 I manage to totally reduce the or the time for the fraction of the

05:38 , one worked on two pretty much negligible. The best paid up on

05:45 get is then what's left is the code that you know, one minus

05:52 in this case. So this is as handles law, and that's something

05:56 should I've heard of and the Easy are remember to when if somebody talks

06:05 what someone's law and I'm most most said that speed up is limited by

06:13 peace of the code that was not . This intuitive but people sometimes than

06:18 forget on yeah, that, in , historically and Don was on engineer

06:26 IBM that came up with this a , long time ago, and I

06:32 kind of be used as a way the time to say that that WAAS

06:43 in Dominating were at the forefront at time of improving codes through architectural means

06:53 to say that instruction level parallelism is limited. So there was kind of

07:01 to engage in it because the games be saw marginal anywhere that the complexity

07:08 those by some us hardly justified. on the reason why things these days

07:19 and napped follow this particular model? that pretty much all there good?

07:33 , speed ups or the juice even in a multi core processors and

07:41 particular clusters is due to this concept data parallel with which is not covered

07:48 by this simple, um, does eso. The next thing is an

07:57 that some of us they have If you have taken computer architecture

08:03 it's a very simplistic example. But illustrates the point off London's law in

08:10 way, and it comes from the it's a Paterson book computer architecture that

08:18 kind of the dominating textbook for computer classes and for those of you who

08:25 know them, they are California and on NSC at Stanford and the

08:32 at Berkeley. So this little example in the context of what's a typical

08:38 for California's to think about. And going to Las Vegas. Sure.

08:43 in this case, there is, , a little bit on there,

08:49 contrived example of getting from L. to Las Vegas with a certain kind

08:56 options for different parts of it. on this case of his fun section

09:02 pretty much at the walked according these two fellows that takes 20 hours

09:08 then for the rest of the ah or trip from L. A to

09:14 Vegas. Um, there's 200 miles you have a bunch of options for

09:19 are you travel those 200 months on next few slides, I'm going to

09:26 through this little exercise. So here the first option. The first was

09:31 walk for 20 hours, and then can also walk, put for the

09:35 of the trip, and then that then at, you know, four

09:42 . That takes you 50 hours toe there the 200 miles, and then

09:46 call it speed up for this part to compare it with the other options

09:52 normalized that toe one. So the chip time it's not 70 hours and

09:57 you can use the bicycle. Then goes a little bit faster. So

10:01 it takes basically 20 hours to do 200 miles, and then we have

10:06 initial 20 hours. So now it's and that means yes, there are

10:13 consequences that it talks about half the or speed up or 1.8, and

10:19 you can do sort of more modest car, and that's getting a little

10:24 more speed up. And then you do you are for our guys on

10:32 things. Now it takes considerably instead the 50 hours the walking takes less

10:37 two hours, and then you know can dio the rocket car. If

10:42 live in California and you have the where they try rocket cars every now

10:46 then, um so, yes, takes only in that case a third

10:53 an hour instead of 50 hours. in the end, the told speed

10:58 wasn't more than about the fact of so that's just illustrated on and they'll

11:06 the some parts that doesn't improve in end. Limit speed up. So

11:13 regardless how hard to try for the that you can affect the middle.

11:17 this case, it was the majority are the distance or perhaps the number

11:25 lines of cold? It doesn't value to go to the extreme. So

11:33 just a simple example. All the up now focusing it a little bit

11:40 , Yes. And what cover for rest of this class? And not

11:47 today and of course, as where improvements is due to paralyzing cult.

11:54 for that subscript e on the speed for the section of the code that

11:59 paralyzed. So now I'm done so , looks the same. It's just

12:06 this improvement on a section of coldness a symbol s subscript ID with

12:14 So, in this case, just , even if 95% of the code

12:21 reduce reduce to almost nothing in the best speed that we can hope

12:26 is that 20. Okay, so now if you think of the

12:39 speed up, So this is questions we can hanging up here too,

12:46 . And have some opinion whether in , monkeying get, uh, speed

12:55 or improvement in the running time. is, uh, mawr than the

13:03 of threats to engaged. I suppose 10 threads Can you get more than

13:10 10 times speed up? Ignoring caches the light? Yes, exactly.

13:22 that's good point. Yes. So , um, what can happen is

13:31 as you increase the number off threads that means you have most likely then

13:38 for it to make sense, to a bunch, of course. And

13:43 , um, at some point a least when it comes to live with

13:48 cash is the problem may be small that it fits in level three

13:54 So that means that most of the references will be too. Cash is

13:58 depending upon what's your code is. then you will see the runtime being

14:05 proportional Thio accessing, cashing that accessing main memory. So this is

14:13 what we said. So why? this is one reason, and anyone

14:20 of some other reasons, um, intuitively we would we would say

14:34 The first time if we can. can't imagine anything other than cash behavior

14:40 it more speed of being greater than number of groups that we have.

14:45 . That's correct. But in the . Well, what I have on

14:50 up on this side is something um hopefully that will become You become

14:58 of it as I talked about more parallel algorithms and this is the fact

15:04 it comes to cluster. So this not, um, be something that's

15:10 due to think of at this point this course that I want you to

15:15 aware of it because that's something that important when it comes to clusters that

15:24 and communication may scale differently and that cause things. Thio also gaps.

15:32 speed up is higher than the number Fed. So, of course,

15:37 to use and then there's a third that shouldn't really be. But when

15:44 re papers, it's something you have be aware of. So when we

15:49 at speed up, um, if use this notion off using looking at

15:57 up based on rates, so then you compute sort of the maximum rate

16:09 could be achieved using the nominal uh, or standard clock frequency.

16:21 then the code is actually being So when you do timing than they

16:28 turn out that the processes may have running in terrible mode. So that

16:34 you're not using the proper maximum execution soul. When you read papers,

16:42 will find occasionally that people have you know, more than an speed

16:49 that is higher than proportion to the of Fred's, or they get deficiencies

16:56 are harder than 100% which obviously it's possible. But many people that have

17:02 problem writing it down and getting it even or is obviously is a

17:08 It is not correct so hard to . Yep. Are you aware of

17:16 utilities similar to Newman Control that allows Thio tell the process or to stay

17:21 a base? Frequently there is. don't remember exactly what they are,

17:25 you can lock today process of frequency prevent it from changing frequency on

17:31 whether it's in the When we talked power management, that there was thes

17:40 states that is, the cloud grades used, and one can basically tell

17:46 operating system not to manipulate the Great Quincy. So I don't have

17:57 command. We can look it but it is something you can

18:01 So you can prevent these things from to you and know exactly what

18:08 So that's if one wants to be benchmarking. That's what should be.

18:15 not saying that it's 100% guaranteed because may be that, um, the

18:26 unquote firmware that the CPU vendors have sometime over right. If the CPU

18:37 in danger of overheating, that it's will not honor what the operating system

18:45 to be the cockpit. Because there this kind of sometimes tug of wars

18:54 the firmware and the operating system as who's supposed to has the final word

19:00 what's happening. And it tends to , you know, everyone wants to

19:06 the equipment. So in the from where will go mhm Um,

19:18 , so now about this concept of and weak scaling and parallel efficiency.

19:23 these are also things where I personally a victim, an issue with what

19:31 happens out there, and so I'm my biases here. And the reason

19:42 these things are, um I will on this slide. But first,

19:49 me tell you what I just wrote on this slide. Yeah. So

19:54 have something that I also called fair up that is usually not disgusting.

19:59 talked about weak and strong scaling or up. And the reasons for me

20:08 putting in this notion affair speed up bond. We'll talk about that later

20:13 I talked about parallel algorithms because it out not in the case you have

20:19 so far for the standard matrix vector matrix multiplication algorithms. They do paralyze

20:27 fine. But for many other the best sequential algorithms don't necessarily paralyzed

20:40 , so if run has a, know, access, I want to

20:46 parallel computation. You may choose a algorithm and certainly different incarnation in terms

20:55 the code than what you would um, things for a single core

21:03 threat. So at first, type speed up mentioned this slide as essentially

21:13 thank you. Aware of that there different ways of looking at the speed

21:20 , and sometimes it just marginal And sometimes it may be quite

21:28 Uh, how in terms of total being performed that the parallel algorithm in

21:35 American pursue a lot more work than that were designed for a single core

21:43 threat. So I think to be , one should really look, if

21:49 wants to do it, looking at up from a single thread to whatever

21:55 of threads you're using, one should of look at the best case for

22:00 single scenario. And then you cut everyone has for soapy friends. And

22:09 what the fair means and the other types of speed up. The other

22:14 types of speed up is what's typically weak and strong. And I'll start

22:24 to comment on the strong scaling. that s basically you look at the

22:31 workload or problem, and you tried find out how it scales or how

22:40 execution. Time is reduced with the all threads that you're using for that

22:50 , and that's certainly a very variable of looking at things. But still

23:02 has to be careful and have one that because, um, that was

23:12 of my early arguments with the creators this top 500 list and the use

23:23 this girls and elimination coat, because that this was started, the thing

23:31 looked at boss to solve their linear of equations with the 1000 equation and

23:39 unknowns, mhm and aside argued with on the list staple, and nobody

23:48 going to by a supercomputer to solve problem. So in general, larger

23:58 that is being used is used to larger problems and that this was this

24:05 . A week speed up or weak comes from in for that type of

24:13 . By speed up, you scale problem in proportion to the sites of

24:18 cluster or parallel computer that you're using effectively in another way of looking at

24:25 , you use kind of the same or fraction of total memory in the

24:34 it's a one node or 10,000 you say 50% or whatever memory you

24:43 for the problem size, and then study how the efficiency or time is

24:51 as you scale things up. Potential . Both the problem and the

24:56 What you do not do in strong . So this is what I

25:02 So I think in terms of practicalities , people use larger crusted for larger

25:12 they don't use. Largely. It mean that sometimes for modest sized

25:19 it's still interesting to find out the scaling. So any questions or comments

25:27 that it comes to me. I you to be aware all these things

25:33 anyone writes papers or read papers about , what actually being, uh discussed

25:44 the critical about what you read and on this is kind of just follow

25:52 in terms of efficiency. And that's issue I have in most papers up

25:58 . Not every paper but a zits been red at the bottom.

26:06 most people report this power efficiency. simply efficiency. Now parallel efficiency is

26:16 on the top one. Basically, is kind of the ratio between the

26:22 up you got for using soapy Fred's the P threats or course that you

26:29 . So you know, if you something that the running time decreases just

26:38 proportion of perfectly 100% proportional to MM off course of our friends you're using

26:47 you get the part of efficiency over . Um, but it doesn't mean

26:54 you're efficient, really is 100%. that's my problem when people report this

27:02 efficiency as efficiency because in practice things use a very small fraction on the

27:11 is you actually using. So that's I really emphasized one should look,

27:22 , to really efficiency is nothing wrong looking at Carol efficiency. But don't

27:28 it to be the efficiency off there coats. So anyone, um,

27:42 sto yes was the typical efficiency is rial, as I call it on

27:51 slide efficient is off cold weather's and note code, or even more so

28:01 terms off classical. What fraction of total resource is typically being used?

28:10 , anyone would guess it's in a digits, right? So even the

28:18 calls are often in the, you , 1 to 10% range, and

28:26 cold, maybe you know 0.1%. it is not easy to get very

28:40 . Utilization of resource is so high , as I think was it on

28:47 slide, have ordered from build up of the early lectures was, you

28:53 , Parliament, they call in a . Efficiency is easy. Single core

28:58 is hard. All right. so any questions on these topics now

29:10 move on to structure, approach, about clusters and MPP and things from

29:20 nature. So many of you may have used clusters and know what it

29:25 , but I'm first going to talk case you haven't, uh, talked

29:31 little bit about clusters. And what differences between clusters and so called MPP

29:37 for short and again, some examples it so harsh definition talk the level

29:50 parallel lists that is typical for both and empathy. Zehr supposed to,

29:56 , what we have talked about so and to get the cluster clusters

30:01 as basically collection of independent processors tied by a network and then used as

30:13 a computer platform. And I talk little bit about the particular simple performance

30:21 to understand these clusters can be modeled terms of performance. So first I

30:28 a little bit what? The differences about how they're constructed, and I'll

30:33 a little bit more about that in next several sites. So clusters,

30:43 , some tea too is a cluster clusters we have had Or have that

30:49 mention the data science institute to And, um, this is the

30:58 I got the new one that it just announced, Sabine. But that's

31:03 old one. And then there was new one. See something?

31:07 anyway, but there yes, they're clusters and so on. The overriding

31:19 was a driving or principal about putting clusters is, I would say,

31:26 effectiveness more than performance. So and tends to be using mass produced

31:35 So it's standard service that are basically units that they can buy and use

31:45 by one. Or if i, know, hundreds of thousands off

31:49 and then you connect them up through communication technology and then varies a little

31:59 . Performance conscious you are, but general use used any form of standard

32:08 technology out there. And Ethernet is far the dominating standard, not necessarily

32:15 terms off clusters, but in terms and networking technology. Internet is by

32:23 the dominating at technology or protocols being . Thio connecting up things, whether

32:31 local area network, or what area , but the cause of its relatively

32:40 cost in particular for the lower data . Uh, it's also used in

32:46 clusters. So I think we'll Ponta have Ethernet and Sabina may also you

32:53 if a net stamping does not, it's more of a focus on

32:59 But it's still that was a kind a mass produced, Internet interconnection network

33:08 known as Infinite Band. It's an standards, basically, even though,

33:13 , practically it has been dominated by vendor. That was anyway being as

33:23 in high performance standard networks as Intel bean in terms of CPUs, and

33:31 company is known as melon arsenal mentioned a little bit more and subsequent

33:37 Um, and I think I'm comment little bit more on Ethernet. And

33:43 in abandoned slices to come MPP I was a. The main difference

33:53 that and piggies tends to use proprietary technology, sometimes even proprietary processor

34:06 Um, or at least in terms compute nodes, has been proprietary

34:14 not an offer self server, but that is this unjust to build and

34:22 and It has an been driven by aspect, So cost has been not

34:32 totally dominating aspect, but more performance very large. Um, were close

34:42 his pain was something that paralyzes to very high degree. So But in

34:52 of the programming model tends not to the all that visible when you use

35:02 I've talked about for this class, message passing library, MP I,

35:07 implementation of the libraries are very So the performance characteristics are different.

35:13 whether you use a cluster in an , but in terms of the

35:17 right is not different. So there's pictures and guys. Yeah, anyone

35:28 seen clusters. And this is not that common toe what things look like

35:34 . I'm getting some of the dates the side that I put here as

35:38 than the top ones and pretty much the bottom of left corner, but

35:46 to be homegrown clusters of people put in their lab. This is more

35:52 it looks like. A professional put clusters and, um, the top

36:01 top role or pictures here from if European side tellers at in Munich in

36:09 . That is as among the largest for research. Uh, and the

36:17 lower left hand corner is simply, , that have been using on the

36:22 right hand corner is one from the data centers and one comment that it's

36:30 related to the use but in terms the lower right hand corner.

36:39 and I think in the very first lecture and used to slide that showed

36:45 aspects of large data centers that has deal with it calling so on the

36:52 right hand picture on the lower left show things that big data the commercial

37:03 centers do not tend to use. , they're nice, pretty cabinets.

37:11 in the lower right hand corner, is no doors for the rack,

37:15 airflow and cooling can so dominate. huh. The giant data centers do

37:23 on, um, they're not done looking pretty or for people to be

37:28 it on dso. Sometimes they run talked, so they're not designed for

37:37 to being signed. And here is little bit what going on? Quote

37:42 backside that doesn't have the door. you have a door, looks like

37:48 DSO. I will talk a little today and then, uh, class

37:54 come. I will talk more about interconnection networks that is being used,

37:59 this shows a little bit to the and yellow and orange cables. They

38:06 basically fiber optic cables on the great . There are basically copper wire cables

38:13 people use and then on the upper hand corner is, um, kind

38:20 wiring abundance that are totally custom made the clusters so on. And then

38:30 is a picture of an MPP, it doesn't look very different from anything

38:36 you saw on. That's is the that from the outside you can't really

38:42 much about whether things are an empathy a cluster. But this picture happened

38:48 be from I've been out of I ends Lou Jean, Syria's of computers

38:59 eso Any questions on that and I'll about I'll talk more about the difference

39:06 clusters and MPP and pictures to come slides to come. Yeah.

39:17 another comment that can do. And one is, um so all these

39:23 data centers at this point in so it'll be this particular,

39:32 empathy. The Blue Jean was designed be very power efficient on it.

39:39 , also the most power efficient not MPP, but more part efficient and

39:44 around. And then what was used there cooling off the data center.

39:51 you can see on the tiles on floor that they're perforated. So what

39:57 then, every other I'll have thes floor tiles for mhm cool air Thio

40:06 up, and then every other ill known as a hot tile that sort

40:11 have the hot air. Then um, suck it up. So

40:15 alternating nothing cold dials in the state centers, and many times they're in

40:21 using enclosures for a pair of ragged in order to separate hot and cold

40:29 from each other. Yeah, and saw the other part, um,

40:35 I wanted to stress at this When it comes to clusters and MPP

40:41 the scale of parallel is that needs be managed compared to single older,

40:49 GPS. So, and as you see, this obviously is from the

40:56 500 list. So it's yes, large clusters. But even you

41:05 the modest sized clusters we have Uh uhh. And we'll show on

41:12 subsequent slides. And even companies have toe. Not only the Internet

41:18 giant clusters, but many other and clusters in industry also have significant

41:25 of course, and threats to So this is from the average

41:30 of course, on the stop And then this is the maximum

41:33 So it's against something. Thio Notice off the largest clusters of more than

41:39 million, of course, um, be managed. It doesn't mean that

41:47 application they run use the entire and that's often not the case.

41:54 they may still use a significant fractions millions, of course, for any

42:01 application. Mhm. And there's also minimum number, of course, on

42:07 list. So even the smallest the number 500 today has more than

42:14 or close to 20,000 course on and this is kind of an

42:20 but you can look at it, guess, on your own display.

42:25 then But I'm just illustrate So a bit the top 10 in the bottom

42:33 on the stock 500 list and then want to produce it in terms of

42:39 ones is DP use and which ones . He was deeply used. It's

42:43 of interesting to see that the question the largest number, of course,

42:49 all, um, as far as know, even though it's it's and

42:58 Chinese processing. But I believe it much uses the X 86 instructions certain

43:05 of similar to that course. We have 10 million plus of those courts

43:13 , simple course like GP. Of , on this was a little bit

43:20 point out again, since money or will end up in industry and

43:26 you know, end up with the companies that you have giant clusters,

43:31 I mentioned, but also the oil gas companies very large posters, and

43:39 that case, their coats are many highly parallel. And, um,

43:47 this so this was just put in to get your some sense for uh

43:53 on the left hand diagram the short , as they call it, and

43:58 stopped funded list system share so that um, off the number of systems

44:05 the top 500 list. Close to of those clusters are in industry.

44:16 as you can see on the right slide if you look at performances

44:21 country illustrates that the clusters used forest or academic use tend to be

44:32 so they may not be as But that's by a little bit.

44:36 50% of the total capability on this 500 list is basically and,

44:45 research types clusters mhm. So try butt out that this is extremely important

44:56 terms of scalability, and that comes both the algorithms as well as

45:03 Your software has to be extremely scalable in order for these clusters are empathy

45:13 be effectively is used. And so orders of magnitude more than you find

45:23 on CPUs and GP use and, , well respected you be use?

45:33 , yes, well, you But now you note that if you

45:38 may have a few 1000 course not about 2000 course, But in order

45:49 make effective use, you need this parallel is so effectively in terms of

45:54 number, all friends would say one to factor after unless you can use

46:03 it comes to NVIDIA. 32 of , in fact, to live

46:06 That is kind of the equivalent of CPU thread, then. So the

46:13 of Carol is even on GPS. , it's your kind of ignore.

46:19 Cindy aspect of the current list is less than well 1 may need to

46:27 with in terms off clusters and MP Andi. In the comments of questions

46:36 that, I'll have a couple more on the difference between clusters and MPP

46:52 often feinting trap. So the next is now, as I mentioned a

46:57 times, what makes things different is . What they talked about before is

47:03 that there is a bunch of independent in some shape and form that are

47:11 out to form this clusters and MPP one of the main differences ISS how

47:20 , um, interconnection is being So I'll try to illustrate this on

47:30 slide. So the main difference is for M. P. P

47:43 the network is much more highly integrated the nodes. Then what you find

47:56 clusters so on. I mentioned that on when I talked. What?

48:02 difference is between clusters and MPP that peas, um designed for highly parallel

48:11 , clothes and performance and use proprietary by being proprietary technology attempt to be

48:19 expensive and pricey. And what clusters so clusters? What kind of dominate

48:26 dominating up there. But there are examples of them trapeze. And if

48:34 go and look at the top 500 that go back and look at the

48:38 , I want to it now. you would find that a few of

48:42 are labeled as MP piece, but far from the 95 or 98 or

48:49 more percent is clusters. So when comes to clusters, then,

49:01 one basically used network adapter step plugs this. Hi. Oh, bus

49:07 piece. I express bus. um, do you think and

49:17 this I also put Yes, they're not you those clusters. But there

49:25 this volunteer computing type networks and some you might hold known about, you

49:32 , folding at home, etcetera. in that case, basically this local

49:37 network or wide area network technology that connects the various pieces there, people

49:45 for, used by some in Um, costs are e I have

49:59 one example often entity, that is the again, Lou Jean,

50:08 be computer, in this case I have their most recent fund is known

50:15 losing, too. Unfortunately, I say on IBM, stop producing the

50:25 Jean Syria's So that's no longer you buy one anymore. You can still

50:31 some on the Blue Jean Cube available there at some centers so still

50:38 but they're no longer produced. But point off using this for administration is

50:51 this integration off the network is, , much more, or network is

50:58 more tightly integrated into the notes on would find in clusters. So here's

51:04 they did in this case, that the network ties into the known

51:10 the same level as the Level three are being done. So,

51:18 network is no further away than a three. Josh, Uh, when

51:25 talk about networks in some future I will talk a little bit more

51:30 what the green boxes at the bottom the slide. These with stores and

51:34 and barrier. It's a transom for empty peas. Um, networks to

51:43 synchronization tend to be also dedicated networks from other networks for data communication between

51:53 . And that's also kind of our class citizens, like other aspects of

52:02 . Wow, on the other has shown before that this of the

52:09 I expressed. But, um, is the I O bus that is

52:14 for interconnecting nodes using standard technologies like or infinite bond. Or sometimes they

52:25 also be proprietary technologies but prepares the well, friend, do not use

52:32 PCs, Crispus. All right, So when it, uh, comes

52:47 dealing with clusters on trying Thio, eventually get good performance, not just

52:54 a working code, one has to again aware off that there is now

52:59 more layer off communication toe worry It's not just the memory hierarchy,

53:12 so the main memory. Andi, various levels of cash that I didn't

53:19 on this particular slide, but also interconnection networks that this then at first

53:27 , were emitted by the EU But then they also be limited by

53:34 the interconnection network is put together, we'll talk about that a little bit

53:42 the next few slides, and uh, interruptible will stop in a

53:48 slides. I believe Thio make Sajjad . Let's see. Um,

53:56 So I said that would make some on, um, least two open

54:04 , the Ethernet standard and the FINA Standard. Um so Ethernet this by

54:15 the dominating standard for local area networks even wide area networks today. And

54:23 reason is it tends to be very effective. Um, because I guess

54:37 it's in part less, um, design, because the focus is not

54:50 so much on Leighton. See, it tends to be highly agency.

54:56 and so it's a little bit, , easier to design and build on

55:07 in that case, being less And then it also. Yet the

55:14 market shares with that again helps drive there or advertise the cost, then

55:23 non recurring costs for engineering. So as the mentor of the Ethernet

55:30 Bob Metcalfe has said, you if it can be done by

55:34 it it will be done by and it tends to the truth.

55:40 , and sometimes when it comes to really high bandwidth. It is not

55:46 that much less costly. Turns out it often for any jump and

55:54 It tends to be maybe a couple years behind, before it becomes,

56:01 , sufficiently lower cost. Thio compete . Things I can't sing about.

56:11 on day, that's why if you and look at this top find and

56:14 lists, which are then clusters that prediction for the top 500 list,

56:19 of them are focused on performance. you will find, uh, that

56:26 very large fraction of those clusters used China banned some use proprietary but also

56:33 nontrivial portion of the cost of use interconnection technology. And then there is

56:41 list off different versions off the On standard on, I think is

56:49 the case, um, even though are another proposed improvement beyond this high

56:57 rate, or HDR on seeing the standard that is out there,

57:06 and underneath also have a little bit listing in terms off. The Layton

57:17 that is inherited plate disease in the . In terms of the switches Thio

57:27 that is being used to put together network. I will talk much more

57:30 that in the future. Lecture as mentioned, um, but the agency

57:43 yes, it's in terms of the cycles, and they still be in

57:48 order of 100 or more, but some, depending on again the switch

57:59 . It may even come down depending the routing protocol. But to

58:04 um, tens off cycles, not on these are for the switchgear

58:11 So later on, when we talk the N. P I. Things

58:18 usually enough in the Nanna second but more in the microsecond range.

58:23 remember that processors operating the few gigahertz to 4 gigahertz range typically. So

58:32 means, uh, the cycle time less than a nanosecond. So microseconds

58:42 means tense off processing tens of thousands processing cycles. I have a couple

58:49 other things mentioned on the slide at bottom on the path that was an

58:55 waas. Mostly I haven't say and that is the propriety of technology by

59:04 that it's now it's owned by uh, in tow as,

59:14 at various times, doubled in networking . Uh, like but until basically

59:25 been focused on being a component you know they've been CP use when

59:30 do something new they like. Early , when clusters or parallel computers started

59:38 emerge, they also had the product for thinking about four or five years

59:45 basically the practical demonstrations. So how can build clusters using CPUs, but

59:51 they didn't want to compete with it some Dell or HP or IBM in

59:56 of platforms. So then they retrenched just doing CPUs. Same thing they

60:03 been in the networking business, but for a while they build switches.

60:09 then the returns retrenched to just doing components being used in switch here and

60:15 they did on the past to try do interconnect for data centers, since

60:23 do want to be in the data business. But unfortunately they didn't get

60:28 for their only path. So it out. And 2019, about four

60:35 after the released their products, they to no longer eso on the path

60:44 about a month ago they, I , in part, change their minds

60:52 and they spun out. Ah, basically created a startup company and funded

61:02 to outside of Intell. Try to interconnection networks built based on this only

61:09 technology that they did, right. , um, Crais and computer,

61:23 always, been a computer company focused the high end of clusters, not

61:31 clusters. And they have also had only two connection technology. I think

61:38 along as well as they don't design own CPUs. But they do design

61:44 own servers and how to integrate the . And they still do. By

61:54 way, I think it waas 78 ago, when the interconnection technology network

62:02 Cray had at the time they sold ah technology toe internal can, I

62:10 , used it for the on the design. But great. Continue to

62:15 new networking technology that this, I , a slingshot technology that But again

62:21 pray is owned by HP. So little bit about the principle of data

62:28 Um So Okay, so I guess should stop and ask if there's any

62:35 before talking about today very high simple model of data communications or any

62:42 on this notions of clusters and M P's and in particular technologies thevenet between

62:52 band or others being used for interconnection the cluster interconnect protocols. Um,

63:01 this also what is used to communicate sockets on the same node?

63:07 Well, ah, and the reason saying no Is that the thing I'll

63:21 about? Like M. P for instance, that is used for

63:27 was a cluster of programming. It's on top of the protocols, so

63:34 may use something native. Someone that uses whatever is provided in terms of

63:44 the CPUs. Because CPU and two p union socket the socket in to

63:51 on a circuit board uses whatever CPU uh built into a platform vendors like

64:03 , Dell and others making multi socket . So they in terms on

64:15 I don't remember exactly. Intel may used the on the path between its

64:24 , and he had this its own technology. Thio use the product calls

64:31 communicating between sockets on the same thing , uh, in open n pierre

64:41 it uses whatever is native to the used to, uh, do the

64:49 between in the Newman nodes. So maps onto the more native protocols being

64:56 . Wow. So but in the model, like you can use

65:03 P i to, um, Basically, um, each core is

65:12 independent computer and the communication between core and N p I. And then

65:18 up to the MP I implementer to what mechanisms they actually used to do

65:24 communication. Given what they know about the data is, it's So the

65:33 MP model treats, uh, any on any socket as unequal citizen.

65:41 then the implementer is what it looks . The difference is, um um

65:47 . Okay, right. So it the native stuff in, whereas MPP

65:53 have more layers on top. And it's a question. Now implement.

65:58 have decided to use the mechanisms and I'll come back to that as

66:07 go on. And the other question the thing that start to make a

66:23 from what I've said so far um, that clusters are a collection

66:33 independent computers. So and that's what's this notion of distributed memory systems that

66:39 not comment all that much about. when we talked about open a

66:45 C and accelerated notes and heterogeneous nodes had to deal with to memory

66:53 the GPU memory and the CPU memory it comes to clusters. Um,

67:01 addition, there is a separate memory for the different notes, and we'll

67:11 back to that aspect. But the thing is that in the notes are

67:17 it into connection network, whether it's proprietary network or standard one,

67:25 it's still is mhm explicit in some , as supposed, Thio CPU

67:34 as we just discussed, um, it needs to be basically different protocols

67:41 did with communication between different nose. the first level in terms I'll talk

67:49 this, then have a high level off capturing this notion off, or

67:57 the communication network into. I was about, um, performance, and

68:07 is unfortunate. Another area, I . Where, uh, things are

68:16 and how people use this team. off late in C and delay.

68:24 , when we talked about the individual , there's no bit of ambiguity as

68:30 mentioned, when it comes to we about slam what it means in terms

68:35 processor or C for you, is the core or is it Uh

68:40 Quote unquote, the package that you into a socket on. So for

68:49 class, I tried thio, the consistent and use these notions notions

68:59 late agency and delay. So for delay and try to use basically they

69:07 encompassing time forgetting the message has it or data from one place to another

69:22 and that told delay is then depending and Leighton see and a time that

69:33 takes to get quote unquote the Good data from point A to point

69:44 . So so for the Leighton. , then it has a few components

69:51 it. There's overheads in the center the receiver to deal with getting the

70:03 from point A to point B. then there is basically time of flight

70:15 is, in monarchy. Many cases off the speed of light delay.

70:23 not quite, but I can think it as the speed of light.

70:28 , when one has said they're just first of a single communication links.

70:34 it's just the point to point type transfer. You have to go through

70:40 protocol stack on the sending and to out to build the packets, since

70:47 of them are packet oriented. But that when I talk more about networks

70:52 some future class, whole talk about protocols for doing it. But,

70:57 know, one protocol has again the one up the Ethernet has the protocol

71:04 that you're taking a networking class, probably know about it. And there's

71:08 bunch of things time being put And you may have walked Mississippi I

71:12 that has a header that needs to put together in terms, source and

71:17 and all kinds of other information that into that. And it takes time

71:22 build up. And And then, , it has to, you

71:28 get put on the warriors, so speak. And yet the first people

71:34 get to the other end, and the receiver has thio uh,

71:40 uh, figure out how to interpret it comes in on, figure out

71:46 to do with. So Leighton is the way I well, and

71:53 hope will be consistent to use it just thes collection of overheads that there

72:00 no matter how small messages sent that's it is. It's not entirely

72:08 Sometimes it may depend on the size the message, but the first

72:13 It's something that is relatively independent. totally independent off the payload. And

72:20 there's the other part that is definitely on the payload. And then the

72:27 slide just illustrates a little bit of impact off, depending upon the balances

72:34 these two aspect the late and see the bandwidth that to get of the

72:41 between the end points and clearly that higher the payload. Uh, then

72:50 less the impact is our the agency the overhead kind of things that you

72:56 . And that's pretty much what the shows that you know, effective band

73:01 this, um, that for small kind of the effective bandwidth is low

73:09 it's kind of dominated by the That's I was a larger day package

73:19 the payload. There's less influence there selecting, say, have That's pretty

73:25 the take home message from this Um, Now, things gets more

73:34 than this very simplistic model, illustrates, because in reality for these

73:44 systems, you don't interconnect every note other note, but you have some

73:53 off interconnection network with switches sector between two and points. So both Layton's

74:05 bandwidth gets affected how this network and put together and what the low

74:11 So the network topology place a role that may cause some of the links

74:19 be shared by many pass between endpoints in the Internet. It depends on

74:29 rotting protocol Being used depends on obviously much load you put on the

74:35 And that, in its sense, on where data gets allocated in this

74:41 or MPP. And then what particular or application is being used,

74:49 how your access data across the Then you may also have message priority

74:58 effects things gets done. And you also carve up the bandwidth on links

75:06 give, um, difference or priority different streams of data between different person

75:15 we're depending on what's happened. And , um, happens also in these

75:22 networks that you have mhm kind What if your Internet communication network business

75:31 one also get this notional software defined where you can have uh huh network

75:41 being managed and allocated and not just the role capabilities off links and

75:50 Thank you, Onda. The next things my time is up.

75:57 um, for anyone interested in um, again, more or less

76:05 alert off. Be careful as to you read. Um, and the

76:13 I have a bunch of slides here I found kind of entertaining more than

76:19 else in terms off people doing And what happens when vendors does comparative

76:29 between their own products and competing Because, you know, everybody can

76:35 stuff or use some customers set up test their stuff against somebody else's

76:42 And the next few slides here that will let you produce. I won't

76:48 about them today or next time, you can see for some other common

76:54 colds that is being used in scientific engineering competition influences from the fluid dynamics

77:01 that is commonly used up there. is another one LS Dyna That is

77:08 code that, um is commonly used the order industry as well as the

77:14 of Defense trying to figure out, know, DVDs that what's original producer

77:21 this cold. They always had tried figure out what happens when you try

77:24 build something up. Uh, car want to figure out what happens in

77:30 crashes. Eso and I think one another one. I can't remember.

77:38 thought it was the third one Yes, thus that is quantum

77:45 Colds or anyone interested in us can what the company argued about Fishing's

77:52 We're doing the same sector benchmark. what I found one of them using

77:56 this case was that the three commonly benchmarks Oh, there well defined.

78:03 both companies in this case used um a similar type environments and without.

78:10 here's just flip through the sides. then there's also little bit of this

78:14 and shot network, but I'll let produce the slides. And then next

78:19 I will start to talk a little about this cluster programming, so I

78:24 end here today. I'll take questions I'll start with cluster programming and np

78:31 next time. Well, stop sharing and take questions. So today I

78:46 wanting you, yes, basic background scalability in terms of perilous that needs

78:52 be managed and add the additional component the communication network in there and

78:59 Thio introduced a very simplistic model in of latency and bandwidth for capturing the

79:06 off performance of communication networks. Dr Johnson, one clarification.

79:24 on the bonus problem for 1.1 esos both inner product and Saxby run the

79:32 amongst the three cases you found to the best performance for all case of

79:37 um So when it says amongst the cases we're talking about, uh,

79:42 parallel ization out of Carroll ization and parallel ization, um and yeah,

79:49 ahead. The for the next That's us. For all cases.

79:53 referring to the size understanding. yes. But that's for all the

80:03 cases and all the size metric size . So, in terms of

80:11 yes, well, right, that's strategies are the loathe outer loop on

80:17 looks, whichever you found to be best for for me, uh,

80:22 the normal measurements, the problem one , I believe, and whichever we

80:29 the best performing. You use that with or three organization flag, compiler

80:39 blood and measure the performance with how matrix sizes as well as all the

80:46 cases. Okay, so sorry. you repeat so when it does amongst

80:57 three cases, I mean, they choosing one of three. Yes,

81:01 either in a loop liberalization on the loop. Fertilization off. Realization

81:09 uh okay, so we'll choose one inner outer and both, and then

81:18 run everything okay? Outside a question . So it was very obvious that

81:25 , um, the the both um, the first, the

81:34 So one slash 24 or 24 slash or whatever the case may be the

81:39 thing lost ones are essentially the same only paralyzing the inner and only paralyzing

81:45 outer. Um, is there is overhead and specifying the one, or

81:52 it doesn't end up being exactly the as if we had just paralyzed the

81:58 and just the inner. Um, my opinion, there should not be

82:06 overhead because, um, the outer . In that case, even if

82:13 do not specify a magma from the , it should be handled by the

82:20 thread. And if you define a of non threats one for that.

82:27 of love. It's basically just a statement. Okay? Because they're they're

82:35 that end up getting it expanded by compiler anyway. Right? So I

82:38 imagine. Okay, 12 eso is to be run by the master threat

82:44 , A so long as we're using threat in the afternoon. Okay,

82:57 . Thanks for very good questions. other question? Okay. If

83:16 then I will end today's session and you. We'll be back on

83:22 Thank you. Thank you. Thank

-
+