© Distribution of this video is restricted by its owner
Transcript ×
Auto highlight
Font-size
00:00 Yeah, So today I'll continue to about open MB on Guy will probably

00:10 finished today either. But on there's no common what I think it's

00:18 to do the exercise on having inciting open and he works. So this

00:25 little bit for today's agenda. Finish talking about control, flow, construct

00:31 open MP Um, and then a bit of run time. And then

00:36 spend some time going through an example to use open and pay for some

00:41 program and then get into yet another with just tasks for open and picking

00:48 all. See how far on task . So we went, I

00:57 to the if statement is where we so yesterday call like as a

01:04 That's a fairly simple and straightforward Yes, that's it. This synchronization

01:11 where all fits still be present at barrier, perform on proceeds. They

01:18 the barrier and critical waas, just conception of code, but that only

01:27 threat at the time can execute. this case, it's just a single

01:32 , but it can be used for block of code that that needs to

01:36 within 40 crack. It's, and they talked about that. And

01:44 there was the atomic. Also, is equivalent for two critical as long

01:51 it's just a single memory location that involved. So the atomic is just

01:57 can assignment type state. Where is ? Ken applied some body of

02:04 Yeah, So here is I where I was at the end of

02:08 lecture and in this case is conditional usually statement that is used, that

02:16 condition is true. Then the parallels gets executed. Otherwise it is

02:28 That means Onley, the master fed the parents read will execute the

02:35 So any idea why one they want do that and see if we can

02:40 questions or responses to questions? there might be more overhead and splitting

02:47 the work as opposed to just letting , never to the river or one

02:52 . Whether exactly is all spawning a of threads is not entirely free,

02:58 it involves a fair amount of Someone should, as a general

03:02 whether it's using open MP or later M P I or some other programming

03:09 . Be careful when one uses a construct because there is overhead, and

03:16 on programming paradigm, it may be significant. So don't use parallelism if

03:22 can do it and recently, time a single threat. Then there is

03:28 a weight clause that allows, threads. There are done with,

03:37 this case, a four loop to at the end of the implied synchronization

03:45 the end off parallel Fordham. So is, I guess, an example

03:52 for the first four loop, there implicit barriers, so no threads.

04:00 tiu the openem before wait statement until threats have done the first four

04:08 but then for the second four loop are done can proceed beyond that.

04:15 that's Presley, but in a weight allows you to do and depending upon

04:22 the computation are, that can give benefits. Then there is another.

04:35 There is no reduction. Reduction is operation that is included in many parallel

04:46 languages, so open MP is not programming language, but they choose to

04:52 it as a construct that one can clause actually can be applied to some

04:57 the constructs. That's one use. here's now the little example how to

05:06 reduction. So reduction is simply what shows here, right? That's this

05:11 . It's there's some reduction. Civil produced a whole array in this case

05:18 a single value by adding all the of the race and then computes the

05:25 . But after the fact divide by number of elements in the rain,

05:31 one point can do is simply that use first without the reduction clause or

05:39 , I'll come to that basically part a for loop that now adds up

05:49 the values off the ray into the A e. So anyone sees thinks

05:59 is a good idea or not. I stopped, there is probably a

06:09 guess that there's some issues anyone volunteers issue. Uh, it'll be writing

06:20 maybe not necessarily the order that you to be, so you could have

06:24 on iteration zero. When say, third threat is writing, it's iteration

06:30 a B, so I wouldn't be accumulating Um, because, say,

06:35 thread that is on iteration. One gotten updated valley from the from the

06:42 threat that it's hard to put in , but I think, yeah,

06:47 close. But it's not quite in case because it's not a written as

06:56 interactive edition in terms of a V just, uh, implementing the A

07:02 variable. So, um yeah, this case, uh, if they

07:10 been an integer should be harmless. it's a floating point, the Order

07:14 Edition actually matters for the potentially for end result. But the problem is

07:21 in this case that more than one may try to update the global variable

07:27 at the same time. And then one gets to do it. So

07:31 this case, some off the ray of a may not in fact ended

07:37 being in the summation variable, maybe maybe, is to share variable.

07:44 that's why it is kind of a condition. And one can fix the

07:51 condition by using the critical, as talked about, um, just recently

07:59 time on. And here is an that's your stayed in terms off showing

08:06 in this case, A ve is declared a surprise that variable and this

08:15 cause in fact gets inherited. Also the nested for loop since two parallel

08:25 and other and an inner one. in this case, because it is

08:32 to each thread, one gets, , as many partial sums in this

08:39 v a local tree. Each thread there are threats, each one

08:44 as whatever piece of very it works into each local variable, and it

08:52 them to sum up all the per , sums off the elements off the

09:01 A and to get that not toe various condition. One can use the

09:08 construct to make sure that they all properly added mhm. This can should

09:16 proof off. The fact that this true is and the image, uh

09:24 the right of this particular slide so can see it in this case,

09:28 were on this fourth threats being used triage on the initialization is on the

09:34 hand side for a and if you at the a V, a local

09:38 each one of the threads, you see that some ends up for each

09:47 of the threat is the sun The two values it off A that

09:52 thread is responsible for. So that's fact didn't work correctly in terms off

09:58 thread, summing it up its values their A. Um, so this

10:06 what I just said. But critical one threat at the time. So

10:13 , the, um, four the inner for loop is for the

10:22 to the extent that you have But then there is a sequential part

10:27 adding up the local part. And what one can fix with this reduction

10:36 . So So here's how the reduction a construct part against strictly cause that

10:46 added to the parallel fork construct in case. And it says that they

10:54 to be plus reduction. One could ah, string of multiplication as

11:01 In that case and multiply will be operator and the reduction cause In this

11:08 , the operators plus in the variable a D. So now, in

11:14 case, uh, things should be done. Um, due to the

11:21 of the reduction cause, that has make sure that, um, all

11:28 different thread values are added correctly. may assume I'm sure that there is

11:36 iti in the plus operation so the of that everything should be added into

11:44 A V variable. And here's just off the type of operations that the

11:53 clause ah, support you. So question on the reduction type clause So

12:09 efficiency I'll give and want to talk the example will use reduction clause in

12:14 examples here to what extent is sufficient . But that's a implementation issue has

12:22 . Um, and of course, evening, it can be done

12:27 but And the one has, I , an idea how you would do

12:38 in a parallel context. Better than version for addition, for instance.

12:48 . Could you repeat the question, Johnson? Yes, I was saying

12:52 . And no one has an idea how to dio Parallel addition. If

12:57 have an array of elements will be up. Mm, you. Could

13:08 just find the sums off that's partition fined the sum of each partition?

13:12 not those up together? Correct? . So, basically, you can

13:17 treated as a binary three on just pair of eyes, and then you

13:23 up the tree from if their lives adding the elements paralyze, and then

13:28 next level up the tree is adding the partial sums until you basically have

13:34 two values to sum up when you at the root of the tree.

13:39 that's the way you can do it a parallel context and in these

13:45 shared memory models. So everything is . So no sound easily done

13:55 a little bit s I said to , the trickiest part to get both

14:01 performance and correct is the sharing issue the next couple of slides. Dance

14:11 that? So this is kind of picture off how things may actually be

14:17 any given state of the competition that , Um, and like a.

14:25 this particular case, they have copies different places that are not necessarily known

14:32 any given time but all the different in the different processors. So that's

14:40 is known as the relaxed consistency model open MP support. So depending upon

14:50 then cache coherency is implemented, 1 may need to be more or

14:56 careful off how the different threads interact terms of reads and writes, and

15:02 essentially what this slide says and there then a a command for synchronization that

15:12 known as a flash command. That then, making sure that things become

15:18 become consistence across different members in Typical multi processor today. And it

15:28 says that don't dry cold bar You of depend on the reeds. And

15:35 happens in a particular orders to get correct results in that case, maturing

15:39 us again implicit or explicit synchronization. use the Flash Command to make sure

15:47 all threads when they need to interact the same value. So now any

16:00 on those things? Something I mentioned before. It has to be very

16:07 about what is hello, Chad or and how it's shared, if not

16:21 about scheduling. And it was That is the question I believe a

16:26 to back on this thing works, , and so there's a few different

16:33 that is, uh, allowed in open, empty and hard to manage

16:44 , um, scheduling off watch in sharing constructs so and maybe I'll talk

16:55 little bit to this slide, and I'll show you a picture that probably

16:59 it more intuitive. about these things means so one thing is static,

17:06 that is simply that at compile time basically create the codes that then carves

17:14 , for instance, loops among And then there's a chunk attributes that

17:24 what all right, so today doing the work is that is being doled

17:32 to the different threats. And, course, if the amount of work

17:39 not even multiple out the chunk size some threats gets, let's but exactly

17:48 happens when the total and work amount work is not, and even multiple

17:57 chunks or a holy integer so they not even or odd multiple off the

18:05 size. Then it's up to the of open, empty, how they

18:12 to great chunk sizes. Dynamic is little bit more complicated in the sense

18:25 you still use the chunk for carving work that it's not statically decided how

18:35 were assigned to threats again on the pick. Slide will probably give you

18:41 better picture of that, and guidance even more flexible in the sense that

18:49 junkman Justin, all the initial chunk and then depending upon remaining works,

18:56 junk size can be modified at run , and I think the rest of

19:04 is further flexibility. Simple on Let show you in that kind of picture

19:12 the static one there's a fixed junk , assuming again that total work is

19:21 on the chunk size. Otherwise, the best to chunks will not necessarily

19:29 about the same size that up until point, every thread gets the

19:36 So it's junk on they do like on this slide in kind of

19:40 round robin assignment order. The dynamic has fixed chunk size you can

19:50 But now the order is more uh, each that finishes off a

19:57 size. And so it's not. on job in assignment, that

20:01 on the static formal schedule. And there's too. Examples are guided.

20:12 on this slide on the idea in this case, the typically as

20:22 guest guided a the first set of sizes air based on the work to

20:27 done and this chunk size given. as you proceed, then the chunk

20:34 . It's determined by the amount of works in that case, chunk sizes

20:40 be reduced from the initial chunk and the guided B is kind of

20:47 more dynamic in this ends up in case, you get one chunk size

20:53 then you may look at what remains the first chunk size and not just

20:58 everything equal in every round. And again, the assignment is not predetermined

21:04 can be arbitrary. So they're different are specifying how the schedule of work

21:15 be, how the work should be up among threats and so just mentioned

21:24 last time and we talked about Um, so this is just saying

21:33 yes, there may be benefits of more flexible schedule ings, but it

21:38 means that there is more work to done a giant time. So there

21:41 overhead again by people involved in being flexible. Uh huh. Have run

21:54 . So any questions on that? again, those are clauses one can

22:04 Thio, the constructs thio. We already talk given examples. I guess

22:13 runtime functionality, Um and this is a few examples and there's been in

22:24 use of this already and come back that, uh, in subsequent examples

22:32 so you can get the number of or course being used number are threads

22:41 maximum numbers available on Then you can the number of threats and is in

22:48 , being a science. And then thing. Yeah, it's great idea

22:51 was already used in, um, examples have shown, and this is

23:00 priority that comes. So Josh mentioned in the demo last time that there

23:06 environmental variables that has the worst and the highest priority is but you

23:18 or sign in the primary region. I think this is just a very

23:29 example in this case where one good in this case four threads for this

23:37 region. And it's important, Thio , uh, once again, another

23:45 . I came up in the previous that these are strictly request, and

23:53 think sometimes there's confusion or one uh, or they understand that these

24:02 threads are assigned by the operating So and I'll come to that in

24:13 detail in the two sides. So operating system knows the situation at the

24:21 time, and it is the case then it chooses to maximally a sign

24:34 they're costed. Number of threats. those numbers off threads are not available

24:46 even more cautiously deemed not feasible for reason, like overheating and other things

24:55 the system that it knows off it assign less threats. So so be

25:08 off just because one request the number threads does not means that one gets

25:16 number of threats there are in these one time functions that you can use

25:23 find out how many threads actually waas today currently, region on there is

25:32 an example. If you actually then example, um, you have if

25:36 cold actually got for threads. Anders for parallel regions in line with what

25:40 have talked about, Ernie. So is against more what? I already

25:48 that the operating system is in charge bond. The number off again threats

26:00 get ISS beside and the next slide an eyesore. But you can find

26:10 in the open MP specification exactly what rules are for the signing threats.

26:19 the simple minded or rule of thumb remember is that thread requested is kind

26:28 the maximum And if things air that's what will be given. But

26:35 you got two were threats on They depends on again whether things are

26:44 or statically aside. But the take message from this is request.

26:54 statements. Imagers looks like the resigning to the regions. It's a

27:04 Um, so this is a genius an example all of the sudden using

27:13 process and numbers of the dynamics that this case, eyes disabled And

27:22 in this case, requestion of this one thread proceed to you on

27:28 for this particular case, there is used a single statement to make sure

27:34 the num threads is a variable just . And one of the threads,

27:43 , it's no need to have several updating the variable, and it shouldn't

27:50 any difference in this case when dynamic zero, but it's kind of could

27:55 it. Yes, so that's just of them, they said on this

28:05 . Yes, I think so. , demo this last time that the

28:11 IEDs their local to each region. this is just the fight showing who

28:19 what's already mentioned last demo and a bit. I guess terminology is maybe

28:32 parent and child is sometimes used to thread is for the sequential part can

28:42 be used for the parent threat. logically, um, maybe it's more

28:49 to think of it as being inside parallel region a Z being the threat

28:55 is, in fact, for the parent thread continuing. So it is

29:01 in existence. All right, so questions on that and I'll given talk

29:10 an example. All right, I'm my chance. Okay, So it

29:25 an example that is used by many terms so showing a simple,

29:31 empty program and have toe work with , um, to show different behaviors

29:39 upon how is you kind of write simple, open and peacoat. So

29:46 turns out punching compute. Bye This particular way of doing American integration

29:54 day one of you have taken an in America analysis Course knows you can

30:01 integral of basically the area under a . And the simplest approximation algorithm is

30:09 to approximated by a bunch of Andi, Clearly it a smaller direct

30:16 are the closer. The area of the rectangles fits the current, and

30:24 nothing to use as known as the method, if I remember correctly.

30:31 right, so basically it's Thio. the area of the rectangles from basically

30:39 the height in this particular picture of rectangle. And one way is just

30:47 here, use X value at some in the rectangle for the function on

30:55 in the next slide, we best the midpoint, um, off the

31:02 for the function value and then multiplied the width of the rectangle. And

31:08 one gets the area for each one direct ankles. So here is kind

31:12 the sequential code. And I just that, um, the step that

31:18 window direct angle takes a function value the midpoint of the rectangle and then

31:23 simply add up all the rectangles, in this case, there would be

31:30 little bit the problem. We'll come that in terms of doing it

31:37 cold, and now there is kind recursive information, so we'll talk about

31:44 . So here is now the first to do an open, empty program

31:53 this sequential coat. So, I think I have that on the

32:00 slide. That's what I said. . Uh, okay, so So

32:05 know. My Hopefully you can remember I thought I had it. So

32:09 in this case is to simplify that to set just having two threads and

32:18 thio avoid the sharing of the sun that you know, was potential

32:26 Each thread now gets his own some by having as many partial sums as

32:35 threats. So each thread can basically a particular location. In this

32:41 some biscuits is just on a ray size, too. But it could

32:46 whatever size and number of threads they to use. Um, so because

32:54 declared outside the parallel region, it now shared variable that since each Fred

33:02 its own slot in the summary, is no issue and updating some values

33:10 each slot, since only one thread a particular slot in the summary.

33:18 let's see what else is happening here the parallel region. Declare and thio

33:29 private variables Now then, because it's the parallel region. Um, the

33:36 is the most important partisans. I a loop in the excellence. In

33:41 way, it is private. But also used an outside the four

33:47 Uh, so maybe it's just Thio declare. I also inside the

33:53 region, and then there is is for the number off threads on this

34:02 little bit tricky. The way they example has done this, and threads

34:07 of has to spelling. So there actually two separate variables, one private

34:13 each fed. And then there is global and threads for the E and

34:22 in it. So what happens in popular region, each thread gets its

34:30 on. Then they also get the of Fred's, um, using in

34:39 parallel region and updates the and Fred . But then what's being done is

34:45 to because, I guess confusions of updates is that in this case conditional

34:52 only the master threat gets to update global variable for the number of

34:59 And then there is a four loop adding up the rectangle values where then

35:08 thread update its location in the array partial, some variables mhm and the

35:22 that this summation is done, ISS kind of a assignment of rectangles to

35:30 thread in a round robin fashion. , as you can see, they

35:35 variable is implemented by the number of and then one. Um, since

35:46 some very some array I should say they said variable. So it's content

35:57 persistent, so it's still available at end of the parallel region. So

36:04 from except the parallel region, then summation off and it's not correct.

36:15 misspoke. But in the sense of values is just simply the function values

36:23 the midpoint of the rectangles against because they don't need to multiply with

36:28 bits of the rectangle inside the loop it's the same wait for all the

36:35 . So, in fact, as can see in the statement towards and

36:40 , uh, do you do the applications once you're have the partial,

36:44 off heights, um, for each of the shirts, and then they

36:49 simply multiply with rectangle woods, which again have been done even outside this

36:57 loop outside the parallel region. At very end, eso this again outside

37:04 parallel region, so there's no problem doing this um, in terms of

37:10 , but it is cereal. So that sense, it's not necessarily want

37:16 you want to do. And the point that was made, uh,

37:23 his fellow Matt, Uh, that came. Uh, yes, this

37:29 kind of, er standard, trick model for how you can paralyzed

37:35 So in this case, they're doing round robin one one can also do

37:40 and otherwise so carving up the But this, in this case,

37:46 that for loop is executed by each , the members of that inner that

37:53 group is now a parallel for it's sequential groups. On this case,

37:59 programmer took the pains to the sun between each threat. Um, so

38:13 this is just and this 17 he this for? Uh, no,

38:21 Z. Obviously, he did modify number of feds and did not just

38:27 two threads. Um, so but we can see in terms off,

38:38 execution time is the right column in stable as a function of the number

38:43 threads, Um, for two threads work recently. Okay. But there

38:50 much off improvement in performance for three four feds, and they're still,

39:02 this case, 100,000 rectangles on It's plenty of work to share him

39:09 more than two threads. So anyone an idea? How come we didn't

39:20 better performance? Since there are lots work to Dole upto friends on?

39:28 , there waas sequential part at the in terms off, uh, summing

39:37 functional values and multiplying with the But we're four threads. There's still

39:44 lot of work in the parallel and there's only four partial sums to

39:52 up. So even though there is at the end, it's still very

39:59 compared to parallel reason. Somehow Ah, if I had to take

40:08 shot at it, I would say since the hyper threading magic works by

40:14 resource is and they're all using the kind of, uh, say,

40:19 example, floating point registers to do actual math. Maybe the hardware drugs

40:24 able Thio independently act at the way we expect them Thio, right?

40:31 this was will, in fact show even if you didn't use hyper 30

40:37 a good point. I am not this case. Yes, it waas

40:42 the correct, but it's close to is saying that. So so here

40:53 , in fact the culprit. So is now where part of the reason

41:04 into play, why I spend time about processor architectures and cash is and

41:09 lines. So here is, and happens in this case is what's known

41:18 falls sharing. So full sharing means different threads shares values in the same

41:32 line, but not the same So it's not sharing off a

41:37 in which case it would be proper . But from the cash line

41:48 that cash line is shared between most those threats. And that comes from

41:57 fact that, as it's kind of on this slide, that this summary

42:07 just one D array and that is out in successive memory locations in the

42:15 memory. And that means that successive , interests of the Sun Ray are

42:25 the same cash line until you don't enough array elements that it fills up

42:32 cash like so, some zero on 123 and this cartoonish example are in

42:42 same cash line, so if thread then is on say course zero on

42:54 hardware threat zero. Now, as mentioned and I'll come back to

43:00 I guess in the next lecture. , I don't think I'll get to

43:03 today. But default away from the system is to share things in this

43:09 between, but Thio se socket So gets it on. Yeah, the

43:17 was a dual core. Um so means then, you know,

43:26 I guess in this case, that two so but so if you do

43:32 close statement, I should look more the site, you know, talk

43:35 . So in this case, some, too, is needed by

43:45 too. And that is potentially then the l one cache for in this

43:52 for the first core. And then everyone cashes at the closest point,

44:00 are private to the course. So you wanted other core wants it,

44:05 it has to get to the Are cash? So then what happens is

44:12 cash land gets eso the first core if they in the first one to

44:18 some zero it in through the cache mechanism on Intel stuff gets invalidated and

44:26 l one cache on the other So when that, um, read

44:33 it it's not valid. So then needs to get a copy from the

44:39 l one cache. And then when updates things, then the cash line

44:46 , um, the first core cash getting validated. So that means there's

44:54 as they call it off the cash between the two caches. Dr.

45:02 . Yeah. Um, at the of the lecture, you mentioned something

45:07 a relaxed consistency Shared memory model Make use of that in this

45:14 Well, so for these things to correct, either the cache coherence,

45:20 scare of it. Otherwise, Mom need explicitly thio, use the flush

45:26 make sure that the other threads has last values. So in the

45:35 off the cache coherence mechanisms for the lines, it takes care of

45:40 Since this is false sharing, it's the same variable. So I guess

45:47 cache coherence was for the same These are not the same variable that

45:53 to, because each thread on Lee at one protector, uh, element

46:00 the summary but because it is one element in the cash line is

46:16 , say, by thread zero. soon as that cash line is updated

46:22 , um, the level one cash , the first corps that cash line

46:28 the whole cash line gets invalidated in other cash through the cache coherence

46:36 So it's a different mechanism should be clear about that. So it's because

46:43 the cache coherence that the cash long invalidated, even though the value itself

46:50 be perfect to fight. But since coherence, my communist doesn't understand individual

47:01 . It just works, um, cash lines. That's why it ends

47:08 being invalidated. So good questions, right? It wasn't clear. Doesn't

47:20 sense now. Yeah, Thank Okay, thanks, Frank. Having

47:27 clarified the two difference, that's that is just based on the program logic

47:36 variables. And this phenomena is based how the Kachin mechanism works. So

47:49 does one fix this so now effects to do this thing. And,

48:15 , this is even used in non programs. Sometimes we respect to how

48:31 Iran memory works. I am spend time talking about memory and Iran,

48:41 that has pages, and I'm not to get into it again today.

48:47 this notion of having is something when uses, in order to gap the

48:56 or desirable memory behavior and avoid, , poor memory performance. And

49:06 for today it's open MP and cash behaviors, but the same trick can

49:15 used to use having soul. And particular kids the pattern was used to

49:23 a two dimensional ray that use Uh so is it said 64 bites

49:35 doubles. So that means double position eight bytes per data elements or by

49:45 it by eight data elements. That , um, this, uh,

49:53 that it's now the some that is two by eight the rain. So

49:59 , depending on when I see your time. But you see of the

50:03 major ordering that means for thread. is now in kind of some one

50:13 . Bad is a full cash So in this case, for each

50:23 occupies a complete cash line. So when anything gets updated, it doesn't

50:38 the cash line, uh, in cash. So now things should work

50:49 and it does so now. As can see, even as one,

50:59 23 and four treads, they're still improvement in performance. That's not

51:07 you know, linear. It's not . But it's not too bad in

51:14 sense that two threads is a little more than half time off. One

51:20 and four threads. There's a little more than half the time off to

51:25 , so it's pretty good speed up sponsors as a functional number of

51:36 But anyone sees have comments about this of doing business, more wasting

51:47 wasting it out in the space. , considerable amount, right? So

51:54 is clearly not the preferred way of business. So there are other ways

51:58 get to so you want one way . And the other part is,

52:04 would no need to know about the line size for the particular platform you're

52:11 to use now, it turns for the one caches, most of

52:15 are 64 bytes that not for all for high level cash is there is

52:22 variability in the 64 1 28 But it was a little bit

52:28 My justification for talking about details about is my process of lecture that one

52:35 needs to be aware of it. yes, various. So So this

52:44 now our way. I'm not kind explicitly Taylor and things Thio Cash is

52:55 by being conscientious on caches works and trying to write to coat that doesn't

53:01 the problem. So in this you can have the parallel region.

53:11 now, by making some private variable each trends in the parallel region,

53:24 MM there is no problem for the threads to update some variables because they

53:30 have their own private, some So there is no issues of race

53:38 or anything and updating some and is the type of code or the programmer

53:45 Thio figure out to paralyze the for him or herself by using this round

53:54 principle. But now there's some very sir private to each thread. So

54:06 one gets out off, the one or rather one has to some the

54:20 variables for the different threads before one the parallel region. So that's why

54:30 an serial portion now inside the parallel using this critical construct to make sure

54:38 the summation it's just, uh, done one at the time to avoid

54:44 race condition. So and now you see in this case we didn't do

54:54 pattern, didn't waste memory, but by being conscientious, uh, the

55:05 cash line invalidation and keeping things local threads than in fact, one gets

55:13 same level of speed up as using patent construct or idea and a comments

55:28 questions on this. So we achieved same time without the, um,

55:37 memory overheads. Yes, so and is no memory overhead, there is

55:43 tiger into a particular architectural er in sense of cash line sizes. There

55:51 just a awareness I would say off cashing mechanisms worked and trying to make

55:58 that there is no kind of falls in this case that things in the

56:06 cash lines has used by multiple Ah, the question I had

56:16 What is the, uh, reason cyclically distributing the loot generations instead of

56:24 cutting them off into partitions? In this case, it could equally

56:28 been used. Carol four constructs and the compiler figure out how the

56:35 the loop. Okay, so it's matter of choice and right. So

56:41 think I called for this example, , and I'm honest about it.

56:45 have the We were at the The I think the author of the

56:51 just wanted to show a way in the programmer may, um, paralyzed

56:59 loop. And in some other it may be preferable that tailor the

57:06 of the loop to the content on loop. In this case, it's

57:09 pretty trivial loop for him. Set statements for the four statements. So

57:15 is really not anything Thio, I say very clever about. So I

57:24 it was just the highlight. One particular way of paralyzing look. And

57:33 , we talked instead of since it's a single variable continues atomic. This

57:39 Now what? So now the thing done to since you mentioned Reduction Thio

57:46 negative part with the other the last that was good In many ways,

57:53 wasn't tailored to particular architecture or I would say. But it has

58:02 sequential edition off the partial, some each thread. So with the hope

58:10 there is a good reduction implementation, can use a reduction cause and use

58:17 this case. Now, um, parallel four statements and then my doesn't

58:26 have the issue with the accumulation off variables because the reduction class shipped.

58:35 sure that that happens correctly. And this final version there is only just

58:41 I mentioned earlier. Since the With Tracked Angle is the same bunch and

58:48 do it. That's the very last . Put this in case in this

58:52 , the way it's being done. this is just Joe. And

58:56 I guess, in this case it out that I guess the reduction implementation

59:03 quite sufficient as the other way of business on this case. If you

59:10 at the last column in the table using the reduction clause on the speed

59:17 , it's not quite as good as critical or one. Could they used

59:23 as well, but that depends on . It's nothing that's necessarily inherent.

59:30 says that reduction would be significant Store says the reduction clause only have

59:40 preset options or doesn't generalize. Thio able to use whatever map reduce method

59:46 can come up with. So the in this case is a function implemented

59:54 a part off the runtime library supporting MP. And it has a collection

60:04 operators that it does support, so can certainly right to one's own

60:17 But then it wouldn't be a close the statement. Maybe I misunderstood your

60:24 . I was just wondering if the mean you can't just pass any sort

60:29 funked or in there like the the the specific maps that they use

60:34 this tailored and implemented in a way makes it fast as a person.

60:38 , to my knowledge, none of open empty version support instead of using

60:43 in this reduction statement, pass in function that euro. Okay, I'm

60:53 . Yeah, I think it only , um, elementary operations. You

60:57 define a function for it. That's good question, I think. But

61:05 my knowledge, they have not, , nothing to the point where they

61:10 take as used to find function. I think this is yes. Summing

61:25 the example that you know there's dealing race conditions that are dealing with their

61:33 sharing. Um and then, potentially using the reduction operator,

61:44 Avoid serial segments of the code adding up individual Fred using to global

61:55 than so Okay, so now getting of different construct that task construct,

62:11 I will call for it a few or minute to see if there's

62:17 So this is now getting into making that even a little bit more flexible

62:24 also more also more complex. So heading into that, uh,

62:31 if there are additional questions about open in terms off constructs causes not,

62:49 will start talking about the test construct and working with open a P open

63:05 after the initial, um, So to speak. Ages ago,

63:13 terms of trying to take some of details from Prato shared memory programming away

63:20 the programmer and leaving it to Itwas discovery. Waas didn't have the

63:30 flexibility that one sometimes what's needed in , and when people call us and

63:40 or coats. And it also turned that someone gets recursive algorithms and trouble

63:50 with initial set the constructs that was in Open and P. Someone created

63:58 task construct, um that I will about now. Yeah, so task

64:16 well is as more flexible because it more to it than individuals friends.

64:31 in some ways, as it said its like that comes that one.

64:37 you parallel region as a set of where the task is an assigned to

64:48 threatening? But then, um, still sort of restricted set of things

64:55 on. The task gets to do , um, hard to manage the

65:01 in particular as well as the execution . So the test constructs has with

65:12 not only the code to execute. also has more of the data environment

65:18 what they call the internal control so it's richer. But it also

65:27 implies more overhead because it's more of machine that needs to be set up

65:32 manage tasks as supposed to threats. like what kind of threads us?

65:43 parable version of these things is where color coded boxes are now tasks that

65:51 yeah, the different task and a pieces of code as supposed to we

65:59 . That's for in parallel region each gets a copy on the same

66:05 Now tasks have different pieces of But like 20 threads, there is

66:13 particular execution order defined between tasks. in this case, that's the pile

66:23 cartoon shows to the right of this . It is arbitrary order and when

66:31 start and how they proceed. So is something more or less what I

66:40 said that that one can think of threads in the initial version or in

66:46 region as task. But there are more restrictive on the test construct.

66:56 here's the task construct at the place structure block, and I'll give examples

67:01 trivial examples in a little bit. then there is slightly different scoping rules

67:11 data scopes than for a parallel region tests. And then they're separate scheduling

67:21 for how to coordinate tasks that I want to use. And then there

67:35 the barrier gun nut for like for regions. And then there's also kind

67:42 weight as well. There's no way . Yeah, so here's That's the

67:51 simple test constructors in our apparently And inside this probably region, there

68:01 first they open empty single construct. , that is used them to often

68:13 typical then, as they say or them one of the other side's that

68:20 create, Um, all the um, task context that are necessary

68:30 the task that follows inside the single . So in that case, what

68:38 means that each task will be Bye. A single threat. But

68:50 them that one of the slices there that there is no particular implied order

68:58 these tasks. So if one needs order, one will need other construct

69:04 assure that order one once is her with there are dependencies. Okay,

69:14 , I guess I had some example to be an example that may help

69:20 remind her open, and he works well as how this test construct

69:27 So this is second sequential programs. no surprise this if you're on

69:34 it should do this thing because it's sequential cold. So the prince statements

69:38 executed one after the other, and is what happens. So now I

69:44 to do the parable version, and now I guess if you do two

69:53 , um, each threat now gets copy of this code, so each

69:58 is going to bring a race But there are, you know,

70:06 statements and in this case, two . So anyone wants to volunteer whatever

70:15 potentially C or C. It's a statement. So but it's again just

70:28 remind you how open MP works There is no particular order between are

70:37 two threads in this case executes these statements? Hmm, that's all

70:51 Okay, so two threads rights on case. Each thread printed out a

71:00 car. My uncle resume. In case, that's true, and they

71:07 nicely ordered, but it could have pretty much any combination of things they

71:13 always for any off. The two , um, would always proceed race

71:20 put all the specific proceed car for particular threat. But the threats,

71:25 progressive, different face so it could fairly jumbled in terms of what comes

71:34 they can try will permit Titian's. quite the future. And that's just

71:37 standard, um, open and P the test construct now. So

71:49 to see the impact on and off single statement now would you expect anything

72:01 at this time? You can be that it'll print a race car on

72:05 entirety for each thread. Yes, of. But there's just a single

72:12 . In this case. They only it once. Yeah, because of

72:17 single thread is enforced. So even there is in the after open

72:22 But then there's the restriction that on single threat can execute. It's not

72:31 critical region, so it's not one the time. It's just one

72:39 So that's when I said So now trying to do the task. Parallel

72:49 so I was similar to the construct his Children. Terms off this task

72:55 . So now what do they So we have If we look at

73:11 , school may have firstly generate a of threads and the parallel construct,

73:21 we decided that for the next thing only going to use one thread.

73:31 , except that means one moment I the test construct. That means each

73:36 of those tests gets a single Now the print a statement is just

73:48 to the single Fred requirement because it's preceded by a task statement, so

73:56 we should only get a printed But then we have the other two

74:05 car. That should be each printed . However, the order is not

74:16 that caused the tasks can be executed the order. So that means we're

74:22 get depending upon how many times you it they can get. But I

74:27 they race car as that sequential, , person did, But we could

74:35 get a threat to root print the task ahead of the race. So

74:45 , because they're ordering between printing, and printing, car is arbitrary.

74:51 is always first because it's outside Constructs was not one more watch,

75:01 guess. Example, uh, so . And I just started, I

75:10 , is fun to watch. That not part of an a task

75:20 So anyone, Well, maybe it's so first, I should always be

75:29 first, right, because that's in single thread region. It's fun to

75:35 should also be printed ones only because also in a single non paralyzed

75:43 But we still have the tough statements can be an arbitrary order, so

75:49 is kind of what might happen. , it may happen. Not before

75:58 on the task constructs gets executed. , the sequential region order single region

76:15 get both this print statements done before of the test, um regions,

76:23 to print what it's supposed to print . All right. Thank you.

76:37 , want to stop there? Two . Uh, unless a question,

76:45 there's no question, I'll talk about . So so far invested when I

76:52 about test constructs ads a level of management that the region construct by itself

77:02 have but execution order among tasks in violent region. ISS arbitrary like or

77:13 progress of threads in the region is kind of arbitrary. So there's this

77:20 of hierarchy that goes from task um, threat to process.

77:29 So process is the most heavyweight because carries with it a lot more

77:37 then threads that are at the other as the least context carried with

77:44 Uh, and task are kind of the middle of sort of. There's

77:48 a shared memory construct, so it has benefits from things being shared.

77:56 , but it has more flexibility and for different codes to be executed by

78:04 by different task. Where is Executes the copy of the same

78:14 so it is going from just The task increased flexibility but also increased

78:24 to set things up, and they go further onto processes. It's even

78:31 things that needs to get created and up. But it's also more

78:39 uh, flexible. And it's not longer necessarily process construct limited to share

78:48 with a lot of things I know again, the process cares what it

78:54 with it. So with that, guess I'll take more questions. But

79:03 will start then scope next time and talk about what's great out here in

79:10 of the test constructs. And then talk about the affinity and how you're

79:18 force threats to be place where you want them to be. Enough have

79:23 operating system designed for you, so will be with on Wednesday and then

79:32 think so. Yes, for the demo on some of the constructed hasn't

79:38 demo yet, including the test if any other questions, or whether

79:55 about lecture material or assignments, If you don't stop sharing my screen

80:22 , now stop the recording

-
+