© Distribution of this video is restricted by its owner
00:00 | Yeah, So today I'll continue to about open MB on Guy will probably |
|
|
00:10 | finished today either. But on there's no common what I think it's |
|
|
00:18 | to do the exercise on having inciting open and he works. So this |
|
|
00:25 | little bit for today's agenda. Finish talking about control, flow, construct |
|
|
00:31 | open MP Um, and then a bit of run time. And then |
|
|
00:36 | spend some time going through an example to use open and pay for some |
|
|
00:41 | program and then get into yet another with just tasks for open and picking |
|
|
00:48 | all. See how far on task . So we went, I |
|
|
00:57 | to the if statement is where we so yesterday call like as a |
|
|
01:04 | That's a fairly simple and straightforward Yes, that's it. This synchronization |
|
|
01:11 | where all fits still be present at barrier, perform on proceeds. They |
|
|
01:18 | the barrier and critical waas, just conception of code, but that only |
|
|
01:27 | threat at the time can execute. this case, it's just a single |
|
|
01:32 | , but it can be used for block of code that that needs to |
|
|
01:36 | within 40 crack. It's, and they talked about that. And |
|
|
01:44 | there was the atomic. Also, is equivalent for two critical as long |
|
|
01:51 | it's just a single memory location that involved. So the atomic is just |
|
|
01:57 | can assignment type state. Where is ? Ken applied some body of |
|
|
02:04 | Yeah, So here is I where I was at the end of |
|
|
02:08 | lecture and in this case is conditional usually statement that is used, that |
|
|
02:16 | condition is true. Then the parallels gets executed. Otherwise it is |
|
|
02:28 | That means Onley, the master fed the parents read will execute the |
|
|
02:35 | So any idea why one they want do that and see if we can |
|
|
02:40 | questions or responses to questions? there might be more overhead and splitting |
|
|
02:47 | the work as opposed to just letting , never to the river or one |
|
|
02:52 | . Whether exactly is all spawning a of threads is not entirely free, |
|
|
02:58 | it involves a fair amount of Someone should, as a general |
|
|
03:02 | whether it's using open MP or later M P I or some other programming |
|
|
03:09 | . Be careful when one uses a construct because there is overhead, and |
|
|
03:16 | on programming paradigm, it may be significant. So don't use parallelism if |
|
|
03:22 | can do it and recently, time a single threat. Then there is |
|
|
03:28 | a weight clause that allows, threads. There are done with, |
|
|
03:37 | this case, a four loop to at the end of the implied synchronization |
|
|
03:45 | the end off parallel Fordham. So is, I guess, an example |
|
|
03:52 | for the first four loop, there implicit barriers, so no threads. |
|
|
04:00 | tiu the openem before wait statement until threats have done the first four |
|
|
04:08 | but then for the second four loop are done can proceed beyond that. |
|
|
04:15 | that's Presley, but in a weight allows you to do and depending upon |
|
|
04:22 | the computation are, that can give benefits. Then there is another. |
|
|
04:35 | There is no reduction. Reduction is operation that is included in many parallel |
|
|
04:46 | languages, so open MP is not programming language, but they choose to |
|
|
04:52 | it as a construct that one can clause actually can be applied to some |
|
|
04:57 | the constructs. That's one use. here's now the little example how to |
|
|
05:06 | reduction. So reduction is simply what shows here, right? That's this |
|
|
05:11 | . It's there's some reduction. Civil produced a whole array in this case |
|
|
05:18 | a single value by adding all the of the race and then computes the |
|
|
05:25 | . But after the fact divide by number of elements in the rain, |
|
|
05:31 | one point can do is simply that use first without the reduction clause or |
|
|
05:39 | , I'll come to that basically part a for loop that now adds up |
|
|
05:49 | the values off the ray into the A e. So anyone sees thinks |
|
|
05:59 | is a good idea or not. I stopped, there is probably a |
|
|
06:09 | guess that there's some issues anyone volunteers issue. Uh, it'll be writing |
|
|
06:20 | maybe not necessarily the order that you to be, so you could have |
|
|
06:24 | on iteration zero. When say, third threat is writing, it's iteration |
|
|
06:30 | a B, so I wouldn't be accumulating Um, because, say, |
|
|
06:35 | thread that is on iteration. One gotten updated valley from the from the |
|
|
06:42 | threat that it's hard to put in , but I think, yeah, |
|
|
06:47 | close. But it's not quite in case because it's not a written as |
|
|
06:56 | interactive edition in terms of a V just, uh, implementing the A |
|
|
07:02 | variable. So, um yeah, this case, uh, if they |
|
|
07:10 | been an integer should be harmless. it's a floating point, the Order |
|
|
07:14 | Edition actually matters for the potentially for end result. But the problem is |
|
|
07:21 | in this case that more than one may try to update the global variable |
|
|
07:27 | at the same time. And then one gets to do it. So |
|
|
07:31 | this case, some off the ray of a may not in fact ended |
|
|
07:37 | being in the summation variable, maybe maybe, is to share variable. |
|
|
07:44 | that's why it is kind of a condition. And one can fix the |
|
|
07:51 | condition by using the critical, as talked about, um, just recently |
|
|
07:59 | time on. And here is an that's your stayed in terms off showing |
|
|
08:06 | in this case, A ve is declared a surprise that variable and this |
|
|
08:15 | cause in fact gets inherited. Also the nested for loop since two parallel |
|
|
08:25 | and other and an inner one. in this case, because it is |
|
|
08:32 | to each thread, one gets, , as many partial sums in this |
|
|
08:39 | v a local tree. Each thread there are threats, each one |
|
|
08:44 | as whatever piece of very it works into each local variable, and it |
|
|
08:52 | them to sum up all the per , sums off the elements off the |
|
|
09:01 | A and to get that not toe various condition. One can use the |
|
|
09:08 | construct to make sure that they all properly added mhm. This can should |
|
|
09:16 | proof off. The fact that this true is and the image, uh |
|
|
09:24 | the right of this particular slide so can see it in this case, |
|
|
09:28 | were on this fourth threats being used triage on the initialization is on the |
|
|
09:34 | hand side for a and if you at the a V, a local |
|
|
09:38 | each one of the threads, you see that some ends up for each |
|
|
09:47 | of the threat is the sun The two values it off A that |
|
|
09:52 | thread is responsible for. So that's fact didn't work correctly in terms off |
|
|
09:58 | thread, summing it up its values their A. Um, so this |
|
|
10:06 | what I just said. But critical one threat at the time. So |
|
|
10:13 | , the, um, four the inner for loop is for the |
|
|
10:22 | to the extent that you have But then there is a sequential part |
|
|
10:27 | adding up the local part. And what one can fix with this reduction |
|
|
10:36 | . So So here's how the reduction a construct part against strictly cause that |
|
|
10:46 | added to the parallel fork construct in case. And it says that they |
|
|
10:54 | to be plus reduction. One could ah, string of multiplication as |
|
|
11:01 | In that case and multiply will be operator and the reduction cause In this |
|
|
11:08 | , the operators plus in the variable a D. So now, in |
|
|
11:14 | case, uh, things should be done. Um, due to the |
|
|
11:21 | of the reduction cause, that has make sure that, um, all |
|
|
11:28 | different thread values are added correctly. may assume I'm sure that there is |
|
|
11:36 | iti in the plus operation so the of that everything should be added into |
|
|
11:44 | A V variable. And here's just off the type of operations that the |
|
|
11:53 | clause ah, support you. So question on the reduction type clause So |
|
|
12:09 | efficiency I'll give and want to talk the example will use reduction clause in |
|
|
12:14 | examples here to what extent is sufficient . But that's a implementation issue has |
|
|
12:22 | . Um, and of course, evening, it can be done |
|
|
12:27 | but And the one has, I , an idea how you would do |
|
|
12:38 | in a parallel context. Better than version for addition, for instance. |
|
|
12:48 | . Could you repeat the question, Johnson? Yes, I was saying |
|
|
12:52 | . And no one has an idea how to dio Parallel addition. If |
|
|
12:57 | have an array of elements will be up. Mm, you. Could |
|
|
13:08 | just find the sums off that's partition fined the sum of each partition? |
|
|
13:12 | not those up together? Correct? . So, basically, you can |
|
|
13:17 | treated as a binary three on just pair of eyes, and then you |
|
|
13:23 | up the tree from if their lives adding the elements paralyze, and then |
|
|
13:28 | next level up the tree is adding the partial sums until you basically have |
|
|
13:34 | two values to sum up when you at the root of the tree. |
|
|
13:39 | that's the way you can do it a parallel context and in these |
|
|
13:45 | shared memory models. So everything is . So no sound easily done |
|
|
13:55 | a little bit s I said to , the trickiest part to get both |
|
|
14:01 | performance and correct is the sharing issue the next couple of slides. Dance |
|
|
14:11 | that? So this is kind of picture off how things may actually be |
|
|
14:17 | any given state of the competition that , Um, and like a. |
|
|
14:25 | this particular case, they have copies different places that are not necessarily known |
|
|
14:32 | any given time but all the different in the different processors. So that's |
|
|
14:40 | is known as the relaxed consistency model open MP support. So depending upon |
|
|
14:50 | then cache coherency is implemented, 1 may need to be more or |
|
|
14:56 | careful off how the different threads interact terms of reads and writes, and |
|
|
15:02 | essentially what this slide says and there then a a command for synchronization that |
|
|
15:12 | known as a flash command. That then, making sure that things become |
|
|
15:18 | become consistence across different members in Typical multi processor today. And it |
|
|
15:28 | says that don't dry cold bar You of depend on the reeds. And |
|
|
15:35 | happens in a particular orders to get correct results in that case, maturing |
|
|
15:39 | us again implicit or explicit synchronization. use the Flash Command to make sure |
|
|
15:47 | all threads when they need to interact the same value. So now any |
|
|
16:00 | on those things? Something I mentioned before. It has to be very |
|
|
16:07 | about what is hello, Chad or and how it's shared, if not |
|
|
16:21 | about scheduling. And it was That is the question I believe a |
|
|
16:26 | to back on this thing works, , and so there's a few different |
|
|
16:33 | that is, uh, allowed in open, empty and hard to manage |
|
|
16:44 | , um, scheduling off watch in sharing constructs so and maybe I'll talk |
|
|
16:55 | little bit to this slide, and I'll show you a picture that probably |
|
|
16:59 | it more intuitive. about these things means so one thing is static, |
|
|
17:06 | that is simply that at compile time basically create the codes that then carves |
|
|
17:14 | , for instance, loops among And then there's a chunk attributes that |
|
|
17:24 | what all right, so today doing the work is that is being doled |
|
|
17:32 | to the different threats. And, course, if the amount of work |
|
|
17:39 | not even multiple out the chunk size some threats gets, let's but exactly |
|
|
17:48 | happens when the total and work amount work is not, and even multiple |
|
|
17:57 | chunks or a holy integer so they not even or odd multiple off the |
|
|
18:05 | size. Then it's up to the of open, empty, how they |
|
|
18:12 | to great chunk sizes. Dynamic is little bit more complicated in the sense |
|
|
18:25 | you still use the chunk for carving work that it's not statically decided how |
|
|
18:35 | were assigned to threats again on the pick. Slide will probably give you |
|
|
18:41 | better picture of that, and guidance even more flexible in the sense that |
|
|
18:49 | junkman Justin, all the initial chunk and then depending upon remaining works, |
|
|
18:56 | junk size can be modified at run , and I think the rest of |
|
|
19:04 | is further flexibility. Simple on Let show you in that kind of picture |
|
|
19:12 | the static one there's a fixed junk , assuming again that total work is |
|
|
19:21 | on the chunk size. Otherwise, the best to chunks will not necessarily |
|
|
19:29 | about the same size that up until point, every thread gets the |
|
|
19:36 | So it's junk on they do like on this slide in kind of |
|
|
19:40 | round robin assignment order. The dynamic has fixed chunk size you can |
|
|
19:50 | But now the order is more uh, each that finishes off a |
|
|
19:57 | size. And so it's not. on job in assignment, that |
|
|
20:01 | on the static formal schedule. And there's too. Examples are guided. |
|
|
20:12 | on this slide on the idea in this case, the typically as |
|
|
20:22 | guest guided a the first set of sizes air based on the work to |
|
|
20:27 | done and this chunk size given. as you proceed, then the chunk |
|
|
20:34 | . It's determined by the amount of works in that case, chunk sizes |
|
|
20:40 | be reduced from the initial chunk and the guided B is kind of |
|
|
20:47 | more dynamic in this ends up in case, you get one chunk size |
|
|
20:53 | then you may look at what remains the first chunk size and not just |
|
|
20:58 | everything equal in every round. And again, the assignment is not predetermined |
|
|
21:04 | can be arbitrary. So they're different are specifying how the schedule of work |
|
|
21:15 | be, how the work should be up among threats and so just mentioned |
|
|
21:24 | last time and we talked about Um, so this is just saying |
|
|
21:33 | yes, there may be benefits of more flexible schedule ings, but it |
|
|
21:38 | means that there is more work to done a giant time. So there |
|
|
21:41 | overhead again by people involved in being flexible. Uh huh. Have run |
|
|
21:54 | . So any questions on that? again, those are clauses one can |
|
|
22:04 | Thio, the constructs thio. We already talk given examples. I guess |
|
|
22:13 | runtime functionality, Um and this is a few examples and there's been in |
|
|
22:24 | use of this already and come back that, uh, in subsequent examples |
|
|
22:32 | so you can get the number of or course being used number are threads |
|
|
22:41 | maximum numbers available on Then you can the number of threats and is in |
|
|
22:48 | , being a science. And then thing. Yeah, it's great idea |
|
|
22:51 | was already used in, um, examples have shown, and this is |
|
|
23:00 | priority that comes. So Josh mentioned in the demo last time that there |
|
|
23:06 | environmental variables that has the worst and the highest priority is but you |
|
|
23:18 | or sign in the primary region. I think this is just a very |
|
|
23:29 | example in this case where one good in this case four threads for this |
|
|
23:37 | region. And it's important, Thio , uh, once again, another |
|
|
23:45 | . I came up in the previous that these are strictly request, and |
|
|
23:53 | think sometimes there's confusion or one uh, or they understand that these |
|
|
24:02 | threads are assigned by the operating So and I'll come to that in |
|
|
24:13 | detail in the two sides. So operating system knows the situation at the |
|
|
24:21 | time, and it is the case then it chooses to maximally a sign |
|
|
24:34 | they're costed. Number of threats. those numbers off threads are not available |
|
|
24:46 | even more cautiously deemed not feasible for reason, like overheating and other things |
|
|
24:55 | the system that it knows off it assign less threats. So so be |
|
|
25:08 | off just because one request the number threads does not means that one gets |
|
|
25:16 | number of threats there are in these one time functions that you can use |
|
|
25:23 | find out how many threads actually waas today currently, region on there is |
|
|
25:32 | an example. If you actually then example, um, you have if |
|
|
25:36 | cold actually got for threads. Anders for parallel regions in line with what |
|
|
25:40 | have talked about, Ernie. So is against more what? I already |
|
|
25:48 | that the operating system is in charge bond. The number off again threats |
|
|
26:00 | get ISS beside and the next slide an eyesore. But you can find |
|
|
26:10 | in the open MP specification exactly what rules are for the signing threats. |
|
|
26:19 | the simple minded or rule of thumb remember is that thread requested is kind |
|
|
26:28 | the maximum And if things air that's what will be given. But |
|
|
26:35 | you got two were threats on They depends on again whether things are |
|
|
26:44 | or statically aside. But the take message from this is request. |
|
|
26:54 | statements. Imagers looks like the resigning to the regions. It's a |
|
|
27:04 | Um, so this is a genius an example all of the sudden using |
|
|
27:13 | process and numbers of the dynamics that this case, eyes disabled And |
|
|
27:22 | in this case, requestion of this one thread proceed to you on |
|
|
27:28 | for this particular case, there is used a single statement to make sure |
|
|
27:34 | the num threads is a variable just . And one of the threads, |
|
|
27:43 | , it's no need to have several updating the variable, and it shouldn't |
|
|
27:50 | any difference in this case when dynamic zero, but it's kind of could |
|
|
27:55 | it. Yes, so that's just of them, they said on this |
|
|
28:05 | . Yes, I think so. , demo this last time that the |
|
|
28:11 | IEDs their local to each region. this is just the fight showing who |
|
|
28:19 | what's already mentioned last demo and a bit. I guess terminology is maybe |
|
|
28:32 | parent and child is sometimes used to thread is for the sequential part can |
|
|
28:42 | be used for the parent threat. logically, um, maybe it's more |
|
|
28:49 | to think of it as being inside parallel region a Z being the threat |
|
|
28:55 | is, in fact, for the parent thread continuing. So it is |
|
|
29:01 | in existence. All right, so questions on that and I'll given talk |
|
|
29:10 | an example. All right, I'm my chance. Okay, So it |
|
|
29:25 | an example that is used by many terms so showing a simple, |
|
|
29:31 | empty program and have toe work with , um, to show different behaviors |
|
|
29:39 | upon how is you kind of write simple, open and peacoat. So |
|
|
29:46 | turns out punching compute. Bye This particular way of doing American integration |
|
|
29:54 | day one of you have taken an in America analysis Course knows you can |
|
|
30:01 | integral of basically the area under a . And the simplest approximation algorithm is |
|
|
30:09 | to approximated by a bunch of Andi, Clearly it a smaller direct |
|
|
30:16 | are the closer. The area of the rectangles fits the current, and |
|
|
30:24 | nothing to use as known as the method, if I remember correctly. |
|
|
30:31 | right, so basically it's Thio. the area of the rectangles from basically |
|
|
30:39 | the height in this particular picture of rectangle. And one way is just |
|
|
30:47 | here, use X value at some in the rectangle for the function on |
|
|
30:55 | in the next slide, we best the midpoint, um, off the |
|
|
31:02 | for the function value and then multiplied the width of the rectangle. And |
|
|
31:08 | one gets the area for each one direct ankles. So here is kind |
|
|
31:12 | the sequential code. And I just that, um, the step that |
|
|
31:18 | window direct angle takes a function value the midpoint of the rectangle and then |
|
|
31:23 | simply add up all the rectangles, in this case, there would be |
|
|
31:30 | little bit the problem. We'll come that in terms of doing it |
|
|
31:37 | cold, and now there is kind recursive information, so we'll talk about |
|
|
31:44 | . So here is now the first to do an open, empty program |
|
|
31:53 | this sequential coat. So, I think I have that on the |
|
|
32:00 | slide. That's what I said. . Uh, okay, so So |
|
|
32:05 | know. My Hopefully you can remember I thought I had it. So |
|
|
32:09 | in this case is to simplify that to set just having two threads and |
|
|
32:18 | thio avoid the sharing of the sun that you know, was potential |
|
|
32:26 | Each thread now gets his own some by having as many partial sums as |
|
|
32:35 | threats. So each thread can basically a particular location. In this |
|
|
32:41 | some biscuits is just on a ray size, too. But it could |
|
|
32:46 | whatever size and number of threads they to use. Um, so because |
|
|
32:54 | declared outside the parallel region, it now shared variable that since each Fred |
|
|
33:02 | its own slot in the summary, is no issue and updating some values |
|
|
33:10 | each slot, since only one thread a particular slot in the summary. |
|
|
33:18 | let's see what else is happening here the parallel region. Declare and thio |
|
|
33:29 | private variables Now then, because it's the parallel region. Um, the |
|
|
33:36 | is the most important partisans. I a loop in the excellence. In |
|
|
33:41 | way, it is private. But also used an outside the four |
|
|
33:47 | Uh, so maybe it's just Thio declare. I also inside the |
|
|
33:53 | region, and then there is is for the number off threads on this |
|
|
34:02 | little bit tricky. The way they example has done this, and threads |
|
|
34:07 | of has to spelling. So there actually two separate variables, one private |
|
|
34:13 | each fed. And then there is global and threads for the E and |
|
|
34:22 | in it. So what happens in popular region, each thread gets its |
|
|
34:30 | on. Then they also get the of Fred's, um, using in |
|
|
34:39 | parallel region and updates the and Fred . But then what's being done is |
|
|
34:45 | to because, I guess confusions of updates is that in this case conditional |
|
|
34:52 | only the master threat gets to update global variable for the number of |
|
|
34:59 | And then there is a four loop adding up the rectangle values where then |
|
|
35:08 | thread update its location in the array partial, some variables mhm and the |
|
|
35:22 | that this summation is done, ISS kind of a assignment of rectangles to |
|
|
35:30 | thread in a round robin fashion. , as you can see, they |
|
|
35:35 | variable is implemented by the number of and then one. Um, since |
|
|
35:46 | some very some array I should say they said variable. So it's content |
|
|
35:57 | persistent, so it's still available at end of the parallel region. So |
|
|
36:04 | from except the parallel region, then summation off and it's not correct. |
|
|
36:15 | misspoke. But in the sense of values is just simply the function values |
|
|
36:23 | the midpoint of the rectangles against because they don't need to multiply with |
|
|
36:28 | bits of the rectangle inside the loop it's the same wait for all the |
|
|
36:35 | . So, in fact, as can see in the statement towards and |
|
|
36:40 | , uh, do you do the applications once you're have the partial, |
|
|
36:44 | off heights, um, for each of the shirts, and then they |
|
|
36:49 | simply multiply with rectangle woods, which again have been done even outside this |
|
|
36:57 | loop outside the parallel region. At very end, eso this again outside |
|
|
37:04 | parallel region, so there's no problem doing this um, in terms of |
|
|
37:10 | , but it is cereal. So that sense, it's not necessarily want |
|
|
37:16 | you want to do. And the point that was made, uh, |
|
|
37:23 | his fellow Matt, Uh, that came. Uh, yes, this |
|
|
37:29 | kind of, er standard, trick model for how you can paralyzed |
|
|
37:35 | So in this case, they're doing round robin one one can also do |
|
|
37:40 | and otherwise so carving up the But this, in this case, |
|
|
37:46 | that for loop is executed by each , the members of that inner that |
|
|
37:53 | group is now a parallel for it's sequential groups. On this case, |
|
|
37:59 | programmer took the pains to the sun between each threat. Um, so |
|
|
38:13 | this is just and this 17 he this for? Uh, no, |
|
|
38:21 | Z. Obviously, he did modify number of feds and did not just |
|
|
38:27 | two threads. Um, so but we can see in terms off, |
|
|
38:38 | execution time is the right column in stable as a function of the number |
|
|
38:43 | threads, Um, for two threads work recently. Okay. But there |
|
|
38:50 | much off improvement in performance for three four feds, and they're still, |
|
|
39:02 | this case, 100,000 rectangles on It's plenty of work to share him |
|
|
39:09 | more than two threads. So anyone an idea? How come we didn't |
|
|
39:20 | better performance? Since there are lots work to Dole upto friends on? |
|
|
39:28 | , there waas sequential part at the in terms off, uh, summing |
|
|
39:37 | functional values and multiplying with the But we're four threads. There's still |
|
|
39:44 | lot of work in the parallel and there's only four partial sums to |
|
|
39:52 | up. So even though there is at the end, it's still very |
|
|
39:59 | compared to parallel reason. Somehow Ah, if I had to take |
|
|
40:08 | shot at it, I would say since the hyper threading magic works by |
|
|
40:14 | resource is and they're all using the kind of, uh, say, |
|
|
40:19 | example, floating point registers to do actual math. Maybe the hardware drugs |
|
|
40:24 | able Thio independently act at the way we expect them Thio, right? |
|
|
40:31 | this was will, in fact show even if you didn't use hyper 30 |
|
|
40:37 | a good point. I am not this case. Yes, it waas |
|
|
40:42 | the correct, but it's close to is saying that. So so here |
|
|
40:53 | , in fact the culprit. So is now where part of the reason |
|
|
41:04 | into play, why I spend time about processor architectures and cash is and |
|
|
41:09 | lines. So here is, and happens in this case is what's known |
|
|
41:18 | falls sharing. So full sharing means different threads shares values in the same |
|
|
41:32 | line, but not the same So it's not sharing off a |
|
|
41:37 | in which case it would be proper . But from the cash line |
|
|
41:48 | that cash line is shared between most those threats. And that comes from |
|
|
41:57 | fact that, as it's kind of on this slide, that this summary |
|
|
42:07 | just one D array and that is out in successive memory locations in the |
|
|
42:15 | memory. And that means that successive , interests of the Sun Ray are |
|
|
42:25 | the same cash line until you don't enough array elements that it fills up |
|
|
42:32 | cash like so, some zero on 123 and this cartoonish example are in |
|
|
42:42 | same cash line, so if thread then is on say course zero on |
|
|
42:54 | hardware threat zero. Now, as mentioned and I'll come back to |
|
|
43:00 | I guess in the next lecture. , I don't think I'll get to |
|
|
43:03 | today. But default away from the system is to share things in this |
|
|
43:09 | between, but Thio se socket So gets it on. Yeah, the |
|
|
43:17 | was a dual core. Um so means then, you know, |
|
|
43:26 | I guess in this case, that two so but so if you do |
|
|
43:32 | close statement, I should look more the site, you know, talk |
|
|
43:35 | . So in this case, some, too, is needed by |
|
|
43:45 | too. And that is potentially then the l one cache for in this |
|
|
43:52 | for the first core. And then everyone cashes at the closest point, |
|
|
44:00 | are private to the course. So you wanted other core wants it, |
|
|
44:05 | it has to get to the Are cash? So then what happens is |
|
|
44:12 | cash land gets eso the first core if they in the first one to |
|
|
44:18 | some zero it in through the cache mechanism on Intel stuff gets invalidated and |
|
|
44:26 | l one cache on the other So when that, um, read |
|
|
44:33 | it it's not valid. So then needs to get a copy from the |
|
|
44:39 | l one cache. And then when updates things, then the cash line |
|
|
44:46 | , um, the first core cash getting validated. So that means there's |
|
|
44:54 | as they call it off the cash between the two caches. Dr. |
|
|
45:02 | . Yeah. Um, at the of the lecture, you mentioned something |
|
|
45:07 | a relaxed consistency Shared memory model Make use of that in this |
|
|
45:14 | Well, so for these things to correct, either the cache coherence, |
|
|
45:20 | scare of it. Otherwise, Mom need explicitly thio, use the flush |
|
|
45:26 | make sure that the other threads has last values. So in the |
|
|
45:35 | off the cache coherence mechanisms for the lines, it takes care of |
|
|
45:40 | Since this is false sharing, it's the same variable. So I guess |
|
|
45:47 | cache coherence was for the same These are not the same variable that |
|
|
45:53 | to, because each thread on Lee at one protector, uh, element |
|
|
46:00 | the summary but because it is one element in the cash line is |
|
|
46:16 | , say, by thread zero. soon as that cash line is updated |
|
|
46:22 | , um, the level one cash , the first corps that cash line |
|
|
46:28 | the whole cash line gets invalidated in other cash through the cache coherence |
|
|
46:36 | So it's a different mechanism should be clear about that. So it's because |
|
|
46:43 | the cache coherence that the cash long invalidated, even though the value itself |
|
|
46:50 | be perfect to fight. But since coherence, my communist doesn't understand individual |
|
|
47:01 | . It just works, um, cash lines. That's why it ends |
|
|
47:08 | being invalidated. So good questions, right? It wasn't clear. Doesn't |
|
|
47:20 | sense now. Yeah, Thank Okay, thanks, Frank. Having |
|
|
47:27 | clarified the two difference, that's that is just based on the program logic |
|
|
47:36 | variables. And this phenomena is based how the Kachin mechanism works. So |
|
|
47:49 | does one fix this so now effects to do this thing. And, |
|
|
48:15 | , this is even used in non programs. Sometimes we respect to how |
|
|
48:31 | Iran memory works. I am spend time talking about memory and Iran, |
|
|
48:41 | that has pages, and I'm not to get into it again today. |
|
|
48:47 | this notion of having is something when uses, in order to gap the |
|
|
48:56 | or desirable memory behavior and avoid, , poor memory performance. And |
|
|
49:06 | for today it's open MP and cash behaviors, but the same trick can |
|
|
49:15 | used to use having soul. And particular kids the pattern was used to |
|
|
49:23 | a two dimensional ray that use Uh so is it said 64 bites |
|
|
49:35 | doubles. So that means double position eight bytes per data elements or by |
|
|
49:45 | it by eight data elements. That , um, this, uh, |
|
|
49:53 | that it's now the some that is two by eight the rain. So |
|
|
49:59 | , depending on when I see your time. But you see of the |
|
|
50:03 | major ordering that means for thread. is now in kind of some one |
|
|
50:13 | . Bad is a full cash So in this case, for each |
|
|
50:23 | occupies a complete cash line. So when anything gets updated, it doesn't |
|
|
50:38 | the cash line, uh, in cash. So now things should work |
|
|
50:49 | and it does so now. As can see, even as one, |
|
|
50:59 | 23 and four treads, they're still improvement in performance. That's not |
|
|
51:07 | you know, linear. It's not . But it's not too bad in |
|
|
51:14 | sense that two threads is a little more than half time off. One |
|
|
51:20 | and four threads. There's a little more than half the time off to |
|
|
51:25 | , so it's pretty good speed up sponsors as a functional number of |
|
|
51:36 | But anyone sees have comments about this of doing business, more wasting |
|
|
51:47 | wasting it out in the space. , considerable amount, right? So |
|
|
51:54 | is clearly not the preferred way of business. So there are other ways |
|
|
51:58 | get to so you want one way . And the other part is, |
|
|
52:04 | would no need to know about the line size for the particular platform you're |
|
|
52:11 | to use now, it turns for the one caches, most of |
|
|
52:15 | are 64 bytes that not for all for high level cash is there is |
|
|
52:22 | variability in the 64 1 28 But it was a little bit |
|
|
52:28 | My justification for talking about details about is my process of lecture that one |
|
|
52:35 | needs to be aware of it. yes, various. So So this |
|
|
52:44 | now our way. I'm not kind explicitly Taylor and things Thio Cash is |
|
|
52:55 | by being conscientious on caches works and trying to write to coat that doesn't |
|
|
53:01 | the problem. So in this you can have the parallel region. |
|
|
53:11 | now, by making some private variable each trends in the parallel region, |
|
|
53:24 | MM there is no problem for the threads to update some variables because they |
|
|
53:30 | have their own private, some So there is no issues of race |
|
|
53:38 | or anything and updating some and is the type of code or the programmer |
|
|
53:45 | Thio figure out to paralyze the for him or herself by using this round |
|
|
53:54 | principle. But now there's some very sir private to each thread. So |
|
|
54:06 | one gets out off, the one or rather one has to some the |
|
|
54:20 | variables for the different threads before one the parallel region. So that's why |
|
|
54:30 | an serial portion now inside the parallel using this critical construct to make sure |
|
|
54:38 | the summation it's just, uh, done one at the time to avoid |
|
|
54:44 | race condition. So and now you see in this case we didn't do |
|
|
54:54 | pattern, didn't waste memory, but by being conscientious, uh, the |
|
|
55:05 | cash line invalidation and keeping things local threads than in fact, one gets |
|
|
55:13 | same level of speed up as using patent construct or idea and a comments |
|
|
55:28 | questions on this. So we achieved same time without the, um, |
|
|
55:37 | memory overheads. Yes, so and is no memory overhead, there is |
|
|
55:43 | tiger into a particular architectural er in sense of cash line sizes. There |
|
|
55:51 | just a awareness I would say off cashing mechanisms worked and trying to make |
|
|
55:58 | that there is no kind of falls in this case that things in the |
|
|
56:06 | cash lines has used by multiple Ah, the question I had |
|
|
56:16 | What is the, uh, reason cyclically distributing the loot generations instead of |
|
|
56:24 | cutting them off into partitions? In this case, it could equally |
|
|
56:28 | been used. Carol four constructs and the compiler figure out how the |
|
|
56:35 | the loop. Okay, so it's matter of choice and right. So |
|
|
56:41 | think I called for this example, , and I'm honest about it. |
|
|
56:45 | have the We were at the The I think the author of the |
|
|
56:51 | just wanted to show a way in the programmer may, um, paralyzed |
|
|
56:59 | loop. And in some other it may be preferable that tailor the |
|
|
57:06 | of the loop to the content on loop. In this case, it's |
|
|
57:09 | pretty trivial loop for him. Set statements for the four statements. So |
|
|
57:15 | is really not anything Thio, I say very clever about. So I |
|
|
57:24 | it was just the highlight. One particular way of paralyzing look. And |
|
|
57:33 | , we talked instead of since it's a single variable continues atomic. This |
|
|
57:39 | Now what? So now the thing done to since you mentioned Reduction Thio |
|
|
57:46 | negative part with the other the last that was good In many ways, |
|
|
57:53 | wasn't tailored to particular architecture or I would say. But it has |
|
|
58:02 | sequential edition off the partial, some each thread. So with the hope |
|
|
58:10 | there is a good reduction implementation, can use a reduction cause and use |
|
|
58:17 | this case. Now, um, parallel four statements and then my doesn't |
|
|
58:26 | have the issue with the accumulation off variables because the reduction class shipped. |
|
|
58:35 | sure that that happens correctly. And this final version there is only just |
|
|
58:41 | I mentioned earlier. Since the With Tracked Angle is the same bunch and |
|
|
58:48 | do it. That's the very last . Put this in case in this |
|
|
58:52 | , the way it's being done. this is just Joe. And |
|
|
58:56 | I guess, in this case it out that I guess the reduction implementation |
|
|
59:03 | quite sufficient as the other way of business on this case. If you |
|
|
59:10 | at the last column in the table using the reduction clause on the speed |
|
|
59:17 | , it's not quite as good as critical or one. Could they used |
|
|
59:23 | as well, but that depends on . It's nothing that's necessarily inherent. |
|
|
59:30 | says that reduction would be significant Store says the reduction clause only have |
|
|
59:40 | preset options or doesn't generalize. Thio able to use whatever map reduce method |
|
|
59:46 | can come up with. So the in this case is a function implemented |
|
|
59:54 | a part off the runtime library supporting MP. And it has a collection |
|
|
60:04 | operators that it does support, so can certainly right to one's own |
|
|
60:17 | But then it wouldn't be a close the statement. Maybe I misunderstood your |
|
|
60:24 | . I was just wondering if the mean you can't just pass any sort |
|
|
60:29 | funked or in there like the the the specific maps that they use |
|
|
60:34 | this tailored and implemented in a way makes it fast as a person. |
|
|
60:38 | , to my knowledge, none of open empty version support instead of using |
|
|
60:43 | in this reduction statement, pass in function that euro. Okay, I'm |
|
|
60:53 | . Yeah, I think it only , um, elementary operations. You |
|
|
60:57 | define a function for it. That's good question, I think. But |
|
|
61:05 | my knowledge, they have not, , nothing to the point where they |
|
|
61:10 | take as used to find function. I think this is yes. Summing |
|
|
61:25 | the example that you know there's dealing race conditions that are dealing with their |
|
|
61:33 | sharing. Um and then, potentially using the reduction operator, |
|
|
61:44 | Avoid serial segments of the code adding up individual Fred using to global |
|
|
61:55 | than so Okay, so now getting of different construct that task construct, |
|
|
62:11 | I will call for it a few or minute to see if there's |
|
|
62:17 | So this is now getting into making that even a little bit more flexible |
|
|
62:24 | also more also more complex. So heading into that, uh, |
|
|
62:31 | if there are additional questions about open in terms off constructs causes not, |
|
|
62:49 | will start talking about the test construct and working with open a P open |
|
|
63:05 | after the initial, um, So to speak. Ages ago, |
|
|
63:13 | terms of trying to take some of details from Prato shared memory programming away |
|
|
63:20 | the programmer and leaving it to Itwas discovery. Waas didn't have the |
|
|
63:30 | flexibility that one sometimes what's needed in , and when people call us and |
|
|
63:40 | or coats. And it also turned that someone gets recursive algorithms and trouble |
|
|
63:50 | with initial set the constructs that was in Open and P. Someone created |
|
|
63:58 | task construct, um that I will about now. Yeah, so task |
|
|
64:16 | well is as more flexible because it more to it than individuals friends. |
|
|
64:31 | in some ways, as it said its like that comes that one. |
|
|
64:37 | you parallel region as a set of where the task is an assigned to |
|
|
64:48 | threatening? But then, um, still sort of restricted set of things |
|
|
64:55 | on. The task gets to do , um, hard to manage the |
|
|
65:01 | in particular as well as the execution . So the test constructs has with |
|
|
65:12 | not only the code to execute. also has more of the data environment |
|
|
65:18 | what they call the internal control so it's richer. But it also |
|
|
65:27 | implies more overhead because it's more of machine that needs to be set up |
|
|
65:32 | manage tasks as supposed to threats. like what kind of threads us? |
|
|
65:43 | parable version of these things is where color coded boxes are now tasks that |
|
|
65:51 | yeah, the different task and a pieces of code as supposed to we |
|
|
65:59 | . That's for in parallel region each gets a copy on the same |
|
|
66:05 | Now tasks have different pieces of But like 20 threads, there is |
|
|
66:13 | particular execution order defined between tasks. in this case, that's the pile |
|
|
66:23 | cartoon shows to the right of this . It is arbitrary order and when |
|
|
66:31 | start and how they proceed. So is something more or less what I |
|
|
66:40 | said that that one can think of threads in the initial version or in |
|
|
66:46 | region as task. But there are more restrictive on the test construct. |
|
|
66:56 | here's the task construct at the place structure block, and I'll give examples |
|
|
67:01 | trivial examples in a little bit. then there is slightly different scoping rules |
|
|
67:11 | data scopes than for a parallel region tests. And then they're separate scheduling |
|
|
67:21 | for how to coordinate tasks that I want to use. And then there |
|
|
67:35 | the barrier gun nut for like for regions. And then there's also kind |
|
|
67:42 | weight as well. There's no way . Yeah, so here's That's the |
|
|
67:51 | simple test constructors in our apparently And inside this probably region, there |
|
|
68:01 | first they open empty single construct. , that is used them to often |
|
|
68:13 | typical then, as they say or them one of the other side's that |
|
|
68:20 | create, Um, all the um, task context that are necessary |
|
|
68:30 | the task that follows inside the single . So in that case, what |
|
|
68:38 | means that each task will be Bye. A single threat. But |
|
|
68:50 | them that one of the slices there that there is no particular implied order |
|
|
68:58 | these tasks. So if one needs order, one will need other construct |
|
|
69:04 | assure that order one once is her with there are dependencies. Okay, |
|
|
69:14 | , I guess I had some example to be an example that may help |
|
|
69:20 | remind her open, and he works well as how this test construct |
|
|
69:27 | So this is second sequential programs. no surprise this if you're on |
|
|
69:34 | it should do this thing because it's sequential cold. So the prince statements |
|
|
69:38 | executed one after the other, and is what happens. So now I |
|
|
69:44 | to do the parable version, and now I guess if you do two |
|
|
69:53 | , um, each threat now gets copy of this code, so each |
|
|
69:58 | is going to bring a race But there are, you know, |
|
|
70:06 | statements and in this case, two . So anyone wants to volunteer whatever |
|
|
70:15 | potentially C or C. It's a statement. So but it's again just |
|
|
70:28 | remind you how open MP works There is no particular order between are |
|
|
70:37 | two threads in this case executes these statements? Hmm, that's all |
|
|
70:51 | Okay, so two threads rights on case. Each thread printed out a |
|
|
71:00 | car. My uncle resume. In case, that's true, and they |
|
|
71:07 | nicely ordered, but it could have pretty much any combination of things they |
|
|
71:13 | always for any off. The two , um, would always proceed race |
|
|
71:20 | put all the specific proceed car for particular threat. But the threats, |
|
|
71:25 | progressive, different face so it could fairly jumbled in terms of what comes |
|
|
71:34 | they can try will permit Titian's. quite the future. And that's just |
|
|
71:37 | standard, um, open and P the test construct now. So |
|
|
71:49 | to see the impact on and off single statement now would you expect anything |
|
|
72:01 | at this time? You can be that it'll print a race car on |
|
|
72:05 | entirety for each thread. Yes, of. But there's just a single |
|
|
72:12 | . In this case. They only it once. Yeah, because of |
|
|
72:17 | single thread is enforced. So even there is in the after open |
|
|
72:22 | But then there's the restriction that on single threat can execute. It's not |
|
|
72:31 | critical region, so it's not one the time. It's just one |
|
|
72:39 | So that's when I said So now trying to do the task. Parallel |
|
|
72:49 | so I was similar to the construct his Children. Terms off this task |
|
|
72:55 | . So now what do they So we have If we look at |
|
|
73:11 | , school may have firstly generate a of threads and the parallel construct, |
|
|
73:21 | we decided that for the next thing only going to use one thread. |
|
|
73:31 | , except that means one moment I the test construct. That means each |
|
|
73:36 | of those tests gets a single Now the print a statement is just |
|
|
73:48 | to the single Fred requirement because it's preceded by a task statement, so |
|
|
73:56 | we should only get a printed But then we have the other two |
|
|
74:05 | car. That should be each printed . However, the order is not |
|
|
74:16 | that caused the tasks can be executed the order. So that means we're |
|
|
74:22 | get depending upon how many times you it they can get. But I |
|
|
74:27 | they race car as that sequential, , person did, But we could |
|
|
74:35 | get a threat to root print the task ahead of the race. So |
|
|
74:45 | , because they're ordering between printing, and printing, car is arbitrary. |
|
|
74:51 | is always first because it's outside Constructs was not one more watch, |
|
|
75:01 | guess. Example, uh, so . And I just started, I |
|
|
75:10 | , is fun to watch. That not part of an a task |
|
|
75:20 | So anyone, Well, maybe it's so first, I should always be |
|
|
75:29 | first, right, because that's in single thread region. It's fun to |
|
|
75:35 | should also be printed ones only because also in a single non paralyzed |
|
|
75:43 | But we still have the tough statements can be an arbitrary order, so |
|
|
75:49 | is kind of what might happen. , it may happen. Not before |
|
|
75:58 | on the task constructs gets executed. , the sequential region order single region |
|
|
76:15 | get both this print statements done before of the test, um regions, |
|
|
76:23 | to print what it's supposed to print . All right. Thank you. |
|
|
76:37 | , want to stop there? Two . Uh, unless a question, |
|
|
76:45 | there's no question, I'll talk about . So so far invested when I |
|
|
76:52 | about test constructs ads a level of management that the region construct by itself |
|
|
77:02 | have but execution order among tasks in violent region. ISS arbitrary like or |
|
|
77:13 | progress of threads in the region is kind of arbitrary. So there's this |
|
|
77:20 | of hierarchy that goes from task um, threat to process. |
|
|
77:29 | So process is the most heavyweight because carries with it a lot more |
|
|
77:37 | then threads that are at the other as the least context carried with |
|
|
77:44 | Uh, and task are kind of the middle of sort of. There's |
|
|
77:48 | a shared memory construct, so it has benefits from things being shared. |
|
|
77:56 | , but it has more flexibility and for different codes to be executed by |
|
|
78:04 | by different task. Where is Executes the copy of the same |
|
|
78:14 | so it is going from just The task increased flexibility but also increased |
|
|
78:24 | to set things up, and they go further onto processes. It's even |
|
|
78:31 | things that needs to get created and up. But it's also more |
|
|
78:39 | uh, flexible. And it's not longer necessarily process construct limited to share |
|
|
78:48 | with a lot of things I know again, the process cares what it |
|
|
78:54 | with it. So with that, guess I'll take more questions. But |
|
|
79:03 | will start then scope next time and talk about what's great out here in |
|
|
79:10 | of the test constructs. And then talk about the affinity and how you're |
|
|
79:18 | force threats to be place where you want them to be. Enough have |
|
|
79:23 | operating system designed for you, so will be with on Wednesday and then |
|
|
79:32 | think so. Yes, for the demo on some of the constructed hasn't |
|
|
79:38 | demo yet, including the test if any other questions, or whether |
|
|
79:55 | about lecture material or assignments, If you don't stop sharing my screen |
|
|
80:22 | , now stop the recording |
|