ICS Video Player

COSC 6365 Introduction To High Performance Computing - Lecture12_OpenMP-III

Transcript ×

Auto highlight

Off

Font-size

00:00	Yeah, So today I'll continue to about open MB on Guy will probably

00:10	finished today either. But on there's no common what I think it's

00:18	to do the exercise on having inciting open and he works. So this

00:25	little bit for today's agenda. Finish talking about control, flow, construct

00:31	open MP Um, and then a bit of run time. And then

00:36	spend some time going through an example to use open and pay for some

00:41	program and then get into yet another with just tasks for open and picking

00:48	all. See how far on task . So we went, I

00:57	to the if statement is where we so yesterday call like as a

01:04	That's a fairly simple and straightforward Yes, that's it. This synchronization

01:11	where all fits still be present at barrier, perform on proceeds. They

01:18	the barrier and critical waas, just conception of code, but that only

01:27	threat at the time can execute. this case, it's just a single

01:32	, but it can be used for block of code that that needs to

01:36	within 40 crack. It's, and they talked about that. And

01:44	there was the atomic. Also, is equivalent for two critical as long

01:51	it's just a single memory location that involved. So the atomic is just

01:57	can assignment type state. Where is ? Ken applied some body of

02:04	Yeah, So here is I where I was at the end of

02:08	lecture and in this case is conditional usually statement that is used, that

02:16	condition is true. Then the parallels gets executed. Otherwise it is

02:28	That means Onley, the master fed the parents read will execute the

02:35	So any idea why one they want do that and see if we can

02:40	questions or responses to questions? there might be more overhead and splitting

02:47	the work as opposed to just letting , never to the river or one

02:52	. Whether exactly is all spawning a of threads is not entirely free,

02:58	it involves a fair amount of Someone should, as a general

03:02	whether it's using open MP or later M P I or some other programming

03:09	. Be careful when one uses a construct because there is overhead, and

03:16	on programming paradigm, it may be significant. So don't use parallelism if

03:22	can do it and recently, time a single threat. Then there is

03:28	a weight clause that allows, threads. There are done with,

03:37	this case, a four loop to at the end of the implied synchronization

03:45	the end off parallel Fordham. So is, I guess, an example

03:52	for the first four loop, there implicit barriers, so no threads.

04:00	tiu the openem before wait statement until threats have done the first four

04:08	but then for the second four loop are done can proceed beyond that.

04:15	that's Presley, but in a weight allows you to do and depending upon

04:22	the computation are, that can give benefits. Then there is another.

04:35	There is no reduction. Reduction is operation that is included in many parallel

04:46	languages, so open MP is not programming language, but they choose to

04:52	it as a construct that one can clause actually can be applied to some

04:57	the constructs. That's one use. here's now the little example how to

05:06	reduction. So reduction is simply what shows here, right? That's this

05:11	. It's there's some reduction. Civil produced a whole array in this case

05:18	a single value by adding all the of the race and then computes the

05:25	. But after the fact divide by number of elements in the rain,

05:31	one point can do is simply that use first without the reduction clause or

05:39	, I'll come to that basically part a for loop that now adds up

05:49	the values off the ray into the A e. So anyone sees thinks

05:59	is a good idea or not. I stopped, there is probably a

06:09	guess that there's some issues anyone volunteers issue. Uh, it'll be writing

06:20	maybe not necessarily the order that you to be, so you could have

06:24	on iteration zero. When say, third threat is writing, it's iteration

06:30	a B, so I wouldn't be accumulating Um, because, say,

06:35	thread that is on iteration. One gotten updated valley from the from the

06:42	threat that it's hard to put in , but I think, yeah,

06:47	close. But it's not quite in case because it's not a written as

06:56	interactive edition in terms of a V just, uh, implementing the A

07:02	variable. So, um yeah, this case, uh, if they

07:10	been an integer should be harmless. it's a floating point, the Order

07:14	Edition actually matters for the potentially for end result. But the problem is

07:21	in this case that more than one may try to update the global variable

07:27	at the same time. And then one gets to do it. So

07:31	this case, some off the ray of a may not in fact ended

07:37	being in the summation variable, maybe maybe, is to share variable.

07:44	that's why it is kind of a condition. And one can fix the

07:51	condition by using the critical, as talked about, um, just recently

07:59	time on. And here is an that's your stayed in terms off showing

08:06	in this case, A ve is declared a surprise that variable and this

08:15	cause in fact gets inherited. Also the nested for loop since two parallel

08:25	and other and an inner one. in this case, because it is

08:32	to each thread, one gets, , as many partial sums in this

08:39	v a local tree. Each thread there are threats, each one

08:44	as whatever piece of very it works into each local variable, and it

08:52	them to sum up all the per , sums off the elements off the

09:01	A and to get that not toe various condition. One can use the

09:08	construct to make sure that they all properly added mhm. This can should

09:16	proof off. The fact that this true is and the image, uh

09:24	the right of this particular slide so can see it in this case,

09:28	were on this fourth threats being used triage on the initialization is on the

09:34	hand side for a and if you at the a V, a local

09:38	each one of the threads, you see that some ends up for each

09:47	of the threat is the sun The two values it off A that

09:52	thread is responsible for. So that's fact didn't work correctly in terms off

09:58	thread, summing it up its values their A. Um, so this

10:06	what I just said. But critical one threat at the time. So

10:13	, the, um, four the inner for loop is for the

10:22	to the extent that you have But then there is a sequential part

10:27	adding up the local part. And what one can fix with this reduction

10:36	. So So here's how the reduction a construct part against strictly cause that

10:46	added to the parallel fork construct in case. And it says that they

10:54	to be plus reduction. One could ah, string of multiplication as

11:01	In that case and multiply will be operator and the reduction cause In this

11:08	, the operators plus in the variable a D. So now, in

11:14	case, uh, things should be done. Um, due to the

11:21	of the reduction cause, that has make sure that, um, all

11:28	different thread values are added correctly. may assume I'm sure that there is

11:36	iti in the plus operation so the of that everything should be added into

11:44	A V variable. And here's just off the type of operations that the

11:53	clause ah, support you. So question on the reduction type clause So

12:09	efficiency I'll give and want to talk the example will use reduction clause in

12:14	examples here to what extent is sufficient . But that's a implementation issue has

12:22	. Um, and of course, evening, it can be done

12:27	but And the one has, I , an idea how you would do

12:38	in a parallel context. Better than version for addition, for instance.

12:48	. Could you repeat the question, Johnson? Yes, I was saying

12:52	. And no one has an idea how to dio Parallel addition. If

12:57	have an array of elements will be up. Mm, you. Could

13:08	just find the sums off that's partition fined the sum of each partition?

13:12	not those up together? Correct? . So, basically, you can

13:17	treated as a binary three on just pair of eyes, and then you

13:23	up the tree from if their lives adding the elements paralyze, and then

13:28	next level up the tree is adding the partial sums until you basically have

13:34	two values to sum up when you at the root of the tree.

13:39	that's the way you can do it a parallel context and in these

13:45	shared memory models. So everything is . So no sound easily done

13:55	a little bit s I said to , the trickiest part to get both

14:01	performance and correct is the sharing issue the next couple of slides. Dance

14:11	that? So this is kind of picture off how things may actually be

14:17	any given state of the competition that , Um, and like a.

14:25	this particular case, they have copies different places that are not necessarily known

14:32	any given time but all the different in the different processors. So that's

14:40	is known as the relaxed consistency model open MP support. So depending upon

14:50	then cache coherency is implemented, 1 may need to be more or

14:56	careful off how the different threads interact terms of reads and writes, and

15:02	essentially what this slide says and there then a a command for synchronization that

15:12	known as a flash command. That then, making sure that things become

15:18	become consistence across different members in Typical multi processor today. And it

15:28	says that don't dry cold bar You of depend on the reeds. And

15:35	happens in a particular orders to get correct results in that case, maturing

15:39	us again implicit or explicit synchronization. use the Flash Command to make sure

15:47	all threads when they need to interact the same value. So now any

16:00	on those things? Something I mentioned before. It has to be very

16:07	about what is hello, Chad or and how it's shared, if not

16:21	about scheduling. And it was That is the question I believe a

16:26	to back on this thing works, , and so there's a few different

16:33	that is, uh, allowed in open, empty and hard to manage

16:44	, um, scheduling off watch in sharing constructs so and maybe I'll talk

16:55	little bit to this slide, and I'll show you a picture that probably

16:59	it more intuitive. about these things means so one thing is static,

17:06	that is simply that at compile time basically create the codes that then carves

17:14	, for instance, loops among And then there's a chunk attributes that

17:24	what all right, so today doing the work is that is being doled

17:32	to the different threats. And, course, if the amount of work

17:39	not even multiple out the chunk size some threats gets, let's but exactly

17:48	happens when the total and work amount work is not, and even multiple

17:57	chunks or a holy integer so they not even or odd multiple off the

18:05	size. Then it's up to the of open, empty, how they

18:12	to great chunk sizes. Dynamic is little bit more complicated in the sense

18:25	you still use the chunk for carving work that it's not statically decided how

18:35	were assigned to threats again on the pick. Slide will probably give you

18:41	better picture of that, and guidance even more flexible in the sense that

18:49	junkman Justin, all the initial chunk and then depending upon remaining works,

18:56	junk size can be modified at run , and I think the rest of

19:04	is further flexibility. Simple on Let show you in that kind of picture

19:12	the static one there's a fixed junk , assuming again that total work is

19:21	on the chunk size. Otherwise, the best to chunks will not necessarily

19:29	about the same size that up until point, every thread gets the

19:36	So it's junk on they do like on this slide in kind of

19:40	round robin assignment order. The dynamic has fixed chunk size you can

19:50	But now the order is more uh, each that finishes off a

19:57	size. And so it's not. on job in assignment, that

20:01	on the static formal schedule. And there's too. Examples are guided.

20:12	on this slide on the idea in this case, the typically as

20:22	guest guided a the first set of sizes air based on the work to

20:27	done and this chunk size given. as you proceed, then the chunk

20:34	. It's determined by the amount of works in that case, chunk sizes

20:40	be reduced from the initial chunk and the guided B is kind of

20:47	more dynamic in this ends up in case, you get one chunk size

20:53	then you may look at what remains the first chunk size and not just

20:58	everything equal in every round. And again, the assignment is not predetermined

21:04	can be arbitrary. So they're different are specifying how the schedule of work

21:15	be, how the work should be up among threats and so just mentioned

21:24	last time and we talked about Um, so this is just saying

21:33	yes, there may be benefits of more flexible schedule ings, but it

21:38	means that there is more work to done a giant time. So there

21:41	overhead again by people involved in being flexible. Uh huh. Have run

21:54	. So any questions on that? again, those are clauses one can

22:04	Thio, the constructs thio. We already talk given examples. I guess

22:13	runtime functionality, Um and this is a few examples and there's been in

22:24	use of this already and come back that, uh, in subsequent examples

22:32	so you can get the number of or course being used number are threads

22:41	maximum numbers available on Then you can the number of threats and is in

22:48	, being a science. And then thing. Yeah, it's great idea

22:51	was already used in, um, examples have shown, and this is

23:00	priority that comes. So Josh mentioned in the demo last time that there

23:06	environmental variables that has the worst and the highest priority is but you

23:18	or sign in the primary region. I think this is just a very

23:29	example in this case where one good in this case four threads for this

23:37	region. And it's important, Thio , uh, once again, another

23:45	. I came up in the previous that these are strictly request, and

23:53	think sometimes there's confusion or one uh, or they understand that these

24:02	threads are assigned by the operating So and I'll come to that in

24:13	detail in the two sides. So operating system knows the situation at the

24:21	time, and it is the case then it chooses to maximally a sign

24:34	they're costed. Number of threats. those numbers off threads are not available

24:46	even more cautiously deemed not feasible for reason, like overheating and other things

24:55	the system that it knows off it assign less threats. So so be

25:08	off just because one request the number threads does not means that one gets

25:16	number of threats there are in these one time functions that you can use

25:23	find out how many threads actually waas today currently, region on there is

25:32	an example. If you actually then example, um, you have if

25:36	cold actually got for threads. Anders for parallel regions in line with what

25:40	have talked about, Ernie. So is against more what? I already

25:48	that the operating system is in charge bond. The number off again threats

26:00	get ISS beside and the next slide an eyesore. But you can find

26:10	in the open MP specification exactly what rules are for the signing threats.

26:19	the simple minded or rule of thumb remember is that thread requested is kind

26:28	the maximum And if things air that's what will be given. But

26:35	you got two were threats on They depends on again whether things are

26:44	or statically aside. But the take message from this is request.

26:54	statements. Imagers looks like the resigning to the regions. It's a

27:04	Um, so this is a genius an example all of the sudden using

27:13	process and numbers of the dynamics that this case, eyes disabled And

27:22	in this case, requestion of this one thread proceed to you on

27:28	for this particular case, there is used a single statement to make sure

27:34	the num threads is a variable just . And one of the threads,

27:43	, it's no need to have several updating the variable, and it shouldn't

27:50	any difference in this case when dynamic zero, but it's kind of could

27:55	it. Yes, so that's just of them, they said on this

28:05	. Yes, I think so. , demo this last time that the

28:11	IEDs their local to each region. this is just the fight showing who

28:19	what's already mentioned last demo and a bit. I guess terminology is maybe

28:32	parent and child is sometimes used to thread is for the sequential part can

28:42	be used for the parent threat. logically, um, maybe it's more

28:49	to think of it as being inside parallel region a Z being the threat

28:55	is, in fact, for the parent thread continuing. So it is

29:01	in existence. All right, so questions on that and I'll given talk

29:10	an example. All right, I'm my chance. Okay, So it

29:25	an example that is used by many terms so showing a simple,

29:31	empty program and have toe work with , um, to show different behaviors

29:39	upon how is you kind of write simple, open and peacoat. So

29:46	turns out punching compute. Bye This particular way of doing American integration

29:54	day one of you have taken an in America analysis Course knows you can

30:01	integral of basically the area under a . And the simplest approximation algorithm is

30:09	to approximated by a bunch of Andi, Clearly it a smaller direct

30:16	are the closer. The area of the rectangles fits the current, and

30:24	nothing to use as known as the method, if I remember correctly.

30:31	right, so basically it's Thio. the area of the rectangles from basically

30:39	the height in this particular picture of rectangle. And one way is just

30:47	here, use X value at some in the rectangle for the function on

30:55	in the next slide, we best the midpoint, um, off the

31:02	for the function value and then multiplied the width of the rectangle. And

31:08	one gets the area for each one direct ankles. So here is kind

31:12	the sequential code. And I just that, um, the step that

31:18	window direct angle takes a function value the midpoint of the rectangle and then

31:23	simply add up all the rectangles, in this case, there would be

31:30	little bit the problem. We'll come that in terms of doing it

31:37	cold, and now there is kind recursive information, so we'll talk about

31:44	. So here is now the first to do an open, empty program

31:53	this sequential coat. So, I think I have that on the

32:00	slide. That's what I said. . Uh, okay, so So

32:05	know. My Hopefully you can remember I thought I had it. So

32:09	in this case is to simplify that to set just having two threads and

32:18	thio avoid the sharing of the sun that you know, was potential

32:26	Each thread now gets his own some by having as many partial sums as

32:35	threats. So each thread can basically a particular location. In this

32:41	some biscuits is just on a ray size, too. But it could

32:46	whatever size and number of threads they to use. Um, so because

32:54	declared outside the parallel region, it now shared variable that since each Fred

33:02	its own slot in the summary, is no issue and updating some values

33:10	each slot, since only one thread a particular slot in the summary.

33:18	let's see what else is happening here the parallel region. Declare and thio

33:29	private variables Now then, because it's the parallel region. Um, the

33:36	is the most important partisans. I a loop in the excellence. In

33:41	way, it is private. But also used an outside the four

33:47	Uh, so maybe it's just Thio declare. I also inside the

33:53	region, and then there is is for the number off threads on this

34:02	little bit tricky. The way they example has done this, and threads

34:07	of has to spelling. So there actually two separate variables, one private

34:13	each fed. And then there is global and threads for the E and

34:22	in it. So what happens in popular region, each thread gets its

34:30	on. Then they also get the of Fred's, um, using in

34:39	parallel region and updates the and Fred . But then what's being done is

34:45	to because, I guess confusions of updates is that in this case conditional

34:52	only the master threat gets to update global variable for the number of

34:59	And then there is a four loop adding up the rectangle values where then

35:08	thread update its location in the array partial, some variables mhm and the

35:22	that this summation is done, ISS kind of a assignment of rectangles to

35:30	thread in a round robin fashion. , as you can see, they

35:35	variable is implemented by the number of and then one. Um, since

35:46	some very some array I should say they said variable. So it's content

35:57	persistent, so it's still available at end of the parallel region. So

36:04	from except the parallel region, then summation off and it's not correct.

36:15	misspoke. But in the sense of values is just simply the function values

36:23	the midpoint of the rectangles against because they don't need to multiply with

36:28	bits of the rectangle inside the loop it's the same wait for all the

36:35	. So, in fact, as can see in the statement towards and

36:40	, uh, do you do the applications once you're have the partial,

36:44	off heights, um, for each of the shirts, and then they

36:49	simply multiply with rectangle woods, which again have been done even outside this

36:57	loop outside the parallel region. At very end, eso this again outside

37:04	parallel region, so there's no problem doing this um, in terms of

37:10	, but it is cereal. So that sense, it's not necessarily want

37:16	you want to do. And the point that was made, uh,

37:23	his fellow Matt, Uh, that came. Uh, yes, this

37:29	kind of, er standard, trick model for how you can paralyzed

37:35	So in this case, they're doing round robin one one can also do

37:40	and otherwise so carving up the But this, in this case,

37:46	that for loop is executed by each , the members of that inner that

37:53	group is now a parallel for it's sequential groups. On this case,

37:59	programmer took the pains to the sun between each threat. Um, so

38:13	this is just and this 17 he this for? Uh, no,

38:21	Z. Obviously, he did modify number of feds and did not just

38:27	two threads. Um, so but we can see in terms off,

38:38	execution time is the right column in stable as a function of the number

38:43	threads, Um, for two threads work recently. Okay. But there

38:50	much off improvement in performance for three four feds, and they're still,

39:02	this case, 100,000 rectangles on It's plenty of work to share him

39:09	more than two threads. So anyone an idea? How come we didn't

39:20	better performance? Since there are lots work to Dole upto friends on?

39:28	, there waas sequential part at the in terms off, uh, summing

39:37	functional values and multiplying with the But we're four threads. There's still

39:44	lot of work in the parallel and there's only four partial sums to

39:52	up. So even though there is at the end, it's still very

39:59	compared to parallel reason. Somehow Ah, if I had to take

40:08	shot at it, I would say since the hyper threading magic works by

40:14	resource is and they're all using the kind of, uh, say,

40:19	example, floating point registers to do actual math. Maybe the hardware drugs

40:24	able Thio independently act at the way we expect them Thio, right?

40:31	this was will, in fact show even if you didn't use hyper 30

40:37	a good point. I am not this case. Yes, it waas

40:42	the correct, but it's close to is saying that. So so here

40:53	, in fact the culprit. So is now where part of the reason

41:04	into play, why I spend time about processor architectures and cash is and

41:09	lines. So here is, and happens in this case is what's known

41:18	falls sharing. So full sharing means different threads shares values in the same

41:32	line, but not the same So it's not sharing off a

41:37	in which case it would be proper . But from the cash line

41:48	that cash line is shared between most those threats. And that comes from

41:57	fact that, as it's kind of on this slide, that this summary

42:07	just one D array and that is out in successive memory locations in the

42:15	memory. And that means that successive , interests of the Sun Ray are

42:25	the same cash line until you don't enough array elements that it fills up

42:32	cash like so, some zero on 123 and this cartoonish example are in

42:42	same cash line, so if thread then is on say course zero on

42:54	hardware threat zero. Now, as mentioned and I'll come back to

43:00	I guess in the next lecture. , I don't think I'll get to

43:03	today. But default away from the system is to share things in this

43:09	between, but Thio se socket So gets it on. Yeah, the

43:17	was a dual core. Um so means then, you know,

43:26	I guess in this case, that two so but so if you do

43:32	close statement, I should look more the site, you know, talk

43:35	. So in this case, some, too, is needed by

43:45	too. And that is potentially then the l one cache for in this

43:52	for the first core. And then everyone cashes at the closest point,

44:00	are private to the course. So you wanted other core wants it,

44:05	it has to get to the Are cash? So then what happens is

44:12	cash land gets eso the first core if they in the first one to

44:18	some zero it in through the cache mechanism on Intel stuff gets invalidated and

44:26	l one cache on the other So when that, um, read

44:33	it it's not valid. So then needs to get a copy from the

44:39	l one cache. And then when updates things, then the cash line

44:46	, um, the first core cash getting validated. So that means there's

44:54	as they call it off the cash between the two caches. Dr.

45:02	. Yeah. Um, at the of the lecture, you mentioned something

45:07	a relaxed consistency Shared memory model Make use of that in this

45:14	Well, so for these things to correct, either the cache coherence,

45:20	scare of it. Otherwise, Mom need explicitly thio, use the flush

45:26	make sure that the other threads has last values. So in the

45:35	off the cache coherence mechanisms for the lines, it takes care of

45:40	Since this is false sharing, it's the same variable. So I guess

45:47	cache coherence was for the same These are not the same variable that

45:53	to, because each thread on Lee at one protector, uh, element

46:00	the summary but because it is one element in the cash line is

46:16	, say, by thread zero. soon as that cash line is updated

46:22	, um, the level one cash , the first corps that cash line

46:28	the whole cash line gets invalidated in other cash through the cache coherence

46:36	So it's a different mechanism should be clear about that. So it's because

46:43	the cache coherence that the cash long invalidated, even though the value itself

46:50	be perfect to fight. But since coherence, my communist doesn't understand individual

47:01	. It just works, um, cash lines. That's why it ends

47:08	being invalidated. So good questions, right? It wasn't clear. Doesn't

47:20	sense now. Yeah, Thank Okay, thanks, Frank. Having

47:27	clarified the two difference, that's that is just based on the program logic

47:36	variables. And this phenomena is based how the Kachin mechanism works. So

47:49	does one fix this so now effects to do this thing. And,

48:15	, this is even used in non programs. Sometimes we respect to how

48:31	Iran memory works. I am spend time talking about memory and Iran,

48:41	that has pages, and I'm not to get into it again today.

48:47	this notion of having is something when uses, in order to gap the

48:56	or desirable memory behavior and avoid, , poor memory performance. And

49:06	for today it's open MP and cash behaviors, but the same trick can

49:15	used to use having soul. And particular kids the pattern was used to

49:23	a two dimensional ray that use Uh so is it said 64 bites

49:35	doubles. So that means double position eight bytes per data elements or by

49:45	it by eight data elements. That , um, this, uh,

49:53	that it's now the some that is two by eight the rain. So

49:59	, depending on when I see your time. But you see of the

50:03	major ordering that means for thread. is now in kind of some one

50:13	. Bad is a full cash So in this case, for each

50:23	occupies a complete cash line. So when anything gets updated, it doesn't

50:38	the cash line, uh, in cash. So now things should work

50:49	and it does so now. As can see, even as one,

50:59	23 and four treads, they're still improvement in performance. That's not

51:07	you know, linear. It's not . But it's not too bad in

51:14	sense that two threads is a little more than half time off. One

51:20	and four threads. There's a little more than half the time off to

51:25	, so it's pretty good speed up sponsors as a functional number of

51:36	But anyone sees have comments about this of doing business, more wasting

51:47	wasting it out in the space. , considerable amount, right? So

51:54	is clearly not the preferred way of business. So there are other ways

51:58	get to so you want one way . And the other part is,

52:04	would no need to know about the line size for the particular platform you're

52:11	to use now, it turns for the one caches, most of

52:15	are 64 bytes that not for all for high level cash is there is

52:22	variability in the 64 1 28 But it was a little bit

52:28	My justification for talking about details about is my process of lecture that one

52:35	needs to be aware of it. yes, various. So So this

52:44	now our way. I'm not kind explicitly Taylor and things Thio Cash is

52:55	by being conscientious on caches works and trying to write to coat that doesn't

53:01	the problem. So in this you can have the parallel region.

53:11	now, by making some private variable each trends in the parallel region,

53:24	MM there is no problem for the threads to update some variables because they

53:30	have their own private, some So there is no issues of race

53:38	or anything and updating some and is the type of code or the programmer

53:45	Thio figure out to paralyze the for him or herself by using this round

53:54	principle. But now there's some very sir private to each thread. So

54:06	one gets out off, the one or rather one has to some the

54:20	variables for the different threads before one the parallel region. So that's why

54:30	an serial portion now inside the parallel using this critical construct to make sure

54:38	the summation it's just, uh, done one at the time to avoid

54:44	race condition. So and now you see in this case we didn't do

54:54	pattern, didn't waste memory, but by being conscientious, uh, the

55:05	cash line invalidation and keeping things local threads than in fact, one gets

55:13	same level of speed up as using patent construct or idea and a comments

55:28	questions on this. So we achieved same time without the, um,

55:37	memory overheads. Yes, so and is no memory overhead, there is

55:43	tiger into a particular architectural er in sense of cash line sizes. There

55:51	just a awareness I would say off cashing mechanisms worked and trying to make

55:58	that there is no kind of falls in this case that things in the

56:06	cash lines has used by multiple Ah, the question I had

56:16	What is the, uh, reason cyclically distributing the loot generations instead of

56:24	cutting them off into partitions? In this case, it could equally

56:28	been used. Carol four constructs and the compiler figure out how the

56:35	the loop. Okay, so it's matter of choice and right. So

56:41	think I called for this example, , and I'm honest about it.

56:45	have the We were at the The I think the author of the

56:51	just wanted to show a way in the programmer may, um, paralyzed

56:59	loop. And in some other it may be preferable that tailor the

57:06	of the loop to the content on loop. In this case, it's

57:09	pretty trivial loop for him. Set statements for the four statements. So

57:15	is really not anything Thio, I say very clever about. So I

57:24	it was just the highlight. One particular way of paralyzing look. And

57:33	, we talked instead of since it's a single variable continues atomic. This

57:39	Now what? So now the thing done to since you mentioned Reduction Thio

57:46	negative part with the other the last that was good In many ways,

57:53	wasn't tailored to particular architecture or I would say. But it has

58:02	sequential edition off the partial, some each thread. So with the hope

58:10	there is a good reduction implementation, can use a reduction cause and use

58:17	this case. Now, um, parallel four statements and then my doesn't

58:26	have the issue with the accumulation off variables because the reduction class shipped.

58:35	sure that that happens correctly. And this final version there is only just

58:41	I mentioned earlier. Since the With Tracked Angle is the same bunch and

58:48	do it. That's the very last . Put this in case in this

58:52	, the way it's being done. this is just Joe. And

58:56	I guess, in this case it out that I guess the reduction implementation

59:03	quite sufficient as the other way of business on this case. If you

59:10	at the last column in the table using the reduction clause on the speed

59:17	, it's not quite as good as critical or one. Could they used

59:23	as well, but that depends on . It's nothing that's necessarily inherent.

59:30	says that reduction would be significant Store says the reduction clause only have

59:40	preset options or doesn't generalize. Thio able to use whatever map reduce method

59:46	can come up with. So the in this case is a function implemented

59:54	a part off the runtime library supporting MP. And it has a collection

60:04	operators that it does support, so can certainly right to one's own

60:17	But then it wouldn't be a close the statement. Maybe I misunderstood your

60:24	. I was just wondering if the mean you can't just pass any sort

60:29	funked or in there like the the the specific maps that they use

60:34	this tailored and implemented in a way makes it fast as a person.

60:38	, to my knowledge, none of open empty version support instead of using

60:43	in this reduction statement, pass in function that euro. Okay, I'm

60:53	. Yeah, I think it only , um, elementary operations. You

60:57	define a function for it. That's good question, I think. But

61:05	my knowledge, they have not, , nothing to the point where they

61:10	take as used to find function. I think this is yes. Summing

61:25	the example that you know there's dealing race conditions that are dealing with their

61:33	sharing. Um and then, potentially using the reduction operator,

61:44	Avoid serial segments of the code adding up individual Fred using to global

61:55	than so Okay, so now getting of different construct that task construct,

62:11	I will call for it a few or minute to see if there's

62:17	So this is now getting into making that even a little bit more flexible

62:24	also more also more complex. So heading into that, uh,

62:31	if there are additional questions about open in terms off constructs causes not,

62:49	will start talking about the test construct and working with open a P open

63:05	after the initial, um, So to speak. Ages ago,

63:13	terms of trying to take some of details from Prato shared memory programming away

63:20	the programmer and leaving it to Itwas discovery. Waas didn't have the

63:30	flexibility that one sometimes what's needed in , and when people call us and

63:40	or coats. And it also turned that someone gets recursive algorithms and trouble

63:50	with initial set the constructs that was in Open and P. Someone created

63:58	task construct, um that I will about now. Yeah, so task

64:16	well is as more flexible because it more to it than individuals friends.

64:31	in some ways, as it said its like that comes that one.

64:37	you parallel region as a set of where the task is an assigned to

64:48	threatening? But then, um, still sort of restricted set of things

64:55	on. The task gets to do , um, hard to manage the

65:01	in particular as well as the execution . So the test constructs has with

65:12	not only the code to execute. also has more of the data environment

65:18	what they call the internal control so it's richer. But it also

65:27	implies more overhead because it's more of machine that needs to be set up

65:32	manage tasks as supposed to threats. like what kind of threads us?

65:43	parable version of these things is where color coded boxes are now tasks that

65:51	yeah, the different task and a pieces of code as supposed to we

65:59	. That's for in parallel region each gets a copy on the same

66:05	Now tasks have different pieces of But like 20 threads, there is

66:13	particular execution order defined between tasks. in this case, that's the pile

66:23	cartoon shows to the right of this . It is arbitrary order and when

66:31	start and how they proceed. So is something more or less what I

66:40	said that that one can think of threads in the initial version or in

66:46	region as task. But there are more restrictive on the test construct.

66:56	here's the task construct at the place structure block, and I'll give examples

67:01	trivial examples in a little bit. then there is slightly different scoping rules

67:11	data scopes than for a parallel region tests. And then they're separate scheduling

67:21	for how to coordinate tasks that I want to use. And then there

67:35	the barrier gun nut for like for regions. And then there's also kind

67:42	weight as well. There's no way . Yeah, so here's That's the

67:51	simple test constructors in our apparently And inside this probably region, there

68:01	first they open empty single construct. , that is used them to often

68:13	typical then, as they say or them one of the other side's that

68:20	create, Um, all the um, task context that are necessary

68:30	the task that follows inside the single . So in that case, what

68:38	means that each task will be Bye. A single threat. But

68:50	them that one of the slices there that there is no particular implied order

68:58	these tasks. So if one needs order, one will need other construct

69:04	assure that order one once is her with there are dependencies. Okay,

69:14	, I guess I had some example to be an example that may help

69:20	remind her open, and he works well as how this test construct

69:27	So this is second sequential programs. no surprise this if you're on

69:34	it should do this thing because it's sequential cold. So the prince statements

69:38	executed one after the other, and is what happens. So now I

69:44	to do the parable version, and now I guess if you do two

69:53	, um, each threat now gets copy of this code, so each

69:58	is going to bring a race But there are, you know,

70:06	statements and in this case, two . So anyone wants to volunteer whatever

70:15	potentially C or C. It's a statement. So but it's again just

70:28	remind you how open MP works There is no particular order between are

70:37	two threads in this case executes these statements? Hmm, that's all

70:51	Okay, so two threads rights on case. Each thread printed out a

71:00	car. My uncle resume. In case, that's true, and they

71:07	nicely ordered, but it could have pretty much any combination of things they

71:13	always for any off. The two , um, would always proceed race

71:20	put all the specific proceed car for particular threat. But the threats,

71:25	progressive, different face so it could fairly jumbled in terms of what comes

71:34	they can try will permit Titian's. quite the future. And that's just

71:37	standard, um, open and P the test construct now. So

71:49	to see the impact on and off single statement now would you expect anything

72:01	at this time? You can be that it'll print a race car on

72:05	entirety for each thread. Yes, of. But there's just a single

72:12	. In this case. They only it once. Yeah, because of

72:17	single thread is enforced. So even there is in the after open

72:22	But then there's the restriction that on single threat can execute. It's not

72:31	critical region, so it's not one the time. It's just one

72:39	So that's when I said So now trying to do the task. Parallel

72:49	so I was similar to the construct his Children. Terms off this task

72:55	. So now what do they So we have If we look at

73:11	, school may have firstly generate a of threads and the parallel construct,

73:21	we decided that for the next thing only going to use one thread.

73:31	, except that means one moment I the test construct. That means each

73:36	of those tests gets a single Now the print a statement is just

73:48	to the single Fred requirement because it's preceded by a task statement, so

73:56	we should only get a printed But then we have the other two

74:05	car. That should be each printed . However, the order is not

74:16	that caused the tasks can be executed the order. So that means we're

74:22	get depending upon how many times you it they can get. But I

74:27	they race car as that sequential, , person did, But we could

74:35	get a threat to root print the task ahead of the race. So

74:45	, because they're ordering between printing, and printing, car is arbitrary.

74:51	is always first because it's outside Constructs was not one more watch,

75:01	guess. Example, uh, so . And I just started, I

75:10	, is fun to watch. That not part of an a task

75:20	So anyone, Well, maybe it's so first, I should always be

75:29	first, right, because that's in single thread region. It's fun to

75:35	should also be printed ones only because also in a single non paralyzed

75:43	But we still have the tough statements can be an arbitrary order, so

75:49	is kind of what might happen. , it may happen. Not before

75:58	on the task constructs gets executed. , the sequential region order single region

76:15	get both this print statements done before of the test, um regions,

76:23	to print what it's supposed to print . All right. Thank you.

76:37	, want to stop there? Two . Uh, unless a question,

76:45	there's no question, I'll talk about . So so far invested when I

76:52	about test constructs ads a level of management that the region construct by itself

77:02	have but execution order among tasks in violent region. ISS arbitrary like or

77:13	progress of threads in the region is kind of arbitrary. So there's this

77:20	of hierarchy that goes from task um, threat to process.

77:29	So process is the most heavyweight because carries with it a lot more

77:37	then threads that are at the other as the least context carried with

77:44	Uh, and task are kind of the middle of sort of. There's

77:48	a shared memory construct, so it has benefits from things being shared.

77:56	, but it has more flexibility and for different codes to be executed by

78:04	by different task. Where is Executes the copy of the same

78:14	so it is going from just The task increased flexibility but also increased

78:24	to set things up, and they go further onto processes. It's even

78:31	things that needs to get created and up. But it's also more

78:39	uh, flexible. And it's not longer necessarily process construct limited to share

78:48	with a lot of things I know again, the process cares what it

78:54	with it. So with that, guess I'll take more questions. But

79:03	will start then scope next time and talk about what's great out here in

79:10	of the test constructs. And then talk about the affinity and how you're

79:18	force threats to be place where you want them to be. Enough have

79:23	operating system designed for you, so will be with on Wednesday and then

79:32	think so. Yes, for the demo on some of the constructed hasn't

79:38	demo yet, including the test if any other questions, or whether

79:55	about lecture material or assignments, If you don't stop sharing my screen

80:22	, now stop the recording

Previous Next

00 : 01
08 : 03
11 : 49
19 : 11
22 : 15
26 : 07
31 : 45
38 : 11
40 : 53
48 : 03
54 : 51
58 : 57
64 : 11
66 : 57
74 : 59