ICS Video Player

COSC 6365 Introduction To High Performance Computing - Lecture08_DRAM

Transcript ×

Auto highlight

Off

Font-size

00:00	trick clerk. Yeah, yes. today I will continue to talk all

00:10	memory. Uh, just cut to point of talking about main memory last

00:16	, but most of the time talking cash. So today, more about

00:23	main memory behavior. And as you seen through the lecture that despite this

00:34	for the had dynamic or ram random memory, they actually the memory is

00:41	like random access. And that's why can get very different performance, depending

00:47	higher access memory. Because it is , as the name indicates,

00:53	I want you to be aware of and understand why that is, and

00:57	the magnitude off performance impact that may . And then I hope to get

01:03	talk a little bit. Baron Energy and I will stop leaving about 20

01:12	or so for suggest to do a off profiling and particular power measurement tools

01:23	we have access to. Um, there are other tools that way,

01:30	, do not have access to severe toe pretty much one mhm so now

01:42	dynamic and memory, or ram in phone form or other. So the

01:52	economic Graham and S from static and the second one is Rahm on

01:59	slide on the right hand side. the kind of design that is used

02:04	cash is. Where is the The design is used for your main

02:10	so the objectives of different. So main memory, you want something that

02:16	hold a fair amount of data that cheap. And that's why one s

02:23	to come up with design that is involving one transistor in one capacitor.

02:29	it's a one transistor cell, so quite small, and there's a little

02:36	the bottom left hand corner. There in indication what it is for.

02:41	has recently current that says that just one bit dynamic random access memory doesn't

02:51	more than 0.0 26 micro meter so it's incredibly small. On the

02:58	hand, the extreme cells they typically six transistors instead of one, and

03:04	reason is that it is, retaining the information, whereas the Iran

03:12	leaks information and child talk about them bit. So SDRAM cells are considerably

03:19	, more or less on the About 10 times as large.

03:24	it's also the designed Thio for So we want cash is to operate

03:30	the cooperate off me course on the logic ale. Use comparatives, decoders

03:40	what not so all the operates at same clock rate, whereas the memory

03:45	are considerable, so slower and I talk about those things. But there

03:51	just for general education in terms of the memory technology is has evolved and

03:58	it is today on, and I put in on it just for I

04:03	general education. So today it's in 10 to 50 nanometer range that the

04:10	being used to produce the ram and to human hair, it's about in

04:18	lateral dimension about 2000 times smaller or more depending upon stick your hair you

04:24	. So that means on just a section of human here. Today,

04:28	kind of fit hundreds of thousands, not millions of transistors or bits.

04:34	talked to Iran give some idea off the technologies and here is just another

04:41	like showing more or less the same in terms of the big density on

04:46	right hand side of the scale that you're gaps close to or the quarter

04:53	, a billion bits in the square on today's chip on the other ball

04:59	bars say base officials typical ship areas they are in the order of about

05:06	square millimeters on bond. Well, were talking about processors. If you

05:11	, the processor chips today tends to fairly large in the 6 to 800

05:16	millimeter range, so these ships are smaller, and the reason for that

05:23	to get ties, yield and low . Yes, for general ads.

05:29	now here's the way how random access is organized. So it's actually

05:37	like the matrix visible and columns and cross section off these rows and

05:46	That's where you have a bit. off some flavor either one transistor dear

05:53	or six transistor static or Esther And then the roles, the typical

06:01	lines and the columns particularly known as lines and the one row then basically

06:11	a block of data. And I'll more about how these things works in

06:15	next several slides. But it's important remember that this is kind of a

06:21	organization. There is just a micro what the chip might look like.

06:26	it can very much see the highly and ordered arrangements off memory bits

06:36	And it was just a little bit text. And since I don't use

06:40	textbook for anyone to read up on micro photograph off and I think a

06:45	Samsung chip that that's a few years by now. So today, 20

06:53	notes our feature sizes and silicon is used. That state of the

06:59	um, half a little bit bigger half of that size and just for

07:06	. So you have some measures. are for static graham cells, and

07:10	have some chip area for basically wanted on this thing and, as you

07:15	see and it's more than a factor ton larger than your the Iran

07:22	So now a little bit about the of the Rams. Since it has

07:27	matrix organization to get what you want out of it or be able to

07:33	it, you need thio have both always call him address. Now,

07:40	way things are organized, remember these are small and footprints like, you

07:47	, 6 to 7 millimeter squares and can't fit too many pins for data

07:56	clocks and power and all the things want. So for that reason,

08:02	tryto economize on pin studies. External huh pin should be not taking to

08:10	the early days it literally wasp ins . There other ways off getting electric

08:18	between the memory chip and the circuit , but just think of it as

08:25	mechanical piece that carries on signals. since the footprint is small, the

08:34	decided thio use the same quote unquote for both rows and columns addresses,

08:41	that means you can't give them both the same time. So when it

08:47	typically, they gave their grow address . And then you give the column

08:53	, and it turns out that the things are organized in many columns,

08:58	you can then for a given the , provide many different column addresses to

09:05	this. If the columns that you and I'll cover that a bit

09:09	But there is a process necessary to read or write the information, so

09:17	one need Thio as it's known, the particular role. And the reason

09:25	that is that again, power consumption a big concern on, and I'll

09:32	more about that towards the end of this lecture. So things that are

09:39	actively being kind of used are in form of low power states and their

09:45	low power states, and again that cover that also in the future.

09:50	the point is that to actually be to read or write, you first

09:53	to activate some, and that takes . And then you do the operation

10:00	wanted. Either you're either you're and then one has to kind of

10:05	the states, and that's known as charge. And then there's yet another

10:10	. So the first four is kind what needs to happen related to read

10:17	write operation. The refresh is a thing, and that doesn't need to

10:24	too closely associated with reads or But it's something that needs to happen

10:30	, as I mentioned earlier, the are leaky, so they forget the

10:35	on. You don't want that to , so at some intervals, one

10:41	to basically restore the information that it before it has disappeared. And that's

10:47	refresh. There's a little bit more just showing again this dynamic ground picture

10:56	the sets of operation and needs to . That's sort of the police charged

11:01	Thio. Activate a bit line and get used, the word addressed and

11:09	get the word line and then you basically the coordinates off a bit.

11:14	then you, um, this when act to it both of this bit

11:20	word line, then the charge and capacitor gets shared with the bed

11:26	And at the end of the bed , there's what's known as sensors or

11:31	amplifiers that are incredibly sensitive that can miners changes in the kind of voltage

11:41	effectively the transistors, not transistors, electrons that comes off the capacitor when

11:50	decline is activated. So it, said, down to you know,

11:57	to hundreds depending upon the technology, these features are very small s appointed

12:06	before, so the number of electrons the transit on the capacitor is in

12:13	order of tens to hundreds on in final technology. So you're kind of

12:19	almost accounting individual electrons to figure out state capacity ahead. Um,

12:30	So let's see what happened here. coming back? Not a little

12:36	um, how the operation works As mentioned that typical models you give zero

12:45	and once that firmly, kind of through this process of activation and off

12:52	road, then you can issue several access requests for that role without

13:02	You don't have to repeat the role as long as you stay within the

13:07	road. So, however, if want Thio, switch to no

13:16	then you basically need to close up active low and activate, you

13:25	wrote and then you can read it . That means that when you go

13:30	one role to the next role, is a time penalty or delay or

13:37	degradation compared to staying in the same . So all this is just something

13:46	much saying what I already said in off, if you're in an open

13:51	, you just keep adding column addresses long as you access things within the

13:57	column. On the other hand, you need to go to another,

14:01	there is a penalty. So here , um, kind off the,

14:11	, time aspect are the Iran. the Rams do you have a

14:19	time and the cycle Time has thio does account for the faces associated with

14:30	. We're accessing or writing information into deer, so that's why cycle times

14:37	much larger than the access time. there access time is associated with.

14:47	you are in a column on, go from accessing data MBA in one

14:52	to another column in the same That is the first thing that is

14:57	access time, whereas again the cycle has to account for all the

15:05	And it's important to not confuse the Am cycle time, which is an

15:14	concept, and not confuse it with actual clock. Greater is being used

15:22	the memory itself. So the cycle is much longer also than the clock

15:32	and now something I want you to familiar with, because if you ever

15:37	to configure things or understand the performance detail, there are a number off

15:48	aspects of the Iran money's too be with, and they have these

15:56	and it typically quoted, that's a of numbers with hyphens between them and

16:05	the T cast number. As it there, it's the number of memory

16:10	cycles needed to access a certain You know, talk more about these

16:18	on the next three or four so but I want you thio kind

16:24	be familiar with t cast. Our is when you switch from the road

16:32	or need to get a new then also said there is delay after

16:35	get the road address before it Yet there access to columns. Then

16:42	is the priest charged time. And there is also the minimum duration for

16:51	or dealing with the role. Because need to go through these steps before

16:57	, you need to activate it, hopefully we'll read it and then you

17:01	to close it. And so there a minimum time before you can jump

17:07	another road. So here's a little of a diagram trying to put these

17:13	in context. So there are city there's when you start a new role

17:22	before you can actually deal with the there is a certain amount of cycles

17:27	needs to happen. Then for once That's all they're. All is

17:34	Then you can access information in the off the road, and there's a

17:40	cycles and associate ID with each All the column access is you

17:47	and the writer is a little bit . And then there's also showing here

17:53	minimum time to T. Ross for direction off euro access. And if

17:59	all when you're all happy and done the rope, then it's the pre

18:05	to kind of restore on, make ready for another role. So the

18:13	slide is just trying to now give a little bit off kind of state

18:20	the art type memories and understanding what things means and what they are as

18:28	as the characteristic. So I think mentioned that before they they are the

18:33	data rate. That's the current standard been around for many, many years

18:38	, the digit after the the after are the number four and keeps changing

18:42	the futures Azaz Well, not the on the specifications or standard, according

18:52	, the Silicon technology available. So think now this five is actually the

18:59	is agreed on, but most server policing products out there, they still

19:05	within the three or four. Then the dash comes in another number.

19:11	it says here, 1600 on the and 3200 towards the bottom. And

19:17	not going to get into the letter follows after these things or that

19:21	But the number that follows is the off the or data transfer speed,

19:33	should say, and that typically measured transitions for second. And that corresponds

19:47	so cock cycle as you know It goes up and then goes

19:51	So clock cycle has two transitions, oclock, period. So if you

19:58	at the I o bus clock, is on the top. It says

20:04	and the empty seconds is 1600. it's 800 megahertz memory bisque lock

20:13	It's elected, whereas then in each period you communicate two bits. Either

20:24	read or you're right, but you one bit when oclock races and you

20:28	one bit when it falls. So why the data rate is twice the

20:34	grade off the memory bus. But there is another memory clock.

20:42	That is important. So that is internal clock off the memories chip.

20:49	sucks. It's not the external thing the bus. And as you can

20:55	, the internal clock is considerably lower the memory bus or are your bus

21:01	. In fact, it's a factor four in this particular standard. So

21:09	clock and if you could think of memory clock compared to the CPU clocked

21:15	we talked about that tends to be the 2 to 4 gigahertz range,

21:21	the memory chip clock is about 10 slower, then what the CPU or

21:32	unit cocks are. And, then there is a columns, that

21:41	name, and that is what you were fined. For a team that

21:48	when you have a this memory you're sticking to the circuit board for

21:52	server or your PC, and that then maps into the width of the

21:59	bus that temperate trees today 64 So that's why that number comes from

22:07	we moved or the writer says and that's where these, um,

22:14	teas, the columns Leighton See and RCD and the AARP shows up here

22:22	just three of the four numbers in case. So that actually tells how

22:29	, external or memory or the IOS memory bus cycles is associate ID with

22:39	column access. That's the first number to look at the top. It

22:43	10 memory bust cycles is what it to get one column entry out on

22:50	. Similar is to delay or 10 . After you initiated or start a

22:57	row, I access until you can the column access. So the most

23:04	part, perhaps in terms of is definitely the right hand column.

23:10	if you look from top to bottom the uh, memory clock column,

23:17	, um, improves. Depending upon speed rated memory you get by a

23:22	of two from 200 to 400. , on the other hand, when

23:28	look at the number off the day in the second, so last column

23:34	the right, you see, the the internal clock is, the higher

23:39	number of clock ticks it takes to the stuff out So the net effect

23:43	that in terms of physical time, of the speed of your memory,

23:51	Leighton see is pretty much to and that has not changed over a

23:57	long time as a show on the slide. So if you look at

24:03	bottom left diagram, you can see the Leighton See has pretty much been

24:12	the same level of about 59 seconds many generations. In fact, it's

24:18	the time access. I'm sorry about , but it related to time in

24:24	time, not processing times. So I come to that why This is

24:31	memories in terms of actual physical agency gotten any faster in terms of the

24:40	delivered to the memory bus. It gotten faster, and that is

24:45	too hard. In fact, you list inside your memory chips. So

24:55	the basic building blocks off the DDR hasn't gotten much faster over time.

25:04	when increase the bandwidth by having parallelism the memory, So this is what

25:13	cover next. Any questions so I should think, huh? Nothing

25:23	the chat. Okay, and if , I'll talk a little bit.

25:28	sit. There is part of this the memory, so this is can

25:35	it's very busiest line in a bit a shock. A. You think

25:38	memories being pretty simple? There was simple matrix I showed you in the

25:42	of how things have put together a in columns. But in order to

25:48	to keep up improvements and processors, the clock rate for them has also

25:57	of stabilized around and arrange to 2 4 gigahertz. And there's reasons for

26:02	that I will also come to. then, when has got many

26:07	So the capability to process data on processor piece of silicon as increased tremendously

26:17	the years. So to try to memory to keep up it has gotten

26:23	lot off structure. So I'm using video or three just kind off one

26:31	generation old, because new ones are more complex than this one. So

26:38	the principle is the same. So you try to buy a piece of

26:44	, it has a number of things and it reads so on top

26:49	says one gigabit. So that means memory die has one billion bits of

26:55	on it, and then it has particular organization. And I think I

27:01	a little bit about that when I about them so that they were times

27:05	times 83 times 16 as the number bits that it comes out in a

27:13	cycle off the memory died. So this is tells you how this one

27:23	disorganized it's this organizes 1 28 megabits a right. So so,

27:32	Now the the they are standard three specified that every chip, regardless with

27:43	vendor is needs to have eight and that's again for part of the

27:48	is that is necessary to keep Then we had the times eight that

27:53	will find here. That's the width the data past that comes out off

27:58	stack off memory erase and then get to the pins or external connections off

28:07	particular, um, the around Then this. This In this

28:15	it's, um, as I said 28 columns and that's you can find

28:22	this, uh, bank things that it has basically 16 K rose in

28:29	columns and then the third dimension in case is 64 bits in each one

28:35	these memory. That and you'll see it's 4 64 bits. It's related

28:43	what's known as the first rate that here. So it's also specified in

28:53	DDR three standard that they should have burst mode off eight. So that

29:01	out of the memory array that operates at, uh, a quarter off

29:10	external memory bust cycle. So that's of factor for you need. But

29:16	, on the external memory bus, do to transitions per clock, which

29:22	don't do internally. So that means got another factor of two. So

29:27	you access the memory banks in each of those clock cycles, you need

29:35	get 64 bits in order for the eight bit, while oh, to

29:41	up with the 8 ft by wide eight bits. Yeah, white,

29:46	, data path. Since that does eight times eight transfers within the same

29:55	as the bank is operating in one cycle, you can also see how

30:00	rolling column addresses are figured in this . So there was 1 28 columns

30:07	the bank, So that means there's business to take the column. And

30:12	there is 14 bits to address the . So this is kind of the

30:20	things are. And then there is bank control logic that decided in which

30:28	they actually lives. So I didn't , but I showed them. If

30:35	look at the slides from last lecture I think it's in the slide deck

30:39	today to, even though I don't it, that the memory controller needs

30:46	provide the rank it needs to provide bank he needs to provide the role

30:51	the column addresses to each other on is now. The dominating shouldn't say

31:00	. But the more state of the , because DDR three is cheaper and

31:05	so many times. If that's good , um, that's being used.

31:10	four is organized not as one set eight banks, but it has two

31:18	four sets, uh, four banks . And it's the thing is,

31:24	you stay within their banks, it's of soul, but you can switch

31:30	banks relatively fast, and there is little bit again. How the addressing

31:37	in this case that now you As I said, there were groups

31:46	four groups before banks, each so to bitch for the group that us

31:50	two bits for the bank news. , it's pretty similar to the

31:55	Are three a little bit off the issues with that? And that

32:04	if the bank and a role I say it's also kind of refer to

32:11	a page. So I specific. you stay within the page, you

32:16	access things at very good, On the other hand, when you

32:22	to go to another roll, things slow. So that's what happens is

32:31	going between banks is fast, like on the DDR four. So in

32:38	case, if you do bank switching rose switching and you get much lower

32:44	, so this is what it's the . So if you had, like

32:50	this stream benchmark, it has tried . The compile asl e things up

32:56	the knowledge of how memory accesses, , access penalties are so they lay

33:05	out so you can stay within rose successive address addresses or access is so

33:11	what, in this case for this DDR three memory that's the peak most

33:16	66 gigabytes per second. Thank Once the DVR 10 33 memories should

33:23	put up on the slide. Where if you go to the same

33:31	And they know row addresses for every than yeah? Don't get the benefit

33:41	kind of bank into leaving, so gets reduced by a factor of

33:47	Because of the eight banks in the , there are three memories, so

33:53	a big difference in performance. And when you did your guts. You're

33:59	likely to stay within the same roll bank on switch between different roles in

34:07	banks. And that's why you get worse performance in the cups as part

34:11	it. The other parties caches and is a little bit exercise off taking

34:18	modern DDR four with the 2.667 make a transfers, uh, per

34:29	notions. So here's the way it on and take a duel Salt conserve

34:36	, and it has eight memory Per Sokka. So this is kind

34:42	in a and e type scenario And then there's 64 bits wide memory

34:52	, and then you converted the bits the volume by eight. And then

34:56	have the transition rates in terms of memory bus. And that means that

35:02	peak memory bandwidth for this particular configuration will stop the Server eight memory,

35:09	per socket and that 2667 um, a transfer right now. So here's

35:20	I guess what I just said, different parts of it. So now

35:25	you so happen, the successive addresses to access is not on alternating

35:33	but on the same subject than you half of the memory bandwidth.

35:40	if it's someone channel, you need factor of lose another factor of eight

35:44	it's only on one memory channel. if it is in the same

35:50	you lose another factor of eight trusting in present. So they're too

35:57	Cases different by more than two orders magnitude. So data, how data

36:06	laid out in memory and how you data relative to are things are in

36:14	is incredibly important. And though programming do not have explicit commands for how

36:21	lay out data. Ah, Sometimes people actually do tweak it by

36:31	their race. Um, because, know, ah, what they want

36:37	of their A. And they know memory architectures, otherwise, function still

36:44	the access order. The compiler is to figure it out, but it

36:49	not figure it out. So later , when I talked about compiler and

36:55	more on the software, I'll sort highlight things one can do to help

37:01	compiler figured out to do so. promised I'll say a little bit about

37:12	I guess I should stop. Maybe a second year. Uh huh.

37:18	if there are questions. Oh, . So no. So since memories

37:32	a problem, why two people not anything about this gap showed some slides

37:43	in this cartoon before and they saw . When I talked about the memory

37:49	right at the clock grades for the chip itself is in the order 2

37:55	400 megahertz. Not in the order you gigahertz. So why not design

38:01	all you get comparable clock rates? , it has to deal with both

38:09	desire to keep things dense and cheap the sea most technology and here is

38:18	the reason for it. So the technology is essentially the charge. Transfer

38:30	basically move buckets of electrons around, either you store them that bucket or

38:37	used thio turn transistors on and So hopefully you remember something. I'm

38:47	everybody has had the basic physics. course about that no lost arm

38:53	Kercher slows and etcetera. So wires resistance. And those that come pastor

39:03	used for story information, so they on memory. The transistors are in

39:10	, actually, that's the same So there gets turned on and

39:14	depending upon the charge on the plates the capacitor. So you have a

39:21	simple R C circuit. And so time Thio charge and discharge this capacitor

39:36	on the product of their assistance and capacity and solve these two things.

39:41	you get basically first order differential equation can solve, and it basically got

39:48	exponential, uh, decay so or and then ramp up off the charge

39:55	the capacitor. Now the resistance is proportion to the length that should be

40:07	intuitive as well as the cross section the, uh, resistance on the

40:14	. The fact that the wire is resistance. It has for more capability

40:19	has to move electrons comparative thin So the resistance is basically the length

40:28	by a cross section, the with the height of the thickness of the

40:36	. Right on the comm passions is to the area on the plate,

40:43	there, capacitor and inversely proportional to distance between the two plates of the

40:51	. That's the age to now as scale technology talked about things being,

41:01	know, to the state of the today, seven nanometers. And it

41:05	to be 10 14 20 etcetera. the thing that gets you more transistors

41:11	a diet is because that ability to transistors and wires with increasingly small dimension

41:20	improved. And that's dependent on a factor that put the S in this

41:24	here. So if you assume that reduce all dimensions except the chip size

41:32	in this case is forsake comparison, everybody likes soul. Basically, you

41:39	to get the information out of the so somehow it needs to cross the

41:44	. So that's why I kept L in this case. But otherwise the

41:50	the area off the capacitor and the and the thickness of wires and the

41:56	distance with same scaling factors everything ships the chip. And if you plug

42:02	in in this formula is what you discover. It's that the time constantly

42:09	that tells you how quickly you can things is actually increasing. Or if

42:15	make things half size, in they are see constant doubles. So

42:24	why I just even keeping the clock , the same as you scale feature

42:31	down. It's a momentous challenge because simple basic laws of physics, unless

42:39	change the physics, which they do terms of how the material is being

42:45	, would cause things to get not faster. So this is the

42:50	reasons why both processor clock rates has kept growing that they used to do

43:01	the early days of Moore's law, they have stabilized over the last

43:06	Plus, and same thing that the creates internal to the memory is

43:14	um, increased. So that's the reason why things are the way they

43:22	, and you should not expect things get fixed any time soon.

43:30	two more comments and then I will talking about memory. Uh, so

43:37	off I mentioned earlier or last lecture terms off memory integration that yeah,

43:46	for servers is that you have this models. You stick in the socket

43:51	the circuit board, and then you to push signals across the circuit board

43:57	the sockets that are relatively power hungry make things slow. So now that

44:05	can get a lot of transistors and and also started to, not only

44:12	it's Trump's for cash, but also this forgettable memory d ram on the

44:20	of time. So recent generation of power and some also of Intel's processor

44:29	has the same sort of single cell your arm technology or design on the

44:38	of, and you can then also people do. But I'm not going

44:44	cover that in this course. They to move processing into the memory to

44:50	things close and get performance up. then there is also something new as

44:57	say, actually in terms of silicon , the first new thing that happened

45:02	25 years, and they came into about two years ago in that is

45:08	as three d X Point. I not go into capped about that.

45:13	I just want you to be aware it. And the next slide is

45:18	of a nice summary. Where is horizontal axis you have the Leighton see

45:28	with different forms of Mary Mary? , technology and design. So for

45:37	else in L three's, you used same design. What you use different

45:40	principles because off cost performance tradeoffs and a logarithmic scale on the horizontal axis

45:51	time on, then they vertical access you sort of the relative notion off

46:04	slow things are. Um, now guess the next few slides since simply

46:14	that I will do quickly that them the most common thing for servers,

46:19	you have these thoughts on the circuit that I talked about several different

46:23	We have the ranks that is the of memory ships that matches the memory

46:30	, and you can have several ranks the same game to put. They

46:33	Thio four ranks, Um, and also the things that memory channels for

46:42	reasons mostly tends to be limited Max eight ranks the bank, regardless

46:49	how they distributed among dims. Then was this other fascinating integration for integration

46:58	that it's not being used for DP and high end GP use, I

47:03	say, and has also been used some high in service where you get

47:09	stack memory integrated in the same Look at the ESPN type memory,

47:18	them with memory, and then I mentioned in the normal technology. Then

47:23	went through a little bit. The issues on this side also gives you

47:29	the bottom, um, energy aspects didn't talk about too much in terms

47:36	me, the processors a little bit telling you how much part of the

47:41	and the processor lecture. And here's little bit in terms off the energy

47:49	for memories, all various flavors. part of the reason for its B

47:55	memory is not on Lee. The aspect that it's, um, can

48:01	wider and better integrated and give. , I am access bandwidth, but

48:07	also significantly lower power in terms off as you perfect and against the bottom

48:18	, take hold. Memory at the highest level is that Iran is not

48:24	access by enemies. And to get good performance, one needs to be

48:30	of that. And we're talking about bit. Bob can do at the

48:36	code level to help compilers. perhaps, hopefully the right thing in

48:42	of active order access is, the layout they have a fixed truth

48:48	. So that's not something they Eso this concert is the fat

48:58	This is what I said. So I was going to switch the power

49:02	energy. But I'll take questions and then I'll make a couple of remarks

49:11	then I let suggest to the Then I will continue here Next

49:16	If there's no time left after So any questions? Mhm s.

49:33	I just, um, again, of us have talked about. It

49:37	just to create an awareness off the and the high variability in performance off

49:45	memory architectures, both respectful caches and and how cashew associative ity matters and

49:56	the cash replacement policies matters and how uh right policies matters that in chemical

50:03	huge difference. And even the physical memory that one typically thrown don't pay

50:11	to its structure, its structure and organization across memory channels, etcetera as

50:19	huge impact. So that's why there's lot of emphasis are in compilers and

50:31	, colder and how you're right. code to try to get good usage

50:38	memory, no questions. And why you take over then and do the

50:45	before I dive into power? But should, I guess, or you

50:51	talk a little bit about Rapolas an in before you talked it what it

50:55	. Maybe that's the easiest. sure I can do that. So

51:02	honest. So this time before I switch screen. So that's why far

51:08	important time. This rapid is a to at the chip level to you

51:14	insight into power consumption. But for let so, yes, takeover.

51:24	is my screen visible? Yes. , great. Thank you. Uh

51:34	. Right. Eso this tool that Johnson just mentioned it's called Rappel.

51:39	stands for running average power limit. , little bit background on it so

51:45	was not originally developed. Doesn't means measure the power, but it was

51:52	designed thio run it on the processors said their power consumption limits in different

52:00	, which I believe professor will talk detail later on. But it was

52:04	a tool to limit the power consumption processors by reading registers that hold energy

52:12	and power consumption values off the But we can tweak it a little

52:18	. And people did. We get , use it for power consumption

52:25	Uh, so the way you can it is by using this code,

52:33	is freely available on the Internet, , which is called the rapid reed

52:38	C. I think this code it goes in the processor and the file

52:49	mainly and reads those specific registers that the value for power consumption almost

52:57	and it's written in a generic way it can work with different processor

53:02	So we'll utilize this apple green dot to measure power consumption for some sample

53:10	. Now on the left, just have our console on stampede to

53:15	on a computer Note eso the steps you need to follow Our are as

53:21	So when you need to use rappelled dot c, you don't need to

53:26	changes to your coat. So here just have a simple madam all dot

53:32	program that we eso for our second and as off noise again, I

53:40	just a classic manimal that will be So the first step that you need

53:47	do here is just go ahead and compile your cord as you would do

53:53	other intel compiler or GCC Uh, once you've done that, the next

54:00	that you need to do is you to come to this rapidly dot C

54:06	and around line number 800 or And so you will see a sleep

54:14	If, uh, here something like for 15 minutes, second or

54:22	something similar. But we you need just comment that out. I'll just

54:26	it for now and replace it with Call. The court that you will

54:31	provided will likely have all these Are these calls already in? So

54:37	you need to do is just provide executable name off your program in this

54:44	called. And that's all you need do in this rapidly dot C

54:49	So you compile your code you provided executed building inside the system called.

54:59	thing that you need to do is need to compile this rappel read program

55:05	you can use this. Uh, on here, toe. Compile it

55:12	you will see you. Have you an executable called apple? Read.

55:16	, the way this works is what will compile. Raffle Reid Zazi program

55:23	Tazi program, uh, will run programs that execute table in turn.

55:30	if you see on the top and bottom off this system, call here

55:36	the apple re dot C program. measures the power consumption for this section

55:41	the code. So this the section the top, it starts the power

55:45	and the bottom stops the power So happily dot c Don't your parents

55:50	program on your behalf now visualize how are working, Uh, and to

55:59	sure that the program runs on the or socket, which he wanted to

56:03	on. We will use a package Edge Top and you can load it

56:10	module. Load it. Stop command stamping too. Um, yeah.

56:18	then just go ahead and run the edge top. Now, this command

56:24	very helpful That so remember, you 96 threads because there's hyper threading on

56:35	compute nodes. You have two each with 48 threads on. So

56:41	sockets with 96 threads So you can the usage off each of these,

56:48	, hyper threads using this edge top . Now, since we're still using

56:55	threaded program, we would want our to run on a single thread.

57:01	for that, we can use the , uh, dusk set and give

57:10	parameters finest C zero. So what means here is we want our program

57:17	run on core zero. Um, I d zero. And we'll see

57:22	that looks when the gold runs and go ahead and provide the executable

57:30	Now why we are using Tusk said does said what it does. It

57:34	. It pins your program to the court that you mentioned here. So

57:38	is called process spinning, and the why we need to do that is

57:45	rappelled read it only reports the power for at the socket level, it

57:52	not provide power consumption at corps level even much final level. The best

57:58	can do is at socket level, travel read. So that's why we

58:03	to make sure that our program runs the course that we wanted to and

58:07	we can distinguish between the outputs, , measure to get the correct our

58:14	readings. So what I'll do here I'll just go ahead and run this

58:19	. And as you see, since ran our program on court number

58:24	it's, um, utilization just went . Now, while this program is

58:31	notice here that this is the organization rappelled read, I found about the

58:37	and the processes that are president on particular note. Now the notation here

58:43	the number outside the bracket. It's core number and number. Inside these

58:50	is the socket number. So we two sockets, so we have socket

58:54	socket, 10 and one and so . So as you can see,

58:59	operating system has, uh, generate a mapping off physical course toe these

59:06	or logical course in an inter lived . Court number zero goes to socket

59:11	. Court number one goes to socket . And so on. Now,

59:19	. So the execution just finished. now what? How do we read

59:24	output that was generated by Apple? remember, we bend our program to

59:30	zero, which was part off socket zero, but this thing is a

59:37	bit confusing, so let's see So the package hyphen zero was the

59:45	in which those files are where power readings are written for socket zero and

59:54	hyphen. One is the directory um where the power consumption readings was

60:00	for socket one. However, the . When it went inside the directory

60:05	read those files, it found package one first. And that's why it's

60:11	. That one is back in and it named on the sock and

60:17	directory as package one. So we that our core zero belongs toe socket

60:23	. So we'll take these readings as you can see, because our

60:29	ran on socket zero. This is energy consumption for just the processor part

60:38	the mother board. So only the the chip, Uh, that contains

60:44	the 24 course. So this is consumption off all the 24 course,

60:48	since we used just one will take as three energy consumption for just 11

60:55	. The reason you see, dear um, energy consumption lower than

61:02	one that the socket that we did use is because the matrix matrix size

61:08	really small. Leonard, if I it again, you'll see here in

61:11	edge top output, that memory You pretty much remained close to

61:16	So most of the transaction were uh, from caches. But if

61:21	run a large enough problem which will quite a lot of time to

61:26	you will see a much larger difference the diagram consumption off the socket that

61:32	ran on and the socket that was idle for this whole time.

61:40	yeah, that's mostly that's my most travel. So does anyone have any

61:47	about it? Just a quick somebody , goes and reads these files that

61:58	the energy consumption, and it runs program and reports the energy consumption towards

62:04	end. Yeah, and it's in next assignment. You will use

62:14	So that's right part of what I . So you have to cover

62:19	So you have questions. Please ask this point, so rappel can and

62:24	talking about measure a couple of things measures then started ship power on it

62:33	also measure the memory power the Iran separately. And so that's why when

62:42	do a rappel diamond first you should sure that again used to compute

62:46	and she used it exclusively. And you should, um, then make

62:54	that you've been in a particular Yeah, painting to core is the

63:01	important part because we're using single thread rappel only reports that socket level,

63:06	you can see here. So we two sockets and it's a border for

63:10	sockets on. For now, the said command would do for us because

63:20	just need to paint it toe just zero, since we're just dealing with

63:24	threat. But when we moved to , MP and multi threaded programs will

63:29	a different command that's called Numa That gives you much more granularity on

63:35	you can put the threads for your on which physical seaview on which logical

63:41	views. And it also provides the to on what socket you want.

63:47	. Allocate the memory, but that's the later part. But here you

63:51	can only use tasks. Tasks set you scored. Zero. Okay,

64:03	if there's not any questions, I'll to the second part of the

64:08	which involves using of batter broth, is the G y based profiler.

64:22	in assignment to all of you uh, the people Off, which

64:29	a command line based, uh, is the command line based profiler now

64:42	Assignment three, we will provide you a couple of codes that you just

64:48	to compile, and then you would thio. Choose the events from

64:54	Whichever events you like, you feel would provide you insights into what those

65:01	are doing. So based on your from assignment to you, will choose

65:05	poppy events and profile those codes and us if you think it's a compute

65:11	problem or if it's a memory bound . So in that case, you

65:15	generate profiles using, uh, how , just like for the assignment

65:20	But you will use the G Y profiler, which is the paragraph eso

65:29	to or like They need to go a log in note to do

65:39	Yeah, so you can, simply run para prov instead off by

65:47	if you have X 11 forwarding with ssh connection. But with my experimentation

65:54	year as well in this year as , the X 11 forwarding on stampede

65:59	is really slow, and the rendering barrel prov takes a really long

66:04	Eso to deal with that. You use V and C to connect to

66:10	to the stampede to cluster in a forwarding mode. And then you can

66:16	fire proof, and it's much faster compared to X 11 forwarding. So

66:21	are a few steps that you need perform. First step is to set

66:26	DNC password on stampede, too, make sure for these steps you need

66:31	be on the logging notes, since nodes do not have are not connected

66:35	external Internet eso. The first step will do is just that you're BNC

66:42	, which, according to stamp itu's guide, you should be different than

66:47	log in password. Just go ahead set any other password. It asked

66:55	view only password. Just set it the same as above. Really

67:01	Uh, so once you've said this stampede to already has this,

67:08	batch file defined in their shared So what do you need to do

67:14	just go ahead and run this, , submit this batch file using as

67:19	command. You can also provide, , the resolution in which you want

67:24	DNC session to run later on by the hyphen geometry flag. Uh,

67:32	yeah, so just use that submit this job on the patch off

67:42	, and it's output would come out a file called the N C server

67:46	out eso just to read that you use this command, touch me and

67:51	server dot out till and so And when you read that file,

67:56	will see that we got a logging and a BNC port as 5901 and

68:03	on and all this information. So this message that you're BNC server is

68:09	running on the stampede to cluster. step you need to do if you

68:15	using Lennox or a Mac based, , laptop, then you can just

68:21	ahead and use this command. Ssh Toe this, uh oh. This

68:27	. Give the local port and stamp to port that you just got,

68:30	it's very likely the same. 12155.2155 , so this is to create a

68:38	from your laptop toe the sweet and several that we just started. So

68:43	for Lennox and Mac for Windows. you're using Pootie, you can just

68:48	putting with port forwarding to create a . So to do that, just

68:54	another party session. Uh, goto and Donald's, uh, check these

69:05	in the forest option here. Use source sport that you see here for

69:11	logging note. DNC. So that's at it here and for destination.

69:17	want the same port on your local as well. So 12155 as local

69:24	go ahead and add it. So , during this whole, uh,

69:32	, you need to keep your old open. You don't You should not

69:36	it. And with those settings, go ahead and log in to stamp

69:42	to again. And as soon as log in, our tunnel will be

69:51	. So that's that's all you need do here. Don't close the console

69:55	well. Once you're done, just can use any VI nce view our

70:02	for me I downloaded this weekend. viewer client Uh huh. To connect

70:10	your BNC server that's running on stampede . So just go ahead and make

70:14	new connection again. Provide the same address. So 12155 You might get

70:23	different board address, uh, when try to do it and for different

70:28	session. So don't worry if it out a little bit different to just

70:34	a new session on BNC viewer and click on it should have worked,

70:44	think on the local host you did instead of 12155 Why is that

70:49	Yeah, my bad with me. it again quickly. So 12155

71:09	let's try it again. Okay, the tunnel is open. Let's try

71:27	again. Create a new connection. Let's see. Yeah, so it's

71:42	a encrypted connection, So just go and continue. Now it will ask

71:46	the password to put the password that entered using the V and C password

71:51	E was good and do that. what that's going to do is it's

71:56	to open a BNC session. so let me go to a directory

72:02	there is a profile for a So this is from our previous

72:08	where I showed you use use off . Let's just go into any one

72:13	these directories where we have a profile that was generated using town. And

72:18	, instead of doing people off, can just do para prov. We

72:25	to do model low down. Then will work because paragraph comes as a

72:32	off now, and as soon as do it, it's going to take

72:36	second, and it's still a little when you use it. It might

72:42	a while to render things but it's better than X 11 forwarding. So

72:49	you can see this is the profile single precision operations for our matrix multiplication

72:56	that we generated earlier. Uh, , I m c s. That

73:02	is a little bit buggy s so can ignore that from now s.

73:07	, this is just simply if you click on node zero here, it

73:10	open another window that will show you detailed metrics for your program. So

73:17	is again similar to what you guys have seen as an output for single

73:21	offs. So here you see the precision office for classic Matt Malfunction and

73:29	other functions as well. If you to go ahead and try,

73:34	more functions off paragraph, you can three D visual visualization as well for

73:40	for this metric, just from windows three D visualization. Open that on

73:47	will. You can visualize thes profiles three D as well. It may

73:52	look very well over the Internet. be a little buddy, but here's

74:01	function. As you can see, classic Matt model would be here,

74:06	the single precision knobs here. So , clicking with right click. You

74:11	move this whole graf with left You can oriented as whichever way you

74:18	it, so you can play with . Uh, there's lots of other

74:26	, and this you can feel free play with it. Uh,

74:33	on when you will use, let's say open and be or any

74:40	threaded program that then you will likely lots and lots off Mawr bars.

74:47	, for things is just a single program. So you see no.

74:51	only. Let's say, if you eight threads in your program, you

74:54	see no zero to note seven and on. So it's a It's a

75:00	rich tool. It's just a little by using it through stampede to but

75:05	free to play with it. It's more intuitive than the people off the

75:10	line profiler on. That's pretty much . So many questions. Not a

75:19	but a comment. I think Tak like a portal where you can do

75:23	the DNC sessions online in the Web E. I don't know if it

75:27	be any better or any worse. , just Yeah, okay. I

75:33	aware of it. I'll check it . If that's the case, then

75:36	see what steps are. Wow. you. Any other comments?

75:54	so in that case, I'll stop . Okay? Thank you. So

76:00	guess sharing my screen again? Uh huh. From cover a couple more

76:13	before time is almost up. So is I will repeat what I'm doing

76:19	, and it's just sitting getting your set for the next lecture. So

76:24	related to rappels of why the last years I've included issues about power since

76:34	is, um, an important I was a on everybody's mind whether

76:43	design process of chips or memory chips servers or whatever it is industry you're

76:50	or you're actually using stuff. And a zip's. So you have said

76:54	rappel waas initial design thio control the consumption in data centers. So here

77:02	kind of the reason that, um scale things matters a lot. So

77:10	took a rule of families that the power consumption during the course of a

77:17	costs about a million dollars. So means if you look at it the

77:22	companies Andi other hosting companies that are large data center. They spent millions

77:31	it's not tens of millions in a data centers in terms of Google and

77:35	and these others. They spent this hundreds of millions of dollars every year

77:39	utility bills for electricity and cooling. this is just showing a little bit

77:47	you don't think of data centers as basically predominant, predominantly plumbing.

77:53	yeah, on the left side is see that and impressive pictures in terms

77:59	number of servers and Brazilian servers. rarely show you the plumbing that it

78:04	actually operated take a century after having middle column. Um, the other

78:12	why they things has become important is sends pretty much a decade back.

78:21	cost off power and calling during the of the system exceeds the cost off

78:30	actual system itself. So I know lifetime cost of ownership, fire and

78:37	dominates, and that's clearly why you , the Internet companies and others have

78:43	a lot of efforts in trying to the power consumption as so does.

78:49	know, Intel Andy and the Chip from their hands and just just saying

78:56	yes. Park may have increased depression the total cost of ownership.

79:03	yes, they have grown over But it's not. The full reason

79:08	it is so. In another part if one looks at the total energy

79:13	, electric energy consumption off anything Thio, the information and communication technology

79:21	city for short, shared a significant of the total energy. And it

79:27	out that the data center part, is kind of servers and disks,

79:33	an increase in fraction off that increase . So it's part of the reason

79:38	it has become very big issue and the last few years. Since this

79:47	something about 10 15 years back, one is clearly there is environmental

79:53	among other big uses. For this is just showing a little bit

79:58	evolution off a surface temperature. It's bit cold and should try to find

80:03	new one, but it's pretty And here's the correlation between some of

80:08	emission part course, not Data center not the only ones, but again

80:13	centers are in. Computing is an part of the total energy consumption.

80:22	a little bit off comparing energy consumption observer compared to your car, so

80:27	might be curiosity item. So the has taken notice. So you look

80:35	the big consumers again, like Facebook and others. What they

80:40	they try. Thio do clean power they do hydro electric car and wind

80:45	. The large degree and they locate data centers come close to where there

80:51	, um, plenty of electing hydro power or, for that matter,

80:57	recently, or whether it can get at, um are discussed by locating

81:05	in cool climates like this Facebook Data . And I think I will

81:13	That's an environmental concern, but part the reason that I'm going to get

81:18	next and this how power consumption related computing effort of the work that is

81:25	done and over bunch of years back made the point that typical service did

81:35	have energy proportional computer and power. yes, part I went down with

81:39	workload, but nowhere in proportion to workload. And that's what I want

81:45	talk about next picture. So time up so I will stop and take

81:52	. If there are questions, If now stop the recording, it

82:05

Previous Next

00 : 01
07 : 25
10 : 51
15 : 33
18 : 13
27 : 33
35 : 19
38 : 19
43 : 29
51 : 05
57 : 27
61 : 39
66 : 25
70 : 15
75 : 59