ICS Video Player

COSC6365 Intro to HPC Fall 2021 - Lecture04_Processor_Architecture_Overview

Transcript ×

Auto highlight

Off

Font-size

00:02	Okay. So today, what my is to give you the aspects uh

00:15	of the aspects of the platforms you're on now and in the future that

00:22	important to understand whether they're cold you're uses the resources well or not.

00:32	it's gage tourists the elements of processor that are important for understanding resource use

00:47	mm Okay. So that's the way kind of organized. It's respect to

00:56	types uh Cpus the first in the , as I mentioned at the beginning

01:04	the of course the first lecture that trying to understand what's important for conventional

01:17	server. Cpus then look a little closer at GPUS or accelerators but for

01:29	that will only be on GPS but trying to understand a little bit more

01:38	for a long time it was that so called X 86 Type architectures for

01:47	instruction set. The story way back Intel had the models processes labeled something

01:56	and that instruction set is then run only by Inter but also by MD

02:01	a few others, but because of becoming a prime the same constant.

02:13	number of things started to happen in in recent years that there is a

02:18	of architectures and in particular architecture is for being energy efficient. Has now

02:27	entered into the general sort of computer , not just in the mobile computing

02:34	. So I'll give a bit of on these things and and generals platforms

02:43	processors have become what's known as heterogeneous try to bring that up just to

02:48	you an orientation of what's out there try to assess the thing that you

02:53	be using and getting assignments. So I said a few times already,

03:03	high performance, as far as I'm is synonymous with high efficiency and you're

03:09	um kind of getting high performance or to have good resource efficiency, that

03:19	you need to understand what capabilities are the platform you're using as well as

03:25	also have to understand both the application the code. They are not synonymous

03:32	you have an application in mind and find some algorithms for it and then

03:37	call it up and several different steps their choices once even from given

03:46	your sources of algorithms and their choices you produced the code. So one

03:51	to understand Both consumption of the bottom the top 10 No one is working

03:58	. Mhm. So the things are to stress today is kind of this

04:05	bit of laundry list on this Uh It's important to understand the degree

04:14	perilous that exists even in a single or single processor, which is kind

04:22	more than the number of course is to it than that. We'll bring

04:27	up as I talk about the various today. Uh There are things to

04:34	attention to when one looks at the of the various processors is what's shared

04:43	what's not and what's private for the and it's typically has to deal with

04:51	parts of the memory anarchy but and buses that feeds and will stay in

04:57	that are shared. Another important whether it is has some form of

05:06	instructions of the instructions Uh typically called instructions uh with an extra part and

05:16	V a like w very long instruction has a little bit more flexibility than

05:21	instructions, so they're not identical or synonymous. Another thing that is usually

05:33	need to look at the kind of of description, a little bit on

05:37	marketing materials. Sometimes it's often hard come by the fact that because of

05:48	is such an important role in modern , most of them have firmly under

05:55	the clock frequency. So you as I mentioned, I think last

05:59	you don't have full control um by , uh I thought frequency is being

06:08	so I'm trying to understand whether the is making good use. One also

06:14	to understand what's going on under the to some degree and then there is

06:20	memory system then as I mentioned is weakest part of the system.

06:25	And depending upon applications, different aspects the memory system is critical. So

06:34	insee to the various levels in the hierarchy, there is the bandwidth issues

06:40	there's also it's just like cash line that are important and how when there

06:50	capacity conflict mrs Howe But the rules for kind of writing or replacing what's

07:00	in the cache. And then there other things that is known as cash

07:05	selectivity that has to deal with how places in a given cash that the

07:13	item can be written to today. will mostly mention these attributes of a

07:25	and I'll talk more in a later about some details of the memory hierarchical

07:32	system. And then we also need pay attention to the data paths that

07:39	available for moving data around between caches to of course in the same

07:51	So I guess that still be hard ask I guess and figure it out

07:57	in this virtual environment that And we're to know how many of you are

08:04	with these concepts and for many these kind of new. So I don't

08:09	how to do that. Maybe you suggestion can monitor to some degree or

08:19	figure out. Next lecture will be to face again. So maybe they'll

08:24	it at that time. It's hard . Okay, so this is what

08:29	going to try to point out when talked about various processes in the next

08:35	side or today. It's a preamble try to uh condition you what I'm

08:43	to focus on but this is just cartoon and hopefully most of you have

08:50	in one form or another before. well I can't do best to illustrate

08:58	yes in each process there's tend to a collection of course that has some

09:06	of processing logic that are floating point , integrate units and compares com pretenders

09:14	branching logic and all kinds of other . Instruction decoders basically complete computer in

09:24	in terms of all being able to and make use of instructions and reference

09:33	. Then there are some piratical caches so today is when it comes to

09:42	process is definitely the most I think of them have 3 11th of cash

09:46	there included in the process and they're part of the same piece of silicon

09:54	the data rates between the different levels cash and my memory where is quite

10:02	bit. And so does the latency the distance in terms of processor cycles

10:10	, where data lives and talk in about that when I talked about various

10:18	. Um but uh you can see it's in this case mm if you

10:31	at the for a single core which kind of a column in this cattle

10:40	the the data right between The functional of processing logic and the level one

10:48	is nor less the same as the rate to main memory. Typically a

10:58	bit less so but if you look the numbers in terms of gigabytes per

11:06	they will notice that it doesn't take cores to need access to main memory

11:14	the main memory, then that becomes limiting factor. It's like and there's

11:21	numbers on this line is kind of course if you don't have any data

11:26	use, they basically such a right the memory And it's also about two

11:34	of magnitude or the country more. . Between the agency or the distance

11:42	terms of compute cycles to the main compared to level one cache. So

11:51	are fire, quantitative numbers are The best to try to understand how

11:57	work they need on data without being . Brand was limited the way we

12:03	about in terms of the real fine in the last match. So there

12:11	more or less a similar picture. reminding your terminology that registers are they

12:18	closest to the execution units and the units, ADDers, multipliers, uh

12:28	unit. They all operate on data registers. Register files sometimes gone.

12:36	those are basically a single cycle away . So you can grab things out

12:43	registers and put data back in registers a single cycle. And that also

12:51	to bandwidth between the register file and collection of registers and the functional businesses

12:57	high in order to try not to to be able to support to functional

13:05	. But then as you move away the ability to move data has produced

13:18	and in front of it discussed, wiring complexity. We'll talk a little

13:26	more about that I guess some slices come but there are 10s of thousands

13:35	or more wires on a single piece silicon. So then tend to be

13:42	ones that defines the cheap area. the logic. Yeah. And then

13:51	is different levels of cash and I'll a little bit more about that on

13:58	next side. The one thing that wanted to end if you look at

14:05	of the sub folders on the cash say that as I mentioned last time

14:10	, when I talked about stream that processors don't strictly follow this notion of

14:19	move from one level of cash to next level of cash etcetera from main

14:24	to the execution units and then back cash policies sometimes are not helpful.

14:34	actually her performance and for that reason are many architect as a way of

14:42	caches and sometimes that's called us non stores. For instance, if you

14:48	to write things back to memory and caches, other things that some

14:58	you know, perhaps not everyone is with some of the main memory that

15:07	called Iran for dynamic random access it's by no means random access and

15:14	talk about that in a subsequent lecture the design of those, the main

15:20	do something called double data rate designs then it's followed by a number that

15:28	kind of the generation of the design specification or standard for this double data

15:34	memories. So today's recent year PCS server processors that used to be R4s

15:46	wants to three and there's still some it around and take reasons depending upon

15:53	what the market is for processors to an older memory technology And there is

16:01	some indeed er five standard but it not yet been support it but I

16:07	towards the end of the year news that are coming out are expected to

16:12	engineer fine but not yet in production beyond that just played a disk in

16:21	and they're not deal with this can storage in this cost. Mhm So

16:30	is uh just a little bit again the same pictures before but something I

16:39	encouraged everyone to do and try to performance. Follow the data, where

16:46	it start and where does it end ? So and for pretty much most

16:54	, data starts in main memory or in any part in the cache hierarchy

17:00	to start somewhere else. And if a real time of steaming processors main

17:05	, maybe someday I input but and the results obviously the registers are not

17:15	end point for the results but it to come out of the system

17:23	Mhm. In most applications it is case that there is more input data

17:30	there is output. data. That's this kind of read errors are the

17:37	words and my graph here that and as a simple example that there's more

17:45	being read and written just thinking about matrix vector multiplication where you have to

17:51	Y and X. Um why? the results. That's the thing here

17:56	right back to memory or some other of output and that is can be

18:05	there Largest. So if you have taking us may take 30,000 3000 and

18:11	have a million elements But why? just 1000 elements of it's not just

18:18	small factor that can be a huge in terms of the amount of of

18:22	to be loaded compared to what's Mhm. So many, some mentioned

18:34	time a lot of processes kind of designed to do very well for matrix

18:43	. Of course not everything is made operation but has been a core operation

18:51	on and it still is in and you look at machine learning and many

18:56	types of applications, it still is that tends to have more important than

19:05	data. Mm So as the black is the capabilities and the red things

19:14	, especially pointing to the needs rather the capabilities of systems where the capabilities

19:23	the black errors is the function of trade off, what's costly and what's

19:32	and the assumption that many applications somehow an option for data reuse. So

19:41	you design your algorithm picked algorithm as . And the realization or implementation of

19:49	in terms of software is such that can make effective use for memory.

19:54	always stay down. So that's why point is to try to get all

20:00	pieces to work. So the performance kind of ideally determined. I say

20:07	level one cache and not finding memory that's again the roof flying and these

20:13	intensity we talked about last time where place at all and there is some

20:25	basically how things works as I mentioned that registers pretty much all this and

20:32	of the architecture. So it's just cycle away. Just take 1 2nd

20:36	move data from for two registers and functional units. And there is usually

20:44	wiring to be able to move all operas needed for an instruction from the

20:55	file to functional units and back. all this data path can be operated

21:01	parallel. So one can, if isn't registered, one can operate functional

21:11	, kind of full speed full But the other point I tried to

21:16	out here that means that data rates the ability to move data is quite

21:27	in a way on modern processing So it's in order 10s of terabytes

21:33	second at the innermost level if one a function units and what any that

21:45	I said, that means 10s of of wires and we also know that

21:56	process soon, I think I showed pictures, you know, I'm a

22:00	the board if you haven't seen them and when you buy a processor

22:06	into the MD or somebody else, something that maybe if it's a big

22:11	, maybe an inch by an inch little bit more. But you

22:16	they realized it's sort of sure 10s thousands of connections on that kind of

22:23	area. So it's just not feasible bring everything out for mechanical reasons.

22:32	, so in order to try to with all kinds of limitations on the

22:40	, cash is exists and has been existence for a long, long

22:45	What has happened over the years? , that's got them more and more

22:48	them in terms of um not sizes , but ah, hierarchy, that

22:55	different scopes. So as I there's typically three levels of cash on

23:01	on Pcs today and they Level one designed to be pretty much as close

23:11	functional units as the register's not sometimes it's just a single cycles between

23:16	one and register file, but sometimes a couple of seconds. And as

23:22	get 11-11 3 then they come further . The reason is also that The

23:33	one by being designed to operate at speed. They also tend to both

23:40	more power and take a bit more per bit than the other 7 -

23:50	, 3. So it's again a off based on properties the technology being

23:57	to implement Cashes wire one has this enough just to one level of

24:06	Another thing to the bandwidth is a that in terms of the process of

24:11	and I will talk about later is one is to look at the with

24:16	data path and the rate at which operate. So typically and most of

24:24	chips today, different parts of the piece of silly can operate at different

24:30	frequencies and that means that different levels cash may operate different clock frequency is

24:39	buses or the data path on the may operate the different clock frequencies that

24:45	caches or the functional units. So needs to pay attention to not only

24:56	some buses but also the data right operates that in order to actually understand

25:03	feasible and whether resources were used or and um uh because that's all

25:15	that's main memories several 100 cycles So this is kind of a bit

25:23	recall and didn't talk much about you know, it was on the

25:26	in last lecture but just to get perspective on things and 11 can look

25:33	it from the bandwidth perspective, that the upper graph, we can look

25:38	it from the latest perspective that is lower graph in this slide and it

25:46	, and this line was done not long ago, but john Mcalpin,

25:53	is not the university of texas in and he is generally known as Mr

25:58	and he was the one that first up with the idea of the memory

26:04	stream that has become widely used even , many years after he came up

26:10	the idea but the artist slide is terms of state of the art processors

26:20	doesn't quite sure how in some sense or what the capabilities are in terms

26:30	processing or functional units on a chip the ability to feed the chip.

26:37	this has here's a fit, take recent generation of inter process is that

26:44	is 83, to that is the that was released this year, You

26:50	do about 100,000 floating point operations in time it takes just to get one

27:00	work out of memory that gives some also how hard it is or well

27:08	to use memory hierarchies and what it of an application or the computation in

27:14	to be actually sustain the performance close the functional units can do on a

27:23	. So for most applications, as mentioned, day before they end up

27:30	memory band was limited, it if you look at the agency,

27:37	not quite as bad and the bottom is the red line is really looking

27:43	the number of floating point out for you can do in the time it

27:47	too rich even data item from memory that hasn't changed much since john mcalpin

27:53	graph. And the lower curve is just for reference putting the upper blue

27:59	on the same rest. But this again takes perspective on again what it

28:08	both been picking algorithms if it's possible the application you have and then watch

28:14	what the code does in terms of realizing what's possible. Yeah, so

28:22	little bit more and then I'll stop a second and asked if that

28:27	So here is, you know, to this simple stream trial age or

28:33	you do in the matrix spectrum. matrix multiplication is kind of in the

28:40	instruction ambassador multiplying and out. Mhm if we try to work it out

28:46	it takes in terms of again moving . So the theme is consistently that

28:54	bandwidth and or memory system is the part. Mhm And so here is

29:00	of just putting some numbers to that . So suppose this takes, you

29:05	, three loads and one story A and C needs to be loaded and

29:09	you store see once it's no results then you also need an address for

29:14	and simplicity and seven supermoon assume matters also 64 bits like for 64 bit

29:22	addresses may I mean the sixth form today uh so it that it

29:29	that's an example. It is everything making it. So that means

29:36	56 points um for cycle to do single operation. So in that case

29:48	if you run something at 2.5 gigahertz it is fairly typical. Some processes

29:55	faster as you will see today and mean it was well And some of

30:00	may actually four GHz for a little more than that even so it's by

30:05	means the worst case. So this kind of a single threat needs,

30:12	doing this operation. So now if think what a single processor chip can

30:23	And if you have 256 threads running the same piece of silicon, that

30:29	you need about 36 terabytes per second order to sustain uh the ability to

30:36	this instruction on the top. So if you look at typical service and

30:47	they use so called dual in line modules of dims, we'll talk more

30:54	that in a later lecture but that's of a memory module you used and

30:59	onto the motherboard and This now they four us again the fourth generation of

31:07	double data rate memories and the 3200 you the cop grades of that memory

31:14	again we'll talk more about it later it means that this is kind of

31:19	top of the line memory module today is supported by pieces or servers And

31:24	can do 25GB per second. So basically More than three orders of magnitude

31:33	. So if they try to support functional units directly from main memory at

31:42	terabytes per second, that would mean it would actually need 1400 memory modules

31:48	this type for a single processor So that's obviously not realistic in the

31:57	place to get that many memory channels a single processor chip. And if

32:07	multiply it out, I mean 64 times 1400. So that is right

32:13	to 100,000 wires, you would need get out of the processor. So

32:17	doesn't really work. Mhm. So also says in order to actually again

32:25	it. We'll need to have applications has potential for data reduce and then

32:30	to figure out hard to realize that . But there's also another aspect that

32:36	critical and that's the other part that the energy consumption. So the energy

32:45	for this type of memory module is what doesn't sound like much five pickled

32:56	Corbett. But if you were to that and operate that these 36 terabytes

33:01	second, what it means is The would consume about 1004 kW. So

33:10	again not realizable in any way see use or can be a parliament far

33:19	but this is I would say 4, 5, 6 times even

33:25	power hungry. She presumed wounded. in the end it's just not

33:34	Uh so on, no stop after slide. So this I just give

33:42	little bit of comment in terms since talk why it's necessary to use

33:49	His soul. You don't need to about the whole slide the scandal cut

33:54	bottom lines or a rose in this little table that says islam. That

34:00	kind of cash so and the best of castro versus if you were to

34:07	to go to main memory so it's the factor of 100 plus in terms

34:14	energy consumption difference between using cash is using main memory. So that's again

34:24	an energy perspective it's necessary to make , use our caches as well as

34:31	a performance perspective and I can stop and see if there's any questions and

34:40	will switch to talk and give you process for examples. Okay, so

34:56	move on to talk about processor examples here is just a little bit of

35:04	recent generation of processors and the last our community to us. So the

35:19	guess things I wanted you to pay to here is I guess the memory

35:25	that these processors chips support. So top of the line ones they support

35:36	eight memory channels today and they Use DVR four memories or 34, 30

35:46	version of and that's the fastest they to support and that means to get

35:53	200 gigabyte per second per psychotic and bottom ones are at the DP US

36:11	the thing for them, it is the fact that of Between depending upon

36:19	models use 4-6 times the memory band such a gap on a CPU the

36:27	in the middle intel atom just put there as an example of something that

36:36	typically designed for mobile processors are that used in cell phone risks in some

36:46	, still used in cell phones. they don't have many much of memory

36:50	and memory language it and they also a lot cooler. So they said

36:55	want to do what processors were as rest of them runs it maybe a

36:59	of 100 once the other thing I I should comment on on this flight

37:05	in terms of memory technology is something most of them says they are and

37:11	version and then there's HP N that for high bandwidth memory that will also

37:17	about subsequent action when I talk about more in detail, but these are

37:26	a different way of being integrated with processors and that allows for the higher

37:32	that is giving the name to the . Right. Um Yeah, so

37:43	this is a little bit kind of summer a little bit, but I

37:46	talk about in terms of the number threats. Again, a level of

37:49	listening to have in each processor. today there's tens of thousands of no

37:57	of threads. Sorry that for typical processors where it's in the several thousands

38:12	in the case of GPU s but also needs a limited type of primary

38:20	and GPS because they are this simply . So things needs to be

38:27	you need to have application in which can the organizer, single instruction operates

38:36	data at the same time. This not necessarily the case in a sip

38:43	that has the w so yeah, flexibility and what you can do in

38:48	single instruction. Um so now too and questions meanwhile, thanks. So

39:10	first example is for I would say the most ambitious and complex and feature

39:23	processor out there today. That is IBM power series of processors. They

39:32	lots of features in them and they That's awesome teacher, which course.

39:42	that means it doesn't tend to have very high core count on a single

39:50	of silicon because each one its core requiring a fair amount of realistic.

40:01	one thing again, coming back to path and is one important aspects of

40:06	processors and what you can do with . Um kind of a little graph

40:14	in the middle. This shows you weakness of the data pass in this

40:20	between the level two and level It doesn't say it's between level one

40:24	the way IBM does it for this the include the level one in what

40:33	label core. So it's inside that of silicon. So but it's says

40:42	mhm ability to move data from L to L one is 256 bits white

40:54	a stability from to write things from 1 to L two Is only 64

41:00	. So it's a factor of four terms of the ability to load data

41:05	Store data between L one and L between L two and L three is

41:11	of balanced and I think it depends yeah, application, they have been

41:17	what they see as important. So this case obviously came to the conclusion

41:26	it was okay that between L two L freedom may be fine to have

41:32	much the same capability for loads and But when it comes to L 3

41:38	make memory it's again, 2 to . I shouldn't say again, it's

41:44	to 14 to one and two to three. 2-1 is kind of common

41:49	most processes but this social little bit this case. So the market that

41:57	targets for this tip is most transaction or database processing. Uh No so

42:11	for the scientific and engineering type computation . So you don't as well as

42:21	the internet type applications so you don't many power type architectures. Yes but

42:29	or google or Microsoft for that Mm Other things to pay attention to

42:37	terms of um the cash is is ice. So in this case the

42:46	one is 32. Tell avoid most the process that have separate Instructions and

42:54	Cache at that Level one. But level two there tend to be um

43:02	er or unified. I said sometimes so data instruction share down to use

43:09	they tend to Have their own one and 32 kilobytes. The sephardic

43:17	Size for L. one data and caches. So when you come to

43:24	. two as I mentioned the tend be bigger. So for this case

43:29	too is 512. So considerably bigger the 32. So So that is

43:40	16 times bigger. So it's noticeable . The other thing that is somewhat

43:48	for this architecture is That the cash sites that means seven amount of data

43:55	that is moved together between the cash and the main memory. That is

44:03	as a block of data that is of atomic. And in terms of

44:07	stuff around 128 fight is a bit than the most common is 64

44:17	And one more thing to point out as I mentioned in terms of the

44:23	, it's what's known as associative No and then maybe you know where

44:29	off but if you don't it's a of places to which a chara fine

44:37	and be assigned in the cash. in a four way cash it means

44:48	cash line taken from memory can be in one of four places in

44:54	It cannot go in an arbitrary So in terms of associative itty of

45:01	is there are at one extreme direct caches that means There is a 1

45:08	one correspondence between location and cash and memories. You can only go in

45:14	place. So if you need to something that means that whatever is in

45:21	location uh oh maybe over it. if that's allow otherwise that thing first

45:29	to be stored the memory before you load the data, new data that

45:34	to go in that place. And the other extreme is the fully associated

45:40	language case whatever comes from memory can stored in any place in the cash

45:50	then there's different cash policies and I will talk about that later in

45:54	of when you have a choice where you choose to store the data and

46:05	things? So in this case it's power hungry processors on this consent,

46:11	known as the T. V. . That is a thermal design power

46:16	is the participation that the processor is to be able to sustain. It's

46:27	necessarily as I mentioned last time, maximum power that the process that will

46:35	use and that needs to be cool of the TDP is a common number

46:43	it's important to realize that is not for the maximum power consumption of the

46:50	. Okay, that's in the As I mentioned last time that one

46:55	the computer vendors that we both come from mate and I think that's and

47:07	guess one part, one more comment this ship is that as I

47:14	I've been targeted, I don't know processing or database applications. So that

47:21	memory bandwidth is particularly critical. Um for that reason the in fact,

47:29	the chip itself has a channel is memory but then they actually made kind

47:35	a buffer memory if you like. replicates each one of the memory channels

47:40	the processor chip four times. So fact it has a little of 32

47:50	to main memory. So that also you can have a lot more main

47:58	on the single processor than most other can. And when you read things

48:04	this systems, that means you have better chancellor cashing things in main memory

48:09	in most other designs. So here an inter skylight now. So I

48:19	to probably not spend as much time each one of the new other processes

48:24	try to allergic to things and just out some differences. So Skylink is

48:30	one that is being used on stampede . It's not the most recent generation

48:35	mental that the slide on that but um this is what you come

48:41	you, so it has 28 so that is kind of more than

48:47	a little bit more than a number course on the IBM process I asked

48:51	about Yeah, the other is instead forgot to mention on this uh IBM

48:59	but we'll go back and bring it now and that is up here in

49:04	corner that says there's a little bit confusing terminology in terms of threads when

49:11	comes to processors, uh and it's because slightly different mechanisms used to handle

49:21	threads on a single core. So call it simultaneous multi friending or S

49:28	t. Um and the course on IBM system is basically designed to Be

49:42	to manage four threads concurrently within the court. Well, actually I

49:54	Um Yes, so for now, good enough. So On the other

50:06	, if you look at the more instructions that type architectures, which is

50:13	the power crosses surrounds, it runs own instructions. Um So for the

50:20	that originated by internal way, way . Um the comment thing is that

50:28	are designed two be capable of managing threads, that's the same time and

50:38	call it high preferably and so there's M D. But the mechanism for

50:46	the friends are different and that's partially the different terminology is fair. So

50:52	doesn't confuse the different mechanism. We need to get into the mechanism and

50:56	of course, but things to be of that, both the chips on

51:04	too as well as stampede to are to manage to threats for court now

51:14	admins can configure whether they enable hyper for now and now. I do

51:24	remember about to ask my man remember many times size, turn off the

51:34	of hyper spending and the reason is back to an earlier flight today,

51:43	everything in the core is kind of private to the core nor replicated for

51:53	threats and most things are in fact between the threats operating in the same

52:02	. So that means if you have than one thread then threads compete for

52:09	same resources and may in fact the performance from the uh improve performance.

52:18	hyper threading is good when kind of for instance, memory to deliver things

52:26	maybe one of the threads has stuff so it can proceed then there is

52:33	really contention for say functional units and have with it. Multiple threats may

52:41	win. But for many well designed als, that's not the case.

52:50	multi threading loses. So I think the past for bridges, one final

52:56	correctly, Pittsburgh turned off multi threading maybe the stampede you had enabled

53:07	is that correct? Yes. Until threading enabled british doesn't right. I

53:14	know for bridges to what the status , I think it or it still

53:22	a hyper turning off but just this point to be aware of again,

53:29	you try to understand cold and cold , this notion of hyper threading and

53:35	one shouldn't make the mistake and believe when you use more than one

53:41	the performance doubles and I asked shouldn't great because so many resources are shared

53:49	then maybe contention for the same resources you enable more correct. Um the

53:57	thing I didn't comment so much on the, on the IBM process is

54:03	notion of very long instructions. Word our Cindy features, which is something

54:13	you can do basically you have kind similar to what's the case for Gpus

54:22	there replication of a floating point units allows you to do in a single

54:31	many same mouth ads in the same . So you get the very wide

54:39	that then if they application is such the code is such that you can

54:46	them into a single instruction than you close to the peak performance average and

54:54	there is and things just considered in case the level on caches are very

55:02	to the IBM power ship. They a little bit more believe level two

55:09	and and so on and it's also Power intensive chips so it's about 200

55:19	peak um or TDP the thermal design um the other thing that is importance

55:32	so this chip has six memory channels the next generation ship after this the

55:41	increased to eight. Remember the power eight channels that was kind of the

55:47	age as skylink or even a little older. They already had a more

55:53	more on the memory bandwidth team and the processor level and for comparison I

56:02	the the main competitor to into in of running similar instructions um they were

56:11	sir had more memory channels and better with the memory than many other

56:18	Now there's this more comfortable but for long time intel got a lot of

56:24	for not having enough memory bandwidth, processors. The other part wanted to

56:32	attention to this particular sign is the left hand part and that goes back

56:39	this notion I had mentioned earlier on that it is firmware that controls the

56:48	on the chip. So what the left hand graph um intend to tell

56:56	is that Well one used this very instructions and the chip consumes anticipate more

57:09	and that the power dissipated is related the clock frequency. So in order

57:15	to overheat the chip, if you to make use of all the resources

57:20	the chip, it gets clocked out there's nothing you can do about it

57:29	and someone up let you kind of their ship. So they tried to

57:34	you from being too ambitious and trying squeeze performance out of your processor.

57:39	in this particular case, if you to use all the course and using

57:46	full extent that is RAVX five felt , the co operators about half of

57:52	it otherwise would be. So if just have a single thread in a

57:56	core, the clock right? Maybe the more than twice that of before

58:03	used Every X five films instructions. , so here is just a little

58:12	more an additional comment on this slightest to just be a little bit more

58:20	than the previous picture of the fact there is nowadays so many course on

58:28	single process of die that's using kind just a single bus and have every

58:35	talking to a single bus doesn't So each ship today has a network

58:41	it and it's most common today with number of course that are on the

58:47	that fair list, simple in some to the emotional and more sophisticated network

58:55	issues that tend to use to the . So this is just pointing out

59:02	that is the case and the other to make here in terms of the

59:06	level cash, that is the NlC this cash. It's basically, each

59:11	has a little bit of it even it's shared things from an access point

59:21	all of course. But it also that some pieces on the third level

59:27	closer to record than others. So means that the access time To the

59:32	level cash is not uniform on the . So it depends on the relative

59:40	between data and the court that once data on the doctor. Mhm And

59:49	a little bit just to understand that of designs. A fairly complex and

59:54	just so it's a little bit of data path in this case they level

60:01	and the steps of doing instruction the and breaking it down into what they

60:06	on my crops. And some people always in the architecture community kind of

60:13	a not the war but tussle about I should have a complex instruction set

60:22	a simple instruction set known as risk risk now has gained a lot of

60:28	over in recent years that then tend have a much more limited instructions that

60:35	that means it takes perhaps more instructions to get the particular operation done in

60:42	complex instruction set. But it also that the architect and potential can be

60:50	. So what kind of intel has over the years that has been an

60:54	of assist constructions that the kind of now the Microsoft's, that is more

60:59	risk instructions. So they break down complex instructions into risk like instructions before

61:06	get execute. And I would also if you haven't already done that.

61:16	a very useful. Things is not detailed about to get the big picture

61:22	what happens to listen to the triggering lecture given by john Hennessy and the

61:30	a few years back, it's, think that you were earliest on some

61:34	otherwise deceased. The finding by googling here's a little bit how things are

61:41	put together and the west means to as you can imagine. And that's

61:47	of the way things I've put together this stampede pra notes and then there

61:54	higher, you know, socket but as I mentioned before, the

62:00	sockets are by far the most but there are also in particular

62:06	when you have need for lots of but then in an old touch,

62:11	tend to do four socket or eight configurations but and then I also put

62:21	just a little bit for reference uh most recent survey process that there are

62:29	this year. Um I don't know it's fully object but it's supposed to

62:34	available before the end of the year least. Um and the things that

62:40	predominantly has done, they have upped core comes like everybody else on this

62:46	compared to Skylight. Now there are course, not quite the double of

62:51	28 course but significant more. They design them to do the two threads

62:57	core if you choose to You still this five forward find instructions, they

63:07	a little bit of data size for data cache but the instruction gas is

63:13	the same as previous generations and amount memory is more or less the same

63:19	core. Right Then they have a , they increased from 6 to 8

63:25	channels and they're also Increase the support data rate for the memory to this

63:31	200 memory But it's also more fire business that around 200 it's not going

63:37	closer to 300 T. V. . Um So then just uh talk

63:46	little bit on the am decide the competitor 10 until it's not as widely

63:56	because they have had if you misfortunes the the years they were serious competitors

64:03	then they made some missteps and almost out of business and then they came

64:08	and then it started to become competitive I made some mistakes again and again

64:13	went out of business but now when kind of generation on their architecture that

64:20	as theirs and of course they are competitive and gaining a lot of

64:26	So at this point as well worth familiar also the D and d processors

64:36	now if you use clouds and some context um pretty much all the cloud

64:44	now also allow you to choose between instances that runs on intel or run

64:53	AMG tend to be the, if johnson an AMG processor it's likely cheaper

65:01	you then if he runs on an on their long story about that.

65:07	part of the reason for this difference cost is that am they would say

65:17	have been leading or pioneered sign designs later than has been come adopted by

65:27	I already mentioned that have led in of emphasizing memory bandwidth, they have

65:34	led in terms of focusing on power energy consumption so they have always been

65:43	power than intel cpus there are many changes differences too but one that has

65:49	them to the price competitive is that tended to have get the core account

65:59	by using what's known as chiplets that now becoming common industry. So they

66:07	have a piece of silicon in their often doesn't have a large number of

66:14	honest but they put a bunch of of silicon that repeats the modest number

66:21	course on it or in the same . So in fact what you're getting

66:29	a high core count processor but because piece of silicon they're using smaller,

66:38	get the higher yield so that means cost per core is lower. So

66:44	can in that case be able to on price. There was, you

66:49	see in this case things are very . They actually have had in this

66:55	slightly higher in this case the larger cache. Oh but there are kind

67:00	in the same boat podcast you see and they have been trailing a little

67:06	in terms of the with some instructions In this case they have to 56

67:16	wide instructions was half as wide, half as many things can be done

67:21	a single instruction but they have trying some sense to make up for it

67:26	they have more coarse um there are and end so it's a complex scheme

67:32	figuring out how to stay competitive but can also see that um compared to

67:41	it lower power dissipation. Mhm and is just for your reference uh data

67:52	the I am the processes used in too. So um the so it

68:06	to the internship, I guess one I should point out is they have

68:10	fact a little more level three cache intel has so they have more memory

68:17	the die then even though they also a higher cork are but like the

68:26	, it's the same in this case participation. So let's see so shoe

68:35	examples has said things have gotten interesting terms of diversity things converge for a

68:42	long time. Everything became kind of until because of stumbles by and they

68:48	above all everything was kind of there instructions that the X- 86 that is

68:54	run by MD processes but those processes power hungry. So the mobile community

69:04	the end use either of them. used arm processors that have been designed

69:12	be very power efficient and this is the most recent but I think fairly

69:21	of what you can get in today arm and arm is not building their

69:27	processes. They design processors and the the designs, so Samsung and apple

69:34	that better. Um, and many have used arm processor designs and building

69:45	processor chips. Right? So as can see in this case they are

69:50	even though they are designed for being energy efficient. It's not that they

69:54	skipping on the size of cash is fact even larger. Level one Cache

70:00	than was on the very sophisticated Power or 9 processors. And the same

70:08	with level two Level 3 Cache is by no means sub standard. It's

70:13	that the focus has been different and other features that they don't have but

70:20	of No a couple of 100 watts this case this particular one is like

70:27	. Cool. It's one and a of 10 actually 40 or 50 less

70:36	than the other ones. So that caused a number of companies to try

70:42	figure out how to use this design compete with intel and Andy And that

70:49	happened in the last 5 10 years power became such a constraining factor for

70:57	. So amazon they went off and their own process suits using the or

71:06	using the arm designs. And this just example of using the designs you

71:15	in the previous sign and putting together system using those kinds of processors and

71:20	other components that armed folks design and amazon made a piece of silicon and

71:30	it together and actually making a complete . So today if you use amazon

71:37	can also opt to have your virtual run on what amazon called the graviton

71:44	it's the core of it is armed and then there again cheaper given them

71:53	using the N. D. Instances amazon. And there is just another

72:02	that also based their chips on arm they uh in this case here 60

72:09	and I'm not going to go into details but a number of companies that

72:14	out trying and playing around with using marvel which is the company time they're

72:20	designed then this is yet another companies that actually build their own silicon that

72:30	with the arm designs. But unlike amazon that you can only use on

72:37	, they don't sell their chips, just use it for running right their

72:44	. But parents to start up founders from into a few years ago,

72:50	about the five year old company that this year released their first processor And

72:57	first release have 80 course, that's a high core count. CPU And

73:04	you can recognize decisis of caches and and numbers are similar to the ones

73:09	had before and again they use eight channels to get to memory.

73:17	and this is a little bit the , I want to say something about

73:21	used to so before I have a minutes left today. So, but

73:26	is a summary in terms of a of course most of them use

73:35	has the ability to have two threats socket. So, and I am

73:43	up to this before or eight threads depending upon its a bit, it's

73:49	flexible design. So you can figure course as the atomic versions, so

73:55	speak or you can gang two of together and treat them as if they

73:59	a single court. But in the , What the chip can do is

74:04	threats. Mm hmm. And the terms is the 32 arithmetic operation is

74:12	you used to say the x fight version and the powers and the course

74:17	pretty much this summary, you know about um, the DP us.

74:24	this is the Gpu that is on ranges too, I think there is

74:33	more release from envy their past the that is installed in the bridges to

74:42	but this is still quite typical and not far off from what the most

74:49	edition from a *** is, they a few different versions depending upon when

74:54	target ai or machine learning work clothes more floating point intensive work clothes but

75:01	point of this, I want to what this slide is uh that the

75:09	of threads or that means the number parallel instruction streams this was just before

75:16	of being in the tends to lower is now and the little thousands to

75:24	thousands but there is the restrictions that threads need to have a lot of

75:32	. So one needs to be able use this Cindy feature. So you're

75:40	have one instruction like mouse pad and you have lots of options on

75:47	Then you can make good use of of the GPS um so what else

75:57	want to stress and this one is thing that I had them in early

76:04	that there is a difference and part the, that was a popularity or

76:12	of Gpus is that they have higher bandwidth and CPU safe to use today

76:20	I said had to play up to 64 Bit White Memory Channels. So

76:29	is uh What 512 bits wide data to main memory persecuted. Where is

76:43	Gpus tended to have more bits and particular now when they have started to

76:50	this high bandwidth memory They have like chosen this slide there's 4000 bits of

76:57	96 to be precise. So it's factor eight wider data path to the

77:04	. But it also means you don't these things out and use modules on

77:10	motherboard but it actually has to be a single packet. So you never

77:17	out on their motherboard and then you get This considerably how your bandwidth that

77:24	5-6 times higher than what you see the CPU but it also means you

77:32	kind of restricted in the size of memories. So yes your gang memory

77:37	but you lose in the sights of memories. So Typically what you see

77:42	up to 32 um gigabytes of memory is on there and the political service

77:52	So the bridges to seemed to use think and what 2 56 um Uh

78:00	. Server for the regular memory for extreme memory that may be a few

78:08	. So it's a huge difference in of the memory. So that

78:14	And another aspect of the G U. S. Is there actually

78:18	designed to need a CPU to run they are kind of in a touch

78:25	and they're not self contained. So means the problem usually starts and ends

78:30	the CPU and then the CPU and no need to talk to each other

78:35	solve the entire problem because the memory too small on the GP. And

78:41	I have for comparison the competitor to and number one competitor is A.

78:50	. D. And the again have or broaden their activity now that terrorists

78:58	doing well and have some money. used to be dominating I would say

79:04	terms of the game market whereas NVIDIA on went after this server and scientific

79:13	data center market and now Andy is to do the same and this is

79:19	release from last year that is fully with the recently and envy their releases

79:29	in terms of floating point performance and much anything else. So it's good

79:35	know about the same the chips to my time is always up. But

79:41	I wanted to just mention that the things that I'm there, internet companies

79:50	done in particular google's and has been uh designing their own service for a

79:56	time then even if several years ago they started to thank you for designing

80:03	own silicon and they instead of using US for machine learning they design what's

80:13	as a tensor processing unit. Tpu that has pieces of functional units are

80:25	helpful in getting good performance from machine . And again matrix multiply is one

80:30	the core operations. So there might multiple units on their TPU chip and

80:36	graph up into the right tend to the performance and it's a real flying

80:42	. You see this land dick term then you see the peak performance when

80:45	get to being compute found and then get compute bond. This is a

80:52	scale on the vertical axis. So kind of be deep used by an

80:59	of magnitude more for machine learning and . So I think that's why I

81:05	to point out and this is coming the point that in order to get

81:11	energy efficiency and that's what's behind a of and now many efforts in doing

81:19	processors is that you're getting game orders magnitude in both performance and energy efficiency

81:25	tailoring your designs to your work. and without that I think I will

81:31	skip the last few slides today because time is up conceive and take some

81:35	. This is just given the the of one process we worked with in

81:39	group that shows a different width of data passed on a particular processor and

81:46	the different clock rates for the different that is being used in the system

81:52	with that I think I just stopped and this is something that's josh worked

81:57	also shows this heterogeneous processes that has common in the mobile market where you

82:03	different pieces of silicon that are tailored particular functions like crypto engines and display

82:10	and are your engines and um it's of more floating point and so you

82:17	different types of processors and here is of a snapdragon and you can see

82:21	have the GPU has a display process D P U. Is a vector

82:27	of CPU and has a digital single sp and I said to you so

82:35	, that's other things to also be of. You don't we want deal

82:39	that in the class except attach processor the firm in terms of the GPU

82:46	if you end up working in something to mobile processing, you end up

82:52	the chips that has different functional units has different instruction sets and programming becomes

82:58	more complex than what we deal with the class. Okay, sorry I

83:03	stop there. So any questions. bro. So again, this was

83:30	translucent and to both get to details details are important for understanding performance for

83:38	assignments are going to do plus give a little bit of what's out there

83:43	generally used in other contexts and focus the elements. Again, our

83:51	it's important. Unfortunately it's not just thing. Mhm. Okay, so

84:13	yeah. All right. So when has not been and have few

84:21	I see your username rick, I make sure that happens right now after

84:27	, you know that added sugar access . Yeah. So I'm sorry if

84:36	missed your email earlier, but I seen it. Okay. Yeah.

84:52	. I'll stop sharing screen. I will stop referring

Previous Next

00 : 01
04 : 01
08 : 47
12 : 11
16 : 31
20 : 21
25 : 19
28 : 45
34 : 35
39 : 09
50 : 03
59 : 49
63 : 43
70 : 57
74 : 21
78 : 43