© Distribution of this video is restricted by its owner
Transcript ×
Auto highlight
Font-size
00:05 so today will, um, of of talking about what comes next on

00:18 slide. Soft. Today I will a lot about the processors on the

00:25 of upper and including they're ones that some Vita and Bridges processors and his

00:39 about to defuse architecture. Er um, the problem will get to

00:47 configuration. But I'll talk about a more about trashes, which are important

00:52 understand. So this is really the for the discussion about processors and their

01:03 , because in order to get good , when really need to understand what

01:06 capabilities and limitations or of the platform being used. And uh, then

01:15 also requires that some engine before good what the application is, what the

01:23 are, they work, and in , how the cold then has managed

01:29 do the translation between the applications and it needs and the architecture that there

01:36 . And I should go to say to slice, I'm using a slightly

01:40 towards the end compared to once So I will upload and set after

01:45 class that reflects what I'm using. and I'm gonna take in computer architecture

01:54 should be familiar with a lot of concepts I'm talking about. So for

01:59 of you who have taken their architecture , is the refreshing refresher at the

02:07 level. And I was, This kind of a controlling type description,

02:14 what to modern processor is like, , as we know there. Several

02:22 on broke a little bit about And then there several levels of cash

02:27 typically level one caches is indicating on light that something that is,

02:33 kind of private or unique for each , it's not shared with any other

02:40 . That's most commonly the case also Level two cache. But in some

02:44 that to cash is actually chaired Ah, not many course, but

02:51 , if it is sharing this show two course, where is the Level

02:55 cache is really shared. Accepted sometimes shared between many course. The other

03:03 of this slide is trying to give and I will come back to

03:06 But it's a good thing to trying remember from this stylist the capability of

03:13 data between the different levels for cash the process in unit is not the

03:20 all the way, so its highest is to the processing logic. And

03:25 it goes down as you go up the main memory on the computer and

03:31 course. Sickles, uh, even down decreases has got to disc

03:39 and the processors have lost the different that will talk more about on the

03:46 several sites. Ah, some of terminology, hopefully is familiar to most

03:54 you, but so this is hopefully , just a reminder. They're registers

03:59 in main memory in particular that in of the memory hierarchy that is in

04:04 processors today, Andi, it is to keep in mind that all functional

04:13 day work with the register was registered . That's sometimes called. So our

04:22 are taking from registers and return back registers so it doesn't functional units does

04:29 communicate. Directed with any form of is our main memory. So everything

04:34 registers and then it goes through this hierarchy from number 1 to 3 year

04:43 two main memory. And when it to dealing with courts, it's good

04:51 keep this picture in mind and tend your lovers recommend full of the

04:56 trying to figure out where it's originate where it eventually ends up. And

05:00 passport for the data and has indicated this kind of fatness of this

05:07 like on the previous life, is the capabilities and moving data is the

05:14 close to the execution units. And the lowest close to the main memory

05:20 its orders of magnitude. Difference in . It's not just a small

05:27 and here is pretty much what I said that keep in mind it,

05:34 , moving data. And that's most . They tend to read more data

05:41 main memory than they actually write to . That's why I made this kind

05:46 read. There is showing that kind the need in always related to the

05:52 errors in terms of scale. It's by themselves that loaning there is usually

06:00 , uh, requires a lot So is done much more substantially than

06:06 doing No. If one is kind lucky, then temporary results might stay

06:17 the extremely good case in registers. , in some level of the

06:24 But it's not necessarily the case, can stay in the cash is Therefore

06:30 need to use it again because cash a relatively small compared to main

06:37 So just when it comes to trying optimize and understand performance of codes,

06:44 is a good picture drinking convoy. I live the more specific on.

06:51 the data that is typical today for , yes, that the register to

06:58 is a war which are the focus registered because since the collection of them

07:05 that it's the single cycle to retreat , I will write data for the

07:09 file, uh, from the functional . So and they have typical sufficient

07:18 high bandwidth that any functional unit can any one of the arguments it needs

07:23 carry out the destruction. A swell writing things back without conflict when any

07:29 functions you. So at that point , at the register level, most

07:36 Arnason sets. There's no contention between functional units leading or writing data that

07:46 processors are typically the Allied WR Cindy processors. So they have very wide

07:55 on this, uh, read kind text distance trill a straight a little

08:01 in terms all what data rates are to you do this support. I'll

08:08 back to that in a couple of . It's quite impressive, the amount

08:15 Taylor movement capability that exists on the compared to what is the case for

08:24 we go off the chip. So best that tens of terabytes per second

08:31 gets moved around in a modern No, everything today registers kaffirs,

08:39 even main memory is produced using the kind of silicon technology and owner signals

08:47 metal oxide Simic electors. Now it that the there's no problem with the

08:55 with the move things back and forth registers and they're also very close in

08:59 . In terms of a single there was one. Caches are a

09:03 bit further away in terms of Ah, not so much so.

09:10 , it's not quite one cycle. Samos registers, but it typically 23

09:15 rarely more than foreign cycles that it to hearing retrieve data from level

09:21 So that means the time it takes go from a level one Castro registers

09:26 then to functional units, never to zoo, a little bit further

09:34 Similar. Typically it's about 10 ish cycles, maybe 15 or 20 sometimes

09:41 level three is even further away, a bandwidth is also lower. So

09:51 , that's why to get maximum performance of anyone know that today's processors,

09:56 the data reuse otherwise there's not enough in the data path to move data

10:03 feed their functions. Units. I think that bring myself spots on

10:10 night. Um, and this was to remember that I showed you when

10:14 talked about stream that structure and that shows that for Lauren processors, when

10:22 comes to so one point operations that do about a few 100 of them

10:27 every time or for the time it to receive one word from memory in

10:36 of the van and fuel, put within in terms off the Layton.

10:40 , uh, the access time to it from memory. It's more a

10:45 1000 so it's a so it's sensitive pass our a critical part in trying

10:54 understand performance of cold, and I to stress that to both in this

10:58 , and when I work with my , so Here's a little example now

11:04 . So the thing takes it very kind of instruction doing most apply an

11:10 instruction. You kind of the Tecora matrix operation like major exactly Matrix matrix

11:18 . So if one looks at requirements are for doing this type of

11:22 , you have to load the arguments , B and C and enough to

11:26 a C. And then you have get the addresses for on the operations

11:31 , B and C. So it for simplicity, just assume that everything

11:36 64 bits, right? So then have 7 64 bit entities that needs

11:43 . Most will be received or sent memory to get your data. So

11:48 56 point per cycle. And if run it pretty modest 2.5 gigahertz

11:55 then it's means that just to do operation for memory records in the 140

12:01 per second and that may not some bad, huh? But now money

12:07 processes can support their mouth. The and that can support many threads or

12:13 streams at the same time in an . So you hear so many

12:20 3rd. That is kind of the end to anybody. It's processes can

12:25 that if you picked your high end . So now it's all of a

12:30 28 1.9, close to 10 30 per second. Just ah for a

12:38 core or a single seat. um, And then you can compare

12:44 to what? The memory main member technology that will talk about in a

12:50 of lectures. But a very good today is in the aware of 2025

12:57 30 gigabytes per second. So it's three orders of magnitude. Difference in

13:04 , Ah, processor, I kind need if it worked to do everything

13:09 memory compared to what a single A channel can support instead of what?

13:16 ? Basically, one need I just the math suspiciously. 1400 memories journalists

13:23 a single chip in order for sustained to do this operation from memory.

13:31 that is what I just said and that you can't sustain performance in

13:38 are functional units capabilities today without having larger month of delusions and benefits from

13:46 , is the other thing that I talk about as well in danger in

13:52 course. Is it fun looks at high end memory in terms of its

13:59 or performance. That means that to this data would in fact use 1000

14:11 of power, which is a lot than actually there. Processor itself

14:16 So in the end, that You can't even cool this thing in

14:21 small area. So they very using exploiting cash cash is critical. Not

14:29 is too full of benefits from the capabilities are a processor chip parked in

14:36 of the number possible to cool the if you tried to do it without

14:42 . Some data. So this is little kind of example. Ah,

14:53 the sort of bandwidth capabilities off processes there today and on the left column

15:02 is a number of them by for nine is kind of the high

15:07 processor from IBM. Um, then is one. It may not be

15:14 that Oracle's still produces some processors that company called Son that did yeah processor

15:26 and for this work stations and was Dharma and any other things what the

15:32 is no longer around, that it bought by Oracle and that continued also

15:36 hardware aspect of Sun. But it's clear that they will continue it and

15:40 longer. And then there is the known as Casket Lake that is the

15:46 recent processors from Intel. UH, K and understands where Knights landing and

15:58 was high court count ship, not , produced a few years ago.

16:05 two deciding to discontinue it last year kept it there because it's kind of

16:11 example of the high court concept into atom, something internal still has.

16:18 was targeting the global market. It very successful. It's still being used

16:27 the last. And then a MDs the correspondent ship publicist into the

16:34 Jake from Andy, competitors Intel, prime competitive in a council server chips

16:42 the last year on this line are you use one from a really A

16:48 one form A and B. Now are just calling, and so here

16:53 again, try to point out and come back when they talk a little

16:59 more about the processes themselves. What silicon technology has been used to produce

17:06 chips and seven enemy that technology and a state of the art. So

17:12 , on the other ones, are all the technology that's been out for

17:17 couple of years when it comes to centimeters in a few more years,

17:21 it comes to 20 ish Nanami, acknowledge on the clock frequency is

17:27 Here is sometimes a school that two up means one is the base clock

17:33 and the high number after the That is the turbo mode or the

17:39 freak coincidence the chip is designed And then the column all since a

17:48 of it is tend to be measured terms of floating point operations, not

17:52 in this class but characterization processors. it tells you Hominy kind of billion

18:00 for operations in terms of double positions the various chips are capable of.

18:09 , Kanzi, The GP use. aren't kind off out there. Top

18:16 off this column. Women are at bottom of the column, but in

18:20 off, the numbers are about close 87 to 8 teraflops for a single

18:29 in terms of the position capabilities. kind of interesting to compare Casket Lake

18:37 most recent until chip ah typical exited chip compared to its Kanell ship,

18:45 it turns out the current version is him, indicating and ship that were

18:50 to me having extreme performance. Technology Column is probably good reference for

19:00 on. But I will talk about these numbers mean symptoms of the memory

19:07 now the DDR since with the Devil right that I mentioned before on the

19:11 four is the version of that and what comes after the dash is

19:18 memory of US clock rate in As I mentioned that memory is the

19:28 for lots of computations, so being to move data to and from the

19:34 is important, and one is the bus speed, the thing that is

19:41 29 under that said, but also the number off channels that each processor

19:47 . That means you can have and can operate in parallel and over time

19:54 things has grown, and early on was just one member channel. But

19:59 chips got more and more capable, Eventually one increased the number of memory

20:04 . But that's serious issue because they connection to the printed circuit board so

20:13 that's kind of physical things ever needs have certain sizes to be reliable,

20:20 know. So it increases the footprint be able to fit on the pins

20:26 feed power and data and everything else chip when, as it Kinsey,

20:35 , some of them or are high chips today, they have not.

20:38 am the most recent one is a of Channel Intel, in principle,

20:44 have six member channels. They have been a little bit behind MD in

20:50 case, even though it's just 12 the slide and it being it's because

20:54 Casket Lake is in fact, to ship the each of six channels,

21:01 put too, of those chips in same package Together it goes, in

21:06 , to be able to compete with the and in the right most column

21:10 perhaps the most useful thing. That the action memory bandwidth that never his

21:18 can, uh, maximally support. they are getting now up to about

21:25 to 300 gigabytes per second. Except the GP. Use of one of

21:30 advantages of GP use is not just computer capabilities, but it has been

21:36 fact that it has a much higher , the memory than the typical server

21:44 . So today it's about the factor 4 to 5 compared to you,

21:50 high end server processes. So even server processor and it may ah,

21:56 that much. Yes, or it's the same ratio. If you look

22:00 the floating point capability to the memory , it's not all that different in

22:07 of balance that was mentioned by in Stream talks with anyone. Listen to

22:15 . Um, Here is just a bit, though. Summary that what's

22:21 in terms off the number off course threads and threats again is the execution

22:30 sort of independently managed and most of sort of accepted six scores or all

22:40 their citizens cause actually, today support friends for core. But that's something

22:48 can be configured, and we talked it as find and stampede has a

22:54 . Nothing still support fight, preferring Inter College. That means it

23:00 um, two threads can try to this in a physical piece of

23:06 And, of course, it can't it quite at the same time,

23:10 can switch between the one who was what? Depending on who was went

23:15 for work. Ah, the nice actually supported four threads for each of

23:21 course, and the IBM for uh, can support up to eight

23:30 her court. Now the GP use not support more than a finger.

23:38 , of course, So they don't this capability, but they have a

23:41 more. Of course, there are not simply, and if you

23:47 the first slice for in terms all point capability, even though in terms

23:53 cores per chip, it's but on than an order. Magical city.

24:00 is a magnitude? But if you back in this life, But in

24:04 of capability, Zamora factor off right? No sign on some concrete

24:11 because, yes, I said, , a few times already. It's

24:16 to understand the capabilities the harbor, order to try to understand the behavior

24:23 coat. So the next service lights go through, if you are the

24:31 processes that are recently widely available shelters IBM Power and the an intel now

24:39 most commonly used processors out there. , so the I am foreigner nine

24:51 eight or nine there, not particularly court count processes, but ever court

24:58 were have nots. All features two to make them very efficient in the

25:06 number of scenarios. So if one up in this particular level 12 and

25:13 cashes in this case, you can you that, uh, the level

25:20 gosh is our 32 kilobytes for core again there individuals to its core.

25:29 like most processors today, toe have . Never one cashes for instructions and

25:37 . And for the these processors that out that the instruction canisters 32 the

25:43 never one is 64. Then, you live, it's further away from

25:48 functional units to the Level two cache gets bigger. And for the IBM

25:54 case it's 512 kilobytes park or and can see it also in the picture

26:03 doesn't even show the level bond cash it's included in the core budget.

26:08 not there to caches of private to court. Where is the level three

26:14 shared? Someone in terms of um, the lemon freak out for

26:20 trip? It's eight megabytes it for number I for the power eight.

26:29 . And for the power nine, increased quite a bit to under 20

26:33 lights. You can also see in all the Layton's is and compared to

26:40 cartoon when the beginning of the lecture , in this case, the least

26:45 to live along cash in 3 to signals. And for the level

26:49 it's 12 cycles. And then for level three, it's depends a little

26:57 because of the structure. How they three caches organized. That is not

27:02 . So it depends. Um, part of the level three cache of

27:08 trying to access. All right, like to, And in this

27:13 in fact, they have 11 4 . That is kind of holds

27:21 In the middle of this slide, can also see kind of a little

27:30 that has green errors. That shows the with off the data path,

27:39 , from the level two to the one caches or from liver month

27:44 2 on. You can see that are not symmetric. So they are

27:50 an early slide because many applications Reese data than the right data. That's

27:58 they tend to have nor can pass to move data towards the functional units

28:04 away from that. Now I'm gonna side them in particular, they been

28:11 this product. There are very much on database applications and typical enterprise.

28:21 for that, um, capacity or to move data to him from the

28:28 member is critical, like it is times on their tried to figure out

28:35 to address this ball next. So goes the separate chip that is known

28:42 , um, the centaur ship that fact, has sort of eight channels

28:49 the processor. And then they has channels, tours the deer I'm

28:59 So it's kind of has aggregation type between the actual physical of year on

29:05 and the channels that goes to the processing chip. Eso here is now

29:15 sky like that is what's the process stampede to that you're using for the

29:24 assignment? And it's, um, less expensive chip on, but it's

29:32 quite complex and sophisticated ship in terms its features. Ah, it has

29:42 to 28 course. Ah, don't some people on top of my

29:47 I think it's 24. Uh, comes in many different versions. It's

29:52 just one. And if you go the income, my website and look

29:56 processor, you can see any one the different generations has anywhere from probably

30:02 to 20 plus course, depending on version of the chip on likes this

30:10 . The base clock frequency is 2.5 actions 3.84 this particular Skylink version that

30:18 takes a unity 1 80 that Mr Towards the end of the right

30:24 . Um, this is kind of typical what it is today.

30:31 as the feature sizes, that things enemy just stuff that we talked

30:37 where I talked about before. um, enables more and more transistors

30:43 the chip and that's can be used many different ways, is usedto and

30:51 on and increased car sizes. But Nelson used to increase the number,

30:57 course, but even within course, are changing, so one has not

31:02 multiple functional units off the same type each core. So there several multiply

31:11 units, sermon, a little units associated with each court, and

31:18 comes into once known as the extra for addressed factory extension. Ah,

31:26 names for different vendors. This is name for it that allows you to

31:34 16 double precision floating point operations for single while instruction that is, find

31:42 little bit slight and, in that have to such units in each

31:47 off the course. So that leads Each court ending this 54 something couple

31:55 a few, then multiplied by the . Of course you can get

31:59 It doesn't about 1.5 teraflops. Processing chip? Um, no.

32:08 said the processing wrench was from 2.5 Reconsider 3.8. And when I figured

32:17 this the radical peak, I used and not neither 2.5 or 3.8

32:26 There is a good reason for that . It turns up as I am

32:33 introductory, I think Just before the started that the very the clock

32:38 that more power, the chip ah and it gets harder to cool.

32:45 it turns out when you use these wide instructions and put all the functional

32:51 toe work, you cannot run the even at the base frequency without it

32:58 . So that civil graph in the right hand corner trying to illustrate what

33:05 . So the logic in the chip have processors that control our dissipation toe

33:13 the chip from overheating. So that's so under the rug, that's something

33:19 do not have control over. That's , that does. So when you

33:24 these instructions, the clock rate gets to the point where the chip were

33:33 overheat. So in fact, in case, when he uses white

33:37 the chip gets clock down to 1.7 instead off potentially getting and clock that

33:46 terms of the cash, Isis, see this chip assays same size cash

33:53 the I am chip. In the of the level one, cash has

33:58 like the smaller than a chance just be not 64 but it has the

34:02 level two cache compared to you on IBM ship and the fee cash is

34:13 . And uh huh. Total should somewhere. No muffler put that.

34:18 there is 20 eight times something. it's about 30 plus, um,

34:26 in terms of the level three cache the chip, something that I have

34:34 mentioned. But I will talk to about towards the end of the

34:38 um, lecture. That is, how they operate. So this notion

34:42 cash lying that is so 64 bites an important concept because fine, you

34:53 , interact with main memory, but know something from memory or write something

34:58 memory. It comes in chunks, those chunks are known as cash

35:04 So you're even if she just want word. Whether it's you know,

35:09 bits or 64 bits, you get lot more than you want. Only

35:18 no cache line size is. There between processors but 64 bites. It's

35:25 very common that is used by many , and it's also the case that

35:32 most cases, the cache line size the same high level onto in

35:39 But it doesn't have to be that and some processes. And I think

35:44 IBM that ah but then pay attention . I think there's 100 28 pipe

35:51 lines when you move away from number , so it's no guarantee that the

35:57 line is size is the same. it also says something about eight way

36:04 this slide on 16 way, and is the number of different places in

36:15 cash. A particular cash line can , and I'll talk more about

36:21 Bare your soul. Typically, there an association between memory locations and places

36:31 the cash. And how that is is through this whatever X way mechanism

36:39 will talk about laters in the Saracen contains some letters and numbers.

36:51 obviously one needs. Chips are not in isolation, Zo or they

36:56 Then I might, as some talked the memory channels, but they also

37:03 to need to talk to each Bacon is most, uh, computers

37:10 put together. As knows as we about. First thing right? Eastern

37:16 has a number. Are processors associated horror in it? The most common

37:24 to socket nodes so that means to . But doing processors in the two

37:34 needs to communicate with each other and there's a couple of different ways

37:39 can do it. Um, so processors, in addition, toe having

37:45 channels for memory. They also uh, communication channels for communicating with

37:54 systems that just separate from the memory DSO. These are kind of i

37:59 channels that are this PC I as they're called. But then sometimes

38:08 also separate collection off communication channels that designed for kind of being network

38:19 in particular communication on the same circuit . So when it comes to

38:26 in addition to have the standardized, you a chance using the P.

38:30 I standard? They have something on own kind that is else drop path

38:37 connectors. U p I. That used to interconnect their own processors.

38:44 on your p I and something one be aware of. And this light

38:49 give genetic data rates for this you channels for in this case and are

38:56 40 gigabytes per second for channel. it's compares recently okay with the memories

39:04 Memory Channel in terms of this and it's higher than the PC

39:12 my assistant of Talk about later. here's another just a picture of this

39:20 chip use up to talk about the of this general thing. So it

39:28 28 the maximum of 28 course, they also need to talk to each

39:37 . So these days they have most have some form of a network on

39:43 trip. In the beginning, it just so if you court such,

39:50 could use it. Busts, basically the medium that is shared by all

39:56 different course. But the more there , of course, you have the

40:02 that thing got so he could not . Ah, independent operations for all

40:08 course someone needed actually to have a , you know where to so kind

40:13 separate traffic needed by different course. when it comes to this skylight,

40:20 introduces submission network that is going to it on this slide. Um,

40:32 , I also put them this slide to give you a little bit.

40:36 is more what you need for this , but yes to you. Be

40:42 over a little thing beyond Watcher. , he let two levels watching the

40:50 actually gave one. But for those you study computer architecture interested, the

40:58 that there's been kind of longstanding What is the business vest to call

41:04 or risk instruction set? Well, do that's a complex on Driscoll's.

41:12 reduced are simple instructions. That and risk constructions that, uh, has

41:20 come quite popular. And there's kind a prescribed that is an open source

41:25 a lot of companies starting out to . And you said to them their

41:30 technical processor. However, Intel kind have been at the other end of

41:37 camp, so to speak, and this Sisk instructions that that is also

41:42 by Andy. But this light, trying to point out that what happens

41:49 modern processors up even Double One has complex instructions that they are, in

41:56 the broken down into micro operations or actual execution engine or instruction processing is

42:06 of more risk like team in the . To the compiler, it may

42:11 like a 66 construction, but then hardware is health petition things and breaks

42:17 down to smaller units. That's looks like thing. Reese constructions.

42:26 so here's a little bit when I about it for more in terms

42:29 uh, the need for processes to to each other independent of the memory

42:34 . And this is yes, time to be put together. The two

42:39 . No, it's that the most , that's what it is, both

42:43 and stampede, but for four socket is not that uncommon on a socket

42:56 somewhat rare, but they certainly And it began. It provides more

43:02 and more capability on a single no , then the other ones. But

43:09 consequence relatively bigger and hotter, and things the trade off between which ones

43:15 used. So the two suckers ones to be the cheapest, but not

43:20 optimal for every application. And as can see here, they kind of

43:27 , uh, or injure errors on find outs. The infant soupy I

43:33 for the process is to talk to other. Ah, the thing blue

43:39 is the memory channels, and then have the P. C.

43:42 That is kind of the allure somewhat my family. There's then I want

43:50 dimension the casket like justice, the recent one. Ah, from Intel

43:56 disappointed out It is simply to this like processes put together in the same

44:04 package. So it's pretty giant and it's also pretty hot package.

44:11 it's doctor, about 400 watts. given the physical science, that is

44:17 to giant, that means the heat is quite tired. So there's no

44:22 off air cooling. This particular chips actually needs liquid corn. Teoh be

44:30 to operate this thing at the good . On most of the data you

44:35 find. It's exactly the same s on this Kramnick site on As you

44:41 see, the use on the right operate and corner This like their elder

44:46 , there is showing this jello error is indeed the U P I that

44:51 used for the to. Skylink dies talk to each other in the same

45:01 . Here is I will not go this, but you have a slide

45:04 tends to the characteristic for the house processor, that is, some bridges

45:08 will use later on for some assignments if you want to use it for

45:14 . It's as the service and older and my you can see similar 32

45:21 . Tell everyone in this case smaller 2226 And if you were course.

45:27 Lord er number three per core. , this was a nice landing that

45:33 not discontinued by Intel, but it sent to two course and something

45:39 like with the A V X instruction , um, them or courses.

45:45 the more compute capability you put on single die, you're actually cannot,

45:52 , have a side oclock frequency because the way they see most technology

45:58 And I will talk about that There's something known assess Cory Little.

46:02 essentially the participation increases in proportion to square or even more on the clock

46:12 . So I will not go into . But anyone interested can find something

46:16 and more, uh, places to it. Dana on this. So

46:23 is the correspondent ing from intel to sky like ah, chip. And

46:35 , in fact, however, in instruction cache than the interest processes 7

46:41 kilobytes instead of 32. Otherwise, are fairly similar in terms of

46:47 Isis Teoh what it is for the processors. They are not using the

46:58 big, wide instructions they are using 56 fits. So it's not quite

47:06 bit why you know, whether that a significant rollback or not defense on

47:10 application, because just the fact because have a very wide instruction that can

47:16 a lot, it doesn't mean that application is such that all the potential

47:24 in a very by the instruction to used in the same time. So

47:28 part of the argument from AM delighted . Do it on the ceiling that

47:34 to use the silicon resource is on else, as supposed to increase instruction

47:39 That's the trade off. The other that is different from the way in

47:46 has been doing things for years. , is, uh, they don't

47:55 single monolithic big chips like Intel does trips, um becomes costly from one

48:04 one reason not only because the Silicon is bigger, so that, of

48:08 , you get fewer chips, through wafer talk about that at some

48:15 . Um, so But on the hand, ah, number off successful

48:22 . Ah, decreases because the bigger area, the more likely that this

48:27 there some defect on the trip. then jails on chips goes down.

48:33 A and E has always tried to with a relatively small chips. And

48:39 what they do instead is the the number, all these chip in

48:43 same package of the use, what's as Chip Let's on, then put

48:49 still this same kind of core But then they don't need the entire

48:56 silicon area to the refectory because, course, they don't mind the affected

49:02 . But I'm saying they can have higher fraction or small chips in court

49:07 fine and use those to assemble um, they. Another difference between

49:16 intel is that M D has always better or more memory channels and supported

49:25 bandwidth to memory than Intel. So many years in the customer's always complained

49:31 intern and every meeting, he was group meeting. One influence come to

49:35 something about their memories, system and to, uh, main memory.

49:44 this case interest is that even the recent skylight that have six memory channels

49:50 Andy has eight. It's not just , but it's still six the

49:55 25% higher capacity. And then they also had subordinates like that. Hiring

50:01 call cried for the main memory. , and here's the most recent version

50:10 is no, again using the same Chip less putting together, Ah,

50:17 ship with many course in this case support 64 course in this case on

50:25 eight members on its back or on A pretty substantial amount off transistors and

50:34 particular package something within term. But in the order of 40 billion transistors

50:40 all has still work, um, the things to work Well, I

50:46 also say that the other difference between difference between Intel and Andy is,

50:54 he has try to keep participation. were so the time to be often

51:01 a little bit lower clock great than , but again necessarily have had traditionally

51:10 more course or better memory bandwidth for still have. You should be in

51:15 in terms of performance, but to it differently, Um, So

51:24 since bar has been an issue that an issue for a long time there

51:32 been an interesting trying to produce a consumption. And now get some example

51:39 has happening? Because big and users Amazon and, um, Google and

51:51 have not been happy where, ah in terms of their chips being very

51:59 hungry. So they have looked for ways off designing servers on the in

52:06 , they do design their own service long come back, but they haven't

52:10 their own silicon. But there has an interest and, um, using

52:17 processors, then input processes yes to reduce the power consumption. Some of

52:26 may have heard about arm, maybe . Many are, in fact,

52:32 I work playing the most common processor All last time I looked, which

52:41 a few years back, there were 10 times more arm processors sold every

52:49 . Then income processors and arm processors designed to be there, a car

53:01 energy efficient and they are dominating, then we'll move markets. So Apple's

53:09 , um in many of their products a Samsung used um in amendment of

53:15 phones. Um, call come, them. So there is very large

53:21 on the mobile product companies that uses processors. And as you can see

53:29 this side there, in terms of cache sizes, they are not,

53:40 , second to is kind of server on this case for this particular version

53:46 the are impressed. It was 64 brighter on living one cash,

53:52 which is actually better than intel in with A and B on the

53:58 Three caches are not too skimpy, . Eso that's kind of also in

54:03 server processors. And with them level , then it depends is configurable.

54:10 whoever, um, Vince, the I failed to mention that arm does

54:15 produce silicon. They license, they're to the companies that then you

54:22 then configure they're licensed products. So decide how much of a long

54:27 too on live with e caches they in their product. So but they're

54:37 are typically under one walked and depending where you configure it to be a

54:43 100 millibars to maybe a couple of it depends again. Look, today

54:50 that uses there. Designs decide so so said the big guys are not

54:58 with the fire consumption, so they have gone off. And Amazon ah

55:05 to build something they call now. . Have depression too often known as

55:10 graviton to that has 64 on these cores on, and they have 64

55:19 . Symbols live along 11 to Should have one make of it.

55:23 , the to cash and 32 record day. 32 megabytes for Level three

55:32 . Ah, done runs. Teoh found anything about whether they support Terrible

55:38 enough. It's definitely recent trip It hasn't been published much because they

55:43 sell it, but you can actually it. So, you know,

55:47 the juices Amazon Web services. you can cry out. They grab

55:53 on to for your coat. This of the instance, since the

55:59 they're supposed very a mixture of different , their positions. It's the big

56:05 again, about 30 billion transistors, it's producing these seven under Mutar technology

56:13 its support. Eight memory channels but it competitive the high end

56:21 and these chips is about a quarter less on the par consumption. There's

56:30 company that also uses on, and do sell. Their trips. Of

56:35 is processor vendor Norma's Merveldt, and have a problem because from their X

56:41 for this one that was announced last . And they also used the arm

56:47 . And now what the supposedly is at this time is 64 version,

56:55 they, in this case they There's three rings to getting into

57:00 and they are kind of 15 tiles the connection each Thailand has. Four

57:08 these part, of course, began memory channels, but then they also

57:13 their own into chip into connect in case has some different gigabytes.

57:21 um, so this is a bit summary off what and I talked about

57:27 terms of the core account. So that matters is in particular both for

57:34 counts and threads. Yes, that's level of para lettuce that needs to

57:38 managed effectively, you know, just a single processor. So when they

57:45 about how to program listen makes things , it's important to understand the level

57:52 calories that exist even on a single of today Clearly in a cluster

57:58 it's much more so. Scale ability important, and it becomes even important

58:05 a single chip now. As we , the level the cache sizes are

58:14 when it comes to understanding how you make date as use off data

58:23 it goes without it. That's no they got good performance and we'll talk

58:27 the chocolate hot to make use. cash is when there is potential in

58:32 application to do so. I should talking of Marx's you use on,

58:42 it's a steep EU that someone you have heard about this stands for tensor

58:47 unit. It's again something you cannot him. It is something that you

58:52 use and its golden so Amazon that the gravity to Google. They designed

59:01 steep you and that's a silicon the both, including Facebook. They

59:08 their own servers, so they don't wouldn't say they don't buy things from

59:15 and Ellen IBM they do, but tell those companies what they want,

59:21 these companies build it for them If they don't want to build

59:25 then Google and Amazon has some other for now, So have do examples

59:36 GP use the most common ones out . And later most of you know

59:42 the media. So this is the recent on the news known as the

59:49 time. The E stands for Goldstein this product over in the labour here

59:54 this light. Ah, and the actor is also slide for the previous

60:01 . Pascal Aversion Ah ah, You access to, um Bridges,

60:07 um, guys were just talking about here like, uh, server or

60:16 other chips there also kind of using replication in order to design their chip

60:23 of the highly structured chips and in case, even comes in Kenya.

60:30 am 61 to call graphics crosses in GPC for short and the basis of

60:40 six again to convince other than obviously on the chip. But they are

60:47 . So you can just replicate those on the chip inside of one each

60:51 of these six, um, BBC's for this particular version of the chip

60:59 have for work and video called streaming of processors. So it's again

61:07 replication 14 times are particularly designed. stream, not the processor. So

61:15 total have 84 of those on this chip. No, each one of

61:24 screaming about the processors again contains copies the same thing. So when it

61:33 to a leader in this case, have 64 course that can do floating

61:41 . Also 64 core sinking the integer in each one of these stream most

61:48 processes and they have half as many that can do 64 bit floating

61:56 So there is a quite large number a look at the actual You mean

62:03 some performance work, which is integer floating point functions units. So instead

62:10 you having potentially up to under, were maybe a few 100 You in

62:17 K have a few thousands, so kind of an order of magnitude

62:22 uh, atomic, if you have execution units they do not support

62:30 than 2 14 point operations per execution are one of the food, of

62:37 . So it doesn't have, most defending capabilities like most expensive

62:43 Another course now the ai ah emergence speak where I'm unused. Another reduced

62:55 , they're also included was no must course. Ah, he's one of

63:01 screaming, mostly processors. Other things , um, it's kind of noticeably

63:11 from DP use. And it's the for Andy. It is, in

63:15 case, the hill on cash is something that each core has, but

63:20 something that is shared among all course the stream of the processor. On

63:27 level two is a little bit shared a higher level, and it's

63:35 of targeted towards the memory system that's has on this slide it for memory

63:40 rather than for uh, CPC or most of processing. No, as

63:50 mentioned that I m around with is problem and a subpoena problem for a

63:58 time. And so PC and server vendors say, uh, increase slowly

64:08 of increased the number of memory channels memory channels tends to be standard 64

64:13 white today, Um, but that's the problem. After, said one

64:20 the grand stitches of DP use has that they have had I am memory

64:25 and that they have quite having effectively channels to memory. Then server processes

64:33 to have. Now, if one a channels, um that a 64

64:37 wide. Uh, that is basically and 12 bits wide. So that

64:45 work so well until then. For GP used to compete with anymore because

64:50 got quite good the server processors as . So they have been the first

64:57 move to New Way. We're integrating the are memory that is known this

65:03 bandwidth memory. So that's now you . This dies for Jesus of silicon

65:11 his memory pieces in the same not on the circuit board and

65:17 It's in the same package as the , and we'll talk about that in

65:21 future lecture. So now, in case. But the GP uses the

65:26 with the media, and they have of their high end cards that has

65:32 to 6 pits wide channels to And that's the way to get over

65:39 memory bandwidth today and then, you , social, these floating point numbers

65:48 tipping a half the number of float 64 bit floating from June Ministers kind

65:52 have performance for our position compared to but it also support other produced positions

65:59 in particular for the Tensorflow. In case, it's, um, 125

66:07 , teraflop, Tension said. For circle. It's a very big chip

66:14 its ah pretty hot chip in terms 300 watts. There's something about the

66:21 competing product from A and E that is very similar in terms of

66:30 and the for a long time did , um, que did too much

66:36 application for according double positions. So double position Performance tended to be,

66:44 , now for a good comfort and . Where is there? A single

66:48 performance typically were significantly better than that , and the reason was that they

66:56 did a very good job in the markets. And they were happy with

66:59 gaming working, Um, but recently , too, like NVIDIA, has

67:07 shifted. So now they both want actually going after the data center

67:12 and that means now they can be for the first time. A couple

67:16 ago, this chip that is now competitive with then video problems I just

67:22 about again, the the same principle they have is to try to keep

67:31 small so you can see that this different technology. It's not fair

67:36 quite, but this chip is only 300 plus square millimeter compared to India's

67:43 prosecutor millimeters. So that means right in this case has comfortable performance.

67:54 because it is such as, the ship is so much smaller,

67:59 can try complete for Evelyn Price mention to you. And this is,

68:09 , the script to this lines that they love it more on the thing

68:16 Golden decided to go off and do both of very consumption and performance for

68:22 applications in and the X 86 architectural instructions that was not very good.

68:32 they figured it needed something different. this is something they decide to

68:38 Um, a young applications in and we will come back to talk

68:43 that. But the graph in the right hand corner chose, ah,

68:53 roof line model that it talks about time that they used to correct her

68:57 their behavior and then shows a little on the summer away. How there

69:03 this case the aversion to not the three. This is an older graph

69:07 their ah, roof line side. or a graph that what it shows

69:15 compete. Ah, with both CP and GPU so they may want to

69:22 a look at that and just a fication and see why they did this

69:28 . They also used again memory. there's a critical. So they also

69:33 to, however, man with memory to put 30 minutes. Sorry,

69:41 , many lights, all of And, um, the other parts

69:47 the gap again, the same gigabytes second. And it's kind of addictive

69:55 still smaller than India's Walter Chip. little somewhat, parents think they didn't

70:02 how small the ship or size in they're published. Um, but the

70:09 is about the same as civil And so it was abandoned.

70:16 so this was just kind of a in terms of specialisation giving you energy

70:24 of benefits, and I was going show a little couple more slides,

70:29 I guess I'm didn't get to talking Josh is today on that. All

70:36 , we'll show the next couple of . So next time level, then

70:41 about understanding how caches paves. so, uh, so making a

70:50 pace off in terms of the and efficient. So that's why in the

70:54 market, this is always been used to have dedicated processors for do,

71:01 , rendering from doing communication protocol processing doing security algorithms and, uh,

71:12 graphics. So here is just it's of a countryman example that from HP

71:19 some of them didn't process search that will show a picture of it,

71:25 , later. So I guess So here's what. Chip that?

71:31 . Happened to work with their my a few years back. Souther has

71:36 of these. Did you ever seen process? Of course. Is just

71:40 of the same kind of chip. have their own living 1 11 2

71:44 . Um, then there is a memory. It was not label s

71:49 level three because it is no running cashing protocol. It's obviously a useful

71:57 of memory, but it's something you on your own. Um, but

72:03 there is this chip, then has to see a number of other processing

72:07 for having Io for handling, communication in this case, the

72:15 um, crypto crept in here. . It's a I showed this slide

72:24 coming back to when I more or started with. And that's saying the

72:28 pass are the critical part that one to worry about. And this poses

72:33 little bit missing message. But, , the one point I wanted to

72:38 about this thing that not only does what's of the data past that

72:43 numbers should she but see by their that dark lines with a little splash

72:50 it that can see the width of guys. But, um, it

72:55 also the case that they all both caches and the they passed that moves

73:08 between they're very is caches. I'm operating at the same Clark frequency.

73:17 in this case, the little one operated the same clock frequency as the

73:23 and the register files, but the to cash operates up half the

73:27 And then there is yet another Ah, data pass that operates at

73:32 third. Oh, this CPU called . So just because when you look

73:37 things, it's not sufficient just to at the width of the data

73:41 But it also needs to pay attention the Cochran, but which they operate

73:45 figure out the capabilities from moving data . And there is an example from

73:54 . Qualcomm. Sorry. Have um their trip that your show.

74:00 have Ah, crypto engine. The single processor. The an image

74:06 the I S B ah, and processor that did to you the

74:15 Ah, So there is all kinds different types. Oh, designed processors

74:23 on the same processing guy. There's example that we weren't from in my

74:28 that is called a division processing Of It is it has again,

74:35 , memory controllers have a bunch of processes on me. It has not

74:40 media processes on it. And this the way we actually some chips on

74:45 case since about normally about the one chip and depending on what to

74:51 consist less. More so compared to is so early in this election,

74:55 up to 2 53 100 mega of watt chips, this is a couple

75:01 orders of magnitude less or more. , so this is said and I'm

75:10 closed on my time is up. this slide, basically and probably useless

75:16 an inside scoop today in since and get to talk about caches. But

75:21 it comes to these chips that increasingly even in server processors on and certainly

75:29 chips for PC's and definitely for mobile , they are hitter genius. And

75:35 means each part of the ship may its own instruction sets of programming.

75:40 chips are a headaches, to say least, because they're different. Memory

75:48 are not necessarily shared in different instruction so on. The industry has been

75:54 to figure out how to deal with uh, by then form something they

75:59 the heterogeneous systems architectural I. And it's just a for short on.

76:06 quite a few companies involved in trying come up with, uh, ways

76:11 making programming off these systems bearable, I believe I will stop there,

76:20 I will talk about caches next So there is a few minutes left

76:27 questions. The room for questions during are over that so will start next

76:42 them for the room for questions. the point of today's lecture is the

76:48 . Give your general understanding our properties processing out there on not just qualitatively

76:56 quantitatively, and that is important to than eventually. I understand what the

77:05 are for processes off the particular application cold until May have we're on in

77:15 course. Will talk about that Good. Try to take advantage of

77:21 features. There aren't worth any No, I guess it nothing on

77:52 one come too soon. Questions to , not Avalos recording at this

-
+