© Distribution of this video is restricted by its owner
00:00 | mhm Yes it is time unit and are not many that has checked in |
|
|
00:24 | . Uh huh will never Let's start asking. And at this time. |
|
|
00:31 | , so I'll continue to document memory today. The last time I was |
|
|
00:37 | talking about cashes and a little bit memory is integrated into and server and |
|
|
00:49 | I'll focus more on main memory itself its properties that is important to understand |
|
|
00:57 | failure. Mm So that's it. it's just kind of well see this |
|
|
01:08 | part one as you forget to part , but part one is about Basically |
|
|
01:12 | main memory in itself. And for , if I get this more about |
|
|
01:19 | power management of systems. So the thing is to talk about the design |
|
|
01:26 | then go through the various stages and try to somewhere as what I've done |
|
|
01:32 | about And this is part two as said, if I get it, |
|
|
01:39 | highly control our consumption. So now the memory, this is just a |
|
|
01:46 | showing a little bit of the Think that's some sense. Uh for |
|
|
01:53 | difference between memory ships and processes chips go back and look at what was |
|
|
02:00 | about in terms of processes, chips to be in the range of 5-700 |
|
|
02:08 | mm and in this case here they considerably smaller. I think this one |
|
|
02:20 | . So this Shows a little bit the horizontal axis. one three different |
|
|
02:26 | memory manufacturers and then for each one them it is successive generations of chips |
|
|
02:38 | but that's this line chance to show I'm is there are considerably smaller. |
|
|
02:44 | As you can see from the left , they're about 40-60 sq mm. |
|
|
02:51 | there are More than 10 times smaller your typical processor chip. And then |
|
|
02:57 | itself has certain consequences in terms of you can actually work with these |
|
|
03:05 | The good news is that memory tends be very cheap. Part of the |
|
|
03:11 | for that is that the chips are so the years and the manufacturing process |
|
|
03:16 | just usually considerably harder than it is processor chips. So that I think |
|
|
03:24 | much and you can also see in of the number or dancing of these |
|
|
03:29 | that are uh pretty impressive. So look at the most recent generation chips |
|
|
03:37 | are about 100 megabits per square So that means you get a typical |
|
|
03:44 | may you know, even up to on a single these silicon that is |
|
|
03:52 | to 50 square millimeters insides Anyway. I was just to get a big |
|
|
04:00 | , a little bit of Iran The chips that is used for making |
|
|
04:08 | memory and pretty much all systems and different is I guess I should say |
|
|
04:16 | . So the memory is since many back their kids back are manufactured in |
|
|
04:23 | same technology as the process of So that is the c most technology |
|
|
04:33 | with but two different types of memory designs fundamentally different designs also have |
|
|
04:44 | different properties. So last time mostly about Cashes and Cashes are built |
|
|
04:50 | I was known as islam or send random access memory. Yeah. Whereas |
|
|
04:58 | memory that the thing that is packaged as dim for instance our goal separately |
|
|
05:04 | the circuit bowl or in this high with memories. They yeah tend to |
|
|
05:13 | birthed out of was known as dynamic access memory. And the difference between |
|
|
05:21 | two kind of a sketch of here the design is like. So the |
|
|
05:27 | memory using the RAM chips there are basically to be cheap and for that |
|
|
05:37 | means also very small in size you lots of bits of human piece of |
|
|
05:43 | and so they are effectively just want capacitor and transistor that is basically an |
|
|
05:51 | on switch and the capacity is the that stores The charge that defines whether |
|
|
05:57 | system is in a zero or 1 the bit that's a zero or a |
|
|
06:04 | . Now the thing that is used caches are the Strahm and they are |
|
|
06:11 | done as six transistor cells not to you on the right hand side. |
|
|
06:15 | that means they are considerably larger in of the silicon area required to store |
|
|
06:23 | bit compared to the D ram. difference is that they are able to |
|
|
06:30 | ST which the than mammograms are kind not very good at. So they |
|
|
06:37 | of leach. So it doesn't take that long before you lost the space |
|
|
06:45 | I'll come back to that. But means that dynamic random access memory also |
|
|
06:51 | to have what's known as a refresh restore charges in the memory cells. |
|
|
06:58 | you get the error because it doesn't charged at least now the design than |
|
|
07:07 | actual memories um are organized in this way isn't as organized as the matrix |
|
|
07:16 | in each one of these little cross past volume extremist orbits. And so |
|
|
07:28 | the operations will come to in a bit uh as to when you want |
|
|
07:34 | particular bit or collection a bit they to gain both a role address and |
|
|
07:41 | address. That's very simple. Just you get the matrix element out of |
|
|
07:46 | matrix that you have a Rwanda column . So there's nothing unique about that |
|
|
07:53 | it also means that if you take micro photograph and remember if you get |
|
|
07:58 | that looks like in the upper right corner, they are incredibly well um |
|
|
08:05 | um or like also seen on this for flipping and touch screen. So |
|
|
08:13 | are incredibly dense and they're much dancer terms of transistors per unit area. |
|
|
08:20 | you'll find in the process of design so well structured and can be compacted |
|
|
08:30 | very highly. There's just another thing tells you a little bit to stick |
|
|
08:34 | the hot I guess by now four ago but density has increased but again |
|
|
08:42 | the picture of it memory chip and is just kind a little bit more |
|
|
08:48 | of the extreme and then if you go and look at the footprint in |
|
|
08:55 | of the actual physical size in terms square millimeters of cells and they are |
|
|
09:01 | larger than what they are in the cells. Again, so that's why |
|
|
09:08 | system is expensive relative to the room it requires a larger amount of silicon |
|
|
09:18 | If I remember I guess roughly at 10 times society. It's not more |
|
|
09:27 | so that uh was going to talk their how this actually works and that |
|
|
09:35 | in part a consequence of this kind matrix type design to get to memory |
|
|
09:44 | or works. So it's just uh are pictures of one thing on the |
|
|
09:50 | hand side. It shows up against and there's a role address in the |
|
|
09:56 | address but right. one of the of memory chips is that column and |
|
|
10:08 | address this share basically bet lines. you're like so you can only give |
|
|
10:16 | address at the time. Either the address or the column address. And |
|
|
10:21 | reason for actually sharing the wires and pins is again that the memory chips |
|
|
10:29 | small so you don't have much real to actually provide all the signals you |
|
|
10:37 | both addressing and data and clocks and and all of it. So it |
|
|
10:44 | been the case for a long That depends for rolling column of justice |
|
|
10:50 | shared and that has consequences on is why the dirham is much lower than |
|
|
10:57 | islam? I'm coming to that, not part of it. I'll get |
|
|
11:01 | why it is a good question about and a few slides. I'll try |
|
|
11:05 | explain why it is but its inherent the matrix design. That's where it |
|
|
11:11 | from. Not so much from the that to share pins. So then |
|
|
11:21 | in part because of also energy one has constraints in terms of how |
|
|
11:35 | um in fact the power, the and I'll come to that. So |
|
|
11:44 | is in fact and I'll explain it the next few slides, but that's |
|
|
11:49 | one goes through some of this That is in the right hand column |
|
|
11:53 | this slide where says activate, read pre charge and refresh. And I |
|
|
12:00 | on the next few slides I'll try make sense out of why these things |
|
|
12:06 | the way they are. But it that it's a bit of a process |
|
|
12:12 | either write or read the memory because takes several steps either to read or |
|
|
12:18 | the memory. It's not just sending address there, even if it got |
|
|
12:24 | separate Derwin column. It is, there's more to the story of how |
|
|
12:28 | the arms are operated. That's not case for extra. So it's unique |
|
|
12:34 | the design for them in order to themselves small and sheep to uh |
|
|
12:44 | So here is a little bit more again where this capacity stores a bit |
|
|
12:51 | the transistor access to switch that basically you to either read or write the |
|
|
12:59 | . The problem with this theory ramses you read, I suppose that you |
|
|
13:05 | the capacity of being charged represents the which is the most common, that |
|
|
13:09 | can be the opposite but there is charge and the transistor that represent the |
|
|
13:15 | . And when you want to capture state, you in fact loses the |
|
|
13:22 | and the transistor. So whenever times read something um you're kind of as |
|
|
13:31 | said destroy the state and that you not what you wanted to happen. |
|
|
13:36 | that means that after you kind of the value, you need to restore |
|
|
13:42 | to make sure that the next time want to access it is still |
|
|
13:46 | the state that you had before. that's part of the reason why there's |
|
|
13:51 | bit of a process that in Read it. In fact requires a |
|
|
13:57 | write type operation every time. Um if you remember this uh matrix picture |
|
|
14:09 | I think I'll come back to that the next slide, I can't say |
|
|
14:13 | . So um so maybe I'll talk it and the slide. So basically |
|
|
14:20 | happens then what Francis one needs to order to be able to read arrow |
|
|
14:31 | um need to what's known as And that's again it's a function of |
|
|
14:37 | fact that how you managed power on chip that. Yeah basically only kind |
|
|
14:43 | enabled one at the time in terms reading roles. And that means when |
|
|
14:50 | go from reading one role to another there is a process again of closing |
|
|
14:56 | or restoring things that you may have in the read process before you can |
|
|
15:04 | or activate another rope. Now Rose to be quite long for instance or |
|
|
15:19 | Rose may contain many Work. So not just 32 or 64 bits. |
|
|
15:26 | uh in order to collect some segments number of bits in a role. |
|
|
15:36 | also then give column addresses. So was again the sharing of the bits |
|
|
15:43 | first to give a row address. that allows you then to activate a |
|
|
15:50 | role that contains the data you want read. And then you provide column |
|
|
15:58 | . But as I mentioned there are data items to cook in the role |
|
|
16:05 | the D. Ram. So you the optional selecting basically different columns that |
|
|
16:16 | the collection of them uh contain the for a word that you will or |
|
|
16:22 | or single or double precision where that want. So once you have activated |
|
|
16:28 | role you can read collections of columns sequence without going through the process of |
|
|
16:40 | they're all again. So uh I'll back to that and if you just |
|
|
16:47 | so the process is usually activate the and then if you have good access |
|
|
16:58 | memory, that means hopefully you will columns within the same rule in order |
|
|
17:07 | to enter the penalty and time in of going through the closure and activation |
|
|
17:14 | a given role and activation of a role. So this is just a |
|
|
17:19 | example showing that this case there are different columns that one wants to read |
|
|
17:26 | of the same role. So that's you activate Rose zero and then you |
|
|
17:31 | for uh set a column zero and read for one and read 3 to |
|
|
17:36 | . But then the next read was a different role in this case just |
|
|
17:40 | next row, row one, but could have been to anyone role and |
|
|
17:45 | process will be the same. So nothing unique just to jump into the |
|
|
17:50 | row. But then before you need do that one needs to do what |
|
|
17:56 | as the pre charge. That is restoring whatever this role was that they |
|
|
18:01 | working with. And then after that's and you can activate it new role |
|
|
18:07 | you want to read. And then process repeats in after all and when |
|
|
18:15 | row is red, It kind of into row buffer and I'll come back |
|
|
18:21 | that 10 later slides. So in what happens is you copy this roll |
|
|
18:26 | the robot for and then you get various columns out of this robe. |
|
|
18:36 | So this is just a listing of I kind of just said, what |
|
|
18:41 | when you want to read in a , you activate it and then you |
|
|
18:46 | go through the rate, right process you want to do and then you |
|
|
18:50 | to close it and then Hold on the next one. So I'll stop |
|
|
18:58 | for after another couple of slides and if there are questions given that there |
|
|
19:05 | a bit of a process as a of several steps in order to read |
|
|
19:10 | write a data item to the one has this notion of cycle time |
|
|
19:21 | is not to be confused with the cycles of the memory. So cycle |
|
|
19:31 | and then when it comes to dirham used to describe there. So the |
|
|
19:40 | time between for making successive requests to memory because each request or access is |
|
|
19:55 | associated with several steps. So that's cycle time for the RAM is longer |
|
|
20:04 | the access time and hello come back that again. And another graphical |
|
|
20:13 | But so it was kind of an of what this access times are cycle |
|
|
20:20 | relationship between them but it's kind of silly example but just to try to |
|
|
20:27 | the fact that there is a process that's why time between successive requests is |
|
|
20:36 | than just retrieving a particular item part the memory. Um and now I'm |
|
|
20:46 | a little bit of the detail but anyone of you at some point in |
|
|
20:52 | life need to buy it the um configure servers and pc's these items are |
|
|
21:00 | fact very important and this refers to different steps that are necessary when you |
|
|
21:09 | read. All right in the Iran there is this column access stroll bar |
|
|
21:23 | just this t cast time that is number of cycles that it takes after |
|
|
21:32 | send or give the column address to memory to the Iran before you can |
|
|
21:41 | get the data uh for that column . And part of these things comes |
|
|
21:53 | the fact that it turns out that memory bus for channel operates the higher |
|
|
22:04 | frequency than the deer on itself. that's in part coming um to halfway |
|
|
22:14 | fully answering the question. Why I'm a slow I will try to |
|
|
22:18 | it more yeah in the following Why? Also then the cock rate |
|
|
22:26 | the memory is so much lower than crop. Great on there. Memory |
|
|
22:32 | that in turn is actually slower than clock rate of the processor typically. |
|
|
22:39 | so there is an memory dear. memory is characteristic with is for timing |
|
|
22:49 | or numbers. So one is again number of cycles it takes after you |
|
|
22:55 | the column address before you can actually the values out. Then there's also |
|
|
23:04 | number of cycles that it takes after give the go address before you can |
|
|
23:12 | the column address and that's sort of road to column delay or RCD. |
|
|
23:22 | there is the thing when in here close up the role that also takes |
|
|
23:28 | time and that's characterized but this time recharge RP and then there is yet |
|
|
23:39 | time that is not related to the time for the memory that is the |
|
|
23:46 | you kind of need to stay in role after the issue a role address |
|
|
23:53 | you can issue mhm, recharger, up the role to move to the |
|
|
23:59 | real. So again, when you at specs for the RAM chips, |
|
|
24:06 | tells you a little bit about what bus rate tends to be or is |
|
|
24:11 | and I'll show you some examples In next side or two but it's also |
|
|
24:17 | in terms of these various delays associated the different steps in using the tiara |
|
|
24:29 | was kind of a little bit graphical of how the different times are |
|
|
24:36 | So the RCD was the time after give their all address. Before you |
|
|
24:42 | give the column address, someone you to do something at all the first |
|
|
24:48 | the road address and that takes a time and then you want to issue |
|
|
24:53 | column address and that is the tr tells you how many stifle it Thanks |
|
|
25:00 | terms of memory bust, cycles of , cycles before events go from one |
|
|
25:06 | the other and once you have ready issue a column address then that in |
|
|
25:12 | takes some time to teach us. then you can is to several column |
|
|
25:22 | without waiting for the completion. So can kind of pipeline address column addresses |
|
|
25:31 | that and then we have this minimum time for a role that is the |
|
|
25:41 | , active time will have to stroke then there is on the right hand |
|
|
25:47 | then they pre charged and when you're all of this. Yeah, so |
|
|
25:53 | minimum active time after you issue the address that tells you the earliest time |
|
|
25:59 | can issue pre charge and then the charges subject sometimes and then they're kind |
|
|
26:06 | deal with the whole process of activating old and closing it up and now |
|
|
26:16 | talk about some numbers I guess for , the chips here and chips and |
|
|
26:22 | I'll stop and see if there are . Um but before I've used this |
|
|
26:30 | of DDR and they will show up the next few slides and in case |
|
|
26:35 | think I define it in some earlier but just for your reference again, |
|
|
26:40 | stands for double data rate and what means is that one can got data |
|
|
26:49 | on when the clock signal is rising when it's fault. So then and |
|
|
26:56 | sort of, my little picture here shows a bunch of clock cycles. |
|
|
27:01 | the double air shows the length of clock cycle. So in that case |
|
|
27:05 | can get uh either read or write values per clock cycle through this double |
|
|
27:13 | rate design. No, just in for what I think is on the |
|
|
27:20 | slide shows that the external, there's ratio. Typically that is for |
|
|
27:28 | the internal clock for the memory is times slower than the bus clock. |
|
|
27:39 | . So here is now, I was a state of the art |
|
|
27:47 | for DDR, The Science of DDR is now the most recent designs that |
|
|
27:54 | being used in servers at the end this year, There will be a |
|
|
28:01 | survey generation coming up that will use next generation DDR five, that may |
|
|
28:08 | used in some other scenarios but um is still making the point was typical |
|
|
28:17 | the er, memories and the numbers terms of data rates, some of |
|
|
28:27 | numbers change and some of the numbers change much. Try to point that |
|
|
28:34 | . We have talked earlier on that , process and memory tends to be |
|
|
28:40 | more following because it's slow um and come to the reason why the clock |
|
|
28:47 | so much lower for inside the memory the slide or two but if you |
|
|
28:57 | in the first column on this so I guess the first column is |
|
|
29:01 | still er four and then there's You , four digit number after 1600 on |
|
|
29:08 | top and then 30 200 at the and that's related to okay, the |
|
|
29:17 | channel bus speed um that is She looked at the second column that |
|
|
29:29 | you the actual clock rate four, DDR memory ship itself, so to |
|
|
29:40 | , or the memory array of bits the memory chip, so that goes |
|
|
29:50 | in the lower range of performance for our memory from At 200 MHz And |
|
|
29:57 | top of the line is 400 So well my doctor, I talked |
|
|
30:05 | the process of designs, they tend operate in the 2.5 to 4 gigahertz |
|
|
30:13 | range. So They're about 10 times clock rate than the rate at which |
|
|
30:21 | bitter race in DDR memory operates. that's the fundamental reason I would say |
|
|
30:32 | memories a lot slower. No one wonder why is to cooperate so low |
|
|
30:37 | here clear. And I'll come to , as I mentioned now in order |
|
|
30:43 | try to mitigate a little bit um problem with such so low clocks rates |
|
|
30:52 | the our memory one plays some tricks that as you can see here that |
|
|
31:01 | I. O bus for memory channel incorporate as if you look at this |
|
|
31:09 | consistently four times higher than what it inside the memory and I'll talk a |
|
|
31:17 | bit tall. These things can be in which and how they actually |
|
|
31:25 | The our memory is designed in order be enable to also deliver things on |
|
|
31:31 | bus at the rate the buses operating the fact that it's four times |
|
|
31:38 | Then the memory cells themselves operates. then on we look at this data |
|
|
31:46 | , the transfer rate that is nowadays um rated as transfers per second and |
|
|
31:59 | is just meaning million transfers per seconds related to megahertz cooperates. But this |
|
|
32:06 | every thin or wire then can deliver at This rate which is on the |
|
|
32:17 | line is 1600 mega transfers perception and is per wire so to speak in |
|
|
32:28 | Memory Channel. As you can see is a factor of two for the |
|
|
32:32 | data rate goes from 800 to So there's two per clock cycle and |
|
|
32:37 | true throughout this column. So that the day they are future. And |
|
|
32:44 | we have just um all names here then which is then the factor of |
|
|
32:54 | higher than the transition. So This of indicates that it's kind of at |
|
|
33:03 | eight chip because now it's related more what's happening on the bus. And |
|
|
33:10 | you do then use this eight chips have put eight of them together. |
|
|
33:18 | then you actually get uh the rates terms of advice per second on the |
|
|
33:25 | channel. That in this case is gigabytes per second. For the lower |
|
|
33:31 | for the higher Speed grade lead er is 25.6 gigabytes per second. So |
|
|
33:44 | moving on to this notion all the I get there was the T |
|
|
33:51 | C. D. And R. . And the cast or column I |
|
|
33:57 | stopped. I guess I ended up listed as cl in this column |
|
|
34:02 | But these are the best for the here that sells how many cocks |
|
|
34:07 | In terms of the memory bust it takes delay between the single column |
|
|
34:16 | and before you can get something. if you see the row address, |
|
|
34:21 | number of cycles it takes before you issue a mhm column address and then |
|
|
34:29 | police charge time. So you can that the faster the internal cost. |
|
|
34:35 | coastal If you go down by the here, the higher the rate of |
|
|
34:42 | caucus for the memory cells the more delay cycle this takes before you can |
|
|
34:50 | do something. So even though the are higher as you go towards the |
|
|
34:58 | here. So the higher rate speed chips, the delay between the different |
|
|
35:04 | in the process of reading or writing . The chip goes up. So |
|
|
35:10 | the end the column here to the then tells you the actual physical time |
|
|
35:18 | they used to for the lower grade . Yes, 12.5 nine a second |
|
|
35:27 | if you look at the fastest it's not much Um it's actually |
|
|
35:35 | It's actually 59 seconds. So it's of counterintuitive and that's one needs to |
|
|
35:45 | for whether I want to spend the to buy the fastest are great memory |
|
|
35:52 | . Um that has benefits but if do um for instance you don't need |
|
|
36:02 | switch between different girls all that the you can stay in the same role |
|
|
36:08 | you don't need to pay the time pre charge and so on. So |
|
|
36:12 | memory access pattern also it gives you idea whether um it pays off because |
|
|
36:22 | the late agency involved that the agency not necessarily go down. In fact |
|
|
36:27 | goes up and potentially in this case the states the same so late in |
|
|
36:34 | it doesn't buy you anything band width . It does buy you something and |
|
|
36:43 | is one important very important aspects in you configure your memory, you know |
|
|
36:52 | here for a second and I'll try explain a little bit by things are |
|
|
36:58 | of staying pretty much the same and and not only in terms of the |
|
|
37:05 | to the cooperate that you have inside memory chip but it's also if you |
|
|
37:12 | goes through and I think I have on the future site but over |
|
|
37:17 | So the different generations of DDR the late insee hasn't changed much. |
|
|
37:24 | I'll see if there are questions at point and the muscles again try to |
|
|
37:30 | more why they things are so slow on physics. Okay. Uh do |
|
|
37:46 | little bit more on the physics then guess there's a little bit more of |
|
|
37:50 | was said to Leighton see measured in what it is. And the graph |
|
|
37:56 | at the bottom left shows pretty much say it's actually speed rates but it |
|
|
38:06 | different bullets here is you can see table above I guess I should say |
|
|
38:12 | has the different generations of DDR memory the right thing column here shows |
|
|
38:18 | it has decreased in the first and don't have a year but this is |
|
|
38:23 | ago in terms of the single data design stuff. A double data rate |
|
|
38:30 | probably about 20 years old, the one. But as you can see |
|
|
38:36 | numbers has not really changed and it's essentially inherent in the way the physics |
|
|
38:46 | for C mel's memories. So it's that member of designers are ignorant of |
|
|
38:53 | need to make higher performance chips but is fundamental. So as long as |
|
|
39:01 | stays with the c most technology for chips memory chips the latency is not |
|
|
39:11 | to change and that's something very important try to remember as a kind of |
|
|
39:21 | rule of them don't have too much that the internals of the our memory |
|
|
39:29 | improve in speed um tricks have been if I can say so, but |
|
|
39:37 | has been ways to try to mitigate fact bye increasing the capability to deliver |
|
|
39:47 | to the memory bus. So one to make up for the slow clocks |
|
|
39:56 | the actual architecture of this sign of chips that I will get to think |
|
|
40:03 | . Um yes, so I think will talk a little bit how one |
|
|
40:10 | up or try to bridge the differences one can, but the bridging is |
|
|
40:17 | terms of bandwidth, not in terms latency, I can't Strict Physics Last |
|
|
40:30 | . So yeah, it's kind of picture of how the year on chips |
|
|
40:39 | put together so and one of the few slides today I showed this kind |
|
|
40:47 | fundamental idea of memory being being organized matrix with rules and columns and and |
|
|
40:57 | cross point in between rows and There is a bit sometimes more than |
|
|
41:04 | being stored but fundamentally it's kind of you'll column organization. So here is |
|
|
41:11 | little bit of you know, square to denote one of these. A |
|
|
41:21 | of memory bits. Now, in for this case that the DDR please |
|
|
41:30 | and part of it is also true the therefore they'll talk, give examples |
|
|
41:35 | that today, therefore is a little more complex on the principle Maybe more |
|
|
41:40 | explained on this DDR three. So why I still have this Slides for |
|
|
41:45 | DDR three memory to illustrate how things up. So in fact inside the |
|
|
41:52 | there are eight it raised of memory and those are known as banks not |
|
|
42:05 | confused their ranks. That has to with um sets of memory chips that |
|
|
42:14 | the bus fit for. These are are internal in the they are designed |
|
|
42:23 | sometimes it also is kind of known pages and things uh incurs penalties when |
|
|
42:32 | move from one bank turned up. , it's just said you need to |
|
|
42:40 | roll and call them addresses and that inherent for each one of the |
|
|
42:47 | But then you also need to select back. So the standard For the |
|
|
42:55 | three memory requires that there are eight inside the chip. Memory designers don't |
|
|
43:03 | an option. So eight banks requires bits too select which particular banks I |
|
|
43:10 | to talk to and then we have role addresses and the column addresses. |
|
|
43:19 | now if one looks at the spec the things that exemplified on this drawing |
|
|
43:25 | , It's known as a one gigabit times eight. Yeah, So that |
|
|
43:34 | eight bits wide delivery sawing, every kind of clock is puts out |
|
|
43:44 | bits and then The organization then suffer collection of eight bits, there is |
|
|
43:52 | at least some makeup. So one Yeah, so the chip no, |
|
|
43:59 | saying I think Maryland. So this the banks already mentioned on so on |
|
|
44:04 | is the times eight. So if looks at the detail here and so |
|
|
44:09 | things that comes out that dimension gets of the ship is then it's white |
|
|
44:14 | much specify weight of the. Yeah The 128 that is an and in |
|
|
44:26 | the number of columns um there is that follows that Because some other way |
|
|
44:34 | at 1.8 tells you how many columns are. So that means they're seven |
|
|
44:39 | you find at the bottom here to out which column you want to be |
|
|
44:43 | . And also and when I looked the bank description you find in a |
|
|
44:49 | columns and the descriptions or it's a of roles times the number of columns |
|
|
44:56 | that it's associated with each um your intersections. So it's kind of the |
|
|
45:02 | dimension. That's just an issue economy that are 64 bits. No The |
|
|
45:15 | bits um is kind of not totally . So I think it says |
|
|
45:21 | Right so This is known as kind um the burst mode of eight. |
|
|
45:31 | that means for every request in fact get 64 bits and not just |
|
|
45:41 | So that's kind of the roll buffer a column for the robot for that |
|
|
45:48 | terms of the readout You get actually Value orbits instead of eight bits. |
|
|
45:58 | that's why and then kind of can up for the discrepancy between the data |
|
|
46:10 | of the memory channel and the internal rate. He said there was a |
|
|
46:16 | of four in the turn in clock between the bus and the internals and |
|
|
46:22 | it was double reiterate. So there's but bust cycles. You deliver The |
|
|
46:30 | . So that means in fact there a factor of eight difference in terms |
|
|
46:39 | what gets put out and that's why burst Motor eight is matching that First |
|
|
46:45 | of four in the clockwise and factor for the double. Okay. All |
|
|
46:52 | . So, uh let's see if have something. Yeah, so these |
|
|
46:57 | the number of all simply. And so if you work out the numbers |
|
|
47:06 | , 16 K rose tons 1 28 from 64 beats per row column |
|
|
47:13 | Um then you get 1 28 megabits then there are eight of those banks |
|
|
47:21 | considered for. Okay, go bit chip. So it's, you |
|
|
47:35 | some memories are fairly complex entities in own right. Uh Let's see what |
|
|
47:46 | have. Right. Yeah. So other thing is also Then I point |
|
|
47:54 | in number two on this slide, , that many of the processors works |
|
|
48:04 | 64 Byte Cache lines and Each the with is typically 64 bits or eight |
|
|
48:19 | . So also this first mode of kind of matches the ability to serve |
|
|
48:27 | lots. So that's connection between burst and cash lines as what um any |
|
|
48:45 | on this? Uh, so as soil, I showed you something about |
|
|
48:52 | er four and then I'll give you little bit more of examples here about |
|
|
48:58 | er, memory and problems in using . So, again, one is |
|
|
49:14 | to figure out again how to increase ability to deliver things faster to the |
|
|
49:24 | bust. Spike, the fact that cooperate internally is low and that's part |
|
|
49:30 | what one is trying to do with DDR for design that is or |
|
|
49:35 | So it tends to have groups or and the access time depends on whether |
|
|
49:44 | are working in a single group or group in different accesses or access things |
|
|
49:52 | different groups. So the timing performance of the R4, it's more complex |
|
|
50:01 | it is for the they are three does improve the potential peak then if |
|
|
50:09 | can get uh, Memory DDR so they did therefore peak. There |
|
|
50:21 | , there is on the memory bus according to the Spectators twice that it |
|
|
50:28 | fully we are full but it doesn't that the internals are working nfs to |
|
|
50:35 | . Just that the complexity of the as I will call it all the |
|
|
50:41 | and chip itself has got the more oil complex. Oh, so it |
|
|
50:51 | a little bit, I guess This four groups of banks that addressing is |
|
|
50:56 | of similar still things are shared on bus. But then it just as |
|
|
51:06 | said, it it's more complicated. that's why also you can't just take |
|
|
51:13 | server that was to send for DDR memory and tried to plug unity. |
|
|
51:18 | , Uh trip for nearly our five . It doesn't work. So then |
|
|
51:27 | have a slide here that leads to problem and how do you get the |
|
|
51:34 | performance are not out of the DDR . So because of the internal designs |
|
|
51:44 | well as the pin limitation, there potentially serious performance. It's just in |
|
|
51:54 | video are so it's by no means random access memory as the notion of |
|
|
52:02 | RAM and talks about main memory is deer and but it's by no means |
|
|
52:08 | access. So yes, basically saying if so on this particular example there |
|
|
52:22 | a memory chip that there DDR 1333 MHz so far works out the |
|
|
52:33 | And it was on a previous Line that corresponds to 10 .66 - |
|
|
52:39 | gigabits fights per second. And that's the access pattern to the diagram is |
|
|
52:47 | most favorable if you're only working in banks, air drops by a factor |
|
|
52:56 | two. And if you just work a single bank Than it goes down |
|
|
53:02 | 18 of the peak performance. So you happens to be unlucky and successive |
|
|
53:15 | that you want happens to be reciting the same bank, your performance or |
|
|
53:23 | Main memory goes down by a factor eight For the the R three memory |
|
|
53:28 | potentially more fatty. Therefore so it's big difference in terms of the ability |
|
|
53:41 | the memory to deliver its peak performance pattern has a huge impact on the |
|
|
53:50 | performance. This is just for a the around chip I think. Um |
|
|
53:58 | and then a multicourse, easy to when access is the memory comes from |
|
|
54:03 | course. You may also have additional on request successive requests um not being |
|
|
54:15 | call us then to single banks are banks. So here is kind of |
|
|
54:25 | simple example in a server type in which case in the in the |
|
|
54:32 | scenario basically things are just interpreted successive across the memory channels. So you |
|
|
54:42 | all the memory channels and in the case it turns out that the data |
|
|
54:48 | you want is on a single memory for that memory channels it happens to |
|
|
54:55 | in a single day. So in case the ratio between the best scenario |
|
|
55:00 | the worst scenario is um the yes 3 41 gigabytes per second. The |
|
|
55:12 | scenario and then it's very segregation depending how the access pattern is relative to |
|
|
55:19 | main memory design. So in that it can be a factor of over |
|
|
55:24 | performance degradation. Just as a function where how you access the memory for |
|
|
55:32 | data that you are having for your . So any questions on done before |
|
|
55:45 | talk about them why the corporate is , so I have a question uh |
|
|
55:56 | we use the single bank honored he channeling our band with radios. Strike |
|
|
56:08 | hopefully you're data. I lay out respect the memory and the way you |
|
|
56:18 | it uh will be such that you end up in this worst case |
|
|
56:27 | So well typically as being done is if you have for instance your matrices |
|
|
56:37 | multidimensional arrays, there are first flatten one year ray. All multidimensional arrays |
|
|
56:47 | by default, you know it all order column is your order depending on |
|
|
56:53 | to see or for tre so it into I won the right and this |
|
|
57:00 | the array is then typically laid out that to get the best performance if |
|
|
57:08 | accessed by Straight Strike one. So means it's played out across memory channels |
|
|
57:15 | across banks within the data chips or for the each memory channels. So |
|
|
57:27 | try to do it. So if have Australia one access, you get |
|
|
57:32 | peak memory performance but coming back to cities sometimes you, that means if |
|
|
57:41 | flattened writes like a robot. So you go in a row, the |
|
|
57:47 | element in the role is also then a place where you get fast access |
|
|
57:53 | being out either when you go down column. The next column is um |
|
|
58:03 | length of a role away from. So the first element in the second |
|
|
58:11 | is The roll length away from the element in the 1st row. So |
|
|
58:17 | means it may end up in a where it could be in the same |
|
|
58:24 | as the first element of the Roll. So it could be that |
|
|
58:29 | the column elements are in the same . So that when you work to |
|
|
58:36 | column access instead of row access, get the very slow bank performance instead |
|
|
58:45 | the best case scenario across banks and . So that's fine. It tends |
|
|
58:53 | be if people have, they optimize for The four turn cold by the |
|
|
59:04 | border. And they were made And that was part of the mm |
|
|
59:08 | to guess the second assignment I Um then if one were to convert |
|
|
59:18 | call to Seiko. Um that has opposite uh flattening so column in your |
|
|
59:33 | then it is preserved, the innermost low accesses. Then you get miserable |
|
|
59:44 | and that kind of fairly easy to patio Madonna may take sponsor book client |
|
|
59:51 | . Um and it, regardless of programming language is used, try to |
|
|
59:59 | the water between the two innermost loops you usually will see a big difference |
|
|
60:03 | performance. So so that the banner limited according to the data access but |
|
|
60:13 | not just like if you use the channel, your bandwidth is not automatically |
|
|
60:19 | only their new access to the RAM upon the access pattern, your bandwidth |
|
|
60:25 | be limited. Right, correct? . Thank you. So that's the |
|
|
60:34 | from you know, the program is to get good performance. One would |
|
|
60:43 | to be conscientious both how your erase by default laid out in memory and |
|
|
60:56 | what the access pattern is in the to their race. Such a do |
|
|
61:04 | and unfortunately there is not much control the layout of the kind of flattening |
|
|
61:13 | multidimensional erase standard programming languages has going fault the notion that memory is random |
|
|
61:23 | . So it doesn't take structural memory architecture memory into account. So it |
|
|
61:32 | things are random access and that's the on which the design is made. |
|
|
61:37 | unfortunately that's not true in reality and why one needs to as a programmer |
|
|
61:43 | trying to optimize performance, understand also the memory system itself is designed. |
|
|
61:55 | no, a little bit of comments and so I mentioned that the physics |
|
|
62:01 | why it's hard to get Corporates to up to be comfortable to what it |
|
|
62:10 | for processors. So yes, here's gap but talks about Sun Island. |
|
|
62:16 | this, I think I showed this like before. So you can see |
|
|
62:23 | also this, the technology being used building memory. So one has effectively |
|
|
62:31 | RC circuit and um the future sizes by the state of the art technology |
|
|
62:42 | today it's about and typical I think memories in the 10 to 50 nanometer |
|
|
62:52 | . It then the feature sizes the and not all this the smallest basically |
|
|
62:59 | issues whether you use that. state of the art or one generation |
|
|
63:05 | in terms of cost. Perfect. that yes, but the point is |
|
|
63:11 | scales. So the as we do notice hopefully I mean how they some |
|
|
63:24 | of the physics or electrical engineering course that time we can all imagine that |
|
|
63:32 | the thinner wire is the higher the is or you know, I think |
|
|
63:39 | usual analogy is kind of a holds most of us have seen. |
|
|
63:44 | really tiny holes are a stroll and things through. The straw is much |
|
|
63:50 | than pushing things that were white fat . So there's a resistance goes up |
|
|
63:58 | the smaller the future sizes are on check and something is also happening to |
|
|
64:06 | capacitors. The area scales down. that's a good thing. But also |
|
|
64:13 | vertical distance between the plates and the scales down. So um no changes |
|
|
64:23 | of the charge that they need to in order to in um charges or |
|
|
64:31 | capacitor. So in the end that up being this RC constant that defines |
|
|
64:39 | things behaves and you work out the that the cross section on the |
|
|
64:45 | it goes the square that means the goes up with the square of the |
|
|
64:51 | factor. Um or the wire and and rules with the scaling factor the |
|
|
65:02 | . But in the end the the RC product actually gets worse with |
|
|
65:10 | scaling factor. Now, if the gets shorter than and your sister also |
|
|
65:23 | smaller because the wire is shorter. when it comes to memory, the |
|
|
65:32 | of the warriors tend to stay the because you need to get things out |
|
|
65:36 | the chip. So you have to singles across the chip. So if |
|
|
65:44 | wire length remains saying the same. even though Warriors gets thinner and smaller |
|
|
65:51 | transistors get smaller and the charges get in the end it gets kind of |
|
|
66:00 | so it through a lot of tampering the physics. That one I've actually |
|
|
66:08 | to retain the cop grades on the ships. Um in principle one should |
|
|
66:16 | that to actually potentially have gotten worse the sense that it would be necessary |
|
|
66:23 | use lower cooperates for state of the technology but one has managed to maintain |
|
|
66:31 | whereas so that's because I think one to run things across the chip to |
|
|
66:37 | the signals out. That's not quite . Um I finally quickly flee back |
|
|
66:47 | one of the early signs. If can do a quick flip here, |
|
|
66:51 | just don't want the right way to it. So here you can see |
|
|
66:56 | are basically segments. So inside these there are segment someone doesn't fully run |
|
|
67:04 | without seeing the restoration. Um But Ron needs to run them quite some |
|
|
67:10 | . So in practice between single restorations wires remain at the same length. |
|
|
67:19 | that's not true when you look at of designs because the process of designs |
|
|
67:25 | not as dense and they have more of also feeling power. So in |
|
|
67:31 | case one can actually benefit from the future sizes. Except as I mentioned |
|
|
67:38 | the leakage is the problem. Someone even increase the cooperates on processor chips |
|
|
67:44 | . So even there things have kind landed in a space where carcasses don't |
|
|
67:49 | much as they are not increasing much their. Um And the discrepancy in |
|
|
67:56 | grades comes from basically the fundamental physics the desire to have very dense designs |
|
|
68:05 | uh huh. Memory and not quite stents designs. It's More than a |
|
|
68:12 | of 10 less for processor chips. in that case function have been able |
|
|
68:22 | run things at the higher clock grades the smaller feature sizes. So this |
|
|
68:29 | trying to explain why um It is the cooperates has remained low for memory |
|
|
68:41 | and that there is this problem that not easily solved by architecture. Um |
|
|
68:52 | ramps. So I hope that's when answer the question why it turns out |
|
|
69:03 | the rams our soul much sewer and one is try to make up for |
|
|
69:12 | by having this first note inside the and multiple banks in order to be |
|
|
69:18 | to output uh huh or delivers all a certain sense. Okay the Iran |
|
|
69:29 | is inside itself paralleled by having delivering internal buses more bits than the external |
|
|
69:39 | is only like 64 bits In a internally for eight picks out. But |
|
|
69:51 | other thing to be aware of that latency and memory chips has remained pretty |
|
|
69:57 | constant for over a decade, almost decades and it's not likely to chip |
|
|
70:03 | change come forward even though uh huh me increase ways of having more parallelism |
|
|
70:12 | the DDR memory to deliver more on memory bus. All right. Um |
|
|
70:25 | a little bit of another way of to alleviate the difference uh in speed |
|
|
70:34 | main memory and the process of ship the speed of cash is on site |
|
|
70:43 | the process is just so one thing to try to bring oh memory |
|
|
70:50 | I'm closer than being say on the board and basically try to use Iran |
|
|
71:01 | Iran inside the chip itself and that been done in some recent process of |
|
|
71:09 | science. So and that's known as the Iran for embedded the Iran so |
|
|
71:15 | that case ones use the deer um held designs and not the s from |
|
|
71:24 | cell design for the so called embedded that has been used in for instance |
|
|
71:33 | IBM started to use it for there three and sometimes they also call it |
|
|
71:37 | the four cashes. And intel has started to use this embedded the |
|
|
71:43 | That yeah is one transistor cells for but it has different both speed and |
|
|
71:58 | data retain mint properties again because it's totally different to science. So it's |
|
|
72:04 | ah behaving or operating like the cache cells. So it's operating on a |
|
|
72:12 | mode because of the design. But again they see most technologies or in |
|
|
72:19 | to try to get a little bit speed and energy efficiency and again it's |
|
|
72:27 | transistor so you can get more bits the need to E. D ram |
|
|
72:32 | you can get in the next So that's kind of a trade off |
|
|
72:36 | designers authority to use embedded there and some of the chips. So here |
|
|
72:45 | kind of an example what it is put it you can see it there |
|
|
72:49 | two my practice decide. And the thing is to try to get another |
|
|
72:58 | design but it's still external to the . Um And that this was known |
|
|
73:03 | to be the X point is um a product about two or three years |
|
|
73:12 | it's I told the new way of memory cells and in terms of speed |
|
|
73:20 | cost it fits between the Iran and memory. It's just something to be |
|
|
73:26 | of. It doesn't change. Doesn't in between Ekstrom and era money is |
|
|
73:34 | the Iran and pick prevalent or disk some flavor or flash. And there |
|
|
73:42 | just summary characteristics in terms of Layton for the different type of memory technologies |
|
|
73:51 | are being used, that the color lines is for the different memory technologies |
|
|
73:58 | it kind of shows a little bit things fits in terms of latency and |
|
|
74:06 | will stop there and it any more I'll try to get summarize it and |
|
|
74:16 | probably I don't the part two, just a few minutes I guess I |
|
|
74:25 | do my own somewhere and just reminding of what was partially coming in there |
|
|
74:34 | uh lecture last time in terms of these deer and ships are then put |
|
|
74:42 | in terms of modules known as thems uh unused chips are different with and |
|
|
74:50 | enables configuration in terms of different amounts memory um, that you have not |
|
|
74:59 | but using different number of bits per chip, but the width also helps |
|
|
75:07 | in configuring memory and then things get . Both energy rights and time wise |
|
|
75:16 | go up on a circuit board, via the socket onto the circuit board |
|
|
75:22 | dinner slot. So they're embedded system to use membership directly soldered onto the |
|
|
75:31 | board instead of using yes and them . But so again, increase both |
|
|
75:43 | performance and to performance in terms of and latency one in recent years in |
|
|
75:53 | last few have started to both memory dies and then integrate them in |
|
|
76:05 | same package as the processor chip. using what's known as silicon, no |
|
|
76:11 | interpose er that has a lot more and than what you can do from |
|
|
76:18 | socket to the board so you get channels, the memory and you can |
|
|
76:24 | operate them at a good speed. and then there was just some simple |
|
|
76:32 | of um the performance kind of difference both in terms of speed and energy |
|
|
76:47 | and I guess I found looks at bottom rows here um I can see |
|
|
76:53 | the Stacked memory, the high bandwidth are about 10 times As energy efficient |
|
|
77:01 | the DDR four. Um and it's considerably higher data. Right. So |
|
|
77:11 | are some choices today but I've been memories clearly also more expensive but mm |
|
|
77:20 | candidate. So it has been used some of the G P U. |
|
|
77:23 | in particular. Yeah, but not Gpus because of costs. So you |
|
|
77:30 | get Gps people, they are memory spm memory mhm Let's see what |
|
|
77:38 | Yes. The main point that they discussed a bit. And council question |
|
|
77:45 | to be aware of that main memory its name. The RAM. It's |
|
|
77:54 | no means uniform access uh it can a judge performance difference. So it's |
|
|
78:06 | and we try to understand performance if not good. What the reason mitt |
|
|
78:15 | it's just unfortunate access pattern to the the compiler decided to flatten the |
|
|
78:25 | Mhm And right, this is a innocent. So okay. No time |
|
|
78:33 | time. Part two maybe time for Questions. I'll come back to part |
|
|
78:39 | in the future. Lecture. It not be next lecture album. Um |
|
|
78:44 | , probably decided. Talking about open . And the part to hear about |
|
|
78:51 | power is um there is possibilities from user to actually manage power. But |
|
|
79:01 | most cases it can help explain somewhat performance data that you collect and doing |
|
|
79:11 | or timing experiments because on the control uh car crates and that happens happens |
|
|
79:22 | the rock. So it might be useful most of you as a way |
|
|
79:30 | trying to understand what might have the difference between different runs and the |
|
|
79:39 | time. It also I think it's interesting to see how but the big |
|
|
79:45 | like facebook and other studio in terms how they actually control power in their |
|
|
79:51 | centers, including all the way down the ship. But I'll stop |
|
|
79:57 | Take questions. Mhm |
|