1 00:00:01,550 --> 00:00:03,920 The following content is provided under a Creative 2 00:00:03,920 --> 00:00:05,310 Commons license. 3 00:00:05,310 --> 00:00:07,520 Your support will help MIT OpenCourseWare 4 00:00:07,520 --> 00:00:11,610 continue to offer high quality educational resources for free. 5 00:00:11,610 --> 00:00:14,180 To make a donation or to view additional materials 6 00:00:14,180 --> 00:00:18,140 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,140 --> 00:00:19,026 at ocw.mit.edu. 8 00:00:22,233 --> 00:00:23,650 CHARLES LEISERSON: So today, we're 9 00:00:23,650 --> 00:00:26,200 going to talk about assembly language and computer 10 00:00:26,200 --> 00:00:29,470 architecture. 11 00:00:29,470 --> 00:00:31,900 It's interesting these days, most software courses 12 00:00:31,900 --> 00:00:35,480 don't bother to talk about these things. 13 00:00:35,480 --> 00:00:38,830 And the reason is because as much as possible people 14 00:00:38,830 --> 00:00:42,490 have been insulated in writing their software from performance 15 00:00:42,490 --> 00:00:43,460 considerations. 16 00:00:43,460 --> 00:00:48,430 But if you want to write fast code, 17 00:00:48,430 --> 00:00:51,880 you have to know what is going on underneath so you 18 00:00:51,880 --> 00:00:55,180 can exploit the strengths of the architecture. 19 00:00:55,180 --> 00:00:59,410 And the interface, the best interface, that we have to that 20 00:00:59,410 --> 00:01:03,530 is the assembly language. 21 00:01:03,530 --> 00:01:06,170 So that's what we're going to talk about today. 22 00:01:06,170 --> 00:01:11,140 So when you take a particular piece of code 23 00:01:11,140 --> 00:01:16,870 like fib here, to compile it you run it through Clang, 24 00:01:16,870 --> 00:01:19,480 as I'm sure you're familiar at this point. 25 00:01:19,480 --> 00:01:24,730 And what it produces is a binary machine language 26 00:01:24,730 --> 00:01:28,510 that the computer is hardware programmed 27 00:01:28,510 --> 00:01:31,570 to interpret and execute. 28 00:01:31,570 --> 00:01:35,380 It looks at the bits as instructions as opposed to as 29 00:01:35,380 --> 00:01:36,550 data. 30 00:01:36,550 --> 00:01:38,110 And it executes them. 31 00:01:41,020 --> 00:01:45,880 And that's what we see when we execute. 32 00:01:45,880 --> 00:01:47,740 This process is not one step. 33 00:01:47,740 --> 00:01:51,970 It's actually there are four stages to compilation; 34 00:01:51,970 --> 00:01:55,240 preprocessing, compiling-- sorry, for the redundancy, 35 00:01:55,240 --> 00:01:57,490 that's sort of a bad name conflict, 36 00:01:57,490 --> 00:01:59,170 but that's what they call it-- 37 00:01:59,170 --> 00:02:02,510 assembling and linking. 38 00:02:02,510 --> 00:02:07,075 So I want to take us through those stages. 39 00:02:11,590 --> 00:02:13,210 So the first thing that goes through 40 00:02:13,210 --> 00:02:16,390 is you go through a preprocess stage. 41 00:02:16,390 --> 00:02:19,790 And you can invoke that with Clang manually. 42 00:02:19,790 --> 00:02:21,250 So you can say, for example, if you 43 00:02:21,250 --> 00:02:26,650 do clang minus e, that will run the preprocessor 44 00:02:26,650 --> 00:02:27,960 and nothing else. 45 00:02:27,960 --> 00:02:29,530 And you can take a look at the output 46 00:02:29,530 --> 00:02:36,080 there and look to see how all your macros got expanded 47 00:02:36,080 --> 00:02:40,920 and such before the compilation actually goes through. 48 00:02:40,920 --> 00:02:42,650 Then you compile it. 49 00:02:42,650 --> 00:02:47,230 And that produces assembly code. 50 00:02:47,230 --> 00:02:52,810 So assembly is a mnemonic structure of the machine code 51 00:02:52,810 --> 00:02:55,180 that makes it more human readable than the machine 52 00:02:55,180 --> 00:02:58,090 code itself would be. 53 00:02:58,090 --> 00:03:02,410 And once again, you can produce the assembly yourself 54 00:03:02,410 --> 00:03:06,420 with clang minus s. 55 00:03:06,420 --> 00:03:11,660 And then finally, penultimately maybe, 56 00:03:11,660 --> 00:03:18,710 you can assemble that assembly language code 57 00:03:18,710 --> 00:03:21,050 to produce an object file. 58 00:03:21,050 --> 00:03:23,210 And since we like to have separate compilations, 59 00:03:23,210 --> 00:03:24,710 you don't have to compile everything 60 00:03:24,710 --> 00:03:27,710 as one big monolithic hunk. 61 00:03:27,710 --> 00:03:30,200 Then there's typically a linking stage 62 00:03:30,200 --> 00:03:32,330 to produce the final executable. 63 00:03:32,330 --> 00:03:36,080 And for that we are using ld for the most part. 64 00:03:36,080 --> 00:03:38,870 We're actually using the gold linker, 65 00:03:38,870 --> 00:03:40,530 but ld is the command that calls it. 66 00:03:43,580 --> 00:03:45,230 So let's go through each of those steps 67 00:03:45,230 --> 00:03:46,280 and see what's going on. 68 00:03:46,280 --> 00:03:53,093 So first, the preprocessing is really straightforward. 69 00:03:53,093 --> 00:03:54,260 So I'm not going to do that. 70 00:03:54,260 --> 00:03:56,750 That's just a textual substitution. 71 00:03:56,750 --> 00:04:01,410 The next stage is the source code to assembly code. 72 00:04:01,410 --> 00:04:04,520 So when we do clang minus s, we get 73 00:04:04,520 --> 00:04:06,320 this symbolic representation. 74 00:04:06,320 --> 00:04:10,400 And it looks something like this, where we 75 00:04:10,400 --> 00:04:13,775 have some labels on the side. 76 00:04:17,600 --> 00:04:21,829 And we have some operations when they have some directives. 77 00:04:21,829 --> 00:04:24,470 And then we have a lot of gibberish, 78 00:04:24,470 --> 00:04:27,980 which won't seem like so much gibberish 79 00:04:27,980 --> 00:04:31,160 after you've played with it a little bit. 80 00:04:31,160 --> 00:04:33,930 But to begin with looks kind of like gibberish. 81 00:04:37,080 --> 00:04:41,130 From there, we assemble that assembly code and that 82 00:04:41,130 --> 00:04:43,250 produces the binary. 83 00:04:43,250 --> 00:04:48,390 And once again, you can invoke it just by running Clang. 84 00:04:48,390 --> 00:04:52,110 Clang will recognize that it doesn't have a C file or a C++ 85 00:04:52,110 --> 00:04:52,890 file. 86 00:04:52,890 --> 00:04:56,400 It says, oh, goodness, I've got an assembly language file. 87 00:04:56,400 --> 00:05:02,460 And it will produce the binary. 88 00:05:02,460 --> 00:05:05,190 Now, the other thing that turns out to be the case 89 00:05:05,190 --> 00:05:07,950 is because assembly in machine code, 90 00:05:07,950 --> 00:05:13,320 they're really very similar in structure. 91 00:05:13,320 --> 00:05:17,010 Just things like the op codes, which 92 00:05:17,010 --> 00:05:21,960 are the things that are here in blue or purple, 93 00:05:21,960 --> 00:05:27,060 whatever that color is, like these guys, 94 00:05:27,060 --> 00:05:29,730 those correspond to specific bit patterns over here 95 00:05:29,730 --> 00:05:33,060 in the machine code. 96 00:05:33,060 --> 00:05:36,900 And these are the addresses and the registers that we're 97 00:05:36,900 --> 00:05:39,240 operating on, the operands. 98 00:05:39,240 --> 00:05:47,235 Those correspond to other to other bit codes over there. 99 00:05:47,235 --> 00:05:49,680 And there's very much a-- 100 00:05:49,680 --> 00:05:53,550 it's not exactly one to one, but it's pretty close one to one 101 00:05:53,550 --> 00:05:56,760 compared to if you had C and you look at the binary, 102 00:05:56,760 --> 00:06:00,300 it's like way, way different. 103 00:06:03,450 --> 00:06:08,130 So one of the things that turns out you can do is if you have 104 00:06:08,130 --> 00:06:13,830 the machine code, and especially if the machine code that was 105 00:06:13,830 --> 00:06:16,710 produced with so-called debug symbols-- 106 00:06:16,710 --> 00:06:19,200 that is it was compiled with dash g-- 107 00:06:19,200 --> 00:06:23,040 you can use this program called objdump, 108 00:06:23,040 --> 00:06:28,680 which will produce a disassembly of the machine code. 109 00:06:28,680 --> 00:06:33,570 So it will tell you, OK, here's what the mnemonic, more 110 00:06:33,570 --> 00:06:38,670 human readable code is, the assembly code, from the binary. 111 00:06:38,670 --> 00:06:40,170 And that's really useful, especially 112 00:06:40,170 --> 00:06:42,420 if you're trying to do things-- 113 00:06:42,420 --> 00:06:46,480 well, let's see why do we bother looking at the assembly? 114 00:06:46,480 --> 00:06:49,387 So why would you want to look at the assembly of your program? 115 00:06:49,387 --> 00:06:50,595 Does anybody have some ideas? 116 00:06:53,170 --> 00:06:53,900 Yeah. 117 00:06:53,900 --> 00:06:55,780 AUDIENCE: [INAUDIBLE] made or not. 118 00:06:55,780 --> 00:06:57,280 CHARLES LEISERSON: Yeah, you can see 119 00:06:57,280 --> 00:06:59,720 whether certain optimizations are made or not. 120 00:06:59,720 --> 00:07:00,345 Other reasons? 121 00:07:03,010 --> 00:07:05,630 Everybody is going to say that one. 122 00:07:05,630 --> 00:07:06,130 OK. 123 00:07:10,870 --> 00:07:15,400 Another one is-- well, let's see, so here's some reasons. 124 00:07:15,400 --> 00:07:18,970 The assembly reveals what the compiler did and did not do, 125 00:07:18,970 --> 00:07:23,510 because you can see exactly what the assembly is that is going 126 00:07:23,510 --> 00:07:25,660 to be executed as machine code. 127 00:07:25,660 --> 00:07:27,370 The second reason, which turns out 128 00:07:27,370 --> 00:07:29,590 to happen more often you would think, 129 00:07:29,590 --> 00:07:31,430 is that, hey, guess what, compiler 130 00:07:31,430 --> 00:07:33,200 is a piece of software. 131 00:07:33,200 --> 00:07:35,590 It has bugs. 132 00:07:35,590 --> 00:07:38,160 So your code isn't operating correctly. 133 00:07:38,160 --> 00:07:41,800 Oh, goodness, what's going on? 134 00:07:41,800 --> 00:07:45,650 Maybe the compiler made an error. 135 00:07:45,650 --> 00:07:49,360 And we have certainly found that, especially when you 136 00:07:49,360 --> 00:07:53,620 start using some of the less frequently used features 137 00:07:53,620 --> 00:07:55,220 of a compiler. 138 00:07:55,220 --> 00:07:56,890 You may discover, oh, it's actually not 139 00:07:56,890 --> 00:08:01,120 that well broken in. 140 00:08:01,120 --> 00:08:05,170 And it mentions here you may only have an effect when 141 00:08:05,170 --> 00:08:09,550 compiling at -03, but if you compile at -00, -01, 142 00:08:09,550 --> 00:08:11,510 everything works out just fine. 143 00:08:11,510 --> 00:08:14,920 So then it says, gee, somewhere in the optimizations, 144 00:08:14,920 --> 00:08:17,380 they did an optimization wrong. 145 00:08:17,380 --> 00:08:21,220 So one of the first principles of optimization is do it right. 146 00:08:21,220 --> 00:08:24,400 And then the second is make it fast. 147 00:08:24,400 --> 00:08:28,480 And so sometimes the compiler doesn't that. 148 00:08:28,480 --> 00:08:33,250 It's also the case that sometimes you cannot write code 149 00:08:33,250 --> 00:08:36,860 that produces the assembly that you want. 150 00:08:36,860 --> 00:08:40,220 And in that case, you can actually 151 00:08:40,220 --> 00:08:43,820 write the assembly by hand. 152 00:08:43,820 --> 00:08:46,850 Now, it used to be many years ago-- 153 00:08:46,850 --> 00:08:48,710 many, many years ago-- 154 00:08:48,710 --> 00:08:52,550 that a lot of software was written in assembly. 155 00:08:55,230 --> 00:08:59,740 In fact, my first job out of college, 156 00:08:59,740 --> 00:09:02,040 I spent about half the time programming 157 00:09:02,040 --> 00:09:04,980 in assembly language. 158 00:09:04,980 --> 00:09:08,700 And it's not as bad as you would think. 159 00:09:08,700 --> 00:09:11,400 But it certainly is easier to have high-level languages 160 00:09:11,400 --> 00:09:12,270 that's for sure. 161 00:09:12,270 --> 00:09:15,060 You get lot more done a lot quicker. 162 00:09:15,060 --> 00:09:17,880 And the last reason is reverse engineer. 163 00:09:17,880 --> 00:09:20,760 You can figure out what a program does when you only 164 00:09:20,760 --> 00:09:23,070 have access to its source, so, for example, 165 00:09:23,070 --> 00:09:28,490 the matrix multiplication example that I gave on day 1. 166 00:09:28,490 --> 00:09:31,020 You know, we had the overall outer structure, 167 00:09:31,020 --> 00:09:37,950 but the inner loop, we could not match the Intel math kernel 168 00:09:37,950 --> 00:09:39,690 library code. 169 00:09:39,690 --> 00:09:40,590 So what do we do? 170 00:09:43,055 --> 00:09:44,430 We didn't have the source for it. 171 00:09:44,430 --> 00:09:45,900 We looked to see what it was doing. 172 00:09:45,900 --> 00:09:48,255 We said, oh, is that what they're doing? 173 00:09:50,790 --> 00:09:54,330 And then we're able to do it ourselves 174 00:09:54,330 --> 00:10:00,690 without having to get the sauce from them. 175 00:10:00,690 --> 00:10:03,000 So we reverse engineered what they did? 176 00:10:03,000 --> 00:10:05,310 So all those are good reasons. 177 00:10:05,310 --> 00:10:08,640 Now, in this class, we have some expectations. 178 00:10:08,640 --> 00:10:12,510 So one thing is, you know, assembly is complicated 179 00:10:12,510 --> 00:10:15,210 and you needn't memorize the manual. 180 00:10:15,210 --> 00:10:22,140 In fact, the manual has over 1,000 pages. 181 00:10:22,140 --> 00:10:27,300 It's like-- but here's what we do expect of you. 182 00:10:27,300 --> 00:10:32,220 You should understand how a compiler implements 183 00:10:32,220 --> 00:10:36,900 various C linguistic constructs with x86 instructions. 184 00:10:36,900 --> 00:10:40,660 And that's what we'll see in the next lecture. 185 00:10:40,660 --> 00:10:43,060 And you should be able to read x86 assembly 186 00:10:43,060 --> 00:10:45,670 language with the aid of an architecture manual. 187 00:10:45,670 --> 00:10:49,210 And on a quiz, for example, we would give you snippets 188 00:10:49,210 --> 00:10:51,610 or explain what the op codes that are being 189 00:10:51,610 --> 00:10:53,620 used in case it's not there. 190 00:10:53,620 --> 00:10:55,790 But you should have some understanding of that, 191 00:10:55,790 --> 00:10:58,340 so you can see what's actually happening. 192 00:10:58,340 --> 00:11:00,340 You should understand the high-level performance 193 00:11:00,340 --> 00:11:03,730 implications of common assembly patterns. 194 00:11:03,730 --> 00:11:08,140 OK, so what does it mean to do things 195 00:11:08,140 --> 00:11:11,270 in a particular way in terms of performance? 196 00:11:11,270 --> 00:11:12,760 So some of them are quite obvious. 197 00:11:12,760 --> 00:11:15,850 Vector operations tend to be faster 198 00:11:15,850 --> 00:11:21,550 than doing the same thing with a bunch of scalar operations. 199 00:11:24,670 --> 00:11:27,490 If you do write an assembly, typically what we use 200 00:11:27,490 --> 00:11:31,430 is there are a bunch of compiler intrinsic functions, built-ins, 201 00:11:31,430 --> 00:11:37,330 so-called, that allow you to use the assembly language 202 00:11:37,330 --> 00:11:39,190 instructions. 203 00:11:39,190 --> 00:11:44,950 And you should be after we've done this able to write code 204 00:11:44,950 --> 00:11:48,567 from scratch if the situation demands it sometime 205 00:11:48,567 --> 00:11:49,150 in the future. 206 00:11:49,150 --> 00:11:51,220 We won't do that in this class, but we 207 00:11:51,220 --> 00:11:56,180 expect that you will be in a position to do that after-- 208 00:11:56,180 --> 00:11:58,240 you should get a mastery to the level 209 00:11:58,240 --> 00:12:02,340 where that would not be impossible for you to do. 210 00:12:02,340 --> 00:12:06,220 You'd be able to do that with a reasonable amount of effort. 211 00:12:06,220 --> 00:12:07,800 So the rest of the lecture here is 212 00:12:07,800 --> 00:12:12,630 I'm going to first start by talking about the instruction 213 00:12:12,630 --> 00:12:15,660 set architecture of the x86-64, which 214 00:12:15,660 --> 00:12:18,960 is the one that we are using for the cloud machines 215 00:12:18,960 --> 00:12:21,880 that we're using. 216 00:12:21,880 --> 00:12:24,660 And then I'm going to talk about floating point in vector 217 00:12:24,660 --> 00:12:27,540 hardware and then I'm going to do an overview of computer 218 00:12:27,540 --> 00:12:29,110 architecture. 219 00:12:29,110 --> 00:12:32,730 Now, all of this I'm doing-- this is software class, right? 220 00:12:32,730 --> 00:12:34,800 Software performance engineering we're doing. 221 00:12:34,800 --> 00:12:38,040 So the reason we're doing this is 222 00:12:38,040 --> 00:12:41,610 so you can write code that better matches the hardware, 223 00:12:41,610 --> 00:12:43,500 therefore to better get it. 224 00:12:43,500 --> 00:12:45,900 In order to do that, I could give things at a high-level. 225 00:12:45,900 --> 00:12:47,550 My experience is that if you really 226 00:12:47,550 --> 00:12:49,320 want to understand something, you 227 00:12:49,320 --> 00:12:52,320 want to understand it to level that's necessary 228 00:12:52,320 --> 00:12:55,380 and then one level below that. 229 00:12:55,380 --> 00:12:58,520 It's not that you'll necessarily use that one level below it, 230 00:12:58,520 --> 00:13:02,510 but that gives you insight as to why that layer is what it is 231 00:13:02,510 --> 00:13:04,355 and what's really going on. 232 00:13:04,355 --> 00:13:06,230 And so that's kind of what we're going to do. 233 00:13:06,230 --> 00:13:07,688 We're going to do a dive that takes 234 00:13:07,688 --> 00:13:10,550 us one level beyond what you probably 235 00:13:10,550 --> 00:13:13,790 will need to know in the class, so that you 236 00:13:13,790 --> 00:13:17,330 have a robust foundation for understanding. 237 00:13:17,330 --> 00:13:20,470 Does that makes sense? 238 00:13:20,470 --> 00:13:22,600 That's my part of my learning philosophy 239 00:13:22,600 --> 00:13:25,150 is you know go one step beyond. 240 00:13:25,150 --> 00:13:28,570 And then you can come back. 241 00:13:28,570 --> 00:13:35,120 The ISA primer, so the ISA talks about the syntax and semantics 242 00:13:35,120 --> 00:13:35,620 of assembly. 243 00:13:40,090 --> 00:13:44,770 There are four important concepts 244 00:13:44,770 --> 00:13:49,030 in the instruction set architecture-- 245 00:13:49,030 --> 00:13:52,750 the notion of registers, the notion of instructions, 246 00:13:52,750 --> 00:13:56,470 the data types, and the memory addressing modes. 247 00:13:56,470 --> 00:13:59,470 And those are sort of indicated. 248 00:13:59,470 --> 00:14:03,320 For example, here, we're going to go through those one by one. 249 00:14:03,320 --> 00:14:05,020 So let's start with the registers. 250 00:14:05,020 --> 00:14:08,380 So the registers is where the processor stores things. 251 00:14:08,380 --> 00:14:14,080 And there are a bunch of x86 registers, 252 00:14:14,080 --> 00:14:18,280 so many that you don't need to know most of them. 253 00:14:18,280 --> 00:14:20,170 The ones that are important are these. 254 00:14:23,050 --> 00:14:26,290 So first of all, there a general purpose registers. 255 00:14:26,290 --> 00:14:29,500 And those typically have width 64. 256 00:14:29,500 --> 00:14:32,320 And there are many of those. 257 00:14:32,320 --> 00:14:36,340 There is a so-called flags register, called RFLAGS, 258 00:14:36,340 --> 00:14:38,740 which keeps track of things like whether there 259 00:14:38,740 --> 00:14:41,890 was an overflow, whether the last arithmetic 260 00:14:41,890 --> 00:14:46,000 operation resulted in a zero, whether a kid there 261 00:14:46,000 --> 00:14:51,590 was a carryout of a word or what have you. 262 00:14:51,590 --> 00:14:54,520 The next one is the instruction pointer. 263 00:14:54,520 --> 00:14:56,770 So the assembly language is organized 264 00:14:56,770 --> 00:14:59,140 as a sequence of instructions. 265 00:14:59,140 --> 00:15:01,900 And the hardware marches linearly 266 00:15:01,900 --> 00:15:05,590 through that sequence, one after the other, 267 00:15:05,590 --> 00:15:08,740 unless it encounters a conditional jump 268 00:15:08,740 --> 00:15:11,410 or an unconditional jump, in which case 269 00:15:11,410 --> 00:15:13,628 it'll branch to whatever the location is. 270 00:15:13,628 --> 00:15:15,670 But for the most part, it's just running straight 271 00:15:15,670 --> 00:15:17,530 through memory. 272 00:15:17,530 --> 00:15:21,400 Then there are some registers that 273 00:15:21,400 --> 00:15:28,900 were added quite late in the game, namely the SSE registers 274 00:15:28,900 --> 00:15:31,120 and the AVX registers. 275 00:15:31,120 --> 00:15:33,140 And these are vector registers. 276 00:15:33,140 --> 00:15:38,340 So the XMM registers were, when they first did vectorization, 277 00:15:38,340 --> 00:15:39,730 they used 128 bits. 278 00:15:39,730 --> 00:15:44,290 There's also for AVX, there are the YMM registers. 279 00:15:44,290 --> 00:15:46,460 And in the most recent processors, 280 00:15:46,460 --> 00:15:49,780 which were not using this term, there's 281 00:15:49,780 --> 00:15:55,990 another level of AVX that gives you 512-bit registers. 282 00:15:55,990 --> 00:16:00,470 Maybe we'll use that for the final project, 283 00:16:00,470 --> 00:16:04,750 because it's just like a little more power for the game playing 284 00:16:04,750 --> 00:16:06,400 project. 285 00:16:06,400 --> 00:16:08,200 But for most of what you'll be doing, 286 00:16:08,200 --> 00:16:17,860 we'll just be keeping to the C4 instances in AWS 287 00:16:17,860 --> 00:16:21,100 that you guys have been using. 288 00:16:21,100 --> 00:16:26,800 Now, the x86-64 didn't start out as x86-64. 289 00:16:26,800 --> 00:16:29,080 It started out as x86. 290 00:16:29,080 --> 00:16:34,000 And it was used for machines, in particular the 80-86, 291 00:16:34,000 --> 00:16:35,800 which had a 16-bit word. 292 00:16:38,410 --> 00:16:42,090 So really short. 293 00:16:42,090 --> 00:16:44,760 How many things can you index with a 16-bit word? 294 00:16:48,100 --> 00:16:50,110 About how many? 295 00:16:50,110 --> 00:16:51,407 AUDIENCE: 65,000. 296 00:16:51,407 --> 00:16:52,990 CHARLES LEISERSON: Yeah, about 65,000. 297 00:16:52,990 --> 00:17:00,760 65,536 words you can address, or bytes. 298 00:17:00,760 --> 00:17:02,800 This is byte addressing. 299 00:17:02,800 --> 00:17:06,950 So that's 65k bytes that you can address. 300 00:17:06,950 --> 00:17:10,220 How could they possibly use that for machines? 301 00:17:10,220 --> 00:17:14,030 Well, the answer is that's how much memory was on the machine. 302 00:17:14,030 --> 00:17:16,040 You didn't have gigabytes. 303 00:17:16,040 --> 00:17:17,480 So as the machines-- 304 00:17:17,480 --> 00:17:22,069 as Moore's law marched along and we got more and more memory, 305 00:17:22,069 --> 00:17:24,980 then the words had to become wider to be able to index them. 306 00:17:24,980 --> 00:17:25,684 Yeah? 307 00:17:25,684 --> 00:17:27,582 AUDIENCE: [INAUDIBLE] 308 00:17:27,582 --> 00:17:29,040 CHARLES LEISERSON: Yeah, but here's 309 00:17:29,040 --> 00:17:33,060 the thing is if you're building stuff that's too expensive 310 00:17:33,060 --> 00:17:38,430 and you can't get memory that's big enough, then 311 00:17:38,430 --> 00:17:43,440 if you build a wider word, like if you build a word of 32 bits, 312 00:17:43,440 --> 00:17:45,840 then your processor just cost twice as much 313 00:17:45,840 --> 00:17:48,150 as the next guy's processor. 314 00:17:48,150 --> 00:17:51,030 So instead, what they did is they went along as long as that 315 00:17:51,030 --> 00:17:55,140 was the common size, and then had some growth pains 316 00:17:55,140 --> 00:17:58,012 and went to 32. 317 00:17:58,012 --> 00:17:59,970 And from there, they had some more growth pains 318 00:17:59,970 --> 00:18:01,980 and went to 64. 319 00:18:01,980 --> 00:18:04,120 OK, those are two separate things. 320 00:18:04,120 --> 00:18:08,940 And, in fact, they did they did some really weird stuff. 321 00:18:08,940 --> 00:18:12,120 So what they did in fact is when they made these longer 322 00:18:12,120 --> 00:18:15,100 registers, they have registers that are 323 00:18:15,100 --> 00:18:19,470 aliased to exactly the same thing for the lower bits. 324 00:18:19,470 --> 00:18:27,870 So they can address them either by a byte-- 325 00:18:27,870 --> 00:18:30,330 so these registers all have the same-- 326 00:18:30,330 --> 00:18:33,600 you can do the lower and upper half of the short word, 327 00:18:33,600 --> 00:18:41,670 or you can do the 32-bit word or you can do the 64-bit word. 328 00:18:41,670 --> 00:18:44,190 And that's just like if you're doing this today, 329 00:18:44,190 --> 00:18:45,520 you wouldn't do that. 330 00:18:45,520 --> 00:18:49,200 You wouldn't have all these registers that alias and such. 331 00:18:49,200 --> 00:18:55,410 But that's what they did because this is history, not design. 332 00:18:55,410 --> 00:18:57,820 And the reason was because when they're 333 00:18:57,820 --> 00:19:00,000 doing that they were not designing for long term. 334 00:19:00,000 --> 00:19:03,570 Now, are we going to go to 128-bit addressing? 335 00:19:03,570 --> 00:19:04,380 Probably not. 336 00:19:04,380 --> 00:19:09,130 64 bits address is a spectacular amount of stuff. 337 00:19:09,130 --> 00:19:13,215 You know, not quite as many-- 338 00:19:13,215 --> 00:19:15,120 2 to the 64th is what? 339 00:19:15,120 --> 00:19:21,030 Is like how many gazillions? 340 00:19:21,030 --> 00:19:23,780 It's a lot of gazillions. 341 00:19:23,780 --> 00:19:30,340 So, yeah, we're not going to have to go beyond 64 probably. 342 00:19:30,340 --> 00:19:34,930 So here are the general purpose registers. 343 00:19:34,930 --> 00:19:38,570 And as I mentioned, they have different names, 344 00:19:38,570 --> 00:19:40,130 but they cover the same thing. 345 00:19:40,130 --> 00:19:46,030 So if you change eax, for example, that also changes rax. 346 00:19:46,030 --> 00:19:49,900 And so you see they originally all had functional purposes. 347 00:19:49,900 --> 00:19:55,810 Now, they're all pretty much the same thing, 348 00:19:55,810 --> 00:19:59,925 but the names have stuck because of history. 349 00:19:59,925 --> 00:20:01,300 Instead of calling them registers 350 00:20:01,300 --> 00:20:05,380 0, register 1, or whatever, they all have these funny names. 351 00:20:05,380 --> 00:20:07,600 Some of them still are used for a particular purpose, 352 00:20:07,600 --> 00:20:11,680 like rsp is used as the stack pointer. 353 00:20:11,680 --> 00:20:16,240 And rbp is used to point to the base of the frame, 354 00:20:16,240 --> 00:20:19,608 for those who remember their 6004 stuff. 355 00:20:19,608 --> 00:20:21,400 So anyway, there are a whole bunch of them. 356 00:20:21,400 --> 00:20:22,942 And they're different names depending 357 00:20:22,942 --> 00:20:26,200 upon which part of the register you're accessing. 358 00:20:26,200 --> 00:20:30,100 Now, the format of an x86-64 instruction code 359 00:20:30,100 --> 00:20:33,220 is to have an opcode and then an operand list. 360 00:20:33,220 --> 00:20:38,260 And the operand list is typically 0, 1, 2, or rarely 361 00:20:38,260 --> 00:20:41,050 3 operands separated by commas. 362 00:20:41,050 --> 00:20:43,930 Typically, all operands are sources 363 00:20:43,930 --> 00:20:46,180 and one operand might also be the destination. 364 00:20:46,180 --> 00:20:52,510 So, for example, if you take a look at this add instruction, 365 00:20:52,510 --> 00:20:54,430 the operation is an add. 366 00:20:54,430 --> 00:21:00,050 And the operand list is these two registers. 367 00:21:00,050 --> 00:21:02,650 One is edi and the other is ecx. 368 00:21:02,650 --> 00:21:07,990 And the destination is the second one. 369 00:21:07,990 --> 00:21:10,930 When you add-- in this case, what's going on 370 00:21:10,930 --> 00:21:15,490 is it's taking the value in ecx, adding the value in edi 371 00:21:15,490 --> 00:21:16,330 into it. 372 00:21:16,330 --> 00:21:18,970 And the result is in ecx. 373 00:21:18,970 --> 00:21:19,753 Yes? 374 00:21:19,753 --> 00:21:22,218 AUDIENCE: Is there a convention for where the destination 375 00:21:22,218 --> 00:21:24,190 [INAUDIBLE] 376 00:21:24,190 --> 00:21:26,830 CHARLES LEISERSON: Funny you should ask. 377 00:21:26,830 --> 00:21:28,330 Yes. 378 00:21:28,330 --> 00:21:30,040 So what does op A, B mean? 379 00:21:30,040 --> 00:21:34,840 It turns out naturally that the literature 380 00:21:34,840 --> 00:21:39,470 is inconsistent about how it refers to operations. 381 00:21:39,470 --> 00:21:42,130 And there's two major ways that are used. 382 00:21:42,130 --> 00:21:48,580 One is the AT&T syntax, and the other is the Intel syntax. 383 00:21:48,580 --> 00:21:52,360 So the AT&T syntax, the second operand is the destination. 384 00:21:52,360 --> 00:21:55,210 The last operand is the destination. 385 00:21:55,210 --> 00:22:00,940 In the Intel syntax, the first operand is the destination. 386 00:22:00,940 --> 00:22:03,460 OK, is that confusing? 387 00:22:03,460 --> 00:22:06,580 So almost all the tools that we're going to use 388 00:22:06,580 --> 00:22:08,635 are going to use the AT&T syntax. 389 00:22:13,750 --> 00:22:19,570 But you will read documentation, which is Intel documentation. 390 00:22:19,570 --> 00:22:21,550 It will use the other syntax. 391 00:22:21,550 --> 00:22:24,600 Don't get confused. 392 00:22:24,600 --> 00:22:26,060 OK? 393 00:22:26,060 --> 00:22:29,180 I can't help-- it's like I can't help 394 00:22:29,180 --> 00:22:31,620 that this is the way the state of the world is. 395 00:22:31,620 --> 00:22:32,830 OK? 396 00:22:32,830 --> 00:22:33,522 Yeah? 397 00:22:33,522 --> 00:22:35,480 AUDIENCE: Are there tools that help [INAUDIBLE] 398 00:22:35,480 --> 00:22:36,647 CHARLES LEISERSON: Oh, yeah. 399 00:22:36,647 --> 00:22:40,130 In particular, if you could compile it and undo, 400 00:22:40,130 --> 00:22:41,750 but I'm sure there's-- 401 00:22:41,750 --> 00:22:44,240 I mean, this is not a hard translation thing. 402 00:22:44,240 --> 00:22:47,440 I'll bet if you just Google, you can in two minutes, 403 00:22:47,440 --> 00:22:50,660 in two seconds, find somebody who will translate 404 00:22:50,660 --> 00:22:53,210 from one to the other. 405 00:22:53,210 --> 00:22:59,180 This is not a complicated translation process. 406 00:22:59,180 --> 00:23:05,960 Now, here are some very common x86 opcodes. 407 00:23:05,960 --> 00:23:09,380 And so let me just mention a few of these, 408 00:23:09,380 --> 00:23:14,150 because these are ones that you'll often see in the code. 409 00:23:14,150 --> 00:23:18,008 So move, what do you think move does? 410 00:23:18,008 --> 00:23:19,091 AUDIENCE: Moves something. 411 00:23:19,091 --> 00:23:21,133 CHARLES LEISERSON: Yeah, it puts something in one 412 00:23:21,133 --> 00:23:22,540 register into another register. 413 00:23:22,540 --> 00:23:24,490 Of course, when it moves it, this 414 00:23:24,490 --> 00:23:27,310 is computer science move, not real move. 415 00:23:27,310 --> 00:23:32,140 When I move my belongings in my house to my new house, 416 00:23:32,140 --> 00:23:34,900 they're no longer in the old place, right? 417 00:23:34,900 --> 00:23:37,540 But in computer science, for some reason, when we move 418 00:23:37,540 --> 00:23:42,760 things we leave a copy behind. 419 00:23:42,760 --> 00:23:45,207 So they may call it move, but-- 420 00:23:45,207 --> 00:23:46,790 AUDIENCE: Why don't they call it copy? 421 00:23:46,790 --> 00:23:49,690 CHARLES LEISERSON: Yeah, why don't they call it copy? 422 00:23:49,690 --> 00:23:50,350 You got me. 423 00:23:54,290 --> 00:23:57,250 OK, then there's conditional move. 424 00:23:57,250 --> 00:24:02,830 So this is move based on a condition-- 425 00:24:02,830 --> 00:24:04,810 and we'll see some of the ways that this is-- 426 00:24:04,810 --> 00:24:13,150 like move if flag is equal to 0 and so forth, so 427 00:24:13,150 --> 00:24:15,040 basically conditional move. 428 00:24:15,040 --> 00:24:18,370 It doesn't always do the move. 429 00:24:18,370 --> 00:24:21,580 Then you can extend the sign. 430 00:24:21,580 --> 00:24:28,990 So, for example, suppose you're moving from a 32-bit value 431 00:24:28,990 --> 00:24:32,650 register into a 64-bit register. 432 00:24:32,650 --> 00:24:37,110 Then the question is, what happens to high order bits? 433 00:24:37,110 --> 00:24:39,450 So there's two basic mechanisms that can be used. 434 00:24:39,450 --> 00:24:42,510 Either it can be filled with zeros, 435 00:24:42,510 --> 00:24:47,760 or remember that the first bit, or the leftmost bit as we 436 00:24:47,760 --> 00:24:53,280 think of it, is the sign bit from our electron binary. 437 00:24:53,280 --> 00:24:56,730 That bit will be extended through the high order 438 00:24:56,730 --> 00:25:02,550 part of the word, so that the whole number if it's negative 439 00:25:02,550 --> 00:25:04,140 will be negative and if it's positive, 440 00:25:04,140 --> 00:25:08,060 it'll be zeros and so forth. 441 00:25:08,060 --> 00:25:10,530 Does that makes sense? 442 00:25:10,530 --> 00:25:14,560 Then there are things like push and pop to do stacks. 443 00:25:14,560 --> 00:25:18,460 There's a lot of integer arithmetic. 444 00:25:18,460 --> 00:25:23,380 There's addition, subtraction, multiplication, division, 445 00:25:23,380 --> 00:25:28,210 various shifts, address calculation shifts, rotations, 446 00:25:28,210 --> 00:25:31,000 incrementing, decrementing, negating, etc. 447 00:25:31,000 --> 00:25:35,030 There's also a lot of binary logic, AND, OR, XOR, NOT. 448 00:25:35,030 --> 00:25:38,680 Those are all doing bitwise operations. 449 00:25:38,680 --> 00:25:42,550 And then there is Boolean logic, like testing 450 00:25:42,550 --> 00:25:49,230 to see whether some value has a given value or comparing. 451 00:25:49,230 --> 00:25:51,700 There's unconditional jump, which is jump. 452 00:25:51,700 --> 00:25:54,970 And there's conditional jumps, which is jump with a condition. 453 00:25:54,970 --> 00:25:56,800 And then things like subroutines. 454 00:25:56,800 --> 00:26:00,970 And there are a bunch more, which the manual will have 455 00:26:00,970 --> 00:26:02,628 and which will undoubtedly show up. 456 00:26:02,628 --> 00:26:05,170 Like, for example, there's the whole set of vector operations 457 00:26:05,170 --> 00:26:08,320 we'll talk about a little bit later. 458 00:26:08,320 --> 00:26:11,400 Now, the opcodes may be augmented 459 00:26:11,400 --> 00:26:14,340 with a suffix that describes the data type of the operation 460 00:26:14,340 --> 00:26:16,680 or a condition code. 461 00:26:16,680 --> 00:26:19,980 OK, so an opcode for data movement, arithmetic, or logic 462 00:26:19,980 --> 00:26:26,820 use a single character suffix to indicate the data type. 463 00:26:26,820 --> 00:26:29,280 And if the suffix is missing, it can usually be inferred. 464 00:26:29,280 --> 00:26:31,420 So take a look at this example. 465 00:26:31,420 --> 00:26:33,480 So this is a move with a q at the end. 466 00:26:33,480 --> 00:26:37,470 What do you think q stands for? 467 00:26:37,470 --> 00:26:38,782 AUDIENCE: Quad words? 468 00:26:38,782 --> 00:26:39,990 CHARLES LEISERSON: Quad word. 469 00:26:39,990 --> 00:26:41,850 OK, how many bytes in a quad word? 470 00:26:45,722 --> 00:26:46,558 AUDIENCE: Eight. 471 00:26:46,558 --> 00:26:47,600 CHARLES LEISERSON: Eight. 472 00:26:51,160 --> 00:26:55,900 That's because originally it started out with a 16-bit word. 473 00:26:55,900 --> 00:26:59,860 So they said a quad word was four of those 16-bit words. 474 00:26:59,860 --> 00:27:01,780 So that's 8 bytes. 475 00:27:01,780 --> 00:27:02,950 You get the idea, right? 476 00:27:02,950 --> 00:27:07,390 But let me tell you this is all over the x86 instruction set. 477 00:27:07,390 --> 00:27:09,850 All these historical things and all these 478 00:27:09,850 --> 00:27:14,950 mnemonics that if you don't understand what they really 479 00:27:14,950 --> 00:27:17,800 mean, you can get very confused. 480 00:27:17,800 --> 00:27:20,290 So in this case, we're moving a 64-bit integer, 481 00:27:20,290 --> 00:27:25,990 because a quad word has 8 bytes or 64 bits. 482 00:27:25,990 --> 00:27:27,490 This is one of my-- 483 00:27:27,490 --> 00:27:29,590 it's like whenever I prepare this lecture, 484 00:27:29,590 --> 00:27:35,290 I just go into spasms of laughter, as I look 485 00:27:35,290 --> 00:27:38,500 and I say, oh, my god, they really did that like. 486 00:27:38,500 --> 00:27:42,160 For example, on the last page, when I did subtract. 487 00:27:42,160 --> 00:27:47,430 So the sub-operator, if it's a two argument operator, 488 00:27:47,430 --> 00:27:48,750 it subtracts the-- 489 00:27:48,750 --> 00:27:50,590 I think it's the first and the second. 490 00:27:50,590 --> 00:27:52,780 But there is no way of subtracting the other way 491 00:27:52,780 --> 00:27:54,690 around. 492 00:27:54,690 --> 00:27:57,650 It puts the destination in the second one. 493 00:27:57,650 --> 00:28:00,810 It basically takes the second one minus the first one 494 00:28:00,810 --> 00:28:03,150 and puts that in the second one. 495 00:28:03,150 --> 00:28:06,420 But if you wanted to have it the other way around, 496 00:28:06,420 --> 00:28:08,160 to save yourself a cycle-- 497 00:28:08,160 --> 00:28:09,480 anyway, it doesn't matter. 498 00:28:09,480 --> 00:28:11,820 You can't do it that way. 499 00:28:11,820 --> 00:28:14,130 And all this stuff the compiler has to understand. 500 00:28:17,390 --> 00:28:21,590 So here are the x86-64 data types. 501 00:28:21,590 --> 00:28:25,940 The way I've done it is to show you the difference between C 502 00:28:25,940 --> 00:28:36,140 and x86-64, so for example, here are the declarations in C. 503 00:28:36,140 --> 00:28:41,450 So there's a char, a short, int, unsigned int, long, etc. 504 00:28:41,450 --> 00:28:43,280 Here's an example of a C constant 505 00:28:43,280 --> 00:28:45,300 that does those things. 506 00:28:45,300 --> 00:28:47,570 And here's the size in bytes that you 507 00:28:47,570 --> 00:28:50,510 get when you declare that. 508 00:28:50,510 --> 00:28:58,370 And then the assembly suffix is one of these things. 509 00:28:58,370 --> 00:29:02,930 So in the assembly, it says b or w for a word, an l or d 510 00:29:02,930 --> 00:29:07,300 for a double word, a q for a quad word, i.e. 511 00:29:07,300 --> 00:29:09,980 8 bytes, single precision, double precision, 512 00:29:09,980 --> 00:29:11,432 extended precision. 513 00:29:15,710 --> 00:29:19,460 So sign extension use two date type suffixes. 514 00:29:19,460 --> 00:29:22,470 So here's an example. 515 00:29:22,470 --> 00:29:27,840 So the first one says we're going to move. 516 00:29:27,840 --> 00:29:32,362 And now you see I can't read this without my cheat sheet. 517 00:29:32,362 --> 00:29:33,320 So what is this saying? 518 00:29:33,320 --> 00:29:43,250 This is saying, we're going to move with a zero-extend. 519 00:29:43,250 --> 00:29:45,710 And it's going to be the first operand is a byte, 520 00:29:45,710 --> 00:29:47,400 and the second operation is a long. 521 00:29:47,400 --> 00:29:49,430 Is that right? 522 00:29:49,430 --> 00:29:53,570 If I'm wrong, it's like I got to look at the chart too. 523 00:29:53,570 --> 00:29:56,000 And, of course, we don't hold you to that. 524 00:29:56,000 --> 00:29:58,970 But the z there says extends with zeros. 525 00:29:58,970 --> 00:30:03,240 And the S says preserve the sign. 526 00:30:03,240 --> 00:30:05,460 So that's the things. 527 00:30:05,460 --> 00:30:08,520 Now, that would all be all well and good, 528 00:30:08,520 --> 00:30:15,810 except that then what they did is if you do 32-bit operations, 529 00:30:15,810 --> 00:30:19,320 where you're moving it to a 64-bit value, 530 00:30:19,320 --> 00:30:23,230 it implicitly zero-extends the sign. 531 00:30:23,230 --> 00:30:27,130 If you do it for smaller values and you store it in, 532 00:30:27,130 --> 00:30:30,475 it simply overwrites the values in those registers. 533 00:30:30,475 --> 00:30:32,980 It doesn't touch the high order bits. 534 00:30:32,980 --> 00:30:39,370 But when they did the 32 to 64-bit extension 535 00:30:39,370 --> 00:30:42,430 of the instruction set, they decided 536 00:30:42,430 --> 00:30:45,610 that they wouldn't do what had been done in the past. 537 00:30:45,610 --> 00:30:48,670 And they decided that they would zero-extend things, 538 00:30:48,670 --> 00:30:52,750 unless there was something explicit to the contrary. 539 00:30:52,750 --> 00:30:53,980 You got me, OK. 540 00:30:56,780 --> 00:31:00,640 Yeah, I have a friend who worked at Intel. 541 00:31:00,640 --> 00:31:04,008 And he had a joke about the Intel instructions set. 542 00:31:04,008 --> 00:31:05,550 You'll discover the Intel instruction 543 00:31:05,550 --> 00:31:07,030 set is really complicated. 544 00:31:07,030 --> 00:31:09,640 He says, here's the idea of the Intel instruction set. 545 00:31:09,640 --> 00:31:12,910 He said, to become an Intel fellow, 546 00:31:12,910 --> 00:31:17,832 you need to have an instruction in the Intel instruction set. 547 00:31:17,832 --> 00:31:19,540 You have an instruction that you invented 548 00:31:19,540 --> 00:31:21,940 and that that's now used in Intel. 549 00:31:21,940 --> 00:31:25,000 He says nobody becomes an Intel fellow 550 00:31:25,000 --> 00:31:28,740 for removing instructions. 551 00:31:28,740 --> 00:31:31,710 So it just sort of grows and grows and grows and gets more 552 00:31:31,710 --> 00:31:36,150 and more complicated for each thing. 553 00:31:36,150 --> 00:31:41,160 Now, once again, for extension, you can sign-extend. 554 00:31:41,160 --> 00:31:44,310 And here's two examples. 555 00:31:44,310 --> 00:31:48,120 In one case, moving an 8-bit integer to a 32-bit integer 556 00:31:48,120 --> 00:31:51,570 and zero-extended it versus preserving the sign. 557 00:31:55,440 --> 00:31:57,360 Conditional jumps and conditional moves 558 00:31:57,360 --> 00:32:01,200 also use suffixes to indicate the condition code. 559 00:32:01,200 --> 00:32:05,010 So here, for example, the ne indicates the jump should only 560 00:32:05,010 --> 00:32:08,460 be taken if the argument of the previous comparison 561 00:32:08,460 --> 00:32:09,480 are not equal. 562 00:32:09,480 --> 00:32:11,100 So ne is not equal. 563 00:32:11,100 --> 00:32:12,960 So you do a comparison, and that's 564 00:32:12,960 --> 00:32:16,320 going to set a flag in the RFLAGS register. 565 00:32:16,320 --> 00:32:19,140 Then the jump will look at that flag 566 00:32:19,140 --> 00:32:22,260 and decide whether it's going to jump or not or just continue 567 00:32:22,260 --> 00:32:26,697 the sequential execution of the code. 568 00:32:26,697 --> 00:32:28,530 And there are a bunch of things that you can 569 00:32:28,530 --> 00:32:36,030 jump on which are status flags. 570 00:32:36,030 --> 00:32:37,770 And you can see the names here. 571 00:32:37,770 --> 00:32:39,060 There's Carry. 572 00:32:39,060 --> 00:32:40,800 There's Parity. 573 00:32:40,800 --> 00:32:43,170 Parity is the XOR of all the bits in the word. 574 00:32:46,390 --> 00:32:49,860 Adjust, I don't even know what that's for. 575 00:32:49,860 --> 00:32:51,202 There's the Zero flag. 576 00:32:51,202 --> 00:32:52,410 It tells whether it's a zero. 577 00:32:52,410 --> 00:32:55,720 There's a Sign flag, whether it's positive or negative. 578 00:32:55,720 --> 00:33:01,290 There's a Trap flag and Interrupt enable and Direction, 579 00:33:01,290 --> 00:33:02,070 Overflow. 580 00:33:02,070 --> 00:33:05,310 So anyway, you can see there are a whole bunch of these. 581 00:33:05,310 --> 00:33:08,850 So, for example here, this is going to decrement rbx. 582 00:33:08,850 --> 00:33:12,270 And then it sets the Zero flag if the results are equal. 583 00:33:12,270 --> 00:33:15,210 And then the jump, the conditional jump, 584 00:33:15,210 --> 00:33:21,340 jumps to the label if the ZF flag is not set, in this case. 585 00:33:21,340 --> 00:33:24,040 OK, it make sense? 586 00:33:24,040 --> 00:33:26,050 After a fashion. 587 00:33:26,050 --> 00:33:28,330 Doesn't make rational sense, but it does make sense. 588 00:33:32,300 --> 00:33:35,000 Here are the main ones that you're going to need. 589 00:33:35,000 --> 00:33:38,210 The Carry flag is whether you got a carry or a borrow out 590 00:33:38,210 --> 00:33:39,740 of the most significant bit. 591 00:33:39,740 --> 00:33:44,390 The Zero flag is if the ALU operation was 0, 592 00:33:44,390 --> 00:33:47,450 whether the last ALU operation had the sign bit set. 593 00:33:47,450 --> 00:33:49,640 And the overflow says it resulted 594 00:33:49,640 --> 00:33:52,410 in arithmetic overflow. 595 00:33:52,410 --> 00:33:56,390 The condition codes are-- 596 00:33:56,390 --> 00:33:58,520 if you put one of these condition codes 597 00:33:58,520 --> 00:34:02,600 on your conditional jump or whatever, 598 00:34:02,600 --> 00:34:07,760 this tells you exactly what the flag is that is being set. 599 00:34:07,760 --> 00:34:14,389 So, for example, the easy ones are if it's equal. 600 00:34:14,389 --> 00:34:16,199 But there are some other ones there. 601 00:34:16,199 --> 00:34:22,969 So, for example, if you say why, for example, 602 00:34:22,969 --> 00:34:25,969 do the condition codes e and ne, check the Zero flag? 603 00:34:29,190 --> 00:34:34,320 And the answer is typically, rather 604 00:34:34,320 --> 00:34:36,870 than having a separate comparison, what they've done 605 00:34:36,870 --> 00:34:39,900 is separate the branch from the comparison itself. 606 00:34:39,900 --> 00:34:43,620 But it also needn't be a compare instruction. 607 00:34:43,620 --> 00:34:48,330 It could be the result of the last arithmetic 608 00:34:48,330 --> 00:34:50,940 operation was a zero, and therefore it 609 00:34:50,940 --> 00:34:56,090 can branch without having to do a comparison with zero. 610 00:34:56,090 --> 00:34:59,660 So, for example, if you have a loop. 611 00:34:59,660 --> 00:35:03,140 where you're decrementing a counter till it gets to 0, 612 00:35:03,140 --> 00:35:09,390 that's actually faster by one instruction 613 00:35:09,390 --> 00:35:14,550 to compare whether the loop index hits 0 614 00:35:14,550 --> 00:35:18,740 than it is if you have the loop going up to n, and then 615 00:35:18,740 --> 00:35:21,200 every time through the loop having to compare with n 616 00:35:21,200 --> 00:35:24,530 in order before you can branch. 617 00:35:24,530 --> 00:35:28,460 So these days that optimization doesn't mean anything, 618 00:35:28,460 --> 00:35:31,190 because, as we'll talk about in a little bit, 619 00:35:31,190 --> 00:35:38,840 these machines are so powerful, that doing an extra integer 620 00:35:38,840 --> 00:35:40,310 arithmetic like that probably has 621 00:35:40,310 --> 00:35:41,900 no bearing on the overall cost. 622 00:35:41,900 --> 00:35:42,436 Yeah? 623 00:35:42,436 --> 00:35:44,644 AUDIENCE: So this instruction doesn't take arguments? 624 00:35:44,644 --> 00:35:45,590 It just looks at the flags? 625 00:35:45,590 --> 00:35:47,270 CHARLES LEISERSON: Just looks at the flags, yep. 626 00:35:47,270 --> 00:35:48,270 Just looks at the flags. 627 00:35:48,270 --> 00:35:52,230 It doesn't take any arguments. 628 00:35:52,230 --> 00:35:55,700 Now, the next aspect of this is you can give registers, 629 00:35:55,700 --> 00:35:58,310 but you also can address memory. 630 00:35:58,310 --> 00:36:05,450 And there are three direct addressing modes and three 631 00:36:05,450 --> 00:36:06,920 indirect addressing modes. 632 00:36:09,440 --> 00:36:14,420 At most, one operand may specify a memory address. 633 00:36:14,420 --> 00:36:16,640 So here are the direct addressing modes. 634 00:36:16,640 --> 00:36:19,980 So for immediate what you do is you give it a constant, 635 00:36:19,980 --> 00:36:26,630 like 172, random constant, to store into the register, 636 00:36:26,630 --> 00:36:27,200 in this case. 637 00:36:27,200 --> 00:36:28,430 That's called an immediate. 638 00:36:28,430 --> 00:36:32,120 What happens if you look at the instruction, 639 00:36:32,120 --> 00:36:33,700 if you look at the machine language, 640 00:36:33,700 --> 00:36:37,730 172 is right in the instruction. 641 00:36:37,730 --> 00:36:42,080 It's right in the instruction, that number 172. 642 00:36:42,080 --> 00:36:44,870 Register says we'll move the value from the register, 643 00:36:44,870 --> 00:36:47,390 in this case, %cx. 644 00:36:47,390 --> 00:36:52,070 And then the index of the register is put in that part. 645 00:36:52,070 --> 00:36:58,940 And direct memory says use a particular memory location. 646 00:36:58,940 --> 00:37:00,650 And you can give a hex value. 647 00:37:00,650 --> 00:37:05,910 When you do direct memory, it's going 648 00:37:05,910 --> 00:37:09,020 to use the value at that place in memory. 649 00:37:09,020 --> 00:37:13,730 And to indicate that memory is going to take you, 650 00:37:13,730 --> 00:37:19,190 on a 64-bit machine, 64 8-bytes to specify that memory. 651 00:37:19,190 --> 00:37:27,370 Whereas, for example, the move q, 172 will fit in 1 byte. 652 00:37:27,370 --> 00:37:32,410 And so I'll have spent a lot less storage in order to do it. 653 00:37:32,410 --> 00:37:35,740 Plus, I can do it directly from the instruction stream. 654 00:37:35,740 --> 00:37:38,260 And I avoid having an access to memory, 655 00:37:38,260 --> 00:37:39,910 which is very expensive. 656 00:37:39,910 --> 00:37:43,660 So how many cycles does it take if the value that you're 657 00:37:43,660 --> 00:37:49,450 fetching from memory is not in cache 658 00:37:49,450 --> 00:37:51,167 or whatever or a register? 659 00:37:51,167 --> 00:37:52,750 If I'm fetching something from memory, 660 00:37:52,750 --> 00:37:54,670 how many cycles of the machine does 661 00:37:54,670 --> 00:37:56,200 it typically take these days. 662 00:37:58,970 --> 00:37:59,542 Yeah. 663 00:37:59,542 --> 00:38:00,870 AUDIENCE: A few hundred? 664 00:38:00,870 --> 00:38:03,078 CHARLES LEISERSON: Yeah, a couple of hundred or more, 665 00:38:03,078 --> 00:38:05,220 yeah, a couple hundred cycles. 666 00:38:05,220 --> 00:38:08,230 To fetch something from memory. 667 00:38:08,230 --> 00:38:09,760 It's so slow. 668 00:38:09,760 --> 00:38:12,670 No, it's the processors are so fast. 669 00:38:12,670 --> 00:38:15,940 And so clearly, if you can get things into registers, 670 00:38:15,940 --> 00:38:19,760 most registers you can access in a single cycle. 671 00:38:19,760 --> 00:38:21,880 So we want to move things close to the processor, 672 00:38:21,880 --> 00:38:24,320 operate on them, shove them back. 673 00:38:24,320 --> 00:38:25,880 And while we pull things from memory, 674 00:38:25,880 --> 00:38:28,880 we want other things to be to be working on. 675 00:38:28,880 --> 00:38:33,360 And so the hardware is all organized to do that. 676 00:38:33,360 --> 00:38:35,390 Now, of course, we spend a lot of time 677 00:38:35,390 --> 00:38:36,680 fetching stuff from memory. 678 00:38:36,680 --> 00:38:38,330 And that's one reason we use caching. 679 00:38:38,330 --> 00:38:39,920 And we'll have a big thing-- 680 00:38:39,920 --> 00:38:41,250 caching is really important. 681 00:38:41,250 --> 00:38:42,625 We're going spend a bunch of time 682 00:38:42,625 --> 00:38:45,860 on how to get the best out of your cache. 683 00:38:45,860 --> 00:38:49,100 There's also indirect addressing. 684 00:38:49,100 --> 00:38:51,500 So instead of just giving a location, 685 00:38:51,500 --> 00:38:56,960 you say, oh, let's go to some other place, 686 00:38:56,960 --> 00:39:03,750 for example, a register, and get the value 687 00:39:03,750 --> 00:39:06,740 and the address is going to be stored in that location. 688 00:39:06,740 --> 00:39:10,900 So, for example here, register indirect says, in this case, 689 00:39:10,900 --> 00:39:15,145 move the contents of rax into-- 690 00:39:17,890 --> 00:39:20,740 sorry, the contents is the address of the thing 691 00:39:20,740 --> 00:39:24,600 that you're going to move into rdi. 692 00:39:24,600 --> 00:39:30,020 So if rax was location 172, then it 693 00:39:30,020 --> 00:39:32,975 would take whatever is in location 172 and put it in rdi. 694 00:39:35,520 --> 00:39:37,770 Registered index says, well, do the same thing, 695 00:39:37,770 --> 00:39:42,030 but while you're at it, add an offset. 696 00:39:42,030 --> 00:39:47,220 So once again, if rax had 172, in this case 697 00:39:47,220 --> 00:39:54,250 it would go to 344 to fetch the value out 698 00:39:54,250 --> 00:39:59,140 of that location 344 for this particular instruction. 699 00:39:59,140 --> 00:40:02,410 And then instruction-pointer relative, 700 00:40:02,410 --> 00:40:05,980 instead of indexing off of a general purpose 701 00:40:05,980 --> 00:40:09,590 register, you index off the instruction pointer. 702 00:40:09,590 --> 00:40:13,870 That usually happens in the code where the code is-- 703 00:40:17,950 --> 00:40:19,690 for example, you can jump to where 704 00:40:19,690 --> 00:40:23,320 you are in the code plus four instructions. 705 00:40:23,320 --> 00:40:27,010 So you can jump down some number of instructions in the code. 706 00:40:27,010 --> 00:40:29,620 Usually, you'll see that only with use with control, 707 00:40:29,620 --> 00:40:31,120 because you're talking about things. 708 00:40:31,120 --> 00:40:35,260 But sometimes they'll put some data in the instruction stream. 709 00:40:35,260 --> 00:40:37,330 And then it can index off the instruction pointer 710 00:40:37,330 --> 00:40:39,460 to get those values without having 711 00:40:39,460 --> 00:40:44,020 to soil another register. 712 00:40:44,020 --> 00:40:48,430 Now, the most general form is base indexed scale displacement 713 00:40:48,430 --> 00:40:49,780 addressing. 714 00:40:49,780 --> 00:40:52,090 Wow. 715 00:40:52,090 --> 00:40:59,080 This is a move that has a constant plus three terms. 716 00:40:59,080 --> 00:41:03,880 And this is the most complicated instruction that is supported. 717 00:41:03,880 --> 00:41:09,490 The mode refers to the address whatever the base is. 718 00:41:09,490 --> 00:41:15,970 So the base is a general purpose register, in this case, rdi. 719 00:41:15,970 --> 00:41:19,940 And then it adds the index times the scale. 720 00:41:19,940 --> 00:41:24,370 So the scale is 1, 2, 4, or 8. 721 00:41:24,370 --> 00:41:30,100 And then a displacement, which is that number on the front. 722 00:41:30,100 --> 00:41:33,310 And this gives you very general indexing 723 00:41:33,310 --> 00:41:35,350 of things off of a base point. 724 00:41:35,350 --> 00:41:38,080 You'll often see this kind of accessing 725 00:41:38,080 --> 00:41:40,380 when you're accessing stack memory, 726 00:41:40,380 --> 00:41:41,880 because everything you can say, here 727 00:41:41,880 --> 00:41:46,120 is the base of my frame on the stack, and now for anything 728 00:41:46,120 --> 00:41:49,690 that I want to add, I'm going to be going up a certain amount. 729 00:41:49,690 --> 00:41:51,310 I may scaling by a certain amount 730 00:41:51,310 --> 00:41:54,400 to get the value that I want. 731 00:41:54,400 --> 00:42:02,200 So once again, you will become familiar with a manual. 732 00:42:02,200 --> 00:42:04,553 You don't have to memorize all these, 733 00:42:04,553 --> 00:42:06,220 but you do have to understand that there 734 00:42:06,220 --> 00:42:10,510 are a lot of these complex addressing modes. 735 00:42:10,510 --> 00:42:12,340 The jump instruction take a label 736 00:42:12,340 --> 00:42:14,830 as their operand, which identifies 737 00:42:14,830 --> 00:42:17,080 a location in the code. 738 00:42:17,080 --> 00:42:19,780 For this, the labels can be symbols. 739 00:42:19,780 --> 00:42:21,640 In other words, you can say here's a symbol 740 00:42:21,640 --> 00:42:22,750 that I want to jump to. 741 00:42:22,750 --> 00:42:24,850 It might be the beginning of a function, 742 00:42:24,850 --> 00:42:27,730 or it might be a label that's generated 743 00:42:27,730 --> 00:42:29,910 to be at the beginning of a loop or whatever. 744 00:42:29,910 --> 00:42:33,460 They can be exact addresses-- go to this place in the code. 745 00:42:33,460 --> 00:42:35,940 Or they can be relative address-- jump to some place 746 00:42:35,940 --> 00:42:39,670 as I mentioned that's indexed off the instruction pointer. 747 00:42:39,670 --> 00:42:43,570 And then an indirect jump takes as its 748 00:42:43,570 --> 00:42:44,950 operand an indirect address-- 749 00:42:47,932 --> 00:42:52,220 oop, I got-- as its operand as its operand. 750 00:42:52,220 --> 00:42:54,250 OK, so that's a typo. 751 00:42:54,250 --> 00:42:56,780 It just takes an operand as an indirect address. 752 00:42:56,780 --> 00:43:01,370 So basically, you can say, jump to whatever 753 00:43:01,370 --> 00:43:05,240 is pointed to by that register using whatever indexing method 754 00:43:05,240 --> 00:43:08,000 that you want. 755 00:43:08,000 --> 00:43:12,230 So that's kind of the overview of the assembly language. 756 00:43:12,230 --> 00:43:13,820 Now, let's take a look at some idioms. 757 00:43:13,820 --> 00:43:18,620 So the XOR opcode computes the bitwise XOR of A and B. 758 00:43:18,620 --> 00:43:22,080 We saw XOR was a great trick for swapping numbers, 759 00:43:22,080 --> 00:43:24,140 for example, the other day. 760 00:43:24,140 --> 00:43:26,660 So often in the code, you will see something 761 00:43:26,660 --> 00:43:29,900 like this, xor rax rax. 762 00:43:29,900 --> 00:43:32,450 What does that do? 763 00:43:32,450 --> 00:43:32,950 Yeah. 764 00:43:32,950 --> 00:43:34,000 AUDIENCE: Zeros the register. 765 00:43:34,000 --> 00:43:35,708 CHARLES LEISERSON: It zeros the register. 766 00:43:35,708 --> 00:43:38,126 Why does that zero the register? 767 00:43:38,126 --> 00:43:40,075 AUDIENCE: Is the XOR just the same? 768 00:43:40,075 --> 00:43:41,700 CHARLES LEISERSON: Yeah, it's basically 769 00:43:41,700 --> 00:43:48,445 taking the results of rax, the results rax, xor-ing them. 770 00:43:48,445 --> 00:43:50,070 And when you XOR something with itself, 771 00:43:50,070 --> 00:43:52,495 you get zero, storing that back into it. 772 00:43:52,495 --> 00:43:54,120 So that's actually how you zero things. 773 00:43:54,120 --> 00:43:55,170 So you'll see that. 774 00:43:55,170 --> 00:43:58,470 Whenever you see that, hey, what are they doing? 775 00:43:58,470 --> 00:44:00,720 They're zeroing the register. 776 00:44:00,720 --> 00:44:03,450 And that's actually quicker and easier 777 00:44:03,450 --> 00:44:09,240 than having a zero constant that they put into the instruction. 778 00:44:09,240 --> 00:44:11,940 It saves a byte, because this ends up 779 00:44:11,940 --> 00:44:15,150 being a very short instruction. 780 00:44:15,150 --> 00:44:18,810 I don't remember how many bytes that instruction is. 781 00:44:18,810 --> 00:44:21,960 Here's another one, the test opcode, test A, B, 782 00:44:21,960 --> 00:44:26,130 computes the bitwise AND of A and B and discards the result, 783 00:44:26,130 --> 00:44:29,910 preserving the RFLAGS register. 784 00:44:29,910 --> 00:44:33,030 So basically, it says, what does the test instruction 785 00:44:33,030 --> 00:44:35,160 for these things do? 786 00:44:38,270 --> 00:44:41,990 So what is the first one doing? 787 00:44:41,990 --> 00:44:43,913 So it takes rcx-- yeah. 788 00:44:43,913 --> 00:44:46,328 AUDIENCE: Does it jump? 789 00:44:46,328 --> 00:44:54,060 It jumps to [INAUDIBLE] rcx [INAUDIBLE] 790 00:44:54,060 --> 00:44:58,430 So it takes the bitwise AND of A and B. 791 00:44:58,430 --> 00:45:04,040 And so then it's saying jump if equal. 792 00:45:04,040 --> 00:45:06,011 So-- 793 00:45:06,011 --> 00:45:08,396 AUDIENCE: An AND would be non-zero in any 794 00:45:08,396 --> 00:45:09,350 of the bits set. 795 00:45:09,350 --> 00:45:11,800 CHARLES LEISERSON: Right. 796 00:45:11,800 --> 00:45:14,213 AND is non-zero if any of the bits are set. 797 00:45:14,213 --> 00:45:15,139 AUDIENCE: Right. 798 00:45:15,139 --> 00:45:18,817 So if the zero flag were set, that means that rcx was zero. 799 00:45:18,817 --> 00:45:20,150 CHARLES LEISERSON: That's right. 800 00:45:20,150 --> 00:45:22,760 So if the Zero flag is set, then rcx is set. 801 00:45:22,760 --> 00:45:25,330 So this is going to jump to that location 802 00:45:25,330 --> 00:45:31,340 if rcx holds the value 0. 803 00:45:31,340 --> 00:45:33,770 In all the other cases, it won't set the Zero flag 804 00:45:33,770 --> 00:45:36,380 because the result of the AND will be 0. 805 00:45:36,380 --> 00:45:38,957 So once again, that's kind of an idiom that they use. 806 00:45:38,957 --> 00:45:40,040 What about the second one? 807 00:45:42,940 --> 00:45:45,167 So this is a conditional move. 808 00:45:45,167 --> 00:45:46,750 So both of them are basically checking 809 00:45:46,750 --> 00:45:49,300 to see if the register is 0. 810 00:45:49,300 --> 00:45:53,380 And then doing something if it is or isn't. 811 00:45:53,380 --> 00:45:55,900 But those are just idioms that you sort of 812 00:45:55,900 --> 00:45:59,920 have to look at to see how it is that they accomplish 813 00:45:59,920 --> 00:46:03,070 their particular thing. 814 00:46:03,070 --> 00:46:03,970 Here's another one. 815 00:46:03,970 --> 00:46:09,310 So the ISA can include several no-op, no operation 816 00:46:09,310 --> 00:46:13,180 instructions, including nop, nop A-- that's 817 00:46:13,180 --> 00:46:17,140 an operation with an argument-- and data16, which sets aside 818 00:46:17,140 --> 00:46:20,020 2 bytes of a nop. 819 00:46:20,020 --> 00:46:22,480 So here's a line of assembly that we 820 00:46:22,480 --> 00:46:25,090 found in some of our code-- 821 00:46:25,090 --> 00:46:30,130 data16 days16 data16 nopw and then %csx. 822 00:46:34,320 --> 00:46:38,790 So nopw is going to take this argument, which has got all 823 00:46:38,790 --> 00:46:41,010 this address calculation in it. 824 00:46:41,010 --> 00:46:43,990 So what do you think this is doing? 825 00:46:43,990 --> 00:46:47,110 What's the effect of this, by the way? 826 00:46:47,110 --> 00:46:48,700 They're all no-ops. 827 00:46:48,700 --> 00:46:51,320 So the effect is? 828 00:46:51,320 --> 00:46:53,026 Nothing. 829 00:46:53,026 --> 00:46:55,810 The effect is nothing. 830 00:46:55,810 --> 00:46:57,670 OK, now it does set the RFLAGS. 831 00:46:57,670 --> 00:47:03,080 But basically, mostly, it does nothing. 832 00:47:03,080 --> 00:47:06,980 Why would a compiler generate assembly with these idioms? 833 00:47:06,980 --> 00:47:08,700 Why would you get that kind of-- 834 00:47:08,700 --> 00:47:11,290 that's crazy, right? 835 00:47:11,290 --> 00:47:12,076 Yeah. 836 00:47:12,076 --> 00:47:14,667 AUDIENCE: Could it be doing some cache optimization? 837 00:47:14,667 --> 00:47:16,250 CHARLES LEISERSON: Yeah, it's actually 838 00:47:16,250 --> 00:47:22,280 doing alignment optimization typically or code size. 839 00:47:22,280 --> 00:47:26,030 So it may want to start the next instruction on the beginning 840 00:47:26,030 --> 00:47:27,860 of a cache line. 841 00:47:27,860 --> 00:47:30,830 And, in fact, there's a directive to do that. 842 00:47:30,830 --> 00:47:32,510 If you want all your functions to start 843 00:47:32,510 --> 00:47:34,040 at the beginning of cache line, then 844 00:47:34,040 --> 00:47:40,490 it wants to make sure that if code gets to that point, 845 00:47:40,490 --> 00:47:43,730 you'll just proceed to jump through memory, 846 00:47:43,730 --> 00:47:46,370 continue through memory. 847 00:47:46,370 --> 00:47:47,800 So mainly is to optimize memory. 848 00:47:47,800 --> 00:47:48,950 So you'll see those things. 849 00:47:48,950 --> 00:47:50,850 I mean, you just have to realize, oh, 850 00:47:50,850 --> 00:47:54,710 that's the compiler generating some sum no-ops. 851 00:47:54,710 --> 00:47:58,880 So that's sort of our brief excursion 852 00:47:58,880 --> 00:48:03,770 over assembly language, x86 assembly language. 853 00:48:03,770 --> 00:48:07,040 Now, I want to dive into floating-point and vector 854 00:48:07,040 --> 00:48:09,020 hardware, which is going to be the main part. 855 00:48:09,020 --> 00:48:12,830 And then if there's any time at the end, I'll show the slides-- 856 00:48:12,830 --> 00:48:16,400 I have a bunch of other slides on how branch prediction works 857 00:48:16,400 --> 00:48:19,670 and a variety of other machines sorts of things, 858 00:48:19,670 --> 00:48:21,770 that if we don't get to, it's no problem. 859 00:48:21,770 --> 00:48:23,270 You can take a look at the slides, 860 00:48:23,270 --> 00:48:27,800 and there's also the architecture manual. 861 00:48:27,800 --> 00:48:29,650 So floating-point instruction sets, 862 00:48:29,650 --> 00:48:37,610 so mostly the scalar floating-point operations 863 00:48:37,610 --> 00:48:42,170 are access via couple of different instruction sets. 864 00:48:42,170 --> 00:48:44,180 So the history of floating point is interesting, 865 00:48:44,180 --> 00:48:50,090 because originally the 80-86 did not have a floating-point unit. 866 00:48:50,090 --> 00:48:51,920 Floating-point was done in software. 867 00:48:51,920 --> 00:48:53,930 And then they made a companion chip 868 00:48:53,930 --> 00:48:55,580 that would do floating-point. 869 00:48:55,580 --> 00:48:57,140 And then they started integrating 870 00:48:57,140 --> 00:49:02,180 and so forth as miniaturization took hold. 871 00:49:02,180 --> 00:49:05,150 So the SSE and AVX instructions do 872 00:49:05,150 --> 00:49:08,540 both single and double precision scalar floating-point, i.e. 873 00:49:08,540 --> 00:49:09,960 floats or doubles. 874 00:49:09,960 --> 00:49:14,960 And then the x86 instructions, the x87 instructions-- 875 00:49:14,960 --> 00:49:19,057 that's the 80-87 that was attached to the 80-86 876 00:49:19,057 --> 00:49:20,390 and that's where they get them-- 877 00:49:20,390 --> 00:49:22,640 support single, double, and extended precision 878 00:49:22,640 --> 00:49:24,800 scalar floating-point arithmetic, 879 00:49:24,800 --> 00:49:27,320 including float double and long double. 880 00:49:27,320 --> 00:49:30,650 So you can actually get a great big result of a multiply 881 00:49:30,650 --> 00:49:34,630 if you use the x87 instruction sets. 882 00:49:34,630 --> 00:49:36,380 And they also include vector instructions, 883 00:49:36,380 --> 00:49:39,043 so you can multiply or add there as well-- 884 00:49:39,043 --> 00:49:41,210 so all these places on the chip where you can decide 885 00:49:41,210 --> 00:49:43,670 to do one thing or another. 886 00:49:43,670 --> 00:49:46,190 Compilers generally like the SSE instructions 887 00:49:46,190 --> 00:49:49,100 over the x87 instructions because they're simpler 888 00:49:49,100 --> 00:49:51,440 to compile for and to optimize. 889 00:49:51,440 --> 00:49:58,130 And the SSE opcodes are similar to the normal x86 opcodes. 890 00:49:58,130 --> 00:50:01,160 And they use the XMM registers and floating-point types. 891 00:50:01,160 --> 00:50:03,530 And so you'll see stuff like this, where you've 892 00:50:03,530 --> 00:50:07,610 got a movesd and so forth. 893 00:50:07,610 --> 00:50:10,670 The suffix there is saying what the data type. 894 00:50:10,670 --> 00:50:13,850 In this case, it's saying it's a double precision floating-point 895 00:50:13,850 --> 00:50:15,470 value, i.e. a double. 896 00:50:19,340 --> 00:50:20,900 Once again, they're using suffix. 897 00:50:20,900 --> 00:50:25,070 The sd in this case is a double precision floating-point. 898 00:50:25,070 --> 00:50:29,060 The other option is the first letter 899 00:50:29,060 --> 00:50:33,080 says whether it's single, i.e. a scalar operation, or packed, 900 00:50:33,080 --> 00:50:36,650 i.e. a vector operation. 901 00:50:36,650 --> 00:50:38,870 And the second letter says whether it's 902 00:50:38,870 --> 00:50:41,240 single or double precision. 903 00:50:41,240 --> 00:50:45,140 And so when you see one of these operations, you can decode, 904 00:50:45,140 --> 00:50:50,060 oh, this is operating on a 64-bit value or a 32-bit value, 905 00:50:50,060 --> 00:50:54,920 floating-point value, or on a vector of those values. 906 00:50:54,920 --> 00:50:56,840 Now, what about these vectors? 907 00:50:56,840 --> 00:51:00,128 So when you start using the packed representation 908 00:51:00,128 --> 00:51:01,670 and you start using vectors, you have 909 00:51:01,670 --> 00:51:03,830 to understand a little bit about the vector units that 910 00:51:03,830 --> 00:51:04,747 are on these machines. 911 00:51:07,430 --> 00:51:09,950 So the way a vector unit works is 912 00:51:09,950 --> 00:51:13,910 that there is the processor issuing instructions. 913 00:51:13,910 --> 00:51:19,190 And it issues the instructions to all of the vector units. 914 00:51:19,190 --> 00:51:23,810 So for example, if you take a look at a typical thing, 915 00:51:23,810 --> 00:51:27,410 you may have a vector width of four vector units. 916 00:51:27,410 --> 00:51:30,410 Each of them is often called a lane-- 917 00:51:30,410 --> 00:51:31,910 l-a-n-e. 918 00:51:31,910 --> 00:51:33,570 And the x is the vector width. 919 00:51:33,570 --> 00:51:35,420 And so when the instruction is given, 920 00:51:35,420 --> 00:51:37,820 it's given to all of the vector units. 921 00:51:37,820 --> 00:51:41,060 And they all do it on their own local copy of the register. 922 00:51:41,060 --> 00:51:43,880 So the register you can think of as a very wide thing broken 923 00:51:43,880 --> 00:51:46,100 into several words. 924 00:51:46,100 --> 00:51:48,560 And when I say add two vectors together, 925 00:51:48,560 --> 00:51:53,067 it'll add four words together and store it back 926 00:51:53,067 --> 00:51:54,275 into another vector register. 927 00:51:57,290 --> 00:51:59,570 And so whatever k is-- 928 00:51:59,570 --> 00:52:03,320 in the example I just said, k was 4. 929 00:52:03,320 --> 00:52:07,520 And the lanes are the thing that each of which 930 00:52:07,520 --> 00:52:11,360 contains the integer floating-point arithmetic. 931 00:52:11,360 --> 00:52:15,930 But the important thing is that they all operate in lock step. 932 00:52:15,930 --> 00:52:17,750 It's not like one is going to do one thing 933 00:52:17,750 --> 00:52:19,458 and another is going to do another thing. 934 00:52:19,458 --> 00:52:21,370 They all have to do exactly the same thing. 935 00:52:21,370 --> 00:52:25,700 And the basic idea here is for the price of one instruction, 936 00:52:25,700 --> 00:52:30,260 I can command a bunch of operations to be done. 937 00:52:30,260 --> 00:52:32,180 Now, generally, vector instructions 938 00:52:32,180 --> 00:52:34,400 operate in an element-wise fashion, 939 00:52:34,400 --> 00:52:37,070 where you take the i-th element of one vector 940 00:52:37,070 --> 00:52:40,640 and operate on it with the i-th element of another vector. 941 00:52:40,640 --> 00:52:45,620 And all the lanes perform exactly the same operation. 942 00:52:45,620 --> 00:52:49,520 Depending upon the architecture, some architectures, 943 00:52:49,520 --> 00:52:51,980 the operands need to be aligned. 944 00:52:51,980 --> 00:52:55,730 That is you've got to have the beginnings at the exactly 945 00:52:55,730 --> 00:52:59,510 same place in memory, a multiple of the vector length. 946 00:52:59,510 --> 00:53:01,220 There are others where the vectors 947 00:53:01,220 --> 00:53:04,040 can be shifted in memory. 948 00:53:04,040 --> 00:53:07,855 Usually, there's a performance difference between the two. 949 00:53:07,855 --> 00:53:09,230 If it does support-- some of them 950 00:53:09,230 --> 00:53:12,560 will not support unaligned vector operations. 951 00:53:12,560 --> 00:53:15,710 So if it can't figure out that they're aligned, I'm sorry, 952 00:53:15,710 --> 00:53:19,150 your code will end up being executed scalar, 953 00:53:19,150 --> 00:53:20,840 in a scalar fashion. 954 00:53:20,840 --> 00:53:27,360 If they are aligned, it's got to be able to figure that out. 955 00:53:27,360 --> 00:53:29,150 And in that case-- 956 00:53:29,150 --> 00:53:31,370 sorry, if it's not aligned, but you 957 00:53:31,370 --> 00:53:34,070 do support vector operizations unaligned, 958 00:53:34,070 --> 00:53:38,670 it's usually slower than if they are aligned. 959 00:53:38,670 --> 00:53:40,560 And for some machines now, they actually 960 00:53:40,560 --> 00:53:43,680 have good performance on both. 961 00:53:43,680 --> 00:53:46,740 So it really depends upon the machine. 962 00:53:46,740 --> 00:53:48,960 And then also there are some architectures 963 00:53:48,960 --> 00:53:52,260 will support cross-lane operation, such as inserting 964 00:53:52,260 --> 00:53:54,570 or extracting subsets of vector elements, 965 00:53:54,570 --> 00:53:59,130 permuting, shuffling, scatter, gather types of operations. 966 00:54:02,450 --> 00:54:06,235 So x86 supports several instruction sets, 967 00:54:06,235 --> 00:54:06,860 as I mentioned. 968 00:54:06,860 --> 00:54:07,610 There's SSE. 969 00:54:07,610 --> 00:54:09,170 There's AVX. 970 00:54:09,170 --> 00:54:10,400 There's AVX2. 971 00:54:10,400 --> 00:54:12,710 And then there's now the AVX-512, 972 00:54:12,710 --> 00:54:15,803 or sometimes called AVX3, which is not 973 00:54:15,803 --> 00:54:17,720 available on the machines that we'll be using, 974 00:54:17,720 --> 00:54:21,230 the Haswell machines that we'll be doing. 975 00:54:21,230 --> 00:54:26,330 Generally, the AVX and AVX2 extend the SSE instruction 976 00:54:26,330 --> 00:54:31,820 set by using the wider registers and operate on a 2. 977 00:54:31,820 --> 00:54:34,380 The SSE use wider registers and operate 978 00:54:34,380 --> 00:54:35,780 on at most two operands. 979 00:54:35,780 --> 00:54:42,290 The AVX ones can use the 256 and also have three operands, not 980 00:54:42,290 --> 00:54:43,870 just two operations. 981 00:54:43,870 --> 00:54:47,690 So say you can say add A to B and store it in C, 982 00:54:47,690 --> 00:54:51,800 as opposed to saying add A to B and store it in B. 983 00:54:51,800 --> 00:54:53,300 So it can also support three. 984 00:54:56,610 --> 00:55:01,650 Yeah, most of them are similar to traditional opcodes 985 00:55:01,650 --> 00:55:02,820 with minor differences. 986 00:55:02,820 --> 00:55:07,850 So if you look at them, if you have an SSE, 987 00:55:07,850 --> 00:55:11,730 it basically looks just like the traditional name, 988 00:55:11,730 --> 00:55:14,700 like add in this case, but you can then say, 989 00:55:14,700 --> 00:55:20,600 do a packed add or a vector with packed data. 990 00:55:20,600 --> 00:55:23,450 So the v prefix it's AVX. 991 00:55:23,450 --> 00:55:25,343 So if you see it's v, you go to the part 992 00:55:25,343 --> 00:55:26,510 in the manual that says AVX. 993 00:55:29,390 --> 00:55:32,420 If you see the p's, that say it's packed data. 994 00:55:32,420 --> 00:55:38,760 Then you go to SSE if it doesn't have the v. 995 00:55:38,760 --> 00:55:42,830 And the p prefix distinguishing integer vector instruction, 996 00:55:42,830 --> 00:55:43,560 you got me. 997 00:55:43,560 --> 00:55:48,572 I tried to think why is p distinguishing an integer? 998 00:55:48,572 --> 00:55:53,100 It's like p, good mnemonic for integer, right? 999 00:55:57,070 --> 00:56:00,670 Then in addition, they do this aliasing trick again, 1000 00:56:00,670 --> 00:56:06,560 where the YMM registers actually alias the XMM registers. 1001 00:56:06,560 --> 00:56:08,610 So you can use both operations, but you've 1002 00:56:08,610 --> 00:56:11,737 got to be careful what's going on, 1003 00:56:11,737 --> 00:56:13,070 because they just extended them. 1004 00:56:13,070 --> 00:56:16,820 And now, of course, with AVX-512, 1005 00:56:16,820 --> 00:56:19,550 they did another extension to 512 bits. 1006 00:56:23,060 --> 00:56:24,700 That's vectors stuff. 1007 00:56:24,700 --> 00:56:27,590 So you can use those explicitly. 1008 00:56:27,590 --> 00:56:29,330 The compiler will vectorize for you. 1009 00:56:29,330 --> 00:56:33,710 And the homework this week takes you through some vectorization 1010 00:56:33,710 --> 00:56:34,350 exercises. 1011 00:56:34,350 --> 00:56:35,475 It's actually a lot of fun. 1012 00:56:35,475 --> 00:56:37,410 We were just going over it in a staff meeting. 1013 00:56:37,410 --> 00:56:38,840 And it's really fun. 1014 00:56:38,840 --> 00:56:40,430 I think it's a really fun exercise. 1015 00:56:40,430 --> 00:56:42,740 We introduced that last year, by the way, 1016 00:56:42,740 --> 00:56:44,280 or maybe two years ago. 1017 00:56:44,280 --> 00:56:46,550 But, in any case, it's a fun one-- 1018 00:56:50,550 --> 00:56:54,120 for my definition of fun, which I hope 1019 00:56:54,120 --> 00:56:57,660 is your definition of fun. 1020 00:56:57,660 --> 00:57:00,540 Now, I want to talk generally about computer architecture. 1021 00:57:00,540 --> 00:57:05,430 And I'm not going to get through all of these slides, as I say. 1022 00:57:05,430 --> 00:57:07,950 But I want to get started on the and give you 1023 00:57:07,950 --> 00:57:10,850 a sense of other things going on in the processor 1024 00:57:10,850 --> 00:57:13,060 that you should be aware of. 1025 00:57:13,060 --> 00:57:18,690 So in 6.004, you probably talked about a 5-stage processor. 1026 00:57:18,690 --> 00:57:20,840 Anybody remember that? 1027 00:57:20,840 --> 00:57:22,740 OK, 5-stage processor. 1028 00:57:22,740 --> 00:57:24,480 There's an Instruction Fetch. 1029 00:57:24,480 --> 00:57:25,920 There's an Instruction Decode. 1030 00:57:25,920 --> 00:57:27,660 There's an Execute. 1031 00:57:27,660 --> 00:57:31,440 Then there's a Memory Addressing. 1032 00:57:31,440 --> 00:57:33,960 And then you Write back the values. 1033 00:57:33,960 --> 00:57:36,780 And this is done as a pipeline, so as 1034 00:57:36,780 --> 00:57:39,540 to make-- you could do all of this in one thing, 1035 00:57:39,540 --> 00:57:41,157 but then you have a long clock cycle. 1036 00:57:41,157 --> 00:57:43,240 And you'll only be able to do one thing at a time. 1037 00:57:43,240 --> 00:57:45,930 Instead, they stack them together. 1038 00:57:45,930 --> 00:57:51,610 So here's a block diagram of the 5-stage processor. 1039 00:57:51,610 --> 00:57:53,200 We read the instruction from memory 1040 00:57:53,200 --> 00:57:55,510 in the instruction fetch cycle. 1041 00:57:55,510 --> 00:57:57,550 Then we decode it. 1042 00:57:57,550 --> 00:57:59,020 Basically, it takes a look at, what 1043 00:57:59,020 --> 00:58:02,200 is the opcode, what are the addressing modes, et cetera, 1044 00:58:02,200 --> 00:58:05,040 and figures out what it actually has to do 1045 00:58:05,040 --> 00:58:07,750 and actually performs the ALU operations. 1046 00:58:07,750 --> 00:58:10,060 And then it reads and writes the data memory. 1047 00:58:10,060 --> 00:58:12,430 And then it writes back the results into registers. 1048 00:58:12,430 --> 00:58:15,730 That's typically a common way that these things 1049 00:58:15,730 --> 00:58:19,420 go for a 5-stage processor. 1050 00:58:19,420 --> 00:58:22,480 By the way, this is vastly oversimplified. 1051 00:58:22,480 --> 00:58:26,380 You can take 6823 if you want to learn truth. 1052 00:58:26,380 --> 00:58:30,970 I'm going to tell you nothing but white lies 1053 00:58:30,970 --> 00:58:32,440 for this lecture. 1054 00:58:32,440 --> 00:58:38,140 Now, if you look at the Intel Haswell, the machine 1055 00:58:38,140 --> 00:58:43,210 that we're using, it actually has between 14 and 19 pipeline 1056 00:58:43,210 --> 00:58:45,290 stages. 1057 00:58:45,290 --> 00:58:49,150 The 14 to 19 reflects the fact that there 1058 00:58:49,150 --> 00:58:50,680 are different paths through it that 1059 00:58:50,680 --> 00:58:53,020 take different amounts of time. 1060 00:58:53,020 --> 00:58:54,820 It also I think reflects a little bit 1061 00:58:54,820 --> 00:58:58,150 that nobody has published the Intel internal stuff. 1062 00:58:58,150 --> 00:59:02,500 So maybe we're not sure if it's 14 to 19, but somewhere 1063 00:59:02,500 --> 00:59:03,448 in that range. 1064 00:59:03,448 --> 00:59:05,740 But I think it's actually because the different lengths 1065 00:59:05,740 --> 00:59:08,090 of time as I was explaining. 1066 00:59:08,090 --> 00:59:10,750 So what I want to do is-- 1067 00:59:10,750 --> 00:59:12,400 you've seen the 5-stage price line. 1068 00:59:12,400 --> 00:59:14,920 I want to talk about the difference between that 1069 00:59:14,920 --> 00:59:17,530 and a modern processor by looking at several design 1070 00:59:17,530 --> 00:59:18,220 features. 1071 00:59:18,220 --> 00:59:20,350 We already talked about vector hardware. 1072 00:59:20,350 --> 00:59:22,420 I then want to talk about super scalar 1073 00:59:22,420 --> 00:59:24,280 processing, out of order execution, 1074 00:59:24,280 --> 00:59:28,000 and branch prediction a little bit. 1075 00:59:28,000 --> 00:59:30,400 And the out of order, I'm going to skip a bunch of that 1076 00:59:30,400 --> 00:59:32,620 because it has to do with score boarding, which 1077 00:59:32,620 --> 00:59:37,210 is really interesting and fun, but it's also time consuming. 1078 00:59:37,210 --> 00:59:38,710 But it's really interesting and fun. 1079 00:59:38,710 --> 00:59:42,220 That's what you learn in 6823. 1080 00:59:42,220 --> 00:59:45,610 So historically, there's two ways 1081 00:59:45,610 --> 00:59:47,830 that people make processors go faster-- 1082 00:59:47,830 --> 00:59:52,890 by exploiting parallelism and by exploiting locality. 1083 00:59:52,890 --> 00:59:56,140 And parallelism, there's instruction-- well, 1084 00:59:56,140 --> 00:59:58,330 we already did word-level parallelism 1085 00:59:58,330 --> 01:00:00,740 in the bit tricks thing. 1086 01:00:00,740 --> 01:00:03,350 But there's also instruction-level parallelism, 1087 01:00:03,350 --> 01:00:06,730 so-called ILB, vectorization and multicore. 1088 01:00:06,730 --> 01:00:11,463 And for locality, the main thing that's used there is caching. 1089 01:00:11,463 --> 01:00:12,880 I would say also the fact that you 1090 01:00:12,880 --> 01:00:16,582 have a design with registers that also reflects locality, 1091 01:00:16,582 --> 01:00:18,790 because the way that the processor wants to do things 1092 01:00:18,790 --> 01:00:20,125 is fetch stuff from memory. 1093 01:00:20,125 --> 01:00:21,970 It doesn't want to operate on it in memory. 1094 01:00:21,970 --> 01:00:22,990 That's very expensive. 1095 01:00:22,990 --> 01:00:25,613 It wants to fetch things into memory, get enough of them 1096 01:00:25,613 --> 01:00:27,280 there that you can do some calculations, 1097 01:00:27,280 --> 01:00:28,810 do a whole bunch of calculations, 1098 01:00:28,810 --> 01:00:32,110 and then put them back out there. 1099 01:00:32,110 --> 01:00:34,780 So this lecture we're talking about ILP and vectorization. 1100 01:00:34,780 --> 01:00:39,530 So let me talk about instruction-level parallelism. 1101 01:00:39,530 --> 01:00:46,870 So when you have, let's say, a 5-stage pipeline, 1102 01:00:46,870 --> 01:00:48,700 you're interested in finding opportunities 1103 01:00:48,700 --> 01:00:52,630 to execute multiple instruction simultaneously. 1104 01:00:52,630 --> 01:00:57,490 So in instruction 1, it's going to do an instruction fetch. 1105 01:00:57,490 --> 01:00:58,570 Then it does its decode. 1106 01:00:58,570 --> 01:01:04,930 And so it takes five cycles for this instruction to complete. 1107 01:01:04,930 --> 01:01:07,420 So ideally what you'd like is that you 1108 01:01:07,420 --> 01:01:12,610 can start instruction 2 on cycle 2, instruction 3 on cycle 3, 1109 01:01:12,610 --> 01:01:15,640 and so forth, and have 5 instructions-- once you 1110 01:01:15,640 --> 01:01:19,030 get into the steady state, have 5 instructions executing 1111 01:01:19,030 --> 01:01:20,590 all the time. 1112 01:01:20,590 --> 01:01:25,120 That would be ideal, where each one takes just one thing. 1113 01:01:25,120 --> 01:01:27,167 So that's really pretty good. 1114 01:01:27,167 --> 01:01:28,750 And that would improve the throughput. 1115 01:01:28,750 --> 01:01:30,292 Even though it might take a long time 1116 01:01:30,292 --> 01:01:34,720 to get one instruction done, I can have many instructions 1117 01:01:34,720 --> 01:01:36,280 in the pipeline at some time. 1118 01:01:39,640 --> 01:01:42,670 So each pipeline is executing a different instruction. 1119 01:01:42,670 --> 01:01:45,010 However, in practice this isn't what happens. 1120 01:01:45,010 --> 01:01:49,420 In practice, you discover that there are 1121 01:01:49,420 --> 01:01:51,190 what's called pipeline stalls. 1122 01:01:51,190 --> 01:01:53,950 When it comes time to execute an instruction, 1123 01:01:53,950 --> 01:01:58,330 for some correctness reason, it cannot execute the instruction. 1124 01:01:58,330 --> 01:01:59,530 It has to wait. 1125 01:01:59,530 --> 01:02:01,390 And that's a pipeline stall. 1126 01:02:01,390 --> 01:02:03,040 That's what you want to try to avoid 1127 01:02:03,040 --> 01:02:08,140 and the compiler tries to Bruce code that will avoid stalls. 1128 01:02:08,140 --> 01:02:11,290 So why do stalls happen? 1129 01:02:11,290 --> 01:02:13,870 They happen because of what are called hazards. 1130 01:02:13,870 --> 01:02:15,520 There's actually two notions of hazard. 1131 01:02:15,520 --> 01:02:16,730 And this is one of them. 1132 01:02:16,730 --> 01:02:18,920 The other is a race condition hazard. 1133 01:02:18,920 --> 01:02:20,590 This is dependency hazard. 1134 01:02:20,590 --> 01:02:22,150 But people call them both hazards, 1135 01:02:22,150 --> 01:02:29,390 just like they call the second stage of compilation compiling. 1136 01:02:29,390 --> 01:02:32,260 It's like they make up these words. 1137 01:02:32,260 --> 01:02:35,140 So here's three types of hazards that can prevent 1138 01:02:35,140 --> 01:02:37,180 an instruction from executing. 1139 01:02:37,180 --> 01:02:40,660 First of all, there's what's called a structural hazard. 1140 01:02:40,660 --> 01:02:43,400 Two instructions attempt to use the same functional unit, 1141 01:02:43,400 --> 01:02:45,050 the same time. 1142 01:02:45,050 --> 01:02:52,540 If there's, for example, only one floating-point multiplier 1143 01:02:52,540 --> 01:02:56,380 and two of them try to use it at the same time, one has to wait. 1144 01:02:56,380 --> 01:02:58,910 In modern processors, there's a bunch of each of those. 1145 01:02:58,910 --> 01:03:04,510 But if you have k functional units and k plus 1 instructions 1146 01:03:04,510 --> 01:03:07,690 want to access it, you're out of luck. 1147 01:03:07,690 --> 01:03:09,370 One of them is going to have to wait. 1148 01:03:09,370 --> 01:03:11,872 The second is a data hazard. 1149 01:03:11,872 --> 01:03:13,330 This is when an instruction depends 1150 01:03:13,330 --> 01:03:17,320 on the result of a prior instruction in the pipeline. 1151 01:03:17,320 --> 01:03:21,610 So one instruction is computing a value that 1152 01:03:21,610 --> 01:03:27,060 is going to stick in rcx, say. 1153 01:03:27,060 --> 01:03:28,360 So they stick it into rcx. 1154 01:03:28,360 --> 01:03:30,550 The other one has to read the value from rcx 1155 01:03:30,550 --> 01:03:33,340 and it comes later. 1156 01:03:33,340 --> 01:03:34,870 That other instruction has to wait 1157 01:03:34,870 --> 01:03:37,480 until that value is written there before it can read it. 1158 01:03:37,480 --> 01:03:39,430 That's a data hazard. 1159 01:03:39,430 --> 01:03:44,950 And a control hazard is where you 1160 01:03:44,950 --> 01:03:47,770 decide that you need to make a jump 1161 01:03:47,770 --> 01:03:49,930 and you can't execute the next instruction, 1162 01:03:49,930 --> 01:03:52,923 because you don't know which way the jump is going to go. 1163 01:03:52,923 --> 01:03:54,340 So if you have a conditional jump, 1164 01:03:54,340 --> 01:03:57,250 it's like, well, what's the next instruction after that jump? 1165 01:03:57,250 --> 01:03:58,230 I don't know. 1166 01:03:58,230 --> 01:03:59,890 So I have to wait to execute that. 1167 01:03:59,890 --> 01:04:02,080 I can't go ahead and do the jump and then do 1168 01:04:02,080 --> 01:04:04,420 the next instruction after it, because I don't know what 1169 01:04:04,420 --> 01:04:05,628 happened to the previous one. 1170 01:04:09,030 --> 01:04:13,970 Now of these, we're going to mostly talk about data hazards. 1171 01:04:13,970 --> 01:04:16,490 So an instruction can create a data hazard-- 1172 01:04:16,490 --> 01:04:20,060 I can create a data hazard due to a dependence between i 1173 01:04:20,060 --> 01:04:21,320 and j. 1174 01:04:21,320 --> 01:04:24,380 So the first type is called a true dependence, 1175 01:04:24,380 --> 01:04:28,820 or I read after write dependence. 1176 01:04:28,820 --> 01:04:31,040 And this is where, as in this example, 1177 01:04:31,040 --> 01:04:33,590 I'm adding something and storing into rax 1178 01:04:33,590 --> 01:04:35,660 and the next instruction wants to read from rax. 1179 01:04:38,500 --> 01:04:40,700 So the second instruction can't get 1180 01:04:40,700 --> 01:04:43,820 going until the previous one or it may 1181 01:04:43,820 --> 01:04:48,153 stall until the result of the previous one is known. 1182 01:04:48,153 --> 01:04:50,070 There's another one called an anti-dependence. 1183 01:04:50,070 --> 01:04:52,890 This is where I want to write into a location, 1184 01:04:52,890 --> 01:04:56,250 but I have to wait until the previous instruction has read 1185 01:04:56,250 --> 01:04:59,780 the value, because otherwise I'm going 1186 01:04:59,780 --> 01:05:02,700 to clobber that instruction and clobber 1187 01:05:02,700 --> 01:05:05,580 the value before it gets read. 1188 01:05:05,580 --> 01:05:08,670 so that's an anti-dependence. 1189 01:05:08,670 --> 01:05:12,180 And then the final one is an output dependence, 1190 01:05:12,180 --> 01:05:18,050 where they're both trying to move something to are rax. 1191 01:05:18,050 --> 01:05:22,610 So why would two things want to move things 1192 01:05:22,610 --> 01:05:24,410 to the same location? 1193 01:05:24,410 --> 01:05:27,320 After all, one of them is going to be lost and just not do 1194 01:05:27,320 --> 01:05:31,000 that instruction. 1195 01:05:31,000 --> 01:05:31,618 Why wouldn't-- 1196 01:05:31,618 --> 01:05:32,660 AUDIENCE: Set some flags. 1197 01:05:32,660 --> 01:05:34,368 CHARLES LEISERSON: Yeah, maybe because it 1198 01:05:34,368 --> 01:05:37,030 wants to set some flags. 1199 01:05:37,030 --> 01:05:41,250 So that's one reason that it might do this, 1200 01:05:41,250 --> 01:05:43,000 because you know the first instruction set 1201 01:05:43,000 --> 01:05:47,800 some flags in addition to moving the output to that location. 1202 01:05:47,800 --> 01:05:49,380 And there's one other reason. 1203 01:05:49,380 --> 01:05:50,380 What's the other reason? 1204 01:05:54,290 --> 01:05:55,040 I'm blanking. 1205 01:05:55,040 --> 01:05:56,790 There's two reasons. 1206 01:05:56,790 --> 01:05:58,310 And I didn't put them in my notes. 1207 01:06:03,590 --> 01:06:05,210 I don't remember. 1208 01:06:05,210 --> 01:06:08,710 OK, but anyway, that's a good question for quiz then. 1209 01:06:11,380 --> 01:06:13,704 OK, give me two reasons-- yeah. 1210 01:06:13,704 --> 01:06:17,008 AUDIENCE: Can there be intermediate instructions 1211 01:06:17,008 --> 01:06:20,025 like between those [INAUDIBLE] 1212 01:06:20,025 --> 01:06:21,900 CHARLES LEISERSON: There could, but of course 1213 01:06:21,900 --> 01:06:26,880 then if it's going to use that register, then-- 1214 01:06:26,880 --> 01:06:29,490 oh, I know the other reason. 1215 01:06:29,490 --> 01:06:31,355 So this is still good for a quiz. 1216 01:06:31,355 --> 01:06:33,480 The other reason is there may be aliasing going on. 1217 01:06:33,480 --> 01:06:37,680 Maybe an intervening instruction uses one 1218 01:06:37,680 --> 01:06:40,260 of the values in its aliasist. 1219 01:06:40,260 --> 01:06:43,530 So uses part of the result or whatever, there still 1220 01:06:43,530 --> 01:06:47,110 could be a dependency. 1221 01:06:47,110 --> 01:06:52,890 Anyway, some arithmetic operations 1222 01:06:52,890 --> 01:06:54,450 are complex to implement in hardware 1223 01:06:54,450 --> 01:06:56,790 and have long latencies. 1224 01:06:56,790 --> 01:07:03,270 So here's some sample opcodes and how many latency they take. 1225 01:07:03,270 --> 01:07:05,290 They take a different number. 1226 01:07:05,290 --> 01:07:08,600 So, for example, integer division actually is variable, 1227 01:07:08,600 --> 01:07:10,710 but a multiply takes about three times what 1228 01:07:10,710 --> 01:07:13,350 most of the integer operations are. 1229 01:07:13,350 --> 01:07:16,050 And floating-point multiply is like 5. 1230 01:07:16,050 --> 01:07:17,535 And then fma, what's fma? 1231 01:07:20,740 --> 01:07:22,390 Fused multiply add. 1232 01:07:22,390 --> 01:07:24,790 This is where you're doing both a multiply and an add. 1233 01:07:24,790 --> 01:07:26,940 And why do we care about fuse multiply adds? 1234 01:07:30,174 --> 01:07:32,091 AUDIENCE: For memory accessing and [INAUDIBLE] 1235 01:07:32,091 --> 01:07:33,924 CHARLES LEISERSON: Not for memory accessing. 1236 01:07:33,924 --> 01:07:36,210 This is actually floating-point multiply and add. 1237 01:07:39,830 --> 01:07:43,190 It's called linear algebra. 1238 01:07:43,190 --> 01:07:44,990 So when you do major multiplication, 1239 01:07:44,990 --> 01:07:46,070 you're doing dot product. 1240 01:07:46,070 --> 01:07:48,290 You're doing multiplies and adds. 1241 01:07:48,290 --> 01:07:52,950 So that kind of thing, that's where you do a lot of those. 1242 01:07:52,950 --> 01:07:54,710 So how does the hardware accommodate 1243 01:07:54,710 --> 01:07:57,300 these complex operations? 1244 01:07:57,300 --> 01:08:02,210 So the strategy that much hardware tends to use 1245 01:08:02,210 --> 01:08:05,180 is to have separate functional units for complex operations, 1246 01:08:05,180 --> 01:08:07,490 such as floating-point arithmetic. 1247 01:08:07,490 --> 01:08:11,000 So there may be in fact separate registers, 1248 01:08:11,000 --> 01:08:13,040 for example, the XMM registers, that only 1249 01:08:13,040 --> 01:08:14,610 work with the floating point. 1250 01:08:14,610 --> 01:08:16,430 So you have your basic 5-stage pipeline. 1251 01:08:16,430 --> 01:08:18,979 You have another pipeline that's off on the side. 1252 01:08:18,979 --> 01:08:21,229 And it's going to take multiple cycles sometimes 1253 01:08:21,229 --> 01:08:26,220 for things and maybe pipeline to a different depth. 1254 01:08:26,220 --> 01:08:33,029 And so you basically separate these operations. 1255 01:08:33,029 --> 01:08:34,950 The functional units may be pipelined, fully, 1256 01:08:34,950 --> 01:08:38,560 partially, or not at all. 1257 01:08:38,560 --> 01:08:44,623 And so I now have a whole bunch of different functional units, 1258 01:08:44,623 --> 01:08:46,123 and there's different paths that I'm 1259 01:08:46,123 --> 01:08:52,330 going to be able to take through the data path of the processor. 1260 01:08:52,330 --> 01:08:56,790 So in Haswell, they have integer vector floating-point 1261 01:08:56,790 --> 01:08:59,910 distributed among eight different ports, which 1262 01:08:59,910 --> 01:09:04,620 is sort from the entry. 1263 01:09:04,620 --> 01:09:07,470 So given that, things get really complicated. 1264 01:09:07,470 --> 01:09:11,609 If we go back to our simple diagram, 1265 01:09:11,609 --> 01:09:14,790 suppose we have all these additional functional units, 1266 01:09:14,790 --> 01:09:21,970 how can I now exploit more instruction-level parallelism? 1267 01:09:21,970 --> 01:09:27,060 So right now, we can start up one operation at a time. 1268 01:09:27,060 --> 01:09:31,670 What might I do to get more parallelism out of the hardware 1269 01:09:31,670 --> 01:09:33,098 that I've got? 1270 01:09:39,080 --> 01:09:40,830 What do you think computer architects did? 1271 01:09:43,260 --> 01:09:43,760 OK. 1272 01:09:43,760 --> 01:09:49,790 AUDIENCE: It's a guess but, you could glue together [INAUDIBLE] 1273 01:09:49,790 --> 01:09:52,700 CHARLES LEISERSON: Yeah, so even simpler than that, but 1274 01:09:52,700 --> 01:09:54,350 which is implied in what you're saying, 1275 01:09:54,350 --> 01:09:59,360 is you can just fetch and issue multiple instructions 1276 01:09:59,360 --> 01:10:01,290 per cycle. 1277 01:10:01,290 --> 01:10:03,030 So rather than just doing one per cycle 1278 01:10:03,030 --> 01:10:05,610 as we showed with a typical pipeline processor, 1279 01:10:05,610 --> 01:10:07,860 let me fetch several that use different parts 1280 01:10:07,860 --> 01:10:10,200 of the processor pipeline, because they're not 1281 01:10:10,200 --> 01:10:14,970 going to interfere, to keep everything busy. 1282 01:10:14,970 --> 01:10:17,550 And so that's basically what's called a super scalar 1283 01:10:17,550 --> 01:10:20,430 processor, where it's not executing one thing at a time. 1284 01:10:20,430 --> 01:10:24,340 It's executing multiple things at a time. 1285 01:10:24,340 --> 01:10:27,360 So Haswell, in fact, breaks up the instructions 1286 01:10:27,360 --> 01:10:30,330 into simpler operations, called micro-ops. 1287 01:10:30,330 --> 01:10:33,390 And they can emit for micro-ops per cycle 1288 01:10:33,390 --> 01:10:35,220 to the rest of the pipeline. 1289 01:10:35,220 --> 01:10:38,370 And the fetch and decode stages implement optimizations 1290 01:10:38,370 --> 01:10:41,850 on micro-op processing, including special cases 1291 01:10:41,850 --> 01:10:42,750 for common patents. 1292 01:10:42,750 --> 01:10:47,400 For example, if it sees the XOR of rax and rax, 1293 01:10:47,400 --> 01:10:50,100 it knows that rax is being set to 0. 1294 01:10:50,100 --> 01:10:53,120 It doesn't even use a functional unit for that. 1295 01:10:53,120 --> 01:10:55,530 It just does it and it's done. 1296 01:10:55,530 --> 01:10:58,820 It has just a special logic that observes 1297 01:10:58,820 --> 01:11:02,020 that because it's such a common thing to set things out. 1298 01:11:02,020 --> 01:11:05,030 And so that means that now your processor can execute 1299 01:11:05,030 --> 01:11:06,430 a lot of things at one time. 1300 01:11:06,430 --> 01:11:08,180 And that's the machines that you're doing. 1301 01:11:08,180 --> 01:11:12,450 That's why when I said if you save one add instruction, 1302 01:11:12,450 --> 01:11:14,270 it probably doesn't make any difference 1303 01:11:14,270 --> 01:11:16,220 in today's processor, because there's probably 1304 01:11:16,220 --> 01:11:18,050 an idle adder lying around. 1305 01:11:18,050 --> 01:11:22,560 There's probably a-- did I read caught how many-- 1306 01:11:22,560 --> 01:11:24,560 where do we go here? 1307 01:11:24,560 --> 01:11:27,620 Yeah, so if you look here, you can even 1308 01:11:27,620 --> 01:11:31,220 discover that there are actually a bunch of ALUs that 1309 01:11:31,220 --> 01:11:35,190 are capable of doing an add. 1310 01:11:35,190 --> 01:11:38,730 So they're all over the map in Haswell. 1311 01:11:41,250 --> 01:11:46,020 Now, still, we are insisting that the processors execute 1312 01:11:46,020 --> 01:11:47,820 in things in order. 1313 01:11:47,820 --> 01:11:50,820 And that's kind of the next stage is, how do you 1314 01:11:50,820 --> 01:11:55,065 end up making things run-- 1315 01:11:58,800 --> 01:12:04,380 that is, how do you make it so that you can free yourself 1316 01:12:04,380 --> 01:12:08,400 from the tyranny of one instruction after the other? 1317 01:12:08,400 --> 01:12:11,520 And so the first thing is there's 1318 01:12:11,520 --> 01:12:13,770 a strategy called bypassing. 1319 01:12:13,770 --> 01:12:19,500 So suppose that you have instructions running into rax. 1320 01:12:19,500 --> 01:12:22,800 And then you're going to use that to read. 1321 01:12:22,800 --> 01:12:27,450 Well, why bother waiting for it to be stored into the register 1322 01:12:27,450 --> 01:12:31,560 file and then pulled back out for the second instruction? 1323 01:12:31,560 --> 01:12:36,690 Instead, let's have a bypass, a special circuit 1324 01:12:36,690 --> 01:12:39,330 that identifies that kind of situation 1325 01:12:39,330 --> 01:12:42,900 and feeds it directly to the next instruction 1326 01:12:42,900 --> 01:12:45,660 without requiring that it go into the register file 1327 01:12:45,660 --> 01:12:47,430 and back out. 1328 01:12:47,430 --> 01:12:48,935 So that's called bypassing. 1329 01:12:48,935 --> 01:12:51,060 There are lots of places where things are bypassed. 1330 01:12:51,060 --> 01:12:53,400 And we'll talk about it more. 1331 01:12:53,400 --> 01:12:55,500 So normally, you would stall waiting 1332 01:12:55,500 --> 01:12:57,450 for it to be written back. 1333 01:12:57,450 --> 01:12:59,940 And now, when you eliminate it, now I 1334 01:12:59,940 --> 01:13:02,250 can move it way forward, because I just 1335 01:13:02,250 --> 01:13:06,600 use the bypass path to execute. 1336 01:13:06,600 --> 01:13:08,100 And it allows the second instruction 1337 01:13:08,100 --> 01:13:09,030 to get going earlier. 1338 01:13:12,900 --> 01:13:13,940 What else can we do? 1339 01:13:13,940 --> 01:13:17,843 Well, let's take a large code example. 1340 01:13:17,843 --> 01:13:19,260 Given the amount of time, what I'm 1341 01:13:19,260 --> 01:13:21,180 going to do is basically say, you 1342 01:13:21,180 --> 01:13:22,710 can go through and figure out what 1343 01:13:22,710 --> 01:13:24,930 are the read after write dependencies 1344 01:13:24,930 --> 01:13:27,330 and the write after read dependencies. 1345 01:13:27,330 --> 01:13:28,540 They're all over the place. 1346 01:13:28,540 --> 01:13:33,210 And what you can do is if you look 1347 01:13:33,210 --> 01:13:36,952 at what the dependencies are that I just flashed through, 1348 01:13:36,952 --> 01:13:38,910 you can discover, oh, there's all these things. 1349 01:13:38,910 --> 01:13:44,220 Each one right now has to wait for the previous one 1350 01:13:44,220 --> 01:13:47,070 before it can get started. 1351 01:13:47,070 --> 01:13:49,700 But there are some-- for example, 1352 01:13:49,700 --> 01:13:51,450 the first one is just issue order. 1353 01:13:51,450 --> 01:13:53,070 You can't start the second-- 1354 01:13:53,070 --> 01:13:55,440 if it's in order, you can't start the second 1355 01:13:55,440 --> 01:13:58,290 till you've started the first, that it's 1356 01:13:58,290 --> 01:13:59,862 finished the first stage. 1357 01:13:59,862 --> 01:14:01,320 But the other thing here is there's 1358 01:14:01,320 --> 01:14:04,890 a data dependence between the second and third instructions. 1359 01:14:04,890 --> 01:14:08,040 So if you look at the second and third instructions, 1360 01:14:08,040 --> 01:14:10,940 they're both using XMM2. 1361 01:14:10,940 --> 01:14:13,195 And so we're prevented. 1362 01:14:13,195 --> 01:14:14,820 So one of the questions there is, well, 1363 01:14:14,820 --> 01:14:19,520 why not do a little bit better by taking a look at this 1364 01:14:19,520 --> 01:14:21,140 as a graph and figuring out what's 1365 01:14:21,140 --> 01:14:22,878 the best way through the graph? 1366 01:14:22,878 --> 01:14:24,920 And there are a bunch of tricks you can do there, 1367 01:14:24,920 --> 01:14:28,220 which I'll run through here very quickly. 1368 01:14:28,220 --> 01:14:31,740 And you can take a look at these. 1369 01:14:31,740 --> 01:14:33,740 You can discover that some of these dependencies 1370 01:14:33,740 --> 01:14:35,180 are not real dependence. 1371 01:14:35,180 --> 01:14:37,910 And as long as you're willing to execute things out of order 1372 01:14:37,910 --> 01:14:41,120 and keep track of that, it's perfectly fine. 1373 01:14:41,120 --> 01:14:43,550 If you're not actually dependent on it, 1374 01:14:43,550 --> 01:14:45,290 then just go ahead and execute it. 1375 01:14:45,290 --> 01:14:46,820 And then you can advance things. 1376 01:14:46,820 --> 01:14:48,320 And then the other trick you can use 1377 01:14:48,320 --> 01:14:50,360 is what's called register renaming. 1378 01:14:50,360 --> 01:14:54,130 If you have a destination that's going to be read from-- 1379 01:14:54,130 --> 01:15:00,890 sorry, if I want to write to something, 1380 01:15:00,890 --> 01:15:04,300 but I have to wait for something else to read from it, 1381 01:15:04,300 --> 01:15:08,120 the write after read dependence, then what 1382 01:15:08,120 --> 01:15:11,660 I can do is just rename the register, 1383 01:15:11,660 --> 01:15:13,070 so that I have something to write 1384 01:15:13,070 --> 01:15:15,590 to that is the same thing. 1385 01:15:15,590 --> 01:15:18,080 And there's a very complex mechanism called 1386 01:15:18,080 --> 01:15:21,380 score boarding that does that. 1387 01:15:21,380 --> 01:15:25,982 So anyway, you can take a look at all of these tricks. 1388 01:15:25,982 --> 01:15:27,440 And then the last thing that I want 1389 01:15:27,440 --> 01:15:29,565 to-- so this is this part I was going to skip over. 1390 01:15:29,565 --> 01:15:31,500 And indeed, I don't have time to do it. 1391 01:15:31,500 --> 01:15:35,730 I just want to mention the last thing, which is worthwhile. 1392 01:15:35,730 --> 01:15:37,460 So this-- you don't have to know any 1393 01:15:37,460 --> 01:15:39,320 of the details of that part. 1394 01:15:39,320 --> 01:15:41,850 But it's in there if you're interested. 1395 01:15:41,850 --> 01:15:43,607 So it does renaming and reordering. 1396 01:15:43,607 --> 01:15:45,440 And then the last thing I do want to mention 1397 01:15:45,440 --> 01:15:47,010 is branch prediction. 1398 01:15:47,010 --> 01:15:50,750 So when you come to branch prediction, the outcome, 1399 01:15:50,750 --> 01:15:54,350 you can have a hazard because the outcome is known too late. 1400 01:15:54,350 --> 01:15:58,760 And so in that case, what they do 1401 01:15:58,760 --> 01:16:01,010 is what's called speculative execution, which 1402 01:16:01,010 --> 01:16:03,170 you've probably heard of. 1403 01:16:03,170 --> 01:16:05,510 So basically that says I'm going to guess the outcome 1404 01:16:05,510 --> 01:16:07,970 of the branch and execute. 1405 01:16:07,970 --> 01:16:12,140 If it's encountered, you assume it's taken 1406 01:16:12,140 --> 01:16:13,790 and you execute normally. 1407 01:16:13,790 --> 01:16:16,460 And if you're right, everything is hunky dory. 1408 01:16:16,460 --> 01:16:19,430 If you're wrong, it cost you something like a-- 1409 01:16:23,840 --> 01:16:26,480 you have to undo that speculative computation 1410 01:16:26,480 --> 01:16:29,240 and the effect is sort of like stalling. 1411 01:16:29,240 --> 01:16:31,560 So you don't want that to happen. 1412 01:16:31,560 --> 01:16:36,200 And so a mispredicted branch on Haswell 1413 01:16:36,200 --> 01:16:39,260 costs about 15 to 20 cycles. 1414 01:16:39,260 --> 01:16:41,840 Most of the machines use a branch predictor 1415 01:16:41,840 --> 01:16:43,760 to tell whether or not it's going to do. 1416 01:16:43,760 --> 01:16:45,177 There's a little bit of stuff here 1417 01:16:45,177 --> 01:16:49,690 about how you tell about whether a branch is 1418 01:16:49,690 --> 01:16:52,280 going to be predicted or not. 1419 01:16:52,280 --> 01:16:55,440 And you can take a look at that on your own. 1420 01:16:55,440 --> 01:16:57,140 So sorry to rush a little bit the end, 1421 01:16:57,140 --> 01:16:59,360 but I knew I wasn't going to get through all of this. 1422 01:16:59,360 --> 01:17:03,020 But it's in the notes, in the slides when we put it up. 1423 01:17:03,020 --> 01:17:05,960 And this is really kind of interesting stuff. 1424 01:17:05,960 --> 01:17:08,810 Once again, remember that I'm dealing with this at one level 1425 01:17:08,810 --> 01:17:11,270 below what you really need to do. 1426 01:17:11,270 --> 01:17:13,580 But it is really helpful to understand that layer 1427 01:17:13,580 --> 01:17:17,000 so you have a deep understanding of why certain software 1428 01:17:17,000 --> 01:17:19,350 optimizations work and don't work. 1429 01:17:19,350 --> 01:17:20,890 Sound good? 1430 01:17:20,890 --> 01:17:24,310 OK, good luck on finishing your project 1's.