1 00:00:01,550 --> 00:00:03,920 The following content is provided under a Creative 2 00:00:03,920 --> 00:00:05,310 Commons license. 3 00:00:05,310 --> 00:00:07,520 Your support will help MIT OpenCourseWare 4 00:00:07,520 --> 00:00:11,610 continue to offer high-quality educational resources for free. 5 00:00:11,610 --> 00:00:14,180 To make a donation or to view additional materials 6 00:00:14,180 --> 00:00:18,140 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,140 --> 00:00:19,026 at ocw.mit.edu. 8 00:00:23,660 --> 00:00:27,290 CHARLES LEISERSON: So welcome to 6.172. 9 00:00:27,290 --> 00:00:30,140 My name is Charles Leiserson, and I am one 10 00:00:30,140 --> 00:00:32,180 of the two lecturers this term. 11 00:00:32,180 --> 00:00:35,570 The other is Professor Julian Shun. 12 00:00:35,570 --> 00:00:41,010 We're both in EECS and in CSAIL on the 7th floor of the Gates 13 00:00:41,010 --> 00:00:41,510 Building. 14 00:00:44,040 --> 00:00:49,360 If you don't know it, you are in Performance Engineering 15 00:00:49,360 --> 00:00:53,050 of Software Systems, so if you found yourself 16 00:00:53,050 --> 00:00:57,180 in the wrong place, now's the time to exit. 17 00:00:57,180 --> 00:01:02,430 I want to start today by talking a little bit about why 18 00:01:02,430 --> 00:01:07,200 we do performance engineering, and then I'll 19 00:01:07,200 --> 00:01:09,630 do a little bit of administration, 20 00:01:09,630 --> 00:01:13,540 and then sort of dive into sort of a case study that 21 00:01:13,540 --> 00:01:15,540 will give you a good sense of some of the things 22 00:01:15,540 --> 00:01:19,355 that we're going to do during the term. 23 00:01:19,355 --> 00:01:20,980 I put the administration in the middle, 24 00:01:20,980 --> 00:01:25,740 because it's like, if from me telling you about the course 25 00:01:25,740 --> 00:01:27,360 you don't want to do the course, then 26 00:01:27,360 --> 00:01:30,090 it's like, why should you listen to the administration, right? 27 00:01:30,090 --> 00:01:30,600 It's like-- 28 00:01:34,590 --> 00:01:37,010 So let's just dive right in, OK? 29 00:01:37,010 --> 00:01:39,943 So the first thing to always understand 30 00:01:39,943 --> 00:01:41,360 whenever you're doing something is 31 00:01:41,360 --> 00:01:45,780 a perspective on what matters in what you're doing. 32 00:01:45,780 --> 00:01:48,690 So the whole term, we're going to do software performance 33 00:01:48,690 --> 00:01:49,830 engineering. 34 00:01:49,830 --> 00:01:52,590 And so this is kind of interesting, 35 00:01:52,590 --> 00:01:55,380 because it turns out that performance is usually 36 00:01:55,380 --> 00:01:59,550 not at the top of what people are interested in when they're 37 00:01:59,550 --> 00:02:01,260 building software. 38 00:02:01,260 --> 00:02:01,770 OK? 39 00:02:01,770 --> 00:02:03,478 What are some of the things that are more 40 00:02:03,478 --> 00:02:07,610 important than performance? 41 00:02:07,610 --> 00:02:08,347 Yeah? 42 00:02:08,347 --> 00:02:09,180 AUDIENCE: Deadlines. 43 00:02:09,180 --> 00:02:10,388 CHARLES LEISERSON: Deadlines. 44 00:02:10,388 --> 00:02:10,900 Good. 45 00:02:10,900 --> 00:02:11,880 AUDIENCE: Cost. 46 00:02:11,880 --> 00:02:13,333 CHARLES LEISERSON: Cost. 47 00:02:13,333 --> 00:02:14,250 AUDIENCE: Correctness. 48 00:02:14,250 --> 00:02:15,580 CHARLES LEISERSON: Correctness. 49 00:02:15,580 --> 00:02:16,960 AUDIENCE: Extensibility. 50 00:02:16,960 --> 00:02:18,420 CHARLES LEISERSON: Extensibility. 51 00:02:18,420 --> 00:02:19,870 Yeah, maybe we could go on and on. 52 00:02:19,870 --> 00:02:22,920 And I think that you folks could probably 53 00:02:22,920 --> 00:02:24,960 make a pretty long list. 54 00:02:24,960 --> 00:02:27,180 I made a short list of all the kinds 55 00:02:27,180 --> 00:02:30,900 of things that are more important than performance. 56 00:02:30,900 --> 00:02:34,080 So then, if programmers are so willing to sacrifice 57 00:02:34,080 --> 00:02:37,770 performance for these properties, 58 00:02:37,770 --> 00:02:40,630 why do we study performance? 59 00:02:40,630 --> 00:02:41,130 OK? 60 00:02:41,130 --> 00:02:46,140 So this is kind of a bit of a paradox and a bit of a puzzle. 61 00:02:46,140 --> 00:02:50,250 Why do you study something that clearly 62 00:02:50,250 --> 00:02:51,930 isn't at the top of the list of what 63 00:02:51,930 --> 00:02:56,700 most people care about when they're developing software? 64 00:02:56,700 --> 00:03:01,140 I think the answer to that is that performance 65 00:03:01,140 --> 00:03:03,820 is the currency of computing. 66 00:03:03,820 --> 00:03:04,320 OK? 67 00:03:04,320 --> 00:03:08,590 You use performance to buy these other properties. 68 00:03:08,590 --> 00:03:11,230 So you'll say something like, gee, 69 00:03:11,230 --> 00:03:13,380 I want to make it easy to program, 70 00:03:13,380 --> 00:03:16,650 and so therefore I'm willing to sacrifice some performance 71 00:03:16,650 --> 00:03:19,080 to make something easy to program. 72 00:03:19,080 --> 00:03:22,200 I'm willing to sacrifice some performance to make sure 73 00:03:22,200 --> 00:03:25,270 that my system is secure. 74 00:03:25,270 --> 00:03:25,770 OK? 75 00:03:25,770 --> 00:03:29,010 And all those things come out of your performance budget. 76 00:03:29,010 --> 00:03:32,580 And clearly if performance degrades too far, 77 00:03:32,580 --> 00:03:34,760 your stuff becomes unusable. 78 00:03:34,760 --> 00:03:35,340 OK? 79 00:03:35,340 --> 00:03:39,492 When I talk with people, with programmers, and I say-- 80 00:03:39,492 --> 00:03:41,700 you know, people are fond of saying, ah, performance. 81 00:03:41,700 --> 00:03:42,765 Oh, you do performance? 82 00:03:42,765 --> 00:03:43,890 Performance doesn't matter. 83 00:03:43,890 --> 00:03:45,390 I never think about it. 84 00:03:45,390 --> 00:03:50,610 Then I talk with people who use computers, 85 00:03:50,610 --> 00:03:54,540 and I ask, what's your main complaint about the computing 86 00:03:54,540 --> 00:03:56,020 systems you use? 87 00:03:56,020 --> 00:03:57,610 Answer? 88 00:03:57,610 --> 00:03:58,653 Too slow. 89 00:03:58,653 --> 00:03:59,690 [CHUCKLES] OK? 90 00:03:59,690 --> 00:04:01,690 So it's interesting, whether you're the producer 91 00:04:01,690 --> 00:04:02,600 or whatever. 92 00:04:02,600 --> 00:04:07,083 But the real answer is that performance is like currency. 93 00:04:07,083 --> 00:04:08,125 It's something you spend. 94 00:04:10,810 --> 00:04:12,880 If you look-- you know, would I rather 95 00:04:12,880 --> 00:04:17,810 have $100 or a gallon of water? 96 00:04:17,810 --> 00:04:20,930 Well, water is indispensable to life. 97 00:04:20,930 --> 00:04:22,460 There are circumstances, certainly, 98 00:04:22,460 --> 00:04:25,490 where I would prefer to have the water. 99 00:04:25,490 --> 00:04:26,060 OK? 100 00:04:26,060 --> 00:04:27,600 Than $100. 101 00:04:27,600 --> 00:04:28,100 OK? 102 00:04:28,100 --> 00:04:32,030 But in our modern society, I can buy 103 00:04:32,030 --> 00:04:36,060 water for much less than $100. 104 00:04:36,060 --> 00:04:36,560 OK? 105 00:04:36,560 --> 00:04:40,760 So even though water is essential to life 106 00:04:40,760 --> 00:04:45,500 and far more important than money, money is a currency, 107 00:04:45,500 --> 00:04:48,200 and so I prefer to have the money because I can just 108 00:04:48,200 --> 00:04:49,730 buy the things I need. 109 00:04:49,730 --> 00:04:52,700 And that's the same kind of analogy of performance. 110 00:04:52,700 --> 00:04:58,500 It has no intrinsic value, but it contributes to things. 111 00:04:58,500 --> 00:05:01,070 You can use it to buy things that you care about 112 00:05:01,070 --> 00:05:05,360 like usability or testability or what have you. 113 00:05:05,360 --> 00:05:06,470 OK? 114 00:05:06,470 --> 00:05:09,560 Now, in the early days of computing, 115 00:05:09,560 --> 00:05:12,530 software performance engineering was common 116 00:05:12,530 --> 00:05:14,540 because machine resources were limited. 117 00:05:14,540 --> 00:05:20,720 If you look at these machines from 1964 to 1977-- 118 00:05:20,720 --> 00:05:25,060 I mean, look at how many bytes they have on them, right? 119 00:05:25,060 --> 00:05:31,070 In '64, there is a computer with 524 kilobytes. 120 00:05:31,070 --> 00:05:32,190 OK? 121 00:05:32,190 --> 00:05:35,380 That was a big machine back then. 122 00:05:35,380 --> 00:05:36,190 That's kilobytes. 123 00:05:36,190 --> 00:05:37,990 That's not megabytes, that's not gigabytes. 124 00:05:37,990 --> 00:05:38,900 That's kilobytes. 125 00:05:38,900 --> 00:05:39,990 OK? 126 00:05:39,990 --> 00:05:46,480 And many programs would strain the machine resources, OK? 127 00:05:46,480 --> 00:05:50,500 The clock rate for that machine was 33 kilohertz. 128 00:05:50,500 --> 00:05:53,524 What's a typical clock rate today? 129 00:05:53,524 --> 00:05:54,770 AUDIENCE: 4 gigahertz. 130 00:05:54,770 --> 00:05:56,020 CHARLES LEISERSON: About what? 131 00:05:56,020 --> 00:05:56,480 AUDIENCE: 4 gigahertz. 132 00:05:56,480 --> 00:05:59,270 CHARLES LEISERSON: 4 gigahertz, 3 gigahertz, 2 gigahertz, 133 00:05:59,270 --> 00:06:00,080 somewhere up there. 134 00:06:00,080 --> 00:06:02,660 Yeah, somewhere in that range, OK? 135 00:06:02,660 --> 00:06:05,370 And here they were operating with kilohertz. 136 00:06:05,370 --> 00:06:08,750 So many programs would not fit without intense performance 137 00:06:08,750 --> 00:06:11,040 engineering. 138 00:06:11,040 --> 00:06:13,020 And one of the things, also-- 139 00:06:13,020 --> 00:06:17,880 there's a lot of sayings that came out of that era. 140 00:06:17,880 --> 00:06:23,940 Donald Knuth, who's a Turing Award winner, absolutely 141 00:06:23,940 --> 00:06:27,330 fabulous computer scientist in all respects, 142 00:06:27,330 --> 00:06:30,408 wrote, "Premature optimization is the root of all evil." 143 00:06:30,408 --> 00:06:31,950 And I invite you, by the way, to look 144 00:06:31,950 --> 00:06:35,560 that quote up because it's actually taken out of context. 145 00:06:35,560 --> 00:06:36,060 OK? 146 00:06:36,060 --> 00:06:38,670 So trying to optimize stuff too early he was worried about. 147 00:06:38,670 --> 00:06:40,200 OK? 148 00:06:40,200 --> 00:06:45,480 Bill Wulf, who designed the BLISS language 149 00:06:45,480 --> 00:06:47,850 and worked on the PDP-11 and such, 150 00:06:47,850 --> 00:06:49,560 said, "More computing sins are committed 151 00:06:49,560 --> 00:06:52,770 in the name of efficiency without necessarily achieving 152 00:06:52,770 --> 00:06:55,710 it than for any other single reason, including 153 00:06:55,710 --> 00:06:57,890 blind stupidity." 154 00:06:57,890 --> 00:06:58,890 OK? 155 00:06:58,890 --> 00:07:00,750 And Michael Jackson said, "The first rule 156 00:07:00,750 --> 00:07:02,370 of program optimization-- 157 00:07:02,370 --> 00:07:03,780 don't do it. 158 00:07:03,780 --> 00:07:05,780 Second rule of program optimization, 159 00:07:05,780 --> 00:07:07,560 for experts only-- 160 00:07:07,560 --> 00:07:08,870 don't do it yet." 161 00:07:08,870 --> 00:07:09,370 [CHUCKLES] 162 00:07:09,370 --> 00:07:09,870 OK? 163 00:07:09,870 --> 00:07:12,078 So everybody warning away, because when 164 00:07:12,078 --> 00:07:13,620 you start trying to make things fast, 165 00:07:13,620 --> 00:07:15,760 your code becomes unreadable. 166 00:07:15,760 --> 00:07:16,260 OK? 167 00:07:16,260 --> 00:07:19,170 Making code that is readable and fast-- now 168 00:07:19,170 --> 00:07:21,540 that's where the art is, and hopefully we'll 169 00:07:21,540 --> 00:07:23,720 learn a little bit about doing that. 170 00:07:23,720 --> 00:07:26,370 OK? 171 00:07:26,370 --> 00:07:28,350 And indeed, there was no real point 172 00:07:28,350 --> 00:07:33,300 in working too hard on performance 173 00:07:33,300 --> 00:07:35,490 engineering for many years. 174 00:07:35,490 --> 00:07:37,390 If you look at technology scaling 175 00:07:37,390 --> 00:07:39,090 and you look at how many transistors 176 00:07:39,090 --> 00:07:45,750 are on various processor designs, up until about 2004, 177 00:07:45,750 --> 00:07:52,830 we had Moore's law in full throttle, OK? 178 00:07:52,830 --> 00:07:57,560 With chip densities doubling every two years. 179 00:07:57,560 --> 00:08:00,930 And really quite amazing. 180 00:08:00,930 --> 00:08:05,160 And along with that, as they shrunk the dimensions 181 00:08:05,160 --> 00:08:05,790 of chips-- 182 00:08:05,790 --> 00:08:09,180 because by miniaturization-- the clock speed would go up 183 00:08:09,180 --> 00:08:11,640 correspondingly, as well. 184 00:08:11,640 --> 00:08:14,990 And so, if you found something was too slow, 185 00:08:14,990 --> 00:08:16,900 wait a couple of years. 186 00:08:16,900 --> 00:08:17,780 OK? 187 00:08:17,780 --> 00:08:18,890 Wait a couple of years. 188 00:08:18,890 --> 00:08:21,100 It'll be faster. 189 00:08:21,100 --> 00:08:21,910 OK? 190 00:08:21,910 --> 00:08:24,560 And so, you know, if you were going 191 00:08:24,560 --> 00:08:27,880 to do something with software and make your software ugly, 192 00:08:27,880 --> 00:08:38,590 there wasn't a real good payoff compared 193 00:08:38,590 --> 00:08:43,270 to just simply waiting around. 194 00:08:43,270 --> 00:08:45,020 And in that era, there was something 195 00:08:45,020 --> 00:08:53,430 called Dennard scaling, which, as things shrunk, 196 00:08:53,430 --> 00:08:57,840 allowed the clock speeds to get larger, basically 197 00:08:57,840 --> 00:08:59,670 by reducing power. 198 00:08:59,670 --> 00:09:02,135 You could reduce power and still keep everything fast, 199 00:09:02,135 --> 00:09:04,440 and we'll talk about that in a minute. 200 00:09:04,440 --> 00:09:10,040 So if you look at what happened to from 1977 to 2004-- 201 00:09:10,040 --> 00:09:16,350 here are Apple computers with similar price tags, 202 00:09:16,350 --> 00:09:21,750 and you can see the clock rate really just skyrocketed. 203 00:09:21,750 --> 00:09:27,060 1 megahertz, 400 megahertz, 1.8 gigahertz, OK? 204 00:09:27,060 --> 00:09:30,630 And the data paths went from 8 bits to 30 to 64. 205 00:09:30,630 --> 00:09:32,880 The memory, correspondingly, grows. 206 00:09:32,880 --> 00:09:33,750 Cost? 207 00:09:33,750 --> 00:09:35,850 Approximately the same. 208 00:09:35,850 --> 00:09:38,160 And that's the legacy from Moore's law 209 00:09:38,160 --> 00:09:43,170 and the tremendous advances in semiconductor technology. 210 00:09:43,170 --> 00:09:47,190 And so, until 2004, Moore's law and the scaling 211 00:09:47,190 --> 00:09:50,610 of clock frequency, so-called Dennard scaling, 212 00:09:50,610 --> 00:09:52,800 was essentially a printing press for the currency 213 00:09:52,800 --> 00:09:54,190 of performance. 214 00:09:54,190 --> 00:09:54,690 OK? 215 00:09:54,690 --> 00:09:55,998 You didn't have to do anything. 216 00:09:55,998 --> 00:09:57,540 You just made the hardware go faster. 217 00:09:57,540 --> 00:09:59,820 Very easy. 218 00:09:59,820 --> 00:10:03,330 And all that came to an end-- 219 00:10:03,330 --> 00:10:05,250 well, some of it came to an end-- 220 00:10:05,250 --> 00:10:09,750 in 2004 when clock speeds plateaued. 221 00:10:09,750 --> 00:10:10,250 OK? 222 00:10:10,250 --> 00:10:12,570 So if you look at this, around 2005, 223 00:10:12,570 --> 00:10:14,010 you can see all the speeds-- 224 00:10:14,010 --> 00:10:18,030 we hit, you know, 2 to 4 gigahertz, 225 00:10:18,030 --> 00:10:21,240 and we have not been able to make chips go faster 226 00:10:21,240 --> 00:10:25,060 than that in any practical way since then. 227 00:10:25,060 --> 00:10:26,880 But the densities have kept going great. 228 00:10:26,880 --> 00:10:30,960 Now, the reason that the clock speed flattened 229 00:10:30,960 --> 00:10:32,430 was because of power density. 230 00:10:32,430 --> 00:10:37,050 And this is a slide from Intel from that era, 231 00:10:37,050 --> 00:10:39,480 looking at the growth of power density. 232 00:10:39,480 --> 00:10:41,400 And what they were projecting was 233 00:10:41,400 --> 00:10:46,500 that the junction temperatures of the transistors on the chip, 234 00:10:46,500 --> 00:10:50,040 if they just keep scaling the way they had been scaling, 235 00:10:50,040 --> 00:10:54,390 would start to approach, first of all, 236 00:10:54,390 --> 00:10:57,000 the temperature of a nuclear reactor, then 237 00:10:57,000 --> 00:10:58,770 the temperature of a rocket nozzle, 238 00:10:58,770 --> 00:11:00,110 and then the sun's surface. 239 00:11:00,110 --> 00:11:00,610 OK? 240 00:11:00,610 --> 00:11:03,060 So that we're not going to build little technology 241 00:11:03,060 --> 00:11:05,010 that cools that very well. 242 00:11:05,010 --> 00:11:07,353 And even if you could solve it for a little bit, 243 00:11:07,353 --> 00:11:08,520 the writing was on the wall. 244 00:11:08,520 --> 00:11:11,400 We cannot scale clock frequencies any more. 245 00:11:11,400 --> 00:11:14,070 The reason for that is that, originally, clock frequency 246 00:11:14,070 --> 00:11:18,120 was scaled assuming that most of the power 247 00:11:18,120 --> 00:11:20,100 was dynamic power, which was going 248 00:11:20,100 --> 00:11:21,870 when you switched the circuit. 249 00:11:21,870 --> 00:11:24,420 And what happened as we kept reducing that and reducing 250 00:11:24,420 --> 00:11:27,180 that is, something that used to be in the noise, namely 251 00:11:27,180 --> 00:11:30,150 the leakage currents, OK, started 252 00:11:30,150 --> 00:11:33,270 to become significant to the point where-- 253 00:11:33,270 --> 00:11:37,050 now, today-- the dynamic power is 254 00:11:37,050 --> 00:11:40,080 far less of a concern than the static power 255 00:11:40,080 --> 00:11:43,290 from just the circuit sitting there leaking, 256 00:11:43,290 --> 00:11:48,210 and when you miniaturize, you can't stop that effect 257 00:11:48,210 --> 00:11:49,780 from happening. 258 00:11:49,780 --> 00:11:54,090 So what did the vendors do in 2004 and 2005 259 00:11:54,090 --> 00:11:57,150 and since is, they said, oh, gosh, we've 260 00:11:57,150 --> 00:12:00,780 got all these transistors to use, 261 00:12:00,780 --> 00:12:06,000 but we can't use the transistors to make stuff run faster. 262 00:12:06,000 --> 00:12:08,610 So what they did is, they introduced parallelism 263 00:12:08,610 --> 00:12:11,070 in the form of multicore processors. 264 00:12:11,070 --> 00:12:14,400 They put more than one processing core in a chip. 265 00:12:14,400 --> 00:12:19,380 And to scale performance, they would, you know, 266 00:12:19,380 --> 00:12:23,400 have multiple cores, and each generation of Moore's law 267 00:12:23,400 --> 00:12:27,900 now was potentially doubling the number of cores. 268 00:12:27,900 --> 00:12:31,590 And so if you look at what happened for processor cores, 269 00:12:31,590 --> 00:12:37,200 you see that around 2004, 2005, we started 270 00:12:37,200 --> 00:12:41,190 to get multiple processing cores per chip, to the extent 271 00:12:41,190 --> 00:12:43,650 that today, it's basically impossible 272 00:12:43,650 --> 00:12:47,940 to find a single-core chip for a laptop or a workstation 273 00:12:47,940 --> 00:12:48,810 or whatever. 274 00:12:48,810 --> 00:12:50,700 Everything is multicore. 275 00:12:50,700 --> 00:12:52,470 You can't buy just one. 276 00:12:52,470 --> 00:12:54,150 You have to buy a parallel processor. 277 00:12:59,550 --> 00:13:02,280 And so the impact of that was that performance 278 00:13:02,280 --> 00:13:03,540 was no longer free. 279 00:13:03,540 --> 00:13:05,760 You couldn't just speed up the hardware. 280 00:13:05,760 --> 00:13:08,400 Now if you wanted to use that potential, 281 00:13:08,400 --> 00:13:09,882 you had to do parallel programming, 282 00:13:09,882 --> 00:13:12,090 and that's not something that anybody in the industry 283 00:13:12,090 --> 00:13:16,292 really had done. 284 00:13:16,292 --> 00:13:18,000 So today, there are a lot of other things 285 00:13:18,000 --> 00:13:20,100 that happened in that intervening time. 286 00:13:20,100 --> 00:13:23,910 We got vector units as common parts of our machines; 287 00:13:23,910 --> 00:13:28,920 we got GPUs; we got steeper cache hierarchies; 288 00:13:28,920 --> 00:13:33,370 we have configurable logic on some machines; and so forth. 289 00:13:33,370 --> 00:13:36,180 And now it's up to the software to adapt to it. 290 00:13:36,180 --> 00:13:38,310 And so, although we don't want to have 291 00:13:38,310 --> 00:13:41,700 to deal with performance, today you 292 00:13:41,700 --> 00:13:43,080 have to deal with performance. 293 00:13:43,080 --> 00:13:45,150 And in your lifetimes, you will have 294 00:13:45,150 --> 00:13:48,070 to deal with performance in software 295 00:13:48,070 --> 00:13:49,930 if you're going to have effective software. 296 00:13:49,930 --> 00:13:52,590 OK? 297 00:13:52,590 --> 00:13:54,000 You can see what happened, also-- 298 00:13:54,000 --> 00:13:56,610 this is a study that we did looking 299 00:13:56,610 --> 00:14:03,360 at software bugs in a variety of open-source projects 300 00:14:03,360 --> 00:14:05,530 where they're mentioning the word "performance." 301 00:14:05,530 --> 00:14:09,725 And you can see that in 2004, the numbers start going up. 302 00:14:09,725 --> 00:14:11,100 You know, some of them-- it's not 303 00:14:11,100 --> 00:14:14,460 as convincing for some things as others, 304 00:14:14,460 --> 00:14:18,360 but generally there's a trend of, after 2004, 305 00:14:18,360 --> 00:14:20,940 people started worrying more about performance. 306 00:14:20,940 --> 00:14:25,590 If you look at software developer jobs, 307 00:14:25,590 --> 00:14:30,312 as of around early, mid-2000s-- 308 00:14:30,312 --> 00:14:34,530 the 2000 "oh oh's," I guess, OK-- 309 00:14:34,530 --> 00:14:37,860 you see once again the mention of "performance" in jobs 310 00:14:37,860 --> 00:14:39,240 is going up. 311 00:14:39,240 --> 00:14:42,270 And anecdotally, I can tell you, I 312 00:14:42,270 --> 00:14:46,500 had one student who came to me after the spring, 313 00:14:46,500 --> 00:14:50,100 after he'd taken 6.172, and he said, 314 00:14:50,100 --> 00:14:54,990 you know, I went and I applied for five jobs. 315 00:14:54,990 --> 00:14:59,112 And every job asked me, at every job interview, 316 00:14:59,112 --> 00:15:00,570 they asked me a question I couldn't 317 00:15:00,570 --> 00:15:06,000 have answered if I hadn't taken 6.172, and I got five offers. 318 00:15:06,000 --> 00:15:06,540 OK? 319 00:15:06,540 --> 00:15:08,340 And when I compared those offers, 320 00:15:08,340 --> 00:15:11,860 they tended to be 20% to 30% larger than people 321 00:15:11,860 --> 00:15:13,860 who are just web monkeys. 322 00:15:13,860 --> 00:15:15,070 OK? 323 00:15:15,070 --> 00:15:15,700 So anyway. 324 00:15:15,700 --> 00:15:16,285 [LAUGHTER] 325 00:15:16,285 --> 00:15:19,690 That's not to say that you should necessarily 326 00:15:19,690 --> 00:15:21,790 take this class, OK? 327 00:15:21,790 --> 00:15:24,910 But I just want to point out that what we're going to learn 328 00:15:24,910 --> 00:15:28,900 is going to be interesting from a practical point of view, 329 00:15:28,900 --> 00:15:30,720 i.e., your futures. 330 00:15:30,720 --> 00:15:31,360 OK? 331 00:15:31,360 --> 00:15:33,370 As well as theoretical points of view 332 00:15:33,370 --> 00:15:35,200 and technical points of view. 333 00:15:35,200 --> 00:15:36,490 OK? 334 00:15:36,490 --> 00:15:41,080 So modern processors are really complicated, 335 00:15:41,080 --> 00:15:43,420 and the big question is, how do we 336 00:15:43,420 --> 00:15:48,300 write software to use that modern hardware efficiently? 337 00:15:48,300 --> 00:15:49,800 OK? 338 00:15:49,800 --> 00:15:53,040 I want to give you an example of performance engineering 339 00:15:53,040 --> 00:15:58,170 of a very well-studied problem, namely matrix multiplication. 340 00:15:58,170 --> 00:16:01,079 Who has never seen this problem? 341 00:16:01,079 --> 00:16:04,250 [LAUGHS] Yeah. 342 00:16:04,250 --> 00:16:07,700 OK, so we got some jokers in the class, I can see. 343 00:16:07,700 --> 00:16:10,230 OK. 344 00:16:10,230 --> 00:16:14,477 So, you know, it takes n cubed operations, 345 00:16:14,477 --> 00:16:16,310 because you're basically computing n squared 346 00:16:16,310 --> 00:16:18,100 dot products. 347 00:16:18,100 --> 00:16:18,600 OK? 348 00:16:18,600 --> 00:16:22,710 So essentially, if you add up the total number of operations, 349 00:16:22,710 --> 00:16:25,950 it's about 2n cubed because there is essentially 350 00:16:25,950 --> 00:16:29,790 a multiply and an add for every pair of terms 351 00:16:29,790 --> 00:16:32,020 that need to be accumulated. 352 00:16:32,020 --> 00:16:32,520 OK? 353 00:16:32,520 --> 00:16:34,193 So it's basically 2n cubed. 354 00:16:34,193 --> 00:16:35,610 We're going to look at it assuming 355 00:16:35,610 --> 00:16:42,210 for simplicity that our n is an exact power of 2, OK? 356 00:16:42,210 --> 00:16:45,810 Now, the machine that we're going to look at 357 00:16:45,810 --> 00:16:49,890 is going to be one of the ones that you'll 358 00:16:49,890 --> 00:16:51,720 have access to in AWS. 359 00:16:51,720 --> 00:16:52,220 OK? 360 00:16:52,220 --> 00:16:57,120 It's a compute-optimized machine, 361 00:16:57,120 --> 00:17:00,900 which has a Haswell microachitecture running 362 00:17:00,900 --> 00:17:03,190 at 2.9 gigahertz. 363 00:17:03,190 --> 00:17:06,810 There are 2 processor chips for each of these machines 364 00:17:06,810 --> 00:17:14,490 and 9 processing cores per chip, so a total of 18 cores. 365 00:17:14,490 --> 00:17:17,160 So that's the amount of parallel processing. 366 00:17:17,160 --> 00:17:21,390 It does two-way hyperthreading, which we're actually 367 00:17:21,390 --> 00:17:24,819 going to not deal a lot with. 368 00:17:24,819 --> 00:17:27,960 Hyperthreading gives you a little bit more performance, 369 00:17:27,960 --> 00:17:31,110 but it also makes it really hard to measure things, 370 00:17:31,110 --> 00:17:33,540 so generally we will turn off hyperthreading. 371 00:17:33,540 --> 00:17:36,060 But the performance that you get tends 372 00:17:36,060 --> 00:17:40,650 to be correlated with what you get when you hyperthread. 373 00:17:40,650 --> 00:17:42,540 For floating-point unit there, it 374 00:17:42,540 --> 00:17:45,510 is capable of doing 8 double-precision precision 375 00:17:45,510 --> 00:17:46,170 operations. 376 00:17:46,170 --> 00:17:49,150 That's 64-bit floating-point operations, 377 00:17:49,150 --> 00:17:54,340 including a fused-multiply-add per core, per cycle. 378 00:17:54,340 --> 00:17:55,250 OK? 379 00:17:55,250 --> 00:17:56,540 So that's a vector unit. 380 00:17:56,540 --> 00:17:58,910 So basically, each of these 18 cores 381 00:17:58,910 --> 00:18:04,400 can do 8 double-precision operations, 382 00:18:04,400 --> 00:18:08,130 including a fused-multiply-add, which is actually 2 operations. 383 00:18:08,130 --> 00:18:08,630 OK? 384 00:18:08,630 --> 00:18:11,570 The way that they count these things, OK? 385 00:18:11,570 --> 00:18:15,220 It has a cache-line size of 64 bytes. 386 00:18:15,220 --> 00:18:19,377 The icache is 32 kilobytes, which is 8-way set associative. 387 00:18:19,377 --> 00:18:20,960 We'll talk about some of these things. 388 00:18:20,960 --> 00:18:22,940 If you don't know all the terms, it's OK. 389 00:18:22,940 --> 00:18:25,830 We're going to cover most of these terms later on. 390 00:18:25,830 --> 00:18:28,220 It's got a dcache of the same size. 391 00:18:28,220 --> 00:18:32,930 It's got an L2-cache of 256 kilobytes, 392 00:18:32,930 --> 00:18:34,710 and it's got an L3-cache or what's 393 00:18:34,710 --> 00:18:38,150 sometimes called an LLC, Last-Level Cache, 394 00:18:38,150 --> 00:18:40,590 of 25 megabytes. 395 00:18:40,590 --> 00:18:43,430 And then it's got 60 gigabytes of DRAM. 396 00:18:43,430 --> 00:18:46,070 So this is a honking big machine. 397 00:18:46,070 --> 00:18:46,570 OK? 398 00:18:46,570 --> 00:18:49,750 This is like-- you can get things to sing on this, OK? 399 00:18:49,750 --> 00:18:52,330 If you look at the peak performance, 400 00:18:52,330 --> 00:18:56,890 it's the clock speed times 2 processor 401 00:18:56,890 --> 00:19:05,140 chips times 9 processing cores per chip, each capable 402 00:19:05,140 --> 00:19:07,570 of, if you can use both the multiply and the add, 403 00:19:07,570 --> 00:19:13,960 16 floating-point operations, and that goes out 404 00:19:13,960 --> 00:19:16,840 to just short of teraflop, OK? 405 00:19:16,840 --> 00:19:20,090 836 gigaflops. 406 00:19:20,090 --> 00:19:22,933 So that's a lot of power, OK? 407 00:19:22,933 --> 00:19:23,850 That's a lot of power. 408 00:19:23,850 --> 00:19:26,240 These are fun machines, actually, OK? 409 00:19:26,240 --> 00:19:32,597 Especially when we get into things like the game-playing AI 410 00:19:32,597 --> 00:19:34,430 and stuff that we do for the fourth project. 411 00:19:34,430 --> 00:19:34,730 You'll see. 412 00:19:34,730 --> 00:19:35,660 They're really fun. 413 00:19:35,660 --> 00:19:38,910 You can have a lot of compute, OK? 414 00:19:38,910 --> 00:19:41,250 Now here's the basic code. 415 00:19:41,250 --> 00:19:44,190 This is the full code for Python for doing 416 00:19:44,190 --> 00:19:45,240 matrix multiplication. 417 00:19:45,240 --> 00:19:48,690 Now, generally, in Python, you wouldn't use this code 418 00:19:48,690 --> 00:19:50,940 because you just call a library subroutine that 419 00:19:50,940 --> 00:19:53,070 does matrix multiplication. 420 00:19:53,070 --> 00:19:54,545 But sometimes you have a problem. 421 00:19:54,545 --> 00:19:56,670 I'm going to illustrate with matrix multiplication, 422 00:19:56,670 --> 00:20:01,530 but sometimes you have a problem for which 423 00:20:01,530 --> 00:20:02,910 you have to write the code. 424 00:20:02,910 --> 00:20:06,300 And I want to give you an idea of what kind of performance 425 00:20:06,300 --> 00:20:08,400 you get out of Python, OK? 426 00:20:08,400 --> 00:20:12,503 In addition, somebody has to write-- if there is a library 427 00:20:12,503 --> 00:20:13,920 routine, somebody had to write it, 428 00:20:13,920 --> 00:20:16,200 and that person was a performance engineer, 429 00:20:16,200 --> 00:20:18,400 because they wrote it to be as fast as possible. 430 00:20:18,400 --> 00:20:20,520 And so this will give you an idea of what you can 431 00:20:20,520 --> 00:20:23,190 do to make code run fast, OK? 432 00:20:23,190 --> 00:20:24,720 So when you run this code-- 433 00:20:24,720 --> 00:20:27,230 so you can see that the start time-- 434 00:20:27,230 --> 00:20:31,050 you know, before the triply-nested loop-- 435 00:20:31,050 --> 00:20:34,140 right here, before the triply-nested loop, 436 00:20:34,140 --> 00:20:36,242 we take a time measurement, and then we 437 00:20:36,242 --> 00:20:37,950 take another time measurement at the end, 438 00:20:37,950 --> 00:20:39,810 and then we print the difference. 439 00:20:39,810 --> 00:20:43,290 And then that's just this classic triply-nested loop 440 00:20:43,290 --> 00:20:47,923 for matrix multiplication. 441 00:20:47,923 --> 00:20:50,090 And so, when you run this, how long is this run for, 442 00:20:50,090 --> 00:20:50,590 you think? 443 00:20:55,380 --> 00:20:57,800 Any guesses? 444 00:20:57,800 --> 00:20:58,640 Let's see. 445 00:20:58,640 --> 00:21:00,560 How about-- let's do this. 446 00:21:00,560 --> 00:21:02,840 It runs for 6 microseconds. 447 00:21:02,840 --> 00:21:06,000 Who thinks 6 microseconds? 448 00:21:06,000 --> 00:21:10,100 How about 6 milliseconds? 449 00:21:10,100 --> 00:21:11,780 How about-- 6 milliseconds. 450 00:21:11,780 --> 00:21:12,620 How about 6 seconds? 451 00:21:16,780 --> 00:21:17,625 How about 6 minutes? 452 00:21:20,210 --> 00:21:21,010 OK. 453 00:21:21,010 --> 00:21:22,926 How about 6 hours? 454 00:21:22,926 --> 00:21:24,785 [LAUGHTER] 455 00:21:24,785 --> 00:21:26,150 How about 6 days? 456 00:21:26,150 --> 00:21:27,325 [LAUGHTER] 457 00:21:27,325 --> 00:21:28,897 OK. 458 00:21:28,897 --> 00:21:30,980 Of course, it's important to know what size it is. 459 00:21:30,980 --> 00:21:36,890 It's 4,096 by 4,096, as it shows in the code, OK? 460 00:21:36,890 --> 00:21:39,520 And those of you who didn't vote-- 461 00:21:39,520 --> 00:21:40,320 wake up. 462 00:21:40,320 --> 00:21:41,190 Let's get active. 463 00:21:41,190 --> 00:21:42,870 This is active learning. 464 00:21:42,870 --> 00:21:44,460 Put yourself out there, OK? 465 00:21:44,460 --> 00:21:45,850 It doesn't matter whether you're right or wrong. 466 00:21:45,850 --> 00:21:47,760 There'll be a bunch of people who got the right answer, 467 00:21:47,760 --> 00:21:49,050 but they have no idea why. 468 00:21:49,050 --> 00:21:50,650 [LAUGHTER] 469 00:21:50,650 --> 00:21:51,540 OK? 470 00:21:51,540 --> 00:21:55,390 So it turns out, it takes about 21,000 seconds, 471 00:21:55,390 --> 00:21:58,280 which is about 6 hours. 472 00:21:58,280 --> 00:21:59,470 OK? 473 00:21:59,470 --> 00:21:59,970 Amazing. 474 00:21:59,970 --> 00:22:02,437 Is this fast? 475 00:22:02,437 --> 00:22:04,313 AUDIENCE: (SARCASTICALLY) Yeah. 476 00:22:04,313 --> 00:22:04,918 [LAUGHTER] 477 00:22:04,918 --> 00:22:06,210 CHARLES LEISERSON: Yeah, right. 478 00:22:06,210 --> 00:22:07,938 Duh, right? 479 00:22:07,938 --> 00:22:08,480 Is this fast? 480 00:22:08,480 --> 00:22:08,980 No. 481 00:22:08,980 --> 00:22:15,550 You know, how do we tell whether this is fast or not? 482 00:22:15,550 --> 00:22:16,100 OK? 483 00:22:16,100 --> 00:22:18,493 You know, what should we expect from our machine? 484 00:22:18,493 --> 00:22:19,910 So let's do a back-of-the-envelope 485 00:22:19,910 --> 00:22:22,050 calculation of-- 486 00:22:22,050 --> 00:22:24,775 [LAUGHTER] 487 00:22:24,775 --> 00:22:26,590 --of how many operations there are 488 00:22:26,590 --> 00:22:28,340 and how fast we ought to be able to do it. 489 00:22:28,340 --> 00:22:29,757 We just went through and said what 490 00:22:29,757 --> 00:22:31,760 are all the parameters of the machine. 491 00:22:31,760 --> 00:22:35,130 So there are 2n cubed operations that need to be performed. 492 00:22:35,130 --> 00:22:37,130 We're not doing Strassen's algorithm or anything 493 00:22:37,130 --> 00:22:37,630 like that. 494 00:22:37,630 --> 00:22:42,470 We're just doing a straight triply-nested loop. 495 00:22:42,470 --> 00:22:47,540 So that's 2 to the 37 floating point operations, OK? 496 00:22:47,540 --> 00:22:51,920 The running time is 21,000 seconds, 497 00:22:51,920 --> 00:22:56,840 so that says that we're getting about 6.25 megaflops out 498 00:22:56,840 --> 00:23:01,250 of our machine when we run that code, OK? 499 00:23:01,250 --> 00:23:04,910 Just by dividing it out, how many floating-point operations 500 00:23:04,910 --> 00:23:06,020 per second do we get? 501 00:23:06,020 --> 00:23:11,780 We take the number of operations divided by the time, OK? 502 00:23:11,780 --> 00:23:17,920 The peak, as you recall, was about 836 gigaflops, OK? 503 00:23:17,920 --> 00:23:23,350 And we're getting 6.25 megaflops, OK? 504 00:23:23,350 --> 00:23:34,810 So we're getting about 0.00075% of peak, OK? 505 00:23:34,810 --> 00:23:38,280 This is not fast. 506 00:23:38,280 --> 00:23:38,780 OK? 507 00:23:38,780 --> 00:23:39,605 This is not fast. 508 00:23:42,750 --> 00:23:45,330 So let's do something really simple. 509 00:23:45,330 --> 00:23:49,800 Let's code it in Java rather than Python, OK? 510 00:23:49,800 --> 00:23:51,720 So we take just that loop. 511 00:23:51,720 --> 00:23:54,303 The code is almost the same, OK? 512 00:23:54,303 --> 00:23:55,470 Just the triply-nested loop. 513 00:23:55,470 --> 00:23:58,200 We run it in Java, OK? 514 00:23:58,200 --> 00:24:00,510 And the running time now, it turns out, 515 00:24:00,510 --> 00:24:05,080 is about just under 3,000 seconds, 516 00:24:05,080 --> 00:24:08,750 which is about 46 minutes. 517 00:24:08,750 --> 00:24:11,620 The same code. 518 00:24:11,620 --> 00:24:13,820 Python, Java, OK? 519 00:24:13,820 --> 00:24:20,170 We got almost a 9 times speedup just simply 520 00:24:20,170 --> 00:24:24,220 coding it in a different language, OK? 521 00:24:24,220 --> 00:24:27,432 Well, let's try C. That's the language we're 522 00:24:27,432 --> 00:24:28,390 going to be using here. 523 00:24:28,390 --> 00:24:30,430 What happens when you code it in C? 524 00:24:30,430 --> 00:24:33,220 It's exactly the same thing, OK? 525 00:24:33,220 --> 00:24:37,360 We're going to use the Clang/LLVM 5.0 compiler. 526 00:24:37,360 --> 00:24:40,560 I believe we're using 6.0 this term, is that right? 527 00:24:40,560 --> 00:24:41,250 Yeah. 528 00:24:41,250 --> 00:24:45,980 OK, I should have rerun these numbers for 6.0, but I didn't. 529 00:24:45,980 --> 00:24:48,850 So now, it's basically 1,100 seconds, 530 00:24:48,850 --> 00:24:52,800 which is about 19 minutes, so we got, then, about-- 531 00:24:52,800 --> 00:24:56,200 it's twice as fast as Java and about 18 times faster 532 00:24:56,200 --> 00:24:58,120 than Python, OK? 533 00:24:58,120 --> 00:25:00,620 So here's where we stand so far. 534 00:25:00,620 --> 00:25:01,120 OK? 535 00:25:01,120 --> 00:25:04,990 We have the running time of these various things, OK? 536 00:25:04,990 --> 00:25:08,740 And the relative speedup is how much faster it 537 00:25:08,740 --> 00:25:11,920 is than the previous row, and the absolute speedup is 538 00:25:11,920 --> 00:25:14,170 how it is compared to the first row, 539 00:25:14,170 --> 00:25:22,340 and now we're managing to get, now, 0.014% of peak. 540 00:25:22,340 --> 00:25:27,770 So we're still slow, but before we go and try 541 00:25:27,770 --> 00:25:30,020 to optimize it further-- 542 00:25:30,020 --> 00:25:33,740 like, why is Python so slow and C so fast? 543 00:25:33,740 --> 00:25:36,880 Does anybody know? 544 00:25:36,880 --> 00:25:39,290 AUDIENCE: Python is interpreted, so it has to-- 545 00:25:39,290 --> 00:25:41,150 so there's a C program that basically 546 00:25:41,150 --> 00:25:43,632 parses Python pycode instructions, which 547 00:25:43,632 --> 00:25:45,463 takes up most of the time. 548 00:25:45,463 --> 00:25:46,380 CHARLES LEISERSON: OK. 549 00:25:46,380 --> 00:25:50,280 That's kind of on the right track. 550 00:25:50,280 --> 00:25:52,570 Anybody else have any-- 551 00:25:52,570 --> 00:25:55,830 articulate a little bit why Python is so slow? 552 00:25:55,830 --> 00:25:56,663 Yeah? 553 00:25:56,663 --> 00:25:58,830 AUDIENCE: When you write, like, multiplying and add, 554 00:25:58,830 --> 00:26:00,980 those aren't the only instructions Python's doing. 555 00:26:00,980 --> 00:26:02,770 It's doing lots of code for, like, 556 00:26:02,770 --> 00:26:06,340 going through Python objects and integers and blah-blah-blah. 557 00:26:06,340 --> 00:26:08,320 CHARLES LEISERSON: Yeah, yeah. 558 00:26:08,320 --> 00:26:09,950 OK, good. 559 00:26:09,950 --> 00:26:14,060 So the big reason why Python is slow and C is so fast 560 00:26:14,060 --> 00:26:19,760 is that Python is interpreted and C is compiled directly 561 00:26:19,760 --> 00:26:21,782 to machine code. 562 00:26:21,782 --> 00:26:23,240 And Java is somewhere in the middle 563 00:26:23,240 --> 00:26:26,000 because Java is compiled to bytecode, which is then 564 00:26:26,000 --> 00:26:29,480 interpreted and then just-in-time compiled 565 00:26:29,480 --> 00:26:30,440 into machine codes. 566 00:26:30,440 --> 00:26:34,040 So let me talk a little bit about these things. 567 00:26:34,040 --> 00:26:39,995 So interpreters, such as in Python, are versatile but slow. 568 00:26:42,352 --> 00:26:44,060 It's one of these things where they said, 569 00:26:44,060 --> 00:26:45,852 we're going to take some of our performance 570 00:26:45,852 --> 00:26:47,720 and use it to make a more flexible, 571 00:26:47,720 --> 00:26:49,310 easier-to-program environment. 572 00:26:49,310 --> 00:26:50,630 OK? 573 00:26:50,630 --> 00:26:52,880 The interpreter basically reads, interprets, 574 00:26:52,880 --> 00:26:55,790 and performs each program statement and then 575 00:26:55,790 --> 00:26:57,770 updates the machine state. 576 00:26:57,770 --> 00:27:00,500 So it's not just-- it's actually going through and, each time, 577 00:27:00,500 --> 00:27:05,340 reading your code, figuring out what it does, 578 00:27:05,340 --> 00:27:06,800 and then implementing it. 579 00:27:06,800 --> 00:27:10,010 So there's like all this overhead compared 580 00:27:10,010 --> 00:27:12,500 to just doing its operations. 581 00:27:12,500 --> 00:27:15,290 So interpreters can easily support high-level programming 582 00:27:15,290 --> 00:27:18,200 features, and they can do things like dynamic code alteration 583 00:27:18,200 --> 00:27:21,480 and so forth at the cost of performance. 584 00:27:21,480 --> 00:27:24,590 So that, you know, typically the cycle for an interpreter 585 00:27:24,590 --> 00:27:28,070 is, you read the next statement, you interpret the statement. 586 00:27:28,070 --> 00:27:29,810 You then perform the statement, and then 587 00:27:29,810 --> 00:27:31,970 you update the state of the machine, 588 00:27:31,970 --> 00:27:34,590 and then you fetch the next instruction. 589 00:27:34,590 --> 00:27:35,090 OK? 590 00:27:35,090 --> 00:27:37,250 And you're going through that each time, 591 00:27:37,250 --> 00:27:39,220 and that's done in software. 592 00:27:39,220 --> 00:27:39,860 OK? 593 00:27:39,860 --> 00:27:42,772 When you have things compiled to machine code, 594 00:27:42,772 --> 00:27:44,480 it goes through a similar thing, but it's 595 00:27:44,480 --> 00:27:48,890 highly optimized just for the things that machines are done. 596 00:27:48,890 --> 00:27:49,790 OK? 597 00:27:49,790 --> 00:27:52,550 And so when you compile, you're able to take advantage 598 00:27:52,550 --> 00:27:56,060 of the hardware and interpreter of machine instructions, 599 00:27:56,060 --> 00:28:00,560 and that's much, much lower overhead than the big software 600 00:28:00,560 --> 00:28:02,300 overhead you get with Python. 601 00:28:02,300 --> 00:28:06,042 Now, JIT is somewhere in the middle, what's used in Java. 602 00:28:06,042 --> 00:28:08,125 JIT compilers can recover some of the performance, 603 00:28:08,125 --> 00:28:11,360 and in fact it did a pretty good job in this case. 604 00:28:11,360 --> 00:28:14,660 The idea is, when the code is first executed, 605 00:28:14,660 --> 00:28:17,780 it's interpreted, and then the runtime system 606 00:28:17,780 --> 00:28:21,980 keeps track of how often the various pieces of code 607 00:28:21,980 --> 00:28:23,012 are executed. 608 00:28:23,012 --> 00:28:25,220 And whenever it finds that there's some piece of code 609 00:28:25,220 --> 00:28:27,890 that it's executing frequently, it then 610 00:28:27,890 --> 00:28:31,430 calls the compiler to compile that piece of code, 611 00:28:31,430 --> 00:28:34,500 and then subsequent to that, it runs the compiled code. 612 00:28:34,500 --> 00:28:39,050 So it tries to get the big advantage of performance 613 00:28:39,050 --> 00:28:43,435 by only compiling the things that are necessary-- 614 00:28:43,435 --> 00:28:44,810 you know, for which it's actually 615 00:28:44,810 --> 00:28:48,720 going to pay off to invoke the compiler to do. 616 00:28:48,720 --> 00:28:49,220 OK? 617 00:28:52,665 --> 00:28:54,290 So anyway, so that's the big difference 618 00:28:54,290 --> 00:28:57,140 with those kinds of things. 619 00:28:57,140 --> 00:28:59,870 One of the reasons we don't use Python in this class 620 00:28:59,870 --> 00:29:05,460 is because the performance model is hard to figure out. 621 00:29:05,460 --> 00:29:05,960 OK? 622 00:29:05,960 --> 00:29:11,150 C is much closer to the metal, much closer to the silicon, OK? 623 00:29:11,150 --> 00:29:13,430 And so it's much easier to figure out what's 624 00:29:13,430 --> 00:29:17,000 going on in that context. 625 00:29:17,000 --> 00:29:18,620 OK? 626 00:29:18,620 --> 00:29:21,020 But we will have a guest lecture that we're 627 00:29:21,020 --> 00:29:23,480 going to talk about performance in managed 628 00:29:23,480 --> 00:29:26,738 languages like Python, so it's not 629 00:29:26,738 --> 00:29:28,280 that we're going to ignore the topic. 630 00:29:28,280 --> 00:29:32,285 But we will learn how to do performance engineering 631 00:29:32,285 --> 00:29:34,660 in a place where it's easier to do it. 632 00:29:34,660 --> 00:29:36,290 OK? 633 00:29:36,290 --> 00:29:41,830 Now, one of the things that a good compiler will do is-- 634 00:29:41,830 --> 00:29:43,970 you know, once you get to-- 635 00:29:43,970 --> 00:29:45,677 let's say we have the C version, which 636 00:29:45,677 --> 00:29:47,510 is where we're going to move from this point 637 00:29:47,510 --> 00:29:50,600 because that's the fastest we got so far-- 638 00:29:50,600 --> 00:29:53,120 is, it turns out that you can change 639 00:29:53,120 --> 00:29:55,100 the order of loops in this program 640 00:29:55,100 --> 00:29:56,850 without affecting the correctness. 641 00:29:56,850 --> 00:29:57,350 OK? 642 00:29:57,350 --> 00:29:58,570 So here we went-- 643 00:29:58,570 --> 00:30:01,830 you know, for i, for j, for k, do the update. 644 00:30:01,830 --> 00:30:02,390 OK? 645 00:30:02,390 --> 00:30:06,320 We could otherwise do-- we could do, for i, for k, for j, 646 00:30:06,320 --> 00:30:10,610 do the update, and it computes exactly the same thing. 647 00:30:10,610 --> 00:30:16,550 Or we could do, for k, for j, for i, do the updates. 648 00:30:16,550 --> 00:30:17,570 OK? 649 00:30:17,570 --> 00:30:20,780 So we can change the order without affecting 650 00:30:20,780 --> 00:30:22,670 the correctness, OK? 651 00:30:22,670 --> 00:30:29,930 And so do you think the order of loops matters for performance? 652 00:30:29,930 --> 00:30:30,665 Duh. 653 00:30:30,665 --> 00:30:32,540 You know, this is like this leading question. 654 00:30:32,540 --> 00:30:32,980 Yeah? 655 00:30:32,980 --> 00:30:33,480 Question? 656 00:30:33,480 --> 00:30:36,960 AUDIENCE: Maybe for cache localities. 657 00:30:36,960 --> 00:30:37,960 CHARLES LEISERSON: Yeah. 658 00:30:37,960 --> 00:30:38,460 OK. 659 00:30:38,460 --> 00:30:39,520 And you're exactly right. 660 00:30:39,520 --> 00:30:40,910 Cache locality is what it is. 661 00:30:40,910 --> 00:30:44,430 So when we do that, we get-- 662 00:30:44,430 --> 00:30:49,890 the loop order affects the running time by a factor of 18. 663 00:30:49,890 --> 00:30:50,400 Whoa. 664 00:30:50,400 --> 00:30:52,800 Just by switching the order. 665 00:30:52,800 --> 00:30:53,300 OK? 666 00:30:53,300 --> 00:30:54,920 What's going on there? 667 00:30:54,920 --> 00:30:56,030 OK? 668 00:30:56,030 --> 00:30:57,410 What's going on? 669 00:30:57,410 --> 00:31:00,050 So we're going to talk about this in more depth, so I'm just 670 00:31:00,050 --> 00:31:02,008 going to fly through this, because this is just 671 00:31:02,008 --> 00:31:05,330 sort of showing you the kinds of considerations that you do. 672 00:31:05,330 --> 00:31:09,590 So in hardware, each processor reads and writes 673 00:31:09,590 --> 00:31:13,060 main memory in contiguous blocks called cache lines. 674 00:31:13,060 --> 00:31:13,640 OK? 675 00:31:13,640 --> 00:31:15,380 Previously accessed cache lines are 676 00:31:15,380 --> 00:31:17,780 stored in a small memory called cache 677 00:31:17,780 --> 00:31:20,697 that sits near the processor. 678 00:31:20,697 --> 00:31:22,280 When the processor accesses something, 679 00:31:22,280 --> 00:31:24,500 if it's in the cache, you get a hit. 680 00:31:24,500 --> 00:31:27,380 That's very cheap, OK, and fast. 681 00:31:27,380 --> 00:31:30,590 If you miss, you have to go out to either a deeper level 682 00:31:30,590 --> 00:31:32,630 cache or all the way out to main memory. 683 00:31:32,630 --> 00:31:34,370 That is much, much slower, and we'll 684 00:31:34,370 --> 00:31:37,560 talk about that kind of thing. 685 00:31:37,560 --> 00:31:41,830 So what happens for this matrix problem is, 686 00:31:41,830 --> 00:31:45,630 the matrices are laid out in memory in row-major order. 687 00:31:45,630 --> 00:31:46,817 That means you take-- 688 00:31:46,817 --> 00:31:48,650 you know, you have a two-dimensional matrix. 689 00:31:48,650 --> 00:31:50,420 It's laid out in the linear order 690 00:31:50,420 --> 00:31:53,930 of the addresses of memory by essentially taking row 1, 691 00:31:53,930 --> 00:31:58,100 and then, after row 1, stick row 2, and after that, stick row 3, 692 00:31:58,100 --> 00:31:59,900 and so forth, and unfolding it. 693 00:31:59,900 --> 00:32:02,150 There's another order that things could have been laid 694 00:32:02,150 --> 00:32:03,950 out-- in fact, they are in Fortran-- 695 00:32:03,950 --> 00:32:06,760 which is called column-major order. 696 00:32:06,760 --> 00:32:07,260 OK? 697 00:32:07,260 --> 00:32:11,220 So it turns out C and Fortran operate in different orders. 698 00:32:11,220 --> 00:32:11,720 OK? 699 00:32:11,720 --> 00:32:13,387 And it turns out it affects performance, 700 00:32:13,387 --> 00:32:14,840 which way it does it. 701 00:32:14,840 --> 00:32:17,780 So let's just take a look at the access pattern for order 702 00:32:17,780 --> 00:32:19,800 i, j, k. 703 00:32:19,800 --> 00:32:20,450 OK? 704 00:32:20,450 --> 00:32:24,590 So what we're doing is, once we figure out what i and what j 705 00:32:24,590 --> 00:32:27,740 is, we're going to go through and cycle through k. 706 00:32:27,740 --> 00:32:32,000 And as we cycle through k, OK, C i, j 707 00:32:32,000 --> 00:32:33,710 stays the same for everything. 708 00:32:33,710 --> 00:32:36,530 We get for that excellent spatial locality 709 00:32:36,530 --> 00:32:38,735 because we're just accessing the same location. 710 00:32:38,735 --> 00:32:40,610 Every single time, it's going to be in cache. 711 00:32:40,610 --> 00:32:41,890 It's always going to be there. 712 00:32:41,890 --> 00:32:44,300 It's going to be fast to access C. 713 00:32:44,300 --> 00:32:48,770 For A, what happens is, we go through in a linear order, 714 00:32:48,770 --> 00:32:50,510 and we get good spatial locality. 715 00:32:50,510 --> 00:32:53,330 But for B, it's going through columns, 716 00:32:53,330 --> 00:32:57,260 and those points are distributed far away in memory, 717 00:32:57,260 --> 00:33:00,350 so the processor is going to be bringing in 64 bytes 718 00:33:00,350 --> 00:33:03,170 to operate on a particular datum. 719 00:33:03,170 --> 00:33:03,670 OK? 720 00:33:03,670 --> 00:33:09,770 And then it's ignoring 7 of the 8 floating-point words 721 00:33:09,770 --> 00:33:13,410 on that cache line and going to the next one. 722 00:33:13,410 --> 00:33:15,710 So it's wasting an awful lot, OK? 723 00:33:15,710 --> 00:33:17,600 So this one has good spatial locality 724 00:33:17,600 --> 00:33:20,420 in that it's all adjacent and you would use the cache 725 00:33:20,420 --> 00:33:22,580 lines effectively. 726 00:33:22,580 --> 00:33:25,950 This one, you're going 4,096 elements apart. 727 00:33:25,950 --> 00:33:30,340 It's got poor spatial locality, OK? 728 00:33:30,340 --> 00:33:32,090 And that's for this one. 729 00:33:32,090 --> 00:33:34,800 So then if we look at the different other ones-- 730 00:33:34,800 --> 00:33:37,280 this one, the order i, k, j-- 731 00:33:37,280 --> 00:33:40,910 it turns out you get good spatial locality for both C 732 00:33:40,910 --> 00:33:44,230 and B and excellent for A. OK? 733 00:33:44,230 --> 00:33:47,102 And if you look at even another one, 734 00:33:47,102 --> 00:33:49,060 you don't get nearly as good as the other ones, 735 00:33:49,060 --> 00:33:50,820 so there's a whole range of things. 736 00:33:50,820 --> 00:33:51,320 OK? 737 00:33:51,320 --> 00:33:54,700 This one, you're doing optimally badly in both, OK? 738 00:33:54,700 --> 00:33:57,640 And so you can just measure the different ones, 739 00:33:57,640 --> 00:34:04,430 and it turns out that you can use a tool to figure this out. 740 00:34:04,430 --> 00:34:08,679 And the tool that we'll be using is called Cachegrind. 741 00:34:08,679 --> 00:34:13,156 And it's one of the Valgrind suites of caches. 742 00:34:13,156 --> 00:34:14,739 And what it'll do is, it will tell you 743 00:34:14,739 --> 00:34:17,659 what the miss rates are for the various pieces of code. 744 00:34:17,659 --> 00:34:18,159 OK? 745 00:34:18,159 --> 00:34:20,710 And you'll learn how to use that tool and figure out, 746 00:34:20,710 --> 00:34:21,830 oh, look at that. 747 00:34:21,830 --> 00:34:24,280 We have a high miss rate for some and not for others, 748 00:34:24,280 --> 00:34:27,429 so that may be why my code is running slowly. 749 00:34:27,429 --> 00:34:28,480 OK? 750 00:34:28,480 --> 00:34:30,909 So when you pick the best one of those, 751 00:34:30,909 --> 00:34:36,699 OK, we then got a relative speedup of about 6 and 1/2. 752 00:34:36,699 --> 00:34:38,800 So what other simple changes can we try? 753 00:34:38,800 --> 00:34:40,780 There's actually a collection of things 754 00:34:40,780 --> 00:34:46,373 that we could do that don't even have us touching the code. 755 00:34:46,373 --> 00:34:48,040 What else could we do, for people who've 756 00:34:48,040 --> 00:34:50,909 played with compilers and such? 757 00:34:50,909 --> 00:34:51,420 Hint, hint. 758 00:34:54,199 --> 00:34:54,733 Yeah? 759 00:34:54,733 --> 00:34:56,650 AUDIENCE: You could change the compiler flags. 760 00:34:56,650 --> 00:34:59,530 CHARLES LEISERSON: Yeah, change the compiler flags, OK? 761 00:34:59,530 --> 00:35:02,620 So Clang, which is the compiler that we'll be using, 762 00:35:02,620 --> 00:35:05,800 provides a collection of optimization switches, 763 00:35:05,800 --> 00:35:08,320 and you can specify a switch to the compiler 764 00:35:08,320 --> 00:35:10,520 to ask it to optimize. 765 00:35:10,520 --> 00:35:14,080 So you do minus O, and then a number. 766 00:35:14,080 --> 00:35:16,930 And 0, if you look at the documentation, it says, 767 00:35:16,930 --> 00:35:18,530 "Do not optimize." 768 00:35:18,530 --> 00:35:20,530 1 says, "Optimize." 769 00:35:20,530 --> 00:35:22,930 2 says, "Optimize even more." 770 00:35:22,930 --> 00:35:24,960 3 says, "Optimize yet more." 771 00:35:24,960 --> 00:35:25,960 OK? 772 00:35:25,960 --> 00:35:29,020 In this case, it turns out that even though it optimized more 773 00:35:29,020 --> 00:35:33,890 in O3, it turns out O2 was a better setting. 774 00:35:33,890 --> 00:35:34,780 OK? 775 00:35:34,780 --> 00:35:35,920 This is one of these cases. 776 00:35:35,920 --> 00:35:37,240 It doesn't happen all the time. 777 00:35:37,240 --> 00:35:41,050 Usually, O3 does better than O2, but in this case O2 778 00:35:41,050 --> 00:35:43,030 actually optimized better than O3, 779 00:35:43,030 --> 00:35:46,520 because the optimizations are to some extent heuristic. 780 00:35:46,520 --> 00:35:47,530 OK? 781 00:35:47,530 --> 00:35:49,870 And there are also other kinds of optimization. 782 00:35:49,870 --> 00:35:53,740 You can have it do profile-guided optimization, 783 00:35:53,740 --> 00:35:57,730 where you look at what the performance was and feed that 784 00:35:57,730 --> 00:36:01,630 back into the code, and then the compiler 785 00:36:01,630 --> 00:36:04,083 can be smarter about how it optimizes. 786 00:36:04,083 --> 00:36:05,750 And there are a variety of other things. 787 00:36:05,750 --> 00:36:11,650 So with this simple technology, choosing a good optimization 788 00:36:11,650 --> 00:36:13,750 flag-- in this case, O2-- 789 00:36:13,750 --> 00:36:20,270 we got for free, basically, a factor of 3.25, OK? 790 00:36:20,270 --> 00:36:24,350 Without having to do much work at all, OK? 791 00:36:24,350 --> 00:36:27,650 And now we're actually starting to approach 792 00:36:27,650 --> 00:36:29,510 1% of peak performance. 793 00:36:29,510 --> 00:36:33,680 We've got point 0.3% of peak performance, OK? 794 00:36:33,680 --> 00:36:36,515 So what's causing the low performance? 795 00:36:36,515 --> 00:36:38,390 Why aren't we getting most of the performance 796 00:36:38,390 --> 00:36:40,130 out of this machine? 797 00:36:40,130 --> 00:36:40,990 Why do you think? 798 00:36:40,990 --> 00:36:41,700 Yeah? 799 00:36:41,700 --> 00:36:43,500 AUDIENCE: We're not using all the cores. 800 00:36:43,500 --> 00:36:44,730 CHARLES LEISERSON: Yeah, we're not using all the cores. 801 00:36:44,730 --> 00:36:46,310 So far we're using just one core, 802 00:36:46,310 --> 00:36:48,812 and how many cores do we have? 803 00:36:48,812 --> 00:36:49,760 AUDIENCE: 18. 804 00:36:49,760 --> 00:36:52,220 CHARLES LEISERSON: 18, right? 805 00:36:52,220 --> 00:36:53,190 18 cores. 806 00:36:53,190 --> 00:36:54,540 Ah! 807 00:36:54,540 --> 00:36:58,170 18 cores just sitting there, 17 sitting 808 00:36:58,170 --> 00:37:01,260 idle, while we are trying to optimize one. 809 00:37:01,260 --> 00:37:02,460 OK. 810 00:37:02,460 --> 00:37:04,680 So multicore. 811 00:37:04,680 --> 00:37:08,810 So we have 9 cores per chip, and there are 2 of these chips 812 00:37:08,810 --> 00:37:10,510 in our test machine. 813 00:37:10,510 --> 00:37:14,280 So we're running on just one of them, so let's use them all. 814 00:37:14,280 --> 00:37:19,650 To do that, we're going to use the Cilk infrastructure, 815 00:37:19,650 --> 00:37:22,170 and in particular, we can use what's 816 00:37:22,170 --> 00:37:27,960 called a parallel loop, which in Cilk, you'd call cilk_for, 817 00:37:27,960 --> 00:37:30,720 and so you just relay that outer loop-- for example, 818 00:37:30,720 --> 00:37:32,760 in this case, you say cilk_for, it says, 819 00:37:32,760 --> 00:37:35,670 do all those iterations in parallel. 820 00:37:35,670 --> 00:37:40,470 The compiler and runtime system are free to schedule them 821 00:37:40,470 --> 00:37:41,545 and so forth. 822 00:37:41,545 --> 00:37:42,360 OK? 823 00:37:42,360 --> 00:37:47,800 And we could also do it for the inner loop, OK? 824 00:37:47,800 --> 00:37:54,010 And it turns out you can't also do it for the middle loop, 825 00:37:54,010 --> 00:37:54,990 if you think about it. 826 00:37:54,990 --> 00:37:55,490 OK? 827 00:37:55,490 --> 00:37:57,698 So I'll let you do that is a little bit of a homework 828 00:37:57,698 --> 00:38:01,900 problem-- why can't I just do a cilk_for of the inner loop? 829 00:38:01,900 --> 00:38:04,150 OK? 830 00:38:04,150 --> 00:38:07,660 So the question is, which parallel version works best? 831 00:38:07,660 --> 00:38:11,290 So we can parallel the i loop, we can parallel the j loop, 832 00:38:11,290 --> 00:38:13,250 and we can do i and j together. 833 00:38:13,250 --> 00:38:15,490 You can't do k just with a parallel loop 834 00:38:15,490 --> 00:38:17,050 and expect to get the right thing. 835 00:38:17,050 --> 00:38:18,350 OK? 836 00:38:18,350 --> 00:38:19,570 And that's this one. 837 00:38:19,570 --> 00:38:21,830 So if you look-- 838 00:38:21,830 --> 00:38:22,330 wow! 839 00:38:22,330 --> 00:38:25,010 What a spread of running times, right? 840 00:38:25,010 --> 00:38:25,840 OK? 841 00:38:25,840 --> 00:38:30,640 If I parallelize the just the i loop, it's 3.18 seconds, 842 00:38:30,640 --> 00:38:35,260 and if I parallelize the j loop, it actually slows down, 843 00:38:35,260 --> 00:38:37,630 I think, right? 844 00:38:37,630 --> 00:38:40,550 And then, if I do both i and j, it's still bad. 845 00:38:40,550 --> 00:38:42,650 I just want to do the outer loop there. 846 00:38:42,650 --> 00:38:45,350 This has to do, it turns out, with scheduling overhead, 847 00:38:45,350 --> 00:38:47,420 and we'll learn about scheduling overhead 848 00:38:47,420 --> 00:38:49,112 and how you predict that and such. 849 00:38:49,112 --> 00:38:51,320 So the rule of thumb here is, parallelize outer loops 850 00:38:51,320 --> 00:38:53,660 rather than inner loops, OK? 851 00:38:53,660 --> 00:38:55,250 And so when we do parallel loops, 852 00:38:55,250 --> 00:38:59,970 we get an almost 18x speedup on 18 cores, OK? 853 00:38:59,970 --> 00:39:02,910 So let me assure you, not all code 854 00:39:02,910 --> 00:39:04,740 is that easy to parallelize. 855 00:39:04,740 --> 00:39:05,460 OK? 856 00:39:05,460 --> 00:39:07,500 But this one happens to be. 857 00:39:07,500 --> 00:39:13,150 So now we're up to, what, about just over 5% of peak. 858 00:39:13,150 --> 00:39:13,650 OK? 859 00:39:13,650 --> 00:39:18,210 So where are we losing time here? 860 00:39:18,210 --> 00:39:20,280 OK, why are we getting just 5%? 861 00:39:20,280 --> 00:39:21,486 Yeah? 862 00:39:21,486 --> 00:39:25,470 AUDIENCE: So another area of the parallelism that [INAUDIBLE].. 863 00:39:28,956 --> 00:39:31,882 So we could, for example, vectorize the multiplication. 864 00:39:31,882 --> 00:39:32,840 CHARLES LEISERSON: Yep. 865 00:39:32,840 --> 00:39:33,300 Good. 866 00:39:33,300 --> 00:39:34,800 So that's one, and there's one other 867 00:39:34,800 --> 00:39:36,750 that we're not using very effectively. 868 00:39:36,750 --> 00:39:37,250 OK. 869 00:39:37,250 --> 00:39:39,208 That's one, and those are the two optimizations 870 00:39:39,208 --> 00:39:43,370 we're going to do to get a really good code here. 871 00:39:43,370 --> 00:39:44,910 So what's the other one? 872 00:39:44,910 --> 00:39:45,623 Yeah? 873 00:39:45,623 --> 00:39:48,934 AUDIENCE: The multiply and add operation. 874 00:39:52,210 --> 00:39:53,960 CHARLES LEISERSON: That's actually related 875 00:39:53,960 --> 00:39:56,780 to the same question, OK? 876 00:39:56,780 --> 00:40:01,310 But there's another completely different source of opportunity 877 00:40:01,310 --> 00:40:02,090 here. 878 00:40:02,090 --> 00:40:02,833 Yeah? 879 00:40:02,833 --> 00:40:05,500 AUDIENCE: We could also do a lot better on our handling of cache 880 00:40:05,500 --> 00:40:05,620 misses. 881 00:40:05,620 --> 00:40:06,680 CHARLES LEISERSON: Yeah. 882 00:40:06,680 --> 00:40:09,620 OK, we can actually manage the cache misses better. 883 00:40:09,620 --> 00:40:10,340 OK? 884 00:40:10,340 --> 00:40:13,490 So let's go back to hardware caches, 885 00:40:13,490 --> 00:40:15,740 and let's restructure the computation 886 00:40:15,740 --> 00:40:18,890 to reuse data in the cache as much as possible. 887 00:40:18,890 --> 00:40:21,650 Because cache misses are slow, and hits are fast. 888 00:40:21,650 --> 00:40:23,240 And try to make the most of the cache 889 00:40:23,240 --> 00:40:25,710 by reusing the data that's already there. 890 00:40:25,710 --> 00:40:26,480 OK? 891 00:40:26,480 --> 00:40:28,440 So let's just take a look. 892 00:40:28,440 --> 00:40:34,130 Suppose that we're going to just compute one row of C, OK? 893 00:40:34,130 --> 00:40:37,160 So we go through one row of C. That's going to take us-- 894 00:40:37,160 --> 00:40:41,990 since it's a 4,096-long vector there, 895 00:40:41,990 --> 00:40:46,020 that's going to basically be 4,096 writes that we're going 896 00:40:46,020 --> 00:40:47,130 to do. 897 00:40:47,130 --> 00:40:47,700 OK? 898 00:40:47,700 --> 00:40:50,180 And we're going to get some spatial locality there, 899 00:40:50,180 --> 00:40:52,280 which is good, but we're basically 900 00:40:52,280 --> 00:40:56,150 doing-- the processor's doing 4,096 writes. 901 00:40:56,150 --> 00:41:04,380 Now, to compute that row, I need to access 4,096 reads 902 00:41:04,380 --> 00:41:06,500 from A, OK? 903 00:41:06,500 --> 00:41:11,940 And I need all of B, OK? 904 00:41:11,940 --> 00:41:16,280 Because I go each column of B as I'm 905 00:41:16,280 --> 00:41:21,580 going through to fully compute C. Do people see that? 906 00:41:21,580 --> 00:41:22,290 OK. 907 00:41:22,290 --> 00:41:27,900 So I need-- to just compute one row of C, 908 00:41:27,900 --> 00:41:32,370 I need to access one row of A and all of B. OK? 909 00:41:32,370 --> 00:41:36,000 Because the first element of C needs the whole first column 910 00:41:36,000 --> 00:41:42,180 of B. The second element of C needs the whole second column 911 00:41:42,180 --> 00:41:44,557 of B. Once again, don't worry if you don't fully 912 00:41:44,557 --> 00:41:46,140 understand this, because right now I'm 913 00:41:46,140 --> 00:41:48,110 just ripping through this at high speed. 914 00:41:48,110 --> 00:41:50,610 We're going to go into this in much more depth in the class, 915 00:41:50,610 --> 00:41:52,980 and there'll be plenty of time to master this stuff. 916 00:41:52,980 --> 00:41:54,702 But the main thing to understand is, 917 00:41:54,702 --> 00:41:56,160 you're going through all of B, then 918 00:41:56,160 --> 00:41:58,008 I want to compute another row of C. 919 00:41:58,008 --> 00:41:59,300 I'm going to do the same thing. 920 00:41:59,300 --> 00:42:02,130 I'm going to go through one row of A and all of B 921 00:42:02,130 --> 00:42:07,210 again, so that when I'm done we do about 16 million, 922 00:42:07,210 --> 00:42:10,170 17 million memory accesses total. 923 00:42:10,170 --> 00:42:11,010 OK? 924 00:42:11,010 --> 00:42:13,910 That's a lot of memory access. 925 00:42:13,910 --> 00:42:15,680 So what if, instead of doing that, 926 00:42:15,680 --> 00:42:18,470 I do things in blocks, OK? 927 00:42:18,470 --> 00:42:23,270 So what if I want to compute a 64 by 64 block of C 928 00:42:23,270 --> 00:42:25,760 rather than a row of C? 929 00:42:25,760 --> 00:42:27,570 So let's take a look at what happens there. 930 00:42:27,570 --> 00:42:29,450 So remember, by the way, this number-- 931 00:42:29,450 --> 00:42:31,515 16, 17 million, OK? 932 00:42:31,515 --> 00:42:33,140 Because we're going to compare with it. 933 00:42:33,140 --> 00:42:33,640 OK? 934 00:42:33,640 --> 00:42:35,280 So what about to compute a block? 935 00:42:35,280 --> 00:42:38,360 So if I look at a block, that is going to take-- 936 00:42:38,360 --> 00:42:45,140 64 by 64 also takes 4,096 writes to C. Same number, OK? 937 00:42:45,140 --> 00:42:48,890 But now I have to do about 200,000 938 00:42:48,890 --> 00:42:52,910 reads from A because I need to access all those rows. 939 00:42:52,910 --> 00:42:53,750 OK? 940 00:42:53,750 --> 00:42:58,850 And then for B, I need to access 64 columns of B, OK? 941 00:42:58,850 --> 00:43:04,760 And that's another 262,000 reads from B, OK? 942 00:43:04,760 --> 00:43:09,430 Which ends up being half a million memory accesses total. 943 00:43:09,430 --> 00:43:10,000 OK? 944 00:43:10,000 --> 00:43:15,370 So I end up doing way fewer accesses, OK, 945 00:43:15,370 --> 00:43:20,340 if those blocks will fit in my cache. 946 00:43:20,340 --> 00:43:20,840 OK? 947 00:43:20,840 --> 00:43:25,130 So I do much less to compute the same size footprint 948 00:43:25,130 --> 00:43:28,620 if I compute a block rather than computing a row. 949 00:43:28,620 --> 00:43:29,120 OK? 950 00:43:29,120 --> 00:43:30,600 Much more efficient. 951 00:43:30,600 --> 00:43:31,100 OK? 952 00:43:31,100 --> 00:43:33,010 And that's a scheme called tiling, 953 00:43:33,010 --> 00:43:35,810 and so if you do tiled matrix multiplication, what you do 954 00:43:35,810 --> 00:43:39,320 is you bust your matrices into, let's say, 64 955 00:43:39,320 --> 00:43:43,820 by 64 submatrices, and then you do 956 00:43:43,820 --> 00:43:45,530 two levels of matrix multiply. 957 00:43:45,530 --> 00:43:48,410 You do an outer level of multiplying of the blocks using 958 00:43:48,410 --> 00:43:52,750 the same algorithm, and then when you hit the inner, 959 00:43:52,750 --> 00:43:56,510 to do a 64 by 64 matrix multiply, 960 00:43:56,510 --> 00:44:00,950 I then do another three-nested loops. 961 00:44:00,950 --> 00:44:03,810 So you end up with 6 nested loops. 962 00:44:03,810 --> 00:44:04,650 OK? 963 00:44:04,650 --> 00:44:08,190 And so you're basically busting it like this. 964 00:44:08,190 --> 00:44:10,200 And there's a tuning parameter, of course, 965 00:44:10,200 --> 00:44:13,867 which is, you know, how big do I make my tile size? 966 00:44:13,867 --> 00:44:16,200 You know, if it's s by s, what should I do at the leaves 967 00:44:16,200 --> 00:44:16,700 there? 968 00:44:16,700 --> 00:44:17,760 Should it be 64? 969 00:44:17,760 --> 00:44:20,818 Should it be 128? 970 00:44:20,818 --> 00:44:22,110 What number should I use there? 971 00:44:26,040 --> 00:44:29,210 How do we find the right value of s, this tuning parameter? 972 00:44:29,210 --> 00:44:29,970 OK? 973 00:44:29,970 --> 00:44:31,290 Ideas of how we might find it? 974 00:44:34,074 --> 00:44:37,108 AUDIENCE: You could figure out how much there is in the cache. 975 00:44:37,108 --> 00:44:38,650 CHARLES LEISERSON: You could do that. 976 00:44:38,650 --> 00:44:40,660 You might get a number, but who knows what else 977 00:44:40,660 --> 00:44:43,108 is going on in the cache while you're doing this. 978 00:44:43,108 --> 00:44:44,860 AUDIENCE: Just test a bunch of them. 979 00:44:44,860 --> 00:44:46,777 CHARLES LEISERSON: Yeah, test a bunch of them. 980 00:44:46,777 --> 00:44:47,390 Experiment! 981 00:44:47,390 --> 00:44:47,890 OK? 982 00:44:47,890 --> 00:44:48,850 Try them! 983 00:44:48,850 --> 00:44:50,450 See which one gives you good numbers. 984 00:44:50,450 --> 00:44:53,560 And when you do that, it turns out 985 00:44:53,560 --> 00:44:57,300 that 32 gives you the best performance, OK, 986 00:44:57,300 --> 00:44:58,990 for this particular problem. 987 00:44:58,990 --> 00:45:00,350 OK? 988 00:45:00,350 --> 00:45:03,050 So you can block it, and then you can get faster, 989 00:45:03,050 --> 00:45:12,020 and when you do that, that gave us a speedup of about 1.7. 990 00:45:12,020 --> 00:45:12,520 OK? 991 00:45:12,520 --> 00:45:13,910 So we're now up to-- 992 00:45:13,910 --> 00:45:14,410 what? 993 00:45:14,410 --> 00:45:20,110 We're almost 10% of peak, OK? 994 00:45:20,110 --> 00:45:23,690 And the other thing is that if you use Cachegrind 995 00:45:23,690 --> 00:45:25,900 or a similar tool, you can figure out 996 00:45:25,900 --> 00:45:28,672 how many cache references there are and so forth, 997 00:45:28,672 --> 00:45:30,130 and you can see that, in fact, it's 998 00:45:30,130 --> 00:45:34,300 dropped quite considerably when you do the tiling versus just 999 00:45:34,300 --> 00:45:36,660 the straight parallel loops. 1000 00:45:36,660 --> 00:45:37,360 OK? 1001 00:45:37,360 --> 00:45:40,600 So once again, you can use tools to help you figure this out 1002 00:45:40,600 --> 00:45:45,370 and to understand the cause of what's going on. 1003 00:45:45,370 --> 00:45:48,490 Well, it turns out that our chips 1004 00:45:48,490 --> 00:45:50,410 don't have just one cache. 1005 00:45:50,410 --> 00:45:52,660 They've got three levels of caches. 1006 00:45:52,660 --> 00:45:53,510 OK? 1007 00:45:53,510 --> 00:45:56,920 There's L1-cache, OK? 1008 00:45:56,920 --> 00:45:58,390 And there's data and instructions, 1009 00:45:58,390 --> 00:45:59,860 so we're thinking about data here, 1010 00:45:59,860 --> 00:46:01,480 for the data for the matrix. 1011 00:46:01,480 --> 00:46:04,180 And it's got an L2-cache, which is also private 1012 00:46:04,180 --> 00:46:07,360 to the processor, and then a shared L3-cache, 1013 00:46:07,360 --> 00:46:09,220 and then you go out to the DRAM-- 1014 00:46:09,220 --> 00:46:13,710 you also can go to your neighboring processors 1015 00:46:13,710 --> 00:46:14,580 and such. 1016 00:46:14,580 --> 00:46:15,100 OK? 1017 00:46:15,100 --> 00:46:16,430 And they're of different sizes. 1018 00:46:16,430 --> 00:46:17,722 You can see they grow in size-- 1019 00:46:20,590 --> 00:46:25,150 32 kilobytes, 256 kilobytes, to 25 megabytes, 1020 00:46:25,150 --> 00:46:28,240 to main memory, which is 60 gigabytes. 1021 00:46:28,240 --> 00:46:32,530 So what you can do is, if you want to do two-level tiling, 1022 00:46:32,530 --> 00:46:36,790 OK, you can have two tuning parameters, s and t. 1023 00:46:36,790 --> 00:46:38,390 And now you get to do-- 1024 00:46:38,390 --> 00:46:41,350 you can't do binary search to find it, unfortunately, 1025 00:46:41,350 --> 00:46:42,960 because it's multi-dimensional. 1026 00:46:42,960 --> 00:46:45,370 You kind of have to do it exhaustively. 1027 00:46:45,370 --> 00:46:47,992 And when you do that, you end up with-- 1028 00:46:47,992 --> 00:46:49,830 [LAUGHTER] 1029 00:46:50,330 --> 00:46:54,880 --with 9 nested loops, OK? 1030 00:46:54,880 --> 00:46:57,040 But of course, we don't really want to do that. 1031 00:46:57,040 --> 00:47:00,870 We have three levels of caching, OK? 1032 00:47:00,870 --> 00:47:05,310 Can anybody figure out the inductive number? 1033 00:47:05,310 --> 00:47:08,070 For three levels of caching, how many levels of tiling 1034 00:47:08,070 --> 00:47:09,252 do we have to do? 1035 00:47:13,680 --> 00:47:16,220 This is a gimme, right? 1036 00:47:16,220 --> 00:47:17,040 AUDIENCE: 12. 1037 00:47:17,040 --> 00:47:17,957 CHARLES LEISERSON: 12! 1038 00:47:17,957 --> 00:47:19,320 Good, 12, OK? 1039 00:47:19,320 --> 00:47:22,080 Yeah, you then do 12. 1040 00:47:22,080 --> 00:47:25,560 And man, you know, when I say the code gets ugly when you 1041 00:47:25,560 --> 00:47:27,460 start making things go fast. 1042 00:47:27,460 --> 00:47:27,960 OK? 1043 00:47:27,960 --> 00:47:28,460 Right? 1044 00:47:28,460 --> 00:47:30,380 This is like, ughhh! 1045 00:47:30,380 --> 00:47:31,290 [LAUGHTER] 1046 00:47:31,790 --> 00:47:33,460 OK? 1047 00:47:33,460 --> 00:47:38,230 OK, but it turns out there's a trick. 1048 00:47:38,230 --> 00:47:40,270 You can tile for every power of 2 1049 00:47:40,270 --> 00:47:45,190 simultaneously by just solving the problem recursively. 1050 00:47:45,190 --> 00:47:48,010 So the idea is that you do divide and conquer. 1051 00:47:48,010 --> 00:47:52,865 You divide each of the matrices into 4 submatrices, OK? 1052 00:47:52,865 --> 00:47:55,240 And then, if you look at the calculations you need to do, 1053 00:47:55,240 --> 00:47:58,900 you have to solve 8 subproblems of half the size, 1054 00:47:58,900 --> 00:48:02,140 and then do an addition. 1055 00:48:02,140 --> 00:48:03,310 OK? 1056 00:48:03,310 --> 00:48:05,560 And so you have 8 multiplications 1057 00:48:05,560 --> 00:48:09,128 of size n over 2 by n over 2 and 1 addition of n by n matrices, 1058 00:48:09,128 --> 00:48:10,420 and that gives you your answer. 1059 00:48:10,420 --> 00:48:12,128 But then, of course, what you going to do 1060 00:48:12,128 --> 00:48:14,440 is solve each of those recursively, OK? 1061 00:48:14,440 --> 00:48:16,190 And that's going to give you, essentially, 1062 00:48:16,190 --> 00:48:18,390 the same type of performance. 1063 00:48:18,390 --> 00:48:20,613 Here's the code. 1064 00:48:20,613 --> 00:48:22,280 I don't expect that you understand this, 1065 00:48:22,280 --> 00:48:25,790 but we've written this using in parallel, 1066 00:48:25,790 --> 00:48:28,850 because it turns out you can do 4 of them in parallel. 1067 00:48:28,850 --> 00:48:33,710 And the Cilk spawn here says, go and do this subroutine, which 1068 00:48:33,710 --> 00:48:39,380 is basically a subproblem, and then, while you're doing that, 1069 00:48:39,380 --> 00:48:41,840 you're allowed to go and execute the next statement-- which 1070 00:48:41,840 --> 00:48:44,450 you'll do another spawn and another spawn and finally this. 1071 00:48:44,450 --> 00:48:46,520 And then this statement says, ah, 1072 00:48:46,520 --> 00:48:48,680 but don't start the next phase until you 1073 00:48:48,680 --> 00:48:50,070 finish the first phase. 1074 00:48:50,070 --> 00:48:50,570 OK? 1075 00:48:50,570 --> 00:48:53,730 And we'll learn about this stuff. 1076 00:48:53,730 --> 00:48:57,410 OK, when we do that, we get a running time 1077 00:48:57,410 --> 00:49:02,630 of about 93 seconds, which is about 50 times slower 1078 00:49:02,630 --> 00:49:04,760 than the last version. 1079 00:49:04,760 --> 00:49:08,660 We're using cache much better, but it turns out, 1080 00:49:08,660 --> 00:49:13,485 you know, nothing is free, nothing is easy, typically, 1081 00:49:13,485 --> 00:49:14,610 in performance engineering. 1082 00:49:14,610 --> 00:49:15,910 You have to be clever. 1083 00:49:17,570 --> 00:49:18,470 What happened here? 1084 00:49:18,470 --> 00:49:21,440 Why did this get worse, even though-- it turns out, 1085 00:49:21,440 --> 00:49:23,720 if you actually look at the caching numbers, 1086 00:49:23,720 --> 00:49:25,940 you're getting great hits on cache. 1087 00:49:25,940 --> 00:49:31,130 I mean, very few cache misses, lots of hits on cache, 1088 00:49:31,130 --> 00:49:32,120 but we're still slower. 1089 00:49:32,120 --> 00:49:34,310 Why do you suppose that is? 1090 00:49:34,310 --> 00:49:35,265 Let me get someone-- 1091 00:49:35,265 --> 00:49:35,765 yeah? 1092 00:49:35,765 --> 00:49:39,438 AUDIENCE: Overhead to start functions and [INAUDIBLE].. 1093 00:49:39,438 --> 00:49:40,980 CHARLES LEISERSON: Yeah, the overhead 1094 00:49:40,980 --> 00:49:43,830 to the start of the function, and in particular the place 1095 00:49:43,830 --> 00:49:47,040 that it matters is at the leaves of the computation. 1096 00:49:47,040 --> 00:49:47,760 OK? 1097 00:49:47,760 --> 00:49:51,090 So what we do is, we have a very small base case. 1098 00:49:51,090 --> 00:49:55,090 We're doing this overhead all the way down to n equals 1. 1099 00:49:55,090 --> 00:49:57,090 So there's a function call overhead even when 1100 00:49:57,090 --> 00:49:58,990 you're multiplying 1 by 1. 1101 00:49:58,990 --> 00:50:04,370 So hey, let's pick a threshold, and below that threshold, 1102 00:50:04,370 --> 00:50:09,350 let's just use a standard good algorithm for that threshold. 1103 00:50:09,350 --> 00:50:14,510 And if we're above that, we'll do divide and conquer, OK? 1104 00:50:14,510 --> 00:50:17,600 So what we do is we call-- 1105 00:50:17,600 --> 00:50:23,990 if we're less than the threshold, we call a base case, 1106 00:50:23,990 --> 00:50:26,690 and the base case looks very much like just ordinary matrix 1107 00:50:26,690 --> 00:50:28,250 multiply. 1108 00:50:28,250 --> 00:50:30,060 OK? 1109 00:50:30,060 --> 00:50:33,990 And so, when you do that, you can once again look 1110 00:50:33,990 --> 00:50:36,210 to see what's the best value for the base case, 1111 00:50:36,210 --> 00:50:40,480 and it turns out in this case, I guess, it's 64. 1112 00:50:40,480 --> 00:50:40,980 OK? 1113 00:50:40,980 --> 00:50:44,010 We get down to 1.95 seconds. 1114 00:50:44,010 --> 00:50:46,260 I didn't do the base case of 1, because I tried that, 1115 00:50:46,260 --> 00:50:48,552 and that was the one that gave us terrible performance. 1116 00:50:48,552 --> 00:50:49,950 AUDIENCE: [INAUDIBLE] 1117 00:50:49,950 --> 00:50:50,190 CHARLES LEISERSON: Sorry. 1118 00:50:50,190 --> 00:50:50,910 32-- oh, yeah. 1119 00:50:50,910 --> 00:50:51,900 32 is even better. 1120 00:50:51,900 --> 00:50:53,100 1.3. 1121 00:50:53,100 --> 00:50:54,900 Good, yeah, so we picked 32. 1122 00:50:54,900 --> 00:50:57,030 I think I even-- oh, I didn't highlight it. 1123 00:50:57,030 --> 00:50:58,980 OK. 1124 00:50:58,980 --> 00:51:01,560 I should have highlighted that on the slide. 1125 00:51:01,560 --> 00:51:07,730 So then, when we do that, we now are getting 12% of peak. 1126 00:51:07,730 --> 00:51:08,450 OK? 1127 00:51:08,450 --> 00:51:15,740 And if you count up how many cache misses we have, 1128 00:51:15,740 --> 00:51:17,530 you can see that-- 1129 00:51:17,530 --> 00:51:23,390 here's the data cache for L1, and with parallel divide 1130 00:51:23,390 --> 00:51:25,370 and conquer it's the lowest, but also now so 1131 00:51:25,370 --> 00:51:27,480 is the last-level caching. 1132 00:51:27,480 --> 00:51:27,980 OK? 1133 00:51:27,980 --> 00:51:30,950 And then the total number of references is small, as well. 1134 00:51:30,950 --> 00:51:33,770 So divide and conquer turns out to be a big win here. 1135 00:51:33,770 --> 00:51:34,970 OK? 1136 00:51:34,970 --> 00:51:40,970 Now the other thing that we mentioned, which was we're 1137 00:51:40,970 --> 00:51:43,160 not using the vector hardware. 1138 00:51:43,160 --> 00:51:48,260 All of these things have vectors that we can operate on, OK? 1139 00:51:48,260 --> 00:51:51,020 They have vector hardware that process data in what's called 1140 00:51:51,020 --> 00:51:54,020 SIMD fashion, which means Single-Instruction stream, 1141 00:51:54,020 --> 00:51:55,310 Multiple-Data. 1142 00:51:55,310 --> 00:51:57,080 That means you give one instruction, 1143 00:51:57,080 --> 00:52:01,010 and it does operations on a vector, OK? 1144 00:52:01,010 --> 00:52:07,940 And as we mentioned, we have 8 floating-point units 1145 00:52:07,940 --> 00:52:14,050 per core, of which we can also do a fused-multiply-add, OK? 1146 00:52:16,610 --> 00:52:18,560 So each vector register holds multiple words. 1147 00:52:18,560 --> 00:52:24,020 I believe in the machine we're using this term, it's 4 words. 1148 00:52:24,020 --> 00:52:25,110 I think so. 1149 00:52:25,110 --> 00:52:25,610 OK? 1150 00:52:30,350 --> 00:52:32,780 But it's important when you use these-- you can't just 1151 00:52:32,780 --> 00:52:34,100 use them willy-nilly. 1152 00:52:38,510 --> 00:52:44,660 You have to operate on the data as one chunk of vector data. 1153 00:52:44,660 --> 00:52:48,290 You can't, you know, have this lane 1154 00:52:48,290 --> 00:52:51,075 of the vector unit doing one thing and a different lane 1155 00:52:51,075 --> 00:52:51,950 doing something else. 1156 00:52:51,950 --> 00:52:54,390 They all have to be doing essentially the same thing, 1157 00:52:54,390 --> 00:52:57,460 the only difference being the indexing of memory. 1158 00:52:57,460 --> 00:52:58,130 OK? 1159 00:52:58,130 --> 00:52:59,270 So when you do that-- 1160 00:53:03,830 --> 00:53:06,520 so already we've actually been taking advantage of it. 1161 00:53:06,520 --> 00:53:11,390 But you can produce a vectorization report by asking 1162 00:53:11,390 --> 00:53:16,917 for that, and the system will tell you what kinds of things 1163 00:53:16,917 --> 00:53:19,250 are being vectorized, which things are being vectorized, 1164 00:53:19,250 --> 00:53:20,120 which aren't. 1165 00:53:20,120 --> 00:53:22,100 And we'll talk about how you vectorize things 1166 00:53:22,100 --> 00:53:24,310 that the compiler doesn't want to vectorize. 1167 00:53:24,310 --> 00:53:25,000 OK? 1168 00:53:25,000 --> 00:53:27,800 And in particular, most machines don't support the newest sets 1169 00:53:27,800 --> 00:53:29,810 of vector instructions, so the compiler 1170 00:53:29,810 --> 00:53:33,660 uses vector instructions conservatively by default. 1171 00:53:33,660 --> 00:53:37,250 So if you're compiling for a particular machine, 1172 00:53:37,250 --> 00:53:39,630 you can say, use that particular machine. 1173 00:53:39,630 --> 00:53:41,990 And here's some of the vectorization flags. 1174 00:53:41,990 --> 00:53:45,770 You can say, use the AVX instructions if you have AVX. 1175 00:53:45,770 --> 00:53:47,510 You can use AVX2. 1176 00:53:47,510 --> 00:53:49,430 You can use the fused-multiply-add vector 1177 00:53:49,430 --> 00:53:50,630 instructions. 1178 00:53:50,630 --> 00:53:52,143 You can give a string that tells you 1179 00:53:52,143 --> 00:53:53,810 the architecture that you're running on, 1180 00:53:53,810 --> 00:53:55,250 on that special thing. 1181 00:53:55,250 --> 00:53:58,130 And you can say, well, use whatever machine I'm currently 1182 00:53:58,130 --> 00:54:01,850 compiling on, OK, and it'll figure out 1183 00:54:01,850 --> 00:54:03,440 which architecture is that. 1184 00:54:03,440 --> 00:54:04,370 OK? 1185 00:54:04,370 --> 00:54:08,390 Now, floating-point numbers, as we'll talk about, 1186 00:54:08,390 --> 00:54:10,850 turn out to have some undesirable properties, 1187 00:54:10,850 --> 00:54:15,110 like they're not associative, so if you do A times B 1188 00:54:15,110 --> 00:54:19,220 times C, how you parenthesize that can give you 1189 00:54:19,220 --> 00:54:21,860 two different numbers. 1190 00:54:21,860 --> 00:54:24,890 And so if you give a specification of a code, 1191 00:54:24,890 --> 00:54:27,770 typically, the compiler will not change 1192 00:54:27,770 --> 00:54:32,120 the order of associativity because it says, 1193 00:54:32,120 --> 00:54:34,490 I want to get exactly the same result. 1194 00:54:34,490 --> 00:54:36,200 But you can give it a flag called 1195 00:54:36,200 --> 00:54:39,650 fast math, minus ffast-math, which will allow it 1196 00:54:39,650 --> 00:54:41,390 to do that kind of reordering. 1197 00:54:41,390 --> 00:54:41,990 OK? 1198 00:54:41,990 --> 00:54:43,532 If it's not important to you, then it 1199 00:54:43,532 --> 00:54:46,880 will be the same as the default ordering, OK? 1200 00:54:46,880 --> 00:54:48,540 And when you use that-- 1201 00:54:48,540 --> 00:54:51,230 and particularly using architecture native 1202 00:54:51,230 --> 00:54:55,490 and fast math, we actually get about double the performance 1203 00:54:55,490 --> 00:54:57,090 out of vectorization, just having 1204 00:54:57,090 --> 00:54:59,310 the compiler vectorize it. 1205 00:54:59,310 --> 00:55:01,350 OK? 1206 00:55:01,350 --> 00:55:04,295 Yeah, question. 1207 00:55:04,295 --> 00:55:11,125 AUDIENCE: Are the data types in our matrix, are they 32-bit? 1208 00:55:11,125 --> 00:55:13,190 CHARLES LEISERSON: They're 64-bit. 1209 00:55:13,190 --> 00:55:15,030 Yep. 1210 00:55:15,030 --> 00:55:16,920 These days, 64-bit is pretty standard. 1211 00:55:16,920 --> 00:55:19,730 They call that double precision, but it's pretty standard. 1212 00:55:19,730 --> 00:55:21,900 Unless you're doing AI applications, in which case 1213 00:55:21,900 --> 00:55:26,520 you may want to do lower-precision arithmetic. 1214 00:55:26,520 --> 00:55:29,330 AUDIENCE: So float and double are both the same? 1215 00:55:29,330 --> 00:55:32,690 CHARLES LEISERSON: No, float is 32, OK? 1216 00:55:32,690 --> 00:55:40,700 So generally, people who are doing serious linear algebra 1217 00:55:40,700 --> 00:55:43,340 calculations use 64 bits. 1218 00:55:43,340 --> 00:55:46,760 But actually sometimes they can use less, 1219 00:55:46,760 --> 00:55:48,830 and then you can get more performance 1220 00:55:48,830 --> 00:55:50,270 if you discover you can use fewer 1221 00:55:50,270 --> 00:55:52,460 bits in your representation. 1222 00:55:52,460 --> 00:55:54,490 We'll talk about that, too. 1223 00:55:54,490 --> 00:55:55,760 OK? 1224 00:55:55,760 --> 00:55:57,260 So last thing that we're going to do 1225 00:55:57,260 --> 00:56:01,970 is, you can actually use the instructions, the vector 1226 00:56:01,970 --> 00:56:03,950 instructions, yourself rather than 1227 00:56:03,950 --> 00:56:06,150 rely on the compiler to do it. 1228 00:56:06,150 --> 00:56:11,060 And there's a whole manual of intrinsic instructions 1229 00:56:11,060 --> 00:56:17,270 that you can call from C that allow you to do, you know, 1230 00:56:17,270 --> 00:56:20,130 the specific vector instructions that you might want to do it. 1231 00:56:20,130 --> 00:56:22,130 So the compiler doesn't have to figure that out. 1232 00:56:25,040 --> 00:56:31,580 And you can also use some more insights to do things 1233 00:56:31,580 --> 00:56:33,650 like-- you can do preprocessing, and you 1234 00:56:33,650 --> 00:56:36,680 can transpose the matrices, which turns out to help, 1235 00:56:36,680 --> 00:56:37,890 and do data alignment. 1236 00:56:37,890 --> 00:56:41,390 And there's a lot of other things using clever algorithms 1237 00:56:41,390 --> 00:56:43,210 for the base case, OK? 1238 00:56:46,730 --> 00:56:48,630 And you do more performance engineering. 1239 00:56:48,630 --> 00:56:50,630 You think about you're doing, you code, and then 1240 00:56:50,630 --> 00:56:52,820 you run, run, run to test, and that's 1241 00:56:52,820 --> 00:56:55,730 one nice reason to have the cloud, because you can do tests 1242 00:56:55,730 --> 00:56:56,480 in parallel. 1243 00:56:56,480 --> 00:57:01,593 So it takes you less time to do your tests in terms of your, 1244 00:57:01,593 --> 00:57:04,010 you know, sitting around time when you're doing something. 1245 00:57:04,010 --> 00:57:05,850 You say, oh, I want to do 10 tests. 1246 00:57:05,850 --> 00:57:09,530 Let's spin up 10 machines and do all the tests at the same time. 1247 00:57:09,530 --> 00:57:10,480 When you do that-- 1248 00:57:10,480 --> 00:57:12,230 and the main one we're getting out of this 1249 00:57:12,230 --> 00:57:19,970 is the AVX intrinsics-- we get up to 0.41 1250 00:57:19,970 --> 00:57:27,460 of peak, so 41% of peak, and get about 50,000 speedup. 1251 00:57:27,460 --> 00:57:28,570 OK? 1252 00:57:28,570 --> 00:57:32,820 And it turns out that's where we quit. 1253 00:57:32,820 --> 00:57:33,580 OK? 1254 00:57:33,580 --> 00:57:37,000 And the reason is because we beat 1255 00:57:37,000 --> 00:57:39,490 Intel's professionally engineered Math Kernel 1256 00:57:39,490 --> 00:57:41,750 Library at that point. 1257 00:57:41,750 --> 00:57:42,250 [LAUGHTER] 1258 00:57:42,250 --> 00:57:44,450 OK? 1259 00:57:44,450 --> 00:57:45,970 You know, a good question is, why 1260 00:57:45,970 --> 00:57:47,410 aren't we getting all of peak? 1261 00:57:47,410 --> 00:57:58,100 And you know, I invite you to try to figure that out, OK? 1262 00:57:58,100 --> 00:58:00,080 It turns out, though, Intel MKL is 1263 00:58:00,080 --> 00:58:02,990 better than what we did because we assumed it was a power of 2. 1264 00:58:02,990 --> 00:58:05,870 Intel doesn't assume that it's a power of 2, 1265 00:58:05,870 --> 00:58:11,660 and they're more robust, although we win on the 496 1266 00:58:11,660 --> 00:58:19,280 by 496 matrices, they win on other sizes of matrices, 1267 00:58:19,280 --> 00:58:21,260 so it's not all things. 1268 00:58:24,080 --> 00:58:28,170 But the end of the story is, what have we done? 1269 00:58:28,170 --> 00:58:32,270 We have just done a factor of 50,000, OK? 1270 00:58:32,270 --> 00:58:39,950 If you looked at the gas economy, OK, of a jumbo jet 1271 00:58:39,950 --> 00:58:41,960 and getting the kind of performance 1272 00:58:41,960 --> 00:58:45,740 that we just got in terms of miles per gallon, 1273 00:58:45,740 --> 00:58:52,280 you would be able to run a jumbo jet on a little Vespa scooter 1274 00:58:52,280 --> 00:58:54,620 or whatever type of scooter that is, OK? 1275 00:58:54,620 --> 00:58:56,990 That's how much we've been able to do it. 1276 00:58:56,990 --> 00:58:59,090 You generally-- let me just caution 1277 00:58:59,090 --> 00:59:01,700 you-- won't see the magnitude of a performance improvement 1278 00:59:01,700 --> 00:59:04,020 that we obtained for matrix multiplication. 1279 00:59:04,020 --> 00:59:04,520 OK? 1280 00:59:04,520 --> 00:59:07,400 That turns out to be one where-- 1281 00:59:07,400 --> 00:59:11,030 it's a really good example because it's so dramatic. 1282 00:59:11,030 --> 00:59:14,390 But we will see some substantial numbers, 1283 00:59:14,390 --> 00:59:16,490 and in particular in 6.172 you'll 1284 00:59:16,490 --> 00:59:19,790 learn how to print this currency of performance 1285 00:59:19,790 --> 00:59:22,610 all by yourself so that you don't have to take 1286 00:59:22,610 --> 00:59:24,590 somebody else's library. 1287 00:59:24,590 --> 00:59:28,380 You can, you know, say, oh, no, I'm an engineer of that. 1288 00:59:28,380 --> 00:59:29,990 Let me mention one other thing. 1289 00:59:29,990 --> 00:59:33,340 In this course, we're going to focus on multicore computing. 1290 00:59:33,340 --> 00:59:36,890 We are not, in particular, going to be doing GPUs 1291 00:59:36,890 --> 00:59:40,280 or file systems or network performance. 1292 00:59:40,280 --> 00:59:43,730 In the real world, those are hugely important, OK? 1293 00:59:43,730 --> 00:59:46,310 What we found, however, is that it's better 1294 00:59:46,310 --> 00:59:48,815 to learn a particular domain, and in particular, 1295 00:59:48,815 --> 00:59:50,630 this particular domain-- 1296 00:59:50,630 --> 00:59:58,530 people who master multicore performance engineering, 1297 00:59:58,530 --> 01:00:01,670 in fact, go on to do these other things 1298 01:00:01,670 --> 01:00:03,650 and are really good at it, OK? 1299 01:00:03,650 --> 01:00:06,510 Because you've learned the sort of the core, the basis, 1300 01:00:06,510 --> 01:00:08,860 the foundation--