1 00:00:01,550 --> 00:00:03,920 The following content is provided under a Creative 2 00:00:03,920 --> 00:00:05,310 Commons license. 3 00:00:05,310 --> 00:00:07,520 Your support will help MIT OpenCourseWare 4 00:00:07,520 --> 00:00:11,610 continue to offer high quality educational resources for free. 5 00:00:11,610 --> 00:00:14,180 To make a donation or to view additional materials 6 00:00:14,180 --> 00:00:18,140 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,140 --> 00:00:19,026 at ocw.mit.edu. 8 00:00:21,948 --> 00:00:23,490 CHARLES LEISERSON: Today, we're going 9 00:00:23,490 --> 00:00:29,370 to talk about analyzing task parallel algorithms-- 10 00:00:29,370 --> 00:00:31,410 multi-threaded algorithms. 11 00:00:31,410 --> 00:00:36,810 And this is going to rely on the fact 12 00:00:36,810 --> 00:00:39,360 that everybody has taken an algorithms class. 13 00:00:42,000 --> 00:00:44,250 And so I want to remind you of some of the stuff 14 00:00:44,250 --> 00:00:46,382 you learned in your algorithms class. 15 00:00:46,382 --> 00:00:48,090 And if you don't remember this, then it's 16 00:00:48,090 --> 00:00:50,130 probably good to bone up on it a little bit, 17 00:00:50,130 --> 00:00:52,050 because it's going to be essential. 18 00:00:52,050 --> 00:00:55,440 And that is the topic of divide-and-conquer recurrences. 19 00:00:55,440 --> 00:00:58,680 Everybody remember divide and conquer recurrences? 20 00:00:58,680 --> 00:01:02,745 These are-- and there's a general method for solving them 21 00:01:02,745 --> 00:01:04,620 that will deal with most of the ones we want, 22 00:01:04,620 --> 00:01:06,810 called the Master Method. 23 00:01:06,810 --> 00:01:10,680 And it deals with recurrences in the form T of n 24 00:01:10,680 --> 00:01:14,675 equals a times T of n over b plus f of n. 25 00:01:14,675 --> 00:01:16,050 And this is generally interpreted 26 00:01:16,050 --> 00:01:18,810 as I have a problem of size n. 27 00:01:18,810 --> 00:01:24,060 I can solve it by solving a problems of size n over b, 28 00:01:24,060 --> 00:01:29,250 and it costs me f of n work to do that division 29 00:01:29,250 --> 00:01:33,180 and accumulate whatever the results are of that 30 00:01:33,180 --> 00:01:37,230 to make my final result. 31 00:01:37,230 --> 00:01:41,355 For all these recurrences, the unstated base case 32 00:01:41,355 --> 00:01:43,290 is that this is a running time. 33 00:01:43,290 --> 00:01:49,050 So T of n is constant if n is small. 34 00:01:49,050 --> 00:01:50,980 So does that makes sense? 35 00:01:50,980 --> 00:01:54,140 Everybody familiar with this? 36 00:01:54,140 --> 00:01:55,998 Right? 37 00:01:55,998 --> 00:01:57,540 Well we're going to review it anyway, 38 00:01:57,540 --> 00:02:01,080 because I don't like to go ahead and just assume, and then leave 39 00:02:01,080 --> 00:02:06,750 20% of you or more, or less, left behind in the woods. 40 00:02:06,750 --> 00:02:11,220 So let's just remind ourselves of what this means. 41 00:02:11,220 --> 00:02:16,020 So it's easy to understand this in terms of a recursion tree. 42 00:02:16,020 --> 00:02:19,080 I start out, and the idea is a recursion tree 43 00:02:19,080 --> 00:02:23,550 is to take the running time, here, 44 00:02:23,550 --> 00:02:28,930 and to reexpress it using the recurrence. 45 00:02:28,930 --> 00:02:33,510 So if I reexpress this and I've written 46 00:02:33,510 --> 00:02:36,000 it a little bit differently, then I have an f of n. 47 00:02:36,000 --> 00:02:38,490 I can put an f of n at the root, and have 48 00:02:38,490 --> 00:02:40,980 a copies of T of n over b. 49 00:02:40,980 --> 00:02:44,580 And that's exactly the same amount of work as I had-- 50 00:02:44,580 --> 00:02:47,220 or running time as I had in the T of n. 51 00:02:47,220 --> 00:02:51,190 I've just simply expressed it with the right hand side. 52 00:02:51,190 --> 00:02:54,900 And then I do it again at every level. 53 00:02:54,900 --> 00:02:58,080 So I expand all the leaves. 54 00:02:58,080 --> 00:03:02,880 I only expanded one here because I ran out of space. 55 00:03:02,880 --> 00:03:08,640 And you keep doing that until you get down to T of 1. 56 00:03:08,640 --> 00:03:12,350 And so then the trick of looking at these recurrences 57 00:03:12,350 --> 00:03:15,640 is to add across the rows. 58 00:03:15,640 --> 00:03:18,270 So the first row adds up to f of n. 59 00:03:18,270 --> 00:03:22,770 The second row adds up to a times f of n over b. 60 00:03:22,770 --> 00:03:27,880 The third one is a squared f of n over b squared, and so forth. 61 00:03:27,880 --> 00:03:29,610 And the height here, now. 62 00:03:29,610 --> 00:03:32,820 Since I'm taking n and dividing it by b each time, 63 00:03:32,820 --> 00:03:36,210 how many times can I divide by b until I get to something 64 00:03:36,210 --> 00:03:37,620 that's constant size? 65 00:03:37,620 --> 00:03:43,170 That's just log base b of n. 66 00:03:43,170 --> 00:03:45,180 So, so far, review-- 67 00:03:45,180 --> 00:03:47,630 any questions here? 68 00:03:47,630 --> 00:03:49,140 For anybody? 69 00:03:49,140 --> 00:03:49,920 OK. 70 00:03:49,920 --> 00:03:54,720 So I get the height, and then I look at how many-- 71 00:03:54,720 --> 00:03:58,140 if I've got T of 1 work at every leaf, 72 00:03:58,140 --> 00:03:59,430 how many leaves are there? 73 00:04:03,440 --> 00:04:05,690 And for this analysis we're going to assume everything 74 00:04:05,690 --> 00:04:07,010 works out-- 75 00:04:07,010 --> 00:04:12,330 n is a perfect power of b and so forth. 76 00:04:12,330 --> 00:04:16,339 So if I go down k levels, how many sub problems 77 00:04:16,339 --> 00:04:19,010 are there at k levels? 78 00:04:19,010 --> 00:04:20,209 a of the k. 79 00:04:20,209 --> 00:04:23,600 So how many levels am I going down? 80 00:04:23,600 --> 00:04:26,570 h, which is log base b of n. 81 00:04:26,570 --> 00:04:32,420 So I end up with a to log base b of n times what's 82 00:04:32,420 --> 00:04:33,830 at the leaf, which is T of 1. 83 00:04:37,160 --> 00:04:40,990 And T of 1 is constant. 84 00:04:40,990 --> 00:04:43,570 a log base b of n-- 85 00:04:43,570 --> 00:04:47,360 that's the same as n to the log base b of a. 86 00:04:47,360 --> 00:04:47,860 OK. 87 00:04:47,860 --> 00:04:52,840 That's just a little bit of exponential algebra. 88 00:04:52,840 --> 00:04:54,700 And you can-- one way to see that is, 89 00:04:54,700 --> 00:04:56,770 take the log of both sides of both equations, 90 00:04:56,770 --> 00:04:58,950 and you realize that all that's used there 91 00:04:58,950 --> 00:05:00,780 is the commutative law. 92 00:05:00,780 --> 00:05:03,310 Because if you take the log base-- 93 00:05:06,900 --> 00:05:14,200 if you take log of a log bn, you get log bn times log-- 94 00:05:14,200 --> 00:05:15,790 if you take a base b, log ba. 95 00:05:15,790 --> 00:05:22,780 And then you get the same thing if you take the log base b 96 00:05:22,780 --> 00:05:25,030 of what I have as the result. Then you 97 00:05:25,030 --> 00:05:29,770 get the exponent log base ba times log base b of n. 98 00:05:29,770 --> 00:05:32,650 So same thing, just in different orders. 99 00:05:32,650 --> 00:05:35,698 So that's just a little bit of math, because this is-- 100 00:05:35,698 --> 00:05:37,990 basically, we're interested in, what's the growth in n? 101 00:05:37,990 --> 00:05:40,450 So we prefer not to have log n's in the denominator. 102 00:05:40,450 --> 00:05:42,280 We prefer to have n's-- 103 00:05:42,280 --> 00:05:46,660 sorry, in the exponent we prefer to have n's. 104 00:05:46,660 --> 00:05:49,850 So that's basically the number of things. 105 00:05:49,850 --> 00:05:54,310 And so then the question is, how much work is there 106 00:05:54,310 --> 00:05:59,830 if I add up all of these guys all the way down there? 107 00:05:59,830 --> 00:06:03,930 How much work is in all those levels? 108 00:06:03,930 --> 00:06:07,810 And it turns out there's a trick, 109 00:06:07,810 --> 00:06:16,770 and the trick is to compare n to log base b of a with f of n. 110 00:06:16,770 --> 00:06:19,840 And there are three cases that are very commonly arising, 111 00:06:19,840 --> 00:06:22,090 and for the most part, that's what we're going to see, 112 00:06:22,090 --> 00:06:25,240 is just these three cases. 113 00:06:25,240 --> 00:06:35,590 So case 1 is the case where n to the log base b of a 114 00:06:35,590 --> 00:06:39,550 is much bigger than f of n. 115 00:06:39,550 --> 00:06:44,510 And by much bigger, I mean it's bigger by a polynomial amount. 116 00:06:44,510 --> 00:06:46,480 In other words, there's an epsilon such 117 00:06:46,480 --> 00:06:50,823 that the ratio between the two is at least n to the epsilon. 118 00:06:50,823 --> 00:06:52,240 There's an epsilon greater than 0. 119 00:06:52,240 --> 00:06:55,180 In other words, f of n is O of n to the log base 120 00:06:55,180 --> 00:06:59,170 b of a minus epsilon in the numerator 121 00:06:59,170 --> 00:07:01,510 there, which is the same as n log base b 122 00:07:01,510 --> 00:07:03,670 of a divided by n to the epsilon. 123 00:07:03,670 --> 00:07:07,870 In that case, this is geometrically increasing, 124 00:07:07,870 --> 00:07:09,430 and so all the weight-- 125 00:07:09,430 --> 00:07:12,070 the constant fraction of the weight-- 126 00:07:12,070 --> 00:07:14,880 is in the leaves. 127 00:07:14,880 --> 00:07:20,570 So then the answer is T of n is n to the log base b of a. 128 00:07:20,570 --> 00:07:24,120 So if n to log base b of a is bigger than f of n, 129 00:07:24,120 --> 00:07:25,800 the answer is n to the log base b 130 00:07:25,800 --> 00:07:29,610 of a, as long as it's bigger by a polynomial amount. 131 00:07:32,130 --> 00:07:42,150 Now, case 2 is the situation where n to the log base b of a 132 00:07:42,150 --> 00:07:45,870 is approximately equal to f of n. 133 00:07:45,870 --> 00:07:48,510 They're very similar in growth. 134 00:07:48,510 --> 00:07:51,740 And specifically, we're going to look at the case where f of n 135 00:07:51,740 --> 00:07:58,230 is n to the log base b of a poly-logarithmic factor-- 136 00:07:58,230 --> 00:08:01,980 log to the k of n for some constant k greater than 137 00:08:01,980 --> 00:08:03,240 or equal to 0. 138 00:08:03,240 --> 00:08:06,600 That greater than or equal to 0 is very important. 139 00:08:06,600 --> 00:08:09,420 You can't do this for negative k. 140 00:08:09,420 --> 00:08:12,940 Even though negative k is defined and meaningful, 141 00:08:12,940 --> 00:08:15,150 this is not the answer when k is negative. 142 00:08:15,150 --> 00:08:17,395 But if k is greater than or equal to 0, 143 00:08:17,395 --> 00:08:19,020 then it turns out that what's happening 144 00:08:19,020 --> 00:08:24,520 is it's growing arithmetically from beginning to end. 145 00:08:24,520 --> 00:08:27,450 And so when you solve it, what happens is, 146 00:08:27,450 --> 00:08:30,210 you essentially add an extra log term. 147 00:08:30,210 --> 00:08:34,650 So the answer is, if f of n is n to the log base 148 00:08:34,650 --> 00:08:38,940 b of a log to the k n, the answer is n to the log base 149 00:08:38,940 --> 00:08:43,100 b of a log to the k plus 1 of n. 150 00:08:43,100 --> 00:08:47,350 So you kick in one extra log. 151 00:08:47,350 --> 00:08:48,750 And basically, it's like-- 152 00:08:48,750 --> 00:08:50,430 on average, there's basically-- 153 00:08:53,760 --> 00:08:57,990 it's almost all equal, and there are log layers. 154 00:08:57,990 --> 00:09:01,650 That's not quite the math, but it's good intuition 155 00:09:01,650 --> 00:09:04,380 that they're almost all equal and there are log layers, 156 00:09:04,380 --> 00:09:05,970 so you tack on an extra log. 157 00:09:08,610 --> 00:09:14,460 And then finally, case 3 is the case when no to the log base 158 00:09:14,460 --> 00:09:19,230 b is much less than f of n, and specifically 159 00:09:19,230 --> 00:09:24,060 where it is smaller by, once again, a polynomial factor-- 160 00:09:24,060 --> 00:09:28,140 by an n to the epsilon factor for epsilon greater than 0. 161 00:09:28,140 --> 00:09:30,210 It's also the case here that f has 162 00:09:30,210 --> 00:09:33,480 to satisfy what's called a regularity condition. 163 00:09:33,480 --> 00:09:35,970 And this is a condition that's satisfied 164 00:09:35,970 --> 00:09:38,760 by all the functions we're going to look at-- polynomials 165 00:09:38,760 --> 00:09:40,830 and polynomials times logarithms, 166 00:09:40,830 --> 00:09:43,320 and things of that nature. 167 00:09:43,320 --> 00:09:46,170 It's not satisfied for weird functions 168 00:09:46,170 --> 00:09:50,380 like sines and cosines and things like that. 169 00:09:50,380 --> 00:09:52,650 It's also not-- more relevantly, it's 170 00:09:52,650 --> 00:09:58,530 not satisfied if you have things like exponentials. 171 00:09:58,530 --> 00:10:04,995 So this is-- but for all the things we're going to look at, 172 00:10:04,995 --> 00:10:06,000 that's the case. 173 00:10:06,000 --> 00:10:09,850 And in that case, things are geometrically decreasing, 174 00:10:09,850 --> 00:10:13,950 and so all the work is at the root. 175 00:10:13,950 --> 00:10:17,310 And the root is basically cos f of n, 176 00:10:17,310 --> 00:10:20,093 so the solution is theta f of n. 177 00:10:24,990 --> 00:10:28,210 We're going to hand out a cheat sheet. 178 00:10:28,210 --> 00:10:31,920 So if you could conscript some of the TAs to get that 179 00:10:31,920 --> 00:10:34,940 distributed as quickly as possible. 180 00:10:34,940 --> 00:10:35,910 OK. 181 00:10:35,910 --> 00:10:41,580 So let's do a little puzzle here. 182 00:10:44,790 --> 00:10:46,220 So here's the cheat sheet. 183 00:10:46,220 --> 00:10:47,470 That's basically what's on it. 184 00:10:50,260 --> 00:10:56,350 And we'll do a little in-class quiz, self-quiz. 185 00:10:56,350 --> 00:11:00,300 So we have T of n is 4T n over 2 plus n. 186 00:11:00,300 --> 00:11:01,530 And the solution is? 187 00:11:07,853 --> 00:11:09,770 This is a thing that, as a computer scientist, 188 00:11:09,770 --> 00:11:13,672 you just memorize this so that you can-- in any situation, 189 00:11:13,672 --> 00:11:15,630 you don't have to even look at the cheat sheet. 190 00:11:15,630 --> 00:11:16,460 You just know it. 191 00:11:16,460 --> 00:11:20,780 It's one of these basic things that all computer scientists 192 00:11:20,780 --> 00:11:21,680 should know. 193 00:11:21,680 --> 00:11:24,035 It's kind of like, what's 2 to the 15th? 194 00:11:28,600 --> 00:11:30,570 What is it? 195 00:11:30,570 --> 00:11:31,862 AUDIENCE: [INAUDIBLE] 196 00:11:31,862 --> 00:11:32,820 CHARLES LEISERSON: Yes. 197 00:11:32,820 --> 00:11:35,990 And interestingly, that's my office number. 198 00:11:35,990 --> 00:11:41,160 I'm in 32-G768. 199 00:11:41,160 --> 00:11:45,330 I'm the only one in this data center with a power of 2 office 200 00:11:45,330 --> 00:11:46,680 number. 201 00:11:46,680 --> 00:11:48,310 And that was totally unplanned. 202 00:11:53,790 --> 00:11:56,400 So if you need to remember my office number, 2 to the 15th. 203 00:11:58,950 --> 00:12:00,360 OK, so what's the solution here? 204 00:12:04,280 --> 00:12:05,760 AUDIENCE: Case 1. 205 00:12:05,760 --> 00:12:07,440 CHARLES LEISERSON: It's case 1. 206 00:12:07,440 --> 00:12:08,630 And what's the solution? 207 00:12:12,187 --> 00:12:13,020 AUDIENCE: n squared? 208 00:12:13,020 --> 00:12:13,590 CHARLES LEISERSON: n squared. 209 00:12:13,590 --> 00:12:14,190 Very good. 210 00:12:14,190 --> 00:12:14,980 Yeah. 211 00:12:14,980 --> 00:12:21,930 So n to the log base b of a is n to the log base 2 of 4. 212 00:12:21,930 --> 00:12:24,840 Log base 2 of 4 is 2, so that's n squared. 213 00:12:24,840 --> 00:12:26,880 That's much bigger than n. 214 00:12:26,880 --> 00:12:31,920 So it's case 1, and the answer is theta n squared. 215 00:12:31,920 --> 00:12:34,338 Pretty easy. 216 00:12:34,338 --> 00:12:35,130 How about this one? 217 00:12:42,600 --> 00:12:43,332 Yeah. 218 00:12:43,332 --> 00:12:44,720 AUDIENCE: [INAUDIBLE] 219 00:12:44,720 --> 00:12:44,910 CHARLES LEISERSON: Yeah. 220 00:12:44,910 --> 00:12:46,050 It's n squared log n. 221 00:12:46,050 --> 00:12:48,240 Once again, the first part is the same. 222 00:12:48,240 --> 00:12:50,760 n to the log base b of a is n squared. 223 00:12:50,760 --> 00:12:55,050 n squared is n squared log to the 0 n. 224 00:12:55,050 --> 00:12:58,650 So it's case 2 with k equals 0, and so you just 225 00:12:58,650 --> 00:13:01,620 tack on an extra log factor. 226 00:13:01,620 --> 00:13:03,533 So it's n squared log n. 227 00:13:03,533 --> 00:13:05,450 And then, of course, we've got to do this one. 228 00:13:17,190 --> 00:13:17,735 Yeah. 229 00:13:17,735 --> 00:13:18,610 AUDIENCE: [INAUDIBLE] 230 00:13:18,610 --> 00:13:22,050 CHARLES LEISERSON: Yeah, n cubed, because once again, 231 00:13:22,050 --> 00:13:23,730 n to log base b of a is n squared. 232 00:13:23,730 --> 00:13:25,950 That's much less than n cubed. n cubed's bigger, 233 00:13:25,950 --> 00:13:27,330 so that dominates. 234 00:13:27,330 --> 00:13:30,727 So we have theta n squared. 235 00:13:30,727 --> 00:13:31,560 What about this one? 236 00:13:38,245 --> 00:13:38,745 Yeah. 237 00:13:38,745 --> 00:13:41,363 AUDIENCE: Theta of n squared. 238 00:13:41,363 --> 00:13:42,280 CHARLES LEISERSON: No. 239 00:13:42,280 --> 00:13:43,197 That's not the answer. 240 00:13:46,460 --> 00:13:48,470 Which case do you think it is? 241 00:13:48,470 --> 00:13:50,650 AUDIENCE: [INAUDIBLE] 242 00:13:50,650 --> 00:13:51,733 CHARLES LEISERSON: Case 2? 243 00:13:51,733 --> 00:13:52,580 AUDIENCE: Yeah. 244 00:13:52,580 --> 00:13:53,100 CHARLES LEISERSON: OK. 245 00:13:53,100 --> 00:13:53,600 No. 246 00:13:55,870 --> 00:13:56,370 Yeah. 247 00:13:56,370 --> 00:13:57,300 AUDIENCE: None of the cases? 248 00:13:57,300 --> 00:13:59,050 CHARLES LEISERSON: It's none of the cases. 249 00:13:59,050 --> 00:14:00,390 It's a trick question. 250 00:14:00,390 --> 00:14:03,540 Oh, I'm a nasty guy. 251 00:14:03,540 --> 00:14:05,550 I'm a nasty guy. 252 00:14:05,550 --> 00:14:08,820 This is one where the master method does not apply. 253 00:14:08,820 --> 00:14:11,040 This would be case 2, but k has to be 254 00:14:11,040 --> 00:14:14,440 greater than or equal to 0, and here k is minus 1. 255 00:14:14,440 --> 00:14:18,120 So case two doesn't apply. 256 00:14:18,120 --> 00:14:23,320 And case 1 doesn't apply, where we're 257 00:14:23,320 --> 00:14:26,560 comparing n squared to n squared over log n, 258 00:14:26,560 --> 00:14:30,210 because the ratio there is 1 over log n, and that-- 259 00:14:30,210 --> 00:14:37,860 sorry, the ratio there is log n, and log n is smaller than any n 260 00:14:37,860 --> 00:14:38,920 to the epsilon. 261 00:14:38,920 --> 00:14:41,440 And you need to have an n to the epsilon separation. 262 00:14:44,500 --> 00:14:47,530 There's actually a more-- the actual answer is n 263 00:14:47,530 --> 00:14:51,670 squared log log n for that one, by the way, which you can prove 264 00:14:51,670 --> 00:14:54,850 by the substitution method. 265 00:14:54,850 --> 00:14:59,260 And it uses the same idea. 266 00:14:59,260 --> 00:15:01,270 You just do a little bit different math. 267 00:15:01,270 --> 00:15:03,790 There's a more general solution to this kind of recurrence 268 00:15:03,790 --> 00:15:07,540 called the Akra-Bazzi method. 269 00:15:07,540 --> 00:15:09,300 But for most of what we're going to see, 270 00:15:09,300 --> 00:15:10,780 it's sufficient to just-- 271 00:15:10,780 --> 00:15:13,510 applying the Akra-Bazzi method is more complicated 272 00:15:13,510 --> 00:15:19,210 than simply doing the table lookup of which is bigger 273 00:15:19,210 --> 00:15:21,670 and if sufficiently big, it's one or the other, 274 00:15:21,670 --> 00:15:25,700 or the common case where they're about the same within a log 275 00:15:25,700 --> 00:15:26,200 factor. 276 00:15:26,200 --> 00:15:28,810 So we're going to use the master method, 277 00:15:28,810 --> 00:15:30,610 but there are more general ways of solving 278 00:15:30,610 --> 00:15:31,670 these kinds of things. 279 00:15:31,670 --> 00:15:32,170 OK. 280 00:15:32,170 --> 00:15:36,020 Let's talk about some multi-threaded algorithms. 281 00:15:36,020 --> 00:15:38,500 First thing I want to do is talk about loops, 282 00:15:38,500 --> 00:15:45,490 because loops are a great thing to analyze and understand 283 00:15:45,490 --> 00:15:47,230 because so many programs have loops. 284 00:15:47,230 --> 00:15:52,990 Probably 90% or more of the programs that are parallelized 285 00:15:52,990 --> 00:15:56,180 are parallelized by making parallel loops. 286 00:15:56,180 --> 00:16:00,640 The spawn and sync types of parallelism, 287 00:16:00,640 --> 00:16:03,130 the subroutine-type parallelism, is not 288 00:16:03,130 --> 00:16:06,130 done that frequently in code. 289 00:16:06,130 --> 00:16:08,260 Usually, it's loops. 290 00:16:08,260 --> 00:16:12,370 So what we're going to look at is a code 291 00:16:12,370 --> 00:16:16,510 to do an in-place matrix transpose, 292 00:16:16,510 --> 00:16:18,110 as an example of this. 293 00:16:18,110 --> 00:16:20,380 So if you look at this code, I want 294 00:16:20,380 --> 00:16:22,600 to swap the lower side of the matrix 295 00:16:22,600 --> 00:16:24,460 with the upper side of the matrix, 296 00:16:24,460 --> 00:16:27,090 and here's some code to do it, where 297 00:16:27,090 --> 00:16:28,540 I parallelize the outer loop. 298 00:16:32,980 --> 00:16:38,650 So we're running the outer index from i equals 1 to n. 299 00:16:38,650 --> 00:16:43,780 I'm actually running the indexes from 0 to n minus 1. 300 00:16:43,780 --> 00:16:50,170 And then the inner loop goes from 0 up to i minus 1. 301 00:16:50,170 --> 00:16:52,690 Now, I've seen people write transpose code-- 302 00:16:52,690 --> 00:16:54,250 this is one of these trick questions 303 00:16:54,250 --> 00:16:57,970 they give you in interviews, where they say, write 304 00:16:57,970 --> 00:17:03,080 the transpose of a matrix with nested loops. 305 00:17:03,080 --> 00:17:05,890 And what many people will do is, the inner loop, 306 00:17:05,890 --> 00:17:11,869 they'll run to n rather than running to i. 307 00:17:11,869 --> 00:17:14,990 And what happens if you run the inner loop to n? 308 00:17:20,060 --> 00:17:24,450 It's a very expensive identity function. 309 00:17:24,450 --> 00:17:26,450 And there's an easier, faster way 310 00:17:26,450 --> 00:17:30,410 to compute identity than with doubly nested loops where 311 00:17:30,410 --> 00:17:33,920 you swap everything and you swap them all back. 312 00:17:33,920 --> 00:17:35,870 So it's important that the iteration space 313 00:17:35,870 --> 00:17:38,340 here, what's the shape of the iteration space? 314 00:17:38,340 --> 00:17:40,820 If you look at the i and j values 315 00:17:40,820 --> 00:17:44,360 and you map them out on a plane, what's the shape that you get? 316 00:17:44,360 --> 00:17:46,550 It's not a square, which it would 317 00:17:46,550 --> 00:17:52,955 be if they were both going from 1 to n, or 0 to n minus 1. 318 00:17:52,955 --> 00:17:58,280 What's the shape of this iteration space? 319 00:17:58,280 --> 00:17:59,800 Yeah, it's a triangle. 320 00:17:59,800 --> 00:18:03,170 It's basically-- we're going to run through all 321 00:18:03,170 --> 00:18:07,550 the things in this lower area. 322 00:18:07,550 --> 00:18:09,590 That's the idea. 323 00:18:09,590 --> 00:18:12,260 And we're going to swap it with the things in the upper one. 324 00:18:12,260 --> 00:18:16,460 But the iteration space runs through just the lower 325 00:18:16,460 --> 00:18:18,195 triangle-- or, correspondingly, it 326 00:18:18,195 --> 00:18:19,820 runs through the upper triangle, if you 327 00:18:19,820 --> 00:18:21,487 want to view it from that point of view. 328 00:18:21,487 --> 00:18:23,480 But it doesn't go through both triangles, 329 00:18:23,480 --> 00:18:25,550 because then you will get an identity. 330 00:18:25,550 --> 00:18:29,960 So anyway, that's just a tip when you're interviewing. 331 00:18:29,960 --> 00:18:31,730 Double-check that they've got the loop 332 00:18:31,730 --> 00:18:34,662 indices to be what they ought to be. 333 00:18:34,662 --> 00:18:36,620 And here what we've done is, we've parallelized 334 00:18:36,620 --> 00:18:39,770 the outer loop, which means, how much work is 335 00:18:39,770 --> 00:18:41,960 on each iteration of this loop? 336 00:18:47,300 --> 00:18:50,810 How much time does it take to execute each iteration of loop? 337 00:18:50,810 --> 00:18:54,400 For a given value of i, what does it 338 00:18:54,400 --> 00:19:01,580 cost us to execute the loop? 339 00:19:01,580 --> 00:19:02,232 Yeah. 340 00:19:02,232 --> 00:19:03,380 AUDIENCE: [INAUDIBLE] 341 00:19:03,380 --> 00:19:04,338 CHARLES LEISERSON: Yes. 342 00:19:04,338 --> 00:19:07,430 Theta i, which means that-- 343 00:19:07,430 --> 00:19:09,080 if you think about this, if you've 344 00:19:09,080 --> 00:19:11,630 got a certain number of processors, 345 00:19:11,630 --> 00:19:15,470 you don't want to just chunk it up so that each processor gets 346 00:19:15,470 --> 00:19:20,210 an equal range of i to work on. 347 00:19:20,210 --> 00:19:23,330 You need something that's going to load balance. 348 00:19:23,330 --> 00:19:27,800 And this is where the Cilk technology 349 00:19:27,800 --> 00:19:32,870 is best, is when there are these unbalanced things, because it 350 00:19:32,870 --> 00:19:35,960 does the right thing, as we'll see. 351 00:19:40,040 --> 00:19:43,610 So let's talk a little bit about how loops are actually 352 00:19:43,610 --> 00:19:51,390 implemented by the Open Cilk compiler and runtime system. 353 00:19:51,390 --> 00:19:57,872 So what happens is, we have this doubly-nested loop here, 354 00:19:57,872 --> 00:19:59,580 but the only one that we're interested in 355 00:19:59,580 --> 00:20:02,740 is the outer loop, basically. 356 00:20:02,740 --> 00:20:08,640 And what it does is, it creates this recursive program 357 00:20:08,640 --> 00:20:11,290 for the loop. 358 00:20:11,290 --> 00:20:14,030 And what is this program doing? 359 00:20:14,030 --> 00:20:16,020 I'm highlighting, essentially, this part. 360 00:20:16,020 --> 00:20:19,190 This is basically the loop body here, 361 00:20:19,190 --> 00:20:25,380 which has been lifted into this recursive program. 362 00:20:25,380 --> 00:20:30,060 And what it's doing is, it is finding a midpoint 363 00:20:30,060 --> 00:20:34,740 and then recursively calling itself 364 00:20:34,740 --> 00:20:40,470 on the two sides until it gets down to, in this case, 365 00:20:40,470 --> 00:20:44,580 a one-element iteration. 366 00:20:44,580 --> 00:20:51,060 And then it executes the body of the loop, which in this case 367 00:20:51,060 --> 00:20:55,950 is itself a for loop, but not a parallel for loop. 368 00:20:55,950 --> 00:20:57,450 So it's doing divide and conquer. 369 00:20:57,450 --> 00:20:59,304 It's just basically tree splitting. 370 00:21:07,270 --> 00:21:10,450 So basically, it's got this control on top of it. 371 00:21:10,450 --> 00:21:13,090 And if I take a look at what's going on in the control, 372 00:21:13,090 --> 00:21:16,210 it looks something like this. 373 00:21:16,210 --> 00:21:21,770 So this is using the DAG model that we saw before. 374 00:21:21,770 --> 00:21:26,806 And now what I have here highlighted 375 00:21:26,806 --> 00:21:30,010 is the lifted body of the loop-- 376 00:21:30,010 --> 00:21:31,690 sorry, of the control. 377 00:21:31,690 --> 00:21:35,090 And then down below in the purple, I have the lifted body. 378 00:21:35,090 --> 00:21:38,650 And what it's doing is basically saying, 379 00:21:38,650 --> 00:21:46,180 let me divide it into two parts, and then I spawn one recurrence 380 00:21:46,180 --> 00:21:47,840 and I call the other. 381 00:21:47,840 --> 00:21:51,220 And I just keep dividing like that till I get down 382 00:21:51,220 --> 00:21:53,133 to the base condition. 383 00:21:53,133 --> 00:21:54,550 And then the work that I'm doing-- 384 00:21:54,550 --> 00:21:57,040 I've sort of illustrated here-- 385 00:21:57,040 --> 00:21:59,890 the work I'm doing in each iteration of the loop 386 00:21:59,890 --> 00:22:03,330 is growing from 1 to n. 387 00:22:03,330 --> 00:22:06,250 I'm showing it for 8, but in general, this is working from 1 388 00:22:06,250 --> 00:22:10,570 to n for this particular one. 389 00:22:10,570 --> 00:22:11,250 Is that clear? 390 00:22:11,250 --> 00:22:12,708 So that's what's actually going on. 391 00:22:12,708 --> 00:22:20,070 So the Open Cilk runtime system does not have a loop primitive. 392 00:22:20,070 --> 00:22:22,300 It doesn't have loops. 393 00:22:22,300 --> 00:22:27,040 It only has, essentially, this ability to spawn and so forth. 394 00:22:27,040 --> 00:22:30,370 And so things, effectively, are translated into this divide 395 00:22:30,370 --> 00:22:32,200 and conquer, and that's the way that you 396 00:22:32,200 --> 00:22:35,930 need to think about loops when you're thinking in parallel. 397 00:22:35,930 --> 00:22:37,430 Make sense? 398 00:22:37,430 --> 00:22:41,710 And so one of the questions is, that seems like a lot of code 399 00:22:41,710 --> 00:22:43,600 to write for a simple loop. 400 00:22:43,600 --> 00:22:45,340 What do we pay for that? 401 00:22:45,340 --> 00:22:46,810 How much did that cost us? 402 00:22:46,810 --> 00:22:49,140 So let's analyze this a little bit-- 403 00:22:49,140 --> 00:22:52,580 analyze parallel loops. 404 00:22:52,580 --> 00:22:54,220 So as you know, we analyze things 405 00:22:54,220 --> 00:22:58,660 in terms of work and span. 406 00:22:58,660 --> 00:23:01,420 So what is the work of this computation? 407 00:23:04,258 --> 00:23:06,050 Well, what's the work before you get there? 408 00:23:06,050 --> 00:23:08,900 What's the work of the original computation-- 409 00:23:08,900 --> 00:23:11,630 the doubly-nested loop? 410 00:23:11,630 --> 00:23:15,537 If you just think about it in terms of loops, 411 00:23:15,537 --> 00:23:17,870 if they were serial loops, how much work would be there? 412 00:23:23,440 --> 00:23:26,570 Doubly-nested loop. 413 00:23:26,570 --> 00:23:29,010 In a loop, n iterations. 414 00:23:29,010 --> 00:23:33,450 In your iteration, you're doing i work. 415 00:23:33,450 --> 00:23:34,920 Sum of i. 416 00:23:34,920 --> 00:23:36,120 i equals 1 to n. 417 00:23:36,120 --> 00:23:37,070 What do you get? 418 00:23:37,070 --> 00:23:38,070 AUDIENCE: [INAUDIBLE] 419 00:23:38,070 --> 00:23:39,028 CHARLES LEISERSON: Yes. 420 00:23:39,028 --> 00:23:40,800 Theta n squared. 421 00:23:40,800 --> 00:23:41,870 Doubly-nested group. 422 00:23:41,870 --> 00:23:45,660 And although you're not doing half the work, 423 00:23:45,660 --> 00:23:48,380 you are doing the other half of the work-- 424 00:23:48,380 --> 00:23:50,130 of the n squared work that you might think 425 00:23:50,130 --> 00:23:53,850 was there if you wrote the unfortunate identity function. 426 00:23:56,990 --> 00:23:59,190 So the question is, how much work is 427 00:23:59,190 --> 00:24:00,990 in this particular computation? 428 00:24:00,990 --> 00:24:05,490 Because now I've got this whole tree-spawning business going on 429 00:24:05,490 --> 00:24:10,530 in addition to the work that I'm doing in the leaves. 430 00:24:10,530 --> 00:24:14,760 So the leaf work here, along the bottom 431 00:24:14,760 --> 00:24:18,990 here, that's all going to be order n squared work, 432 00:24:18,990 --> 00:24:24,330 because that's the same as in the serial loop case. 433 00:24:24,330 --> 00:24:26,990 How much does that other stuff up top [INAUDIBLE]?? 434 00:24:26,990 --> 00:24:27,810 It looks huge. 435 00:24:27,810 --> 00:24:32,160 It's bigger than the other stuff, isn't it? 436 00:24:32,160 --> 00:24:32,990 How much is there? 437 00:24:41,220 --> 00:24:42,195 Basic computer science. 438 00:24:44,840 --> 00:24:45,715 AUDIENCE: Theta of n? 439 00:24:45,715 --> 00:24:46,715 CHARLES LEISERSON: Yeah. 440 00:24:46,715 --> 00:24:47,460 It's theta of n. 441 00:24:47,460 --> 00:24:50,610 Why is it theta if n in the upper part? 442 00:24:50,610 --> 00:24:51,304 Yep. 443 00:24:51,304 --> 00:24:53,724 AUDIENCE: Because it's geometrically decreasing 444 00:24:53,724 --> 00:24:56,628 [INAUDIBLE] 445 00:24:57,380 --> 00:24:58,380 CHARLES LEISERSON: Yeah. 446 00:24:58,380 --> 00:25:03,180 So going from the leaves to the root, every level is halving, 447 00:25:03,180 --> 00:25:04,080 so it's geometric. 448 00:25:04,080 --> 00:25:05,880 So it's the total number of leaves, 449 00:25:05,880 --> 00:25:09,300 because there's constant work in each of those phrases. 450 00:25:09,300 --> 00:25:13,350 So the total amount is theta n squared. 451 00:25:13,350 --> 00:25:14,820 Another way of thinking about this 452 00:25:14,820 --> 00:25:17,610 is, you've got a complete binary tree that we've 453 00:25:17,610 --> 00:25:20,490 created with n leaves. 454 00:25:20,490 --> 00:25:22,080 How many internal nodes are there 455 00:25:22,080 --> 00:25:24,540 in a complete binary tree with n leaves? 456 00:25:24,540 --> 00:25:26,567 In this case, there's actually n over-- 457 00:25:26,567 --> 00:25:27,900 let's just say there's n leaves. 458 00:25:27,900 --> 00:25:28,523 Yeah. 459 00:25:28,523 --> 00:25:29,940 How many internal nodes are there? 460 00:25:29,940 --> 00:25:34,943 If I have n leaves, how many internal nodes to the tree-- 461 00:25:34,943 --> 00:25:36,360 that is, nodes that have children? 462 00:25:39,770 --> 00:25:42,320 There's exactly n minus 1. 463 00:25:42,320 --> 00:25:45,530 That's a property that's true of any full binary tree-- 464 00:25:45,530 --> 00:25:48,770 that is, any binary tree in which every non-leaf has 465 00:25:48,770 --> 00:25:50,540 two children. 466 00:25:50,540 --> 00:25:53,940 There's exactly n minus 1. 467 00:25:53,940 --> 00:25:56,420 So nice tree properties, nice computer science properties, 468 00:25:56,420 --> 00:25:56,920 right? 469 00:25:56,920 --> 00:25:58,280 We like computer science. 470 00:25:58,280 --> 00:25:59,330 That's why we're here-- 471 00:25:59,330 --> 00:26:01,288 not because we're going to make a lot of money. 472 00:26:06,880 --> 00:26:08,500 OK. 473 00:26:08,500 --> 00:26:10,470 Let's look at the span of this. 474 00:26:10,470 --> 00:26:12,720 Hmm. 475 00:26:12,720 --> 00:26:19,655 What's the span of this calculation? 476 00:26:19,655 --> 00:26:21,530 Because that's how we understand parallelism, 477 00:26:21,530 --> 00:26:23,120 is by understanding work and span. 478 00:26:32,680 --> 00:26:35,410 I see some familiar hands. 479 00:26:35,410 --> 00:26:36,550 OK. 480 00:26:36,550 --> 00:26:37,930 AUDIENCE: Theta n. 481 00:26:37,930 --> 00:26:39,160 CHARLES LEISERSON: Theta n. 482 00:26:39,160 --> 00:26:39,430 Yeah. 483 00:26:39,430 --> 00:26:40,305 How did you get that? 484 00:26:43,264 --> 00:26:50,000 AUDIENCE: The largest path would be the [INAUDIBLE] node 485 00:26:50,000 --> 00:26:54,090 is size theta n and [INAUDIBLE] 486 00:26:54,090 --> 00:26:55,090 CHARLES LEISERSON: Yeah. 487 00:26:55,090 --> 00:26:58,240 So we're basically-- the longest path 488 00:26:58,240 --> 00:27:03,220 is basically going from here down, down, down to 8, 489 00:27:03,220 --> 00:27:05,560 and then back up. 490 00:27:05,560 --> 00:27:10,870 And so the eight is really n in the general case. 491 00:27:10,870 --> 00:27:12,460 That's really n in the general case. 492 00:27:12,460 --> 00:27:16,410 And so we basically are going down, 493 00:27:16,410 --> 00:27:19,960 And so the span of the loop control is log n. 494 00:27:19,960 --> 00:27:21,380 And that's the key takeaway here. 495 00:27:21,380 --> 00:27:24,550 The span of loop control is log n. 496 00:27:24,550 --> 00:27:26,940 When I do divide and conquer like that, 497 00:27:26,940 --> 00:27:29,590 if I had an infinite number of processors, 498 00:27:29,590 --> 00:27:33,890 I could get it all done in logarithmic time. 499 00:27:33,890 --> 00:27:36,800 But the 8 there is linear. 500 00:27:36,800 --> 00:27:39,500 That's order n. 501 00:27:39,500 --> 00:27:42,980 In this case, n is 8. 502 00:27:42,980 --> 00:27:43,970 So that's order n. 503 00:27:43,970 --> 00:27:46,370 So then it's log n plus order log 504 00:27:46,370 --> 00:27:50,180 n, which is therefore order n. 505 00:27:55,880 --> 00:27:57,180 So what's the parallelism here? 506 00:28:02,868 --> 00:28:03,820 AUDIENCE: Theta n. 507 00:28:03,820 --> 00:28:04,945 CHARLES LEISERSON: Theta n. 508 00:28:04,945 --> 00:28:07,270 It's the ratio of the two. 509 00:28:07,270 --> 00:28:09,190 The ratio of the two is theta n. 510 00:28:09,190 --> 00:28:12,235 Is that good? 511 00:28:12,235 --> 00:28:14,242 AUDIENCE: Theta of n squared? 512 00:28:14,242 --> 00:28:15,950 CHARLES LEISERSON: Well, parallelism of n 513 00:28:15,950 --> 00:28:16,825 squared, do you mean? 514 00:28:16,825 --> 00:28:20,434 Or-- is this good parallelism? 515 00:28:24,880 --> 00:28:27,780 Yeah, that's pretty good. 516 00:28:27,780 --> 00:28:30,120 That's pretty good, because typically, you're 517 00:28:30,120 --> 00:28:35,400 going to be working on systems that have maybe-- 518 00:28:35,400 --> 00:28:38,550 if you are working on a big, big system, 519 00:28:38,550 --> 00:28:43,770 you've got maybe 64 cores or 128 cores or something. 520 00:28:43,770 --> 00:28:46,470 That's pretty big. 521 00:28:46,470 --> 00:28:49,268 Whereas this is saying, if you're doing that, 522 00:28:49,268 --> 00:28:51,060 you better have a problem that's really big 523 00:28:51,060 --> 00:28:53,490 that you're running it on. 524 00:28:53,490 --> 00:28:56,580 And so typically, n is way bigger 525 00:28:56,580 --> 00:29:02,110 than the number of processors for a problem like this. 526 00:29:02,110 --> 00:29:05,080 Not always the case, but here it is. 527 00:29:05,080 --> 00:29:06,660 Any questions about this? 528 00:29:06,660 --> 00:29:11,710 So we can use our work and span analysis 529 00:29:11,710 --> 00:29:16,330 to understand that, hey, the work 530 00:29:16,330 --> 00:29:18,280 overhead is a constant factor. 531 00:29:18,280 --> 00:29:22,780 And We're going to talk more about the overhead of work. 532 00:29:22,780 --> 00:29:26,980 But basically, from an asymptotic point of view, 533 00:29:26,980 --> 00:29:30,520 our work is n squared just like the original code, 534 00:29:30,520 --> 00:29:32,260 and we have a fair amount of parallelism. 535 00:29:32,260 --> 00:29:36,290 We have order n parallelism. 536 00:29:36,290 --> 00:29:43,840 How about if we make the inner loop be parallel as well? 537 00:29:43,840 --> 00:29:46,150 So rather than just parallelize the outer loop, 538 00:29:46,150 --> 00:29:49,810 we're also going to parallelize the inner loop. 539 00:29:49,810 --> 00:29:52,840 So how much work do we have for this situation? 540 00:30:02,160 --> 00:30:06,810 Hint-- all work questions are trivial, or at least no harder 541 00:30:06,810 --> 00:30:10,380 than they were when you were doing 542 00:30:10,380 --> 00:30:12,315 ordinary serial algorithms. 543 00:30:17,030 --> 00:30:20,480 Maybe we can come up with a trick question on the exam 544 00:30:20,480 --> 00:30:24,620 where the work changes, but almost always, the work 545 00:30:24,620 --> 00:30:26,450 doesn't change. 546 00:30:26,450 --> 00:30:28,120 So what's the work? 547 00:30:28,120 --> 00:30:30,860 Yeah. n squared. 548 00:30:30,860 --> 00:30:33,800 Parallelizing stuff doesn't change the work. 549 00:30:33,800 --> 00:30:38,750 What it hopefully does is reduce the span of the calculation. 550 00:30:38,750 --> 00:30:41,540 And by reducing the span, we get big parallelism. 551 00:30:41,540 --> 00:30:43,238 That's the idea. 552 00:30:43,238 --> 00:30:45,530 Now, sometimes it's the case when you parallelize stuff 553 00:30:45,530 --> 00:30:48,560 that you add work, and that's unfortunate, 554 00:30:48,560 --> 00:30:50,240 because it means that even if you end up 555 00:30:50,240 --> 00:30:54,290 taking your parallel program and running it on one processing 556 00:30:54,290 --> 00:30:57,770 core, you're not going to get any speedup. 557 00:30:57,770 --> 00:30:59,720 It's going to be a slowdown compared 558 00:30:59,720 --> 00:31:01,287 to the original algorithm. 559 00:31:01,287 --> 00:31:02,870 So we're actually interested generally 560 00:31:02,870 --> 00:31:05,423 in work-efficient parallel algorithms, which 561 00:31:05,423 --> 00:31:06,590 we'll talk more about later. 562 00:31:06,590 --> 00:31:09,360 So generally, we're after work efficiency. 563 00:31:09,360 --> 00:31:09,860 OK. 564 00:31:09,860 --> 00:31:11,605 What's the span of this? 565 00:31:11,605 --> 00:31:14,008 AUDIENCE: Is it theta n still? 566 00:31:14,008 --> 00:31:15,800 CHARLES LEISERSON: It is not still theta n. 567 00:31:15,800 --> 00:31:19,151 What was your thinking to say it was theta of n? 568 00:31:19,151 --> 00:31:25,495 AUDIENCE: So the path would be similar to 8, and then-- 569 00:31:25,495 --> 00:31:26,870 CHARLES LEISERSON: But now notice 570 00:31:26,870 --> 00:31:30,634 that 8 is a for loop itself. 571 00:31:30,634 --> 00:31:31,259 AUDIENCE: Yeah. 572 00:31:31,259 --> 00:31:33,191 I'm saying maybe you could extend the path 573 00:31:33,191 --> 00:31:36,270 another n so it would be 2n. 574 00:31:36,270 --> 00:31:37,820 CHARLES LEISERSON: OK. 575 00:31:37,820 --> 00:31:43,670 Not quite, but-- so this man is commendable. 576 00:31:46,376 --> 00:31:48,560 [APPLAUSE] 577 00:31:48,560 --> 00:31:50,340 Absolutely. 578 00:31:50,340 --> 00:31:52,055 This is commendable, because this 579 00:31:52,055 --> 00:31:55,220 is-- this is why I try to have a bit of a Socratic method 580 00:31:55,220 --> 00:31:59,030 in here, where I'm asking questions as opposed to just 581 00:31:59,030 --> 00:32:02,690 sitting here lecturing and having it go over your heads. 582 00:32:02,690 --> 00:32:05,720 You have the opportunity to ask questions, 583 00:32:05,720 --> 00:32:08,690 and to have your particular misunderstandings or whatever 584 00:32:08,690 --> 00:32:10,070 corrected. 585 00:32:10,070 --> 00:32:12,830 That's how you learn. 586 00:32:12,830 --> 00:32:15,590 And so I'm really in favor of anybody who 587 00:32:15,590 --> 00:32:18,860 wants to come here and learn. 588 00:32:18,860 --> 00:32:21,470 That's my desire, and that's my job, 589 00:32:21,470 --> 00:32:24,130 is to teach people who want to learn. 590 00:32:24,130 --> 00:32:30,650 So I hope that this is a safe space for you folks 591 00:32:30,650 --> 00:32:34,220 to be willing to put yourself out there and not 592 00:32:34,220 --> 00:32:37,230 necessarily get stuff right. 593 00:32:37,230 --> 00:32:39,930 I can't tell you how many times I've screwed up, 594 00:32:39,930 --> 00:32:42,732 and it's only by airing it and so forth 595 00:32:42,732 --> 00:32:44,690 and having somebody say, no, I don't think it's 596 00:32:44,690 --> 00:32:47,075 like that, Charles. 597 00:32:47,075 --> 00:32:48,530 This is like this. 598 00:32:48,530 --> 00:32:50,520 And I said, oh yeah, you're right. 599 00:32:50,520 --> 00:32:51,515 God, that was stupid. 600 00:32:54,260 --> 00:32:56,630 But the fact is that I no longer beat 601 00:32:56,630 --> 00:32:59,690 my head when I'm being stupid. 602 00:32:59,690 --> 00:33:02,720 Our natural state is stupidity. 603 00:33:02,720 --> 00:33:06,880 We have to work hard not to be stupid. 604 00:33:06,880 --> 00:33:08,180 Right? 605 00:33:08,180 --> 00:33:11,240 It's hard work not to be stupid. 606 00:33:11,240 --> 00:33:12,330 Yeah, question. 607 00:33:12,330 --> 00:33:13,872 AUDIENCE: It's not really a question. 608 00:33:13,872 --> 00:33:16,250 My philosophy on talking in mid-lecture 609 00:33:16,250 --> 00:33:18,688 is that I don't want to waste other people's time. 610 00:33:18,688 --> 00:33:20,480 CHARLES LEISERSON: Yeah, but usually when-- 611 00:33:20,480 --> 00:33:24,350 my experience is-- and this is, let me tell you from-- 612 00:33:24,350 --> 00:33:27,500 I've been at MIT almost 38 years. 613 00:33:27,500 --> 00:33:31,570 My experience is that one person has a question, 614 00:33:31,570 --> 00:33:33,320 there's all these other people in the room 615 00:33:33,320 --> 00:33:36,200 who have the same question. 616 00:33:36,200 --> 00:33:37,850 And by you articulating it, you're 617 00:33:37,850 --> 00:33:39,320 actually helping them out. 618 00:33:39,320 --> 00:33:43,520 If I think you're going to slow, if things are going too slow, 619 00:33:43,520 --> 00:33:45,200 that we're wasting people's time, 620 00:33:45,200 --> 00:33:48,787 that's my job as the lecturer to make sure that doesn't happen. 621 00:33:48,787 --> 00:33:50,370 And I'll say, let's take this offline. 622 00:33:50,370 --> 00:33:52,970 We can talk after class. 623 00:33:52,970 --> 00:33:56,030 But I appreciate your point of view, 624 00:33:56,030 --> 00:33:58,160 because that's considerate. 625 00:33:58,160 --> 00:34:01,120 But actually, it's more consideration 626 00:34:01,120 --> 00:34:04,220 if you're willing to air what you think 627 00:34:04,220 --> 00:34:07,915 and have other people say, you know, I had that same question. 628 00:34:07,915 --> 00:34:10,040 Certainly there are going to be people in the class 629 00:34:10,040 --> 00:34:13,070 who, say, roll their eyes or whatever. 630 00:34:13,070 --> 00:34:17,150 But look, I don't teach to the top 10%. 631 00:34:17,150 --> 00:34:21,170 I try to teach to the top 90%. 632 00:34:21,170 --> 00:34:22,520 And believe me-- 633 00:34:22,520 --> 00:34:24,920 [LAUGHTER] 634 00:34:24,920 --> 00:34:32,330 Believe me that I get farther with students 635 00:34:32,330 --> 00:34:34,760 and have more people enjoying the course 636 00:34:34,760 --> 00:34:36,590 and learning this stuff-- 637 00:34:36,590 --> 00:34:39,620 which is not necessarily easy stuff. 638 00:34:39,620 --> 00:34:44,060 After the fact, you're going to discover this is easy. 639 00:34:44,060 --> 00:34:46,670 But while you're learning it, it's not easy. 640 00:34:46,670 --> 00:34:52,250 This is what Steven Pinker calls the curse of knowledge. 641 00:34:52,250 --> 00:34:56,600 Once you know something, you have a really hard time 642 00:34:56,600 --> 00:34:59,150 putting yourself in the position of what 643 00:34:59,150 --> 00:35:00,500 it was like to not know it. 644 00:35:04,190 --> 00:35:06,740 And so it's very easy to learn something, 645 00:35:06,740 --> 00:35:08,490 and then when somebody doesn't understand, 646 00:35:08,490 --> 00:35:13,000 it's like, oh, whatever. 647 00:35:13,000 --> 00:35:16,340 But the fact of the matter is that most of us-- 648 00:35:16,340 --> 00:35:17,420 it's that empathy. 649 00:35:17,420 --> 00:35:21,410 That's what makes for you to be a good communicator. 650 00:35:21,410 --> 00:35:24,050 And all of you I know are at some point 651 00:35:24,050 --> 00:35:25,520 going to have to do communication 652 00:35:25,520 --> 00:35:28,910 with other people who are not as technically 653 00:35:28,910 --> 00:35:32,210 sophisticated as you folks are. 654 00:35:32,210 --> 00:35:36,020 And so this is really good to sort of appreciate 655 00:35:36,020 --> 00:35:42,620 how important it is to recognize that this stuff isn't 656 00:35:42,620 --> 00:35:46,910 necessarily easy when you're learning it. 657 00:35:46,910 --> 00:35:49,195 Later, you can learn it, and then it'll be easy. 658 00:35:49,195 --> 00:35:51,570 But that doesn't mean it's not so easy for somebody else. 659 00:35:51,570 --> 00:35:55,820 So those of you who think that some of these answers are like, 660 00:35:55,820 --> 00:35:59,960 come on, move along, move along, please 661 00:35:59,960 --> 00:36:02,842 be patient with the other people in the class. 662 00:36:02,842 --> 00:36:04,300 If they learn better, they're going 663 00:36:04,300 --> 00:36:09,320 to be better teammates on projects and so forth. 664 00:36:09,320 --> 00:36:10,640 And we'll all learn. 665 00:36:10,640 --> 00:36:13,190 Nobody's in competition with anybody here, 666 00:36:13,190 --> 00:36:15,710 for grades or anything. 667 00:36:15,710 --> 00:36:16,970 Nobody's in competition. 668 00:36:16,970 --> 00:36:19,610 We all set it up so we're going against benchmarks and so 669 00:36:19,610 --> 00:36:20,210 forth. 670 00:36:20,210 --> 00:36:21,558 You're not in competition. 671 00:36:21,558 --> 00:36:23,600 So we want to make this something where everybody 672 00:36:23,600 --> 00:36:24,990 helps everybody learn. 673 00:36:24,990 --> 00:36:27,570 I probably spent too much time on that, but in some sense, 674 00:36:27,570 --> 00:36:30,180 not nearly enough. 675 00:36:30,180 --> 00:36:31,050 OK. 676 00:36:31,050 --> 00:36:34,590 So the span is not order n. 677 00:36:34,590 --> 00:36:36,800 We got that much. 678 00:36:36,800 --> 00:36:38,540 Who else would like to hazard to-- 679 00:36:38,540 --> 00:36:39,270 OK. 680 00:36:39,270 --> 00:36:40,488 AUDIENCE: Is it log n? 681 00:36:40,488 --> 00:36:41,780 CHARLES LEISERSON: It is log n. 682 00:36:41,780 --> 00:36:42,970 What's your reasoning? 683 00:36:42,970 --> 00:36:45,475 AUDIENCE: It's the normal log n from the time before, 684 00:36:45,475 --> 00:36:48,076 but since we're expanding the n-- 685 00:36:48,076 --> 00:36:49,034 CHARLES LEISERSON: Yup. 686 00:36:49,034 --> 00:36:51,930 AUDIENCE: --again into another tree, it's log n plus log n. 687 00:36:51,930 --> 00:36:53,430 CHARLES LEISERSON: Log n plus log n. 688 00:36:53,430 --> 00:36:54,177 Good. 689 00:36:54,177 --> 00:36:55,490 AUDIENCE: [INAUDIBLE] 690 00:36:55,490 --> 00:36:57,740 CHARLES LEISERSON: And then what about the leaves? 691 00:36:57,740 --> 00:36:59,240 AUDIENCE: [INAUDIBLE] 692 00:36:59,240 --> 00:37:00,590 CHARLES LEISERSON: What's-- you got to add in the span 693 00:37:00,590 --> 00:37:01,173 of the leaves. 694 00:37:01,173 --> 00:37:04,037 That was just the span of the control. 695 00:37:04,037 --> 00:37:05,370 AUDIENCE: The leaves are just 1. 696 00:37:05,370 --> 00:37:07,078 CHARLES LEISERSON: The leaves are just 1. 697 00:37:09,260 --> 00:37:10,190 Boom. 698 00:37:10,190 --> 00:37:13,370 So the span of the outer loop is order log n. 699 00:37:13,370 --> 00:37:15,560 The inner loop is order log n. 700 00:37:15,560 --> 00:37:18,020 And the span of the body is order 1, 701 00:37:18,020 --> 00:37:19,760 because we're going down to the body, 702 00:37:19,760 --> 00:37:22,880 now it's just doing one iteration of serial execution. 703 00:37:22,880 --> 00:37:24,980 It's not doing i iterations. 704 00:37:24,980 --> 00:37:26,760 It's only doing one iteration. 705 00:37:26,760 --> 00:37:29,120 And so I add all that together, and I get log n. 706 00:37:32,563 --> 00:37:33,480 Does that makes sense? 707 00:37:38,750 --> 00:37:40,235 So the parallelism is? 708 00:37:49,798 --> 00:37:51,590 This one, I should-- every hand in the room 709 00:37:51,590 --> 00:37:55,760 should be up, waving to call on me, call on me. 710 00:37:55,760 --> 00:37:56,532 Sure. 711 00:37:56,532 --> 00:37:57,950 AUDIENCE: [INAUDIBLE] 712 00:37:57,950 --> 00:38:00,053 CHARLES LEISERSON: Yeah. n squared over log n. 713 00:38:00,053 --> 00:38:01,220 That's the ratio of the two. 714 00:38:04,240 --> 00:38:04,740 Good. 715 00:38:04,740 --> 00:38:07,940 Any questions about that? 716 00:38:07,940 --> 00:38:09,560 OK. 717 00:38:09,560 --> 00:38:14,840 So the parallelism is n squared over log n, 718 00:38:14,840 --> 00:38:21,710 and this is more parallel than the previous one. 719 00:38:21,710 --> 00:38:23,540 But it turns out-- 720 00:38:23,540 --> 00:38:25,970 you've got to remember, even though it's more parallel, 721 00:38:25,970 --> 00:38:28,070 is it a better algorithm in practice? 722 00:38:30,720 --> 00:38:32,910 Not necessarily, because parallelism 723 00:38:32,910 --> 00:38:34,710 is like a thresholding thing. 724 00:38:34,710 --> 00:38:39,180 What you need is enough parallelism beyond the number 725 00:38:39,180 --> 00:38:41,050 of processors that you have-- 726 00:38:41,050 --> 00:38:44,520 the parallel slackness, remember? 727 00:38:44,520 --> 00:38:47,580 So you have to have the number-- the amount of parallelism, 728 00:38:47,580 --> 00:38:50,060 if it's much greater than the number of processors, 729 00:38:50,060 --> 00:38:52,070 you're good. 730 00:38:52,070 --> 00:38:57,460 So for something like this, if with order n parallelism 731 00:38:57,460 --> 00:39:00,080 you're way bigger than the number of processors, 732 00:39:00,080 --> 00:39:02,380 you don't need to parallelize the inner loop. 733 00:39:04,845 --> 00:39:06,720 You don't need to parallelize the inner loop. 734 00:39:06,720 --> 00:39:07,345 You'll be fine. 735 00:39:07,345 --> 00:39:11,180 And in fact, we're talking a little bit about overheads, 736 00:39:11,180 --> 00:39:13,700 and I'm going to do that with an example 737 00:39:13,700 --> 00:39:17,750 from using vector addition. 738 00:39:17,750 --> 00:39:22,770 So here's a really simple piece of code. 739 00:39:22,770 --> 00:39:26,330 It's a vector-- add two vectors together, two arrays. 740 00:39:26,330 --> 00:39:28,220 And all it does is, it adds b into a. 741 00:39:28,220 --> 00:39:32,150 You can see every position as b into a. 742 00:39:32,150 --> 00:39:34,340 And I'm going to parallelize this 743 00:39:34,340 --> 00:39:38,540 by putting a Cilk for in front, rather than an ordinary for. 744 00:39:38,540 --> 00:39:40,640 And what that does is, it gives us this divide 745 00:39:40,640 --> 00:39:46,850 and conquer tree once again, with n leaves. 746 00:39:46,850 --> 00:39:50,670 And the work here is order n, because that's-- 747 00:39:50,670 --> 00:39:54,930 we've got n iterations of constant time. 748 00:39:54,930 --> 00:39:56,570 And the span is just the control-- 749 00:39:56,570 --> 00:39:58,070 log n. 750 00:39:58,070 --> 00:40:00,200 And so the parallelism is n over log n. 751 00:40:05,090 --> 00:40:09,440 So this is basically easier than what we just did. 752 00:40:09,440 --> 00:40:12,350 So now-- if I look at this, though, the work 753 00:40:12,350 --> 00:40:18,818 here includes some substantial overhead, 754 00:40:18,818 --> 00:40:20,610 because there are all these function calls. 755 00:40:20,610 --> 00:40:24,090 It may be order n, and that's good enough 756 00:40:24,090 --> 00:40:27,990 if you're certain kinds of theoreticians. 757 00:40:27,990 --> 00:40:31,230 This kind of theoretician, that's not good enough. 758 00:40:31,230 --> 00:40:36,140 I want to understand where these overheads are going. 759 00:40:36,140 --> 00:40:40,116 So the first thing that I might do 760 00:40:40,116 --> 00:40:43,330 to get rid of that overhead-- 761 00:40:43,330 --> 00:40:46,720 so in this case, what I'm saying is 762 00:40:46,720 --> 00:40:49,270 that as I do the divide and conquer, 763 00:40:49,270 --> 00:40:52,090 if I go all the way down to n equals 1, 764 00:40:52,090 --> 00:40:53,500 what am I doing in a leaf? 765 00:40:53,500 --> 00:40:55,540 How much work is in one of these leaves here? 766 00:40:59,964 --> 00:41:02,090 It's an add. 767 00:41:02,090 --> 00:41:06,887 It's two memory fetches and a memory store and an add. 768 00:41:06,887 --> 00:41:09,470 The memory operations are going to be the most expensive thing 769 00:41:09,470 --> 00:41:11,630 there. 770 00:41:11,630 --> 00:41:13,610 That's all that's going on. 771 00:41:13,610 --> 00:41:20,250 And yet, right before then, I've done a subroutine call-- 772 00:41:20,250 --> 00:41:24,260 a parallel subroutine call, mind you-- 773 00:41:24,260 --> 00:41:27,975 and that's going to have substantial overhead. 774 00:41:27,975 --> 00:41:30,100 And so the question is, do you do a subroutine call 775 00:41:30,100 --> 00:41:31,300 to add two numbers together? 776 00:41:34,060 --> 00:41:35,930 That's pretty expensive. 777 00:41:35,930 --> 00:41:38,710 So let's take a look at how we can optimize away 778 00:41:38,710 --> 00:41:41,060 some of this overhead. 779 00:41:41,060 --> 00:41:45,940 And so this gets more into the realm of engineering. 780 00:41:45,940 --> 00:41:50,320 So the Open Cilk system has a pragma. 781 00:41:50,320 --> 00:41:52,570 Pragma is a compiler directive-- 782 00:41:52,570 --> 00:41:54,850 suggestion to the compiler-- 783 00:41:54,850 --> 00:41:58,750 where it can suggest, in this case, that there be a grain 784 00:41:58,750 --> 00:42:05,990 size up here of G, for whatever you set G to. 785 00:42:05,990 --> 00:42:08,170 And the grain size is essentially-- 786 00:42:08,170 --> 00:42:09,070 we're going to use-- 787 00:42:09,070 --> 00:42:11,230 and it shows up here in the code-- 788 00:42:11,230 --> 00:42:13,090 as instead of ending up-- it used 789 00:42:13,090 --> 00:42:15,490 to be high greater than low plus 1, so that you 790 00:42:15,490 --> 00:42:17,470 ended with a single element. 791 00:42:17,470 --> 00:42:20,960 Now it's going to be plus G, so that at the leaves, 792 00:42:20,960 --> 00:42:25,210 I'm going to have up to G elements per chunk 793 00:42:25,210 --> 00:42:27,910 that I do when I'm doing my divide and conquer. 794 00:42:27,910 --> 00:42:31,000 So therefore, I can take my subroutine overhead 795 00:42:31,000 --> 00:42:33,700 and amortize it across G iterations 796 00:42:33,700 --> 00:42:37,510 rather than amortizing across one iteration. 797 00:42:37,510 --> 00:42:40,240 So that's coarsening. 798 00:42:40,240 --> 00:42:44,980 Now, if the grain size pragma is not specified, 799 00:42:44,980 --> 00:42:47,740 the Cilk runtime system makes its best guess 800 00:42:47,740 --> 00:42:50,440 to minimize the overhead. 801 00:42:50,440 --> 00:42:52,260 So what it actually does at runtime is, 802 00:42:52,260 --> 00:42:55,890 it figures out for the loop how many cores it's running on, 803 00:42:55,890 --> 00:43:00,180 and makes a good guess as to the actual-- 804 00:43:00,180 --> 00:43:02,460 how much to run serially at the leaves 805 00:43:02,460 --> 00:43:04,645 and how much to do in parallel. 806 00:43:04,645 --> 00:43:05,520 Does that make sense? 807 00:43:08,000 --> 00:43:09,750 So it's basically trying to overcome that. 808 00:43:09,750 --> 00:43:11,490 So let's analyze this a little bit. 809 00:43:14,760 --> 00:43:19,560 Let's let i be the time for one iteration of the loop body. 810 00:43:19,560 --> 00:43:22,590 So this is i for iteration. 811 00:43:22,590 --> 00:43:25,770 This is of this particular loop body-- 812 00:43:25,770 --> 00:43:28,320 so basically, the cost of those three memory 813 00:43:28,320 --> 00:43:31,350 operations plus an add. 814 00:43:31,350 --> 00:43:36,510 And G is the grain size. 815 00:43:36,510 --> 00:43:39,660 And now let's take a look-- add another variable 816 00:43:39,660 --> 00:43:43,070 here, which is the time to perform a spawn and return. 817 00:43:43,070 --> 00:43:44,730 I'm going to call a spawn and return. 818 00:43:44,730 --> 00:43:47,355 It's basically the overhead for spawning. 819 00:43:50,400 --> 00:43:54,470 So if I look at the work in this context, 820 00:43:54,470 --> 00:44:02,280 I can view it as I've got T1 work, which 821 00:44:02,280 --> 00:44:07,920 is n here times the number of iterations-- 822 00:44:07,920 --> 00:44:12,120 because I've got one, two, three, up to n iterations. 823 00:44:12,120 --> 00:44:15,062 And then I have-- 824 00:44:15,062 --> 00:44:16,770 and those are just the normal iterations. 825 00:44:16,770 --> 00:44:21,210 And then, since I have n over G minus 1-- there's n over G 826 00:44:21,210 --> 00:44:24,870 leaves here of size G. So I have n 827 00:44:24,870 --> 00:44:29,130 over G minus 1 internal nodes, which 828 00:44:29,130 --> 00:44:30,390 are my subroutine overhead. 829 00:44:30,390 --> 00:44:35,610 That's S. So the total work is n times i plus n over G 830 00:44:35,610 --> 00:44:38,460 minus 1 times S. 831 00:44:38,460 --> 00:44:42,390 Now, in the original code, effectively, the work is what? 832 00:44:45,320 --> 00:44:47,360 If I had the code without the Cilk 833 00:44:47,360 --> 00:44:56,640 for loop, how much work is there before I put in all 834 00:44:56,640 --> 00:44:59,490 this parallel control stuff? 835 00:44:59,490 --> 00:45:00,460 What would the work be? 836 00:45:00,460 --> 00:45:00,960 Yeah. 837 00:45:00,960 --> 00:45:02,100 AUDIENCE: n i? 838 00:45:02,100 --> 00:45:03,450 CHARLES LEISERSON: n times i. 839 00:45:03,450 --> 00:45:05,378 We're just doing n iterations. 840 00:45:05,378 --> 00:45:07,170 Yeah, there's a little bit of loop control, 841 00:45:07,170 --> 00:45:10,600 but that loop control is really cheap. 842 00:45:10,600 --> 00:45:15,450 And on a modern out-of-order processor, 843 00:45:15,450 --> 00:45:17,640 the cost of incrementing a variable 844 00:45:17,640 --> 00:45:21,780 and testing against its bound is dwarfed by the stuff going on 845 00:45:21,780 --> 00:45:24,060 inside the loop. 846 00:45:24,060 --> 00:45:25,290 So it's ni. 847 00:45:25,290 --> 00:45:27,480 So this part here-- 848 00:45:27,480 --> 00:45:29,010 oops, what did I do? 849 00:45:29,010 --> 00:45:30,110 Oops, I went back. 850 00:45:30,110 --> 00:45:32,100 I see. 851 00:45:32,100 --> 00:45:35,570 So this part here-- 852 00:45:35,570 --> 00:45:37,700 this part here, there we go-- 853 00:45:37,700 --> 00:45:40,160 is all overhead. 854 00:45:40,160 --> 00:45:41,870 This is what it costs-- this part here 855 00:45:41,870 --> 00:45:44,118 is what cost me originally. 856 00:45:47,170 --> 00:45:53,040 So let's take a look at the span of this. 857 00:45:53,040 --> 00:45:59,940 So the span is going to be, well, 858 00:45:59,940 --> 00:46:05,080 if I add up what's at the leaves, that's just G times i. 859 00:46:05,080 --> 00:46:08,100 And now I've got the overhead here 860 00:46:08,100 --> 00:46:11,890 for any of these paths, which is basically proportional-- 861 00:46:11,890 --> 00:46:14,730 I'm ignoring constants here to make it easier-- 862 00:46:14,730 --> 00:46:19,650 log of n over G times S, because it's going log levels. 863 00:46:19,650 --> 00:46:22,890 And I've got n over G chunks, because each-- 864 00:46:22,890 --> 00:46:26,340 I've got G things at the iterations of each leaf, 865 00:46:26,340 --> 00:46:30,240 so therefore the number of leaves n over G. 866 00:46:30,240 --> 00:46:34,990 And I've got n minus 1 of those-- 867 00:46:34,990 --> 00:46:38,220 sorry, got log n of those-- actually, 2 log n. 868 00:46:38,220 --> 00:46:43,020 2 log n over G of those times S. Actually, maybe I don't. 869 00:46:43,020 --> 00:46:44,520 Maybe I just have log n, because I'm 870 00:46:44,520 --> 00:46:46,350 going to count it going down and going up. 871 00:46:46,350 --> 00:46:49,260 So actually, constant of 1 is fine. 872 00:46:49,260 --> 00:46:51,760 Who's confused? 873 00:46:51,760 --> 00:46:52,410 OK. 874 00:46:52,410 --> 00:46:54,610 Let's ask some questions. 875 00:46:54,610 --> 00:46:57,610 You have a question? 876 00:46:57,610 --> 00:46:59,020 I know you're confused. 877 00:46:59,020 --> 00:47:00,760 Believe me, I spend-- 878 00:47:00,760 --> 00:47:03,010 one of my great successes in life 879 00:47:03,010 --> 00:47:08,440 was discovering that, oh, confusion is how I usually am. 880 00:47:08,440 --> 00:47:13,930 And then it's getting confused that is-- 881 00:47:13,930 --> 00:47:16,150 that's the thing, because I see so many people going 882 00:47:16,150 --> 00:47:18,490 through life thinking they're not confused, 883 00:47:18,490 --> 00:47:22,240 but you know what, they're confused. 884 00:47:22,240 --> 00:47:24,100 And that's a worse state of affairs 885 00:47:24,100 --> 00:47:26,448 to be in than knowing that you're confused. 886 00:47:26,448 --> 00:47:27,490 Let's ask some questions. 887 00:47:27,490 --> 00:47:29,920 People who are confused, let's ask some questions, 888 00:47:29,920 --> 00:47:33,650 because I want to make sure that everybody gets this. 889 00:47:33,650 --> 00:47:38,170 And for those who you think know it already, sometimes 890 00:47:38,170 --> 00:47:40,360 it helps them to know it a little bit even better 891 00:47:40,360 --> 00:47:42,680 when we go through a discussion like this. 892 00:47:42,680 --> 00:47:43,930 So somebody ask me a question. 893 00:47:46,700 --> 00:47:47,555 Yes. 894 00:47:47,555 --> 00:47:50,350 AUDIENCE: Could you explain the second half of that [INAUDIBLE] 895 00:47:50,350 --> 00:47:51,110 CHARLES LEISERSON: Yeah. 896 00:47:51,110 --> 00:47:51,610 OK. 897 00:47:51,610 --> 00:47:53,390 The second half of the work part. 898 00:47:53,390 --> 00:47:54,020 OK. 899 00:47:54,020 --> 00:47:56,900 So the second half of the work part, n over G minus 1. 900 00:47:56,900 --> 00:47:59,630 So the first thing is, if I've got G iterations 901 00:47:59,630 --> 00:48:02,700 at the leaves of a binary tree, how many leaves 902 00:48:02,700 --> 00:48:07,288 do I have if I've got a total of n iterations? 903 00:48:07,288 --> 00:48:08,600 AUDIENCE: Is it n over G? 904 00:48:08,600 --> 00:48:11,470 CHARLES LEISERSON: n over G. That's the first thing. 905 00:48:11,470 --> 00:48:14,390 The second thing is a fact about binary trees-- 906 00:48:14,390 --> 00:48:17,520 of any full binary tree, but in particular complete binary 907 00:48:17,520 --> 00:48:18,150 trees. 908 00:48:18,150 --> 00:48:19,800 How many internal nodes are there 909 00:48:19,800 --> 00:48:22,770 in a complete binary tree? 910 00:48:22,770 --> 00:48:25,410 If n is the number of leaves, it's n minus 1. 911 00:48:25,410 --> 00:48:28,710 Here, the number of leaves is n over G, 912 00:48:28,710 --> 00:48:31,215 so it's n over G minus 1. 913 00:48:31,215 --> 00:48:34,230 That clear up something for some people? 914 00:48:34,230 --> 00:48:36,420 OK, good. 915 00:48:36,420 --> 00:48:37,380 So that's where that-- 916 00:48:37,380 --> 00:48:39,990 and now each of those, I've got to do those three 917 00:48:39,990 --> 00:48:44,890 colorful operations, which is what I'm calling S. 918 00:48:44,890 --> 00:48:46,560 So you got the work down? 919 00:48:46,560 --> 00:48:47,160 OK. 920 00:48:47,160 --> 00:48:49,140 Who has a question about span? 921 00:48:49,140 --> 00:48:50,618 Span's my favorite. 922 00:48:53,370 --> 00:48:56,010 Work is good right. 923 00:48:56,010 --> 00:49:00,060 Work is more important, actually, in most contexts. 924 00:49:00,060 --> 00:49:03,170 But span is so cool. 925 00:49:03,170 --> 00:49:04,868 Yeah. 926 00:49:04,868 --> 00:49:08,756 AUDIENCE: What did you mean when you said [INAUDIBLE] 927 00:49:10,242 --> 00:49:12,200 CHARLES LEISERSON: So what I was saying-- well, 928 00:49:12,200 --> 00:49:13,070 I think what I was saying-- 929 00:49:13,070 --> 00:49:15,480 I think I was mis-saying something, probably, there. 930 00:49:15,480 --> 00:49:17,630 But the point is that the span is basically 931 00:49:17,630 --> 00:49:21,740 starting at the top here, and taking any path down to a leaf 932 00:49:21,740 --> 00:49:24,890 and then going back up. 933 00:49:24,890 --> 00:49:26,900 And so if I look at that, that's going 934 00:49:26,900 --> 00:49:30,050 to be then log of the number of leaves. 935 00:49:30,050 --> 00:49:33,860 Well, the number of leaves, as we agreed, was n over G. 936 00:49:33,860 --> 00:49:36,800 And then each of those is, at most, 937 00:49:36,800 --> 00:49:42,200 S to do the subroutine calling and so forth that's bookkeeping 938 00:49:42,200 --> 00:49:44,480 that's in that node. 939 00:49:44,480 --> 00:49:46,820 That make sense? 940 00:49:46,820 --> 00:49:48,290 Still I didn't answer the question? 941 00:49:48,290 --> 00:49:48,790 Or-- 942 00:49:48,790 --> 00:49:52,922 AUDIENCE: Why is that the span? 943 00:49:52,922 --> 00:49:55,140 Why shouldn't it be [INAUDIBLE] 944 00:49:55,140 --> 00:49:57,140 CHARLES LEISERSON: It could be any of the paths. 945 00:49:57,140 --> 00:50:00,950 But take a look at all the paths, go down, and back up. 946 00:50:00,950 --> 00:50:03,770 There's no path that's going down and around 947 00:50:03,770 --> 00:50:04,980 and up and so forth. 948 00:50:04,980 --> 00:50:06,590 This is a DAG. 949 00:50:06,590 --> 00:50:08,900 So if you just look at the directions of the arrows. 950 00:50:08,900 --> 00:50:11,480 You got to follow the directions of the arrows. 951 00:50:11,480 --> 00:50:13,380 You can't go down and up. 952 00:50:13,380 --> 00:50:18,670 You're either going down, or you've started back up. 953 00:50:18,670 --> 00:50:21,610 So it's always going to be, essentially, 954 00:50:21,610 --> 00:50:23,590 down through a set of subroutines and back up 955 00:50:23,590 --> 00:50:25,402 through a set of subroutines. 956 00:50:25,402 --> 00:50:26,870 Does that make sense? 957 00:50:26,870 --> 00:50:31,090 And if you think about the code, the recursive code, what's 958 00:50:31,090 --> 00:50:33,040 happening when you do divide and conquer? 959 00:50:33,040 --> 00:50:34,960 If you were operating with a stack, 960 00:50:34,960 --> 00:50:39,070 how many things would get stacked up and then unstacked? 961 00:50:39,070 --> 00:50:43,930 So the path down and back up would also 962 00:50:43,930 --> 00:50:46,060 be logarithmic at most. 963 00:50:49,336 --> 00:50:51,200 Does that makes sense? 964 00:50:51,200 --> 00:50:52,750 So I don't have a-- 965 00:50:52,750 --> 00:50:57,150 if I had one subtree here, for example, dependent on-- 966 00:50:57,150 --> 00:51:00,780 oops, that's not the mode I want to be in-- so one 967 00:51:00,780 --> 00:51:03,880 subtree here dependent on another subtree, 968 00:51:03,880 --> 00:51:07,818 then indeed, the span would grow. 969 00:51:07,818 --> 00:51:10,360 But the whole point is not to have these two things-- to make 970 00:51:10,360 --> 00:51:12,070 these two things independent, so I 971 00:51:12,070 --> 00:51:14,620 can run them at the same time. 972 00:51:14,620 --> 00:51:17,870 So there's no dependency there. 973 00:51:17,870 --> 00:51:19,970 We good? 974 00:51:19,970 --> 00:51:22,700 OK. 975 00:51:22,700 --> 00:51:25,820 So here I have the work and the span. 976 00:51:25,820 --> 00:51:28,490 I have two things I want out of this. 977 00:51:28,490 --> 00:51:31,740 Number one, I want the work to be small. 978 00:51:31,740 --> 00:51:37,340 I want work to be close to the work of the n times 979 00:51:37,340 --> 00:51:43,310 i, the work of the ordinary serial algorithm. 980 00:51:43,310 --> 00:51:46,130 And I want the span to be small, so it's as parallel 981 00:51:46,130 --> 00:51:47,300 as possible. 982 00:51:47,300 --> 00:51:50,300 Those things are working in opposite directions, 983 00:51:50,300 --> 00:51:53,750 because if you look, the dominant term 984 00:51:53,750 --> 00:52:01,400 for G in the first equation is dividing n. 985 00:52:01,400 --> 00:52:08,190 So if I want the work to be small, I want G to be what? 986 00:52:08,190 --> 00:52:08,690 Big. 987 00:52:11,270 --> 00:52:15,050 The dominant term for G in the span 988 00:52:15,050 --> 00:52:18,800 is the G multiplied by the i. 989 00:52:18,800 --> 00:52:22,410 There is another term there, but that's a lower-order term. 990 00:52:22,410 --> 00:52:29,930 So if I want the span to be small, I want G to be small. 991 00:52:29,930 --> 00:52:33,410 They're going in opposite directions. 992 00:52:33,410 --> 00:52:37,540 So what we're interested in is picking a-- 993 00:52:37,540 --> 00:52:39,950 finding a medium place. 994 00:52:39,950 --> 00:52:42,040 We want G to be-- 995 00:52:42,040 --> 00:52:45,760 and in particular, if you look at this, what I want 996 00:52:45,760 --> 00:52:48,910 is G to be at least S over i. 997 00:52:48,910 --> 00:52:49,610 Why? 998 00:52:49,610 --> 00:52:52,975 If I make G be much bigger than S over i-- 999 00:52:55,670 --> 00:52:58,820 so if G is bigger than S over i-- 1000 00:52:58,820 --> 00:53:02,870 then this term multiplied by S ends up 1001 00:53:02,870 --> 00:53:06,160 being much less than this term. 1002 00:53:06,160 --> 00:53:07,205 You see that? 1003 00:53:07,205 --> 00:53:07,830 That's algebra. 1004 00:53:11,770 --> 00:53:15,606 So do you see that if I make G be-- 1005 00:53:15,606 --> 00:53:18,457 if G is much less than S over i-- 1006 00:53:18,457 --> 00:53:19,540 so get rid of the minus 1. 1007 00:53:19,540 --> 00:53:21,670 That doesn't matter. 1008 00:53:21,670 --> 00:53:28,210 So that's really n times S over G, so therefore S over G, 1009 00:53:28,210 --> 00:53:32,630 that's basically much smaller than i. 1010 00:53:32,630 --> 00:53:34,790 So I end up with something where the result is 1011 00:53:34,790 --> 00:53:38,530 much smaller than n i. 1012 00:53:38,530 --> 00:53:41,470 Does that make sense? 1013 00:53:41,470 --> 00:53:42,130 OK. 1014 00:53:42,130 --> 00:53:43,430 How we doing on time? 1015 00:53:43,430 --> 00:53:44,170 OK. 1016 00:53:44,170 --> 00:53:45,628 I'm going to get through everything 1017 00:53:45,628 --> 00:53:48,130 that I expect to get through, despite my rant. 1018 00:53:52,450 --> 00:53:52,950 OK. 1019 00:53:52,950 --> 00:53:53,825 Does that make sense? 1020 00:53:53,825 --> 00:53:56,560 We want G to be much greater than S over i. 1021 00:53:56,560 --> 00:53:59,350 Then the overhead is going to be small, 1022 00:53:59,350 --> 00:54:03,007 because I'm going to do a whole bunch of iterations that 1023 00:54:03,007 --> 00:54:05,340 are going to make it so that that function call was just 1024 00:54:05,340 --> 00:54:09,000 like, eh, who cares? 1025 00:54:09,000 --> 00:54:09,780 That's the idea. 1026 00:54:14,340 --> 00:54:15,360 So that's the goal. 1027 00:54:15,360 --> 00:54:17,648 So let's take a look at-- 1028 00:54:17,648 --> 00:54:18,690 let's see, what was the-- 1029 00:54:27,580 --> 00:54:28,480 let me just see here. 1030 00:54:28,480 --> 00:54:32,580 Did I-- somehow I feel like I have something out of order 1031 00:54:32,580 --> 00:54:45,922 here, because now I have the other implementation. 1032 00:54:48,796 --> 00:54:49,296 Huh. 1033 00:54:55,570 --> 00:54:56,090 OK. 1034 00:54:56,090 --> 00:54:58,170 I think-- maybe that is where I left it. 1035 00:54:58,170 --> 00:54:58,670 OK. 1036 00:54:58,670 --> 00:55:00,410 I think we come back to this. 1037 00:55:00,410 --> 00:55:01,030 Let me see. 1038 00:55:01,030 --> 00:55:02,030 I'm going to lecture on. 1039 00:55:07,110 --> 00:55:13,770 So here's another implementation of the for loop 1040 00:55:13,770 --> 00:55:15,210 to add two vectors. 1041 00:55:15,210 --> 00:55:17,490 And what this is going to use as a subroutine, 1042 00:55:17,490 --> 00:55:20,340 I have this operator called v add, 1043 00:55:20,340 --> 00:55:24,120 which itself does just a serial vector add. 1044 00:55:24,120 --> 00:55:29,310 And now what I'm going to do is run through the loop here 1045 00:55:29,310 --> 00:55:34,470 and spawn off additions-- 1046 00:55:34,470 --> 00:55:38,010 and the min there is just for a boundary condition. 1047 00:55:38,010 --> 00:55:43,630 I'm going to spin off things in groups of G. 1048 00:55:43,630 --> 00:55:47,590 So I spin off, do a vector add of size G, go on vector 1049 00:55:47,590 --> 00:55:52,810 add of size G, vector add of size G, jumping G each time. 1050 00:55:52,810 --> 00:55:55,070 So let's take a look at the analysis of that. 1051 00:55:55,070 --> 00:55:58,930 So now what I've got is, I've got G iterations, each of which 1052 00:55:58,930 --> 00:56:00,940 costs me i. 1053 00:56:00,940 --> 00:56:03,520 And this is the DAG structure I've 1054 00:56:03,520 --> 00:56:07,540 got, because the for loop here that 1055 00:56:07,540 --> 00:56:10,360 has the Cilk spawn in it is going along, 1056 00:56:10,360 --> 00:56:16,330 and notice that the Cilk spawn is in a loop. 1057 00:56:16,330 --> 00:56:19,550 And so it's basically going-- it's spawning off G iterations. 1058 00:56:19,550 --> 00:56:23,170 So it's spawning off the vector add, which 1059 00:56:23,170 --> 00:56:25,330 is going to do G iterations-- 1060 00:56:25,330 --> 00:56:28,900 because I'm giving basically G, because the boundary case let's 1061 00:56:28,900 --> 00:56:30,610 not worry about. 1062 00:56:30,610 --> 00:56:34,220 And then spawn off G, spawn off G, spawn off G, and so forth. 1063 00:56:34,220 --> 00:56:37,476 So what's the work of this? 1064 00:56:37,476 --> 00:56:38,434 Let's see. 1065 00:56:41,310 --> 00:56:44,170 Well, let's make things easy to begin with. 1066 00:56:44,170 --> 00:56:48,595 Let's assume G is 1 and analyze it. 1067 00:56:48,595 --> 00:56:50,220 And this is a common thing, by the way, 1068 00:56:50,220 --> 00:56:52,530 is you as assume that grain size is 1 1069 00:56:52,530 --> 00:56:55,170 and analyze it, and then as a practical matter, 1070 00:56:55,170 --> 00:56:58,390 coarsen it to make it more efficient. 1071 00:56:58,390 --> 00:57:00,492 So if G is 1, what's the work of this? 1072 00:57:14,760 --> 00:57:16,036 Yeah. 1073 00:57:16,036 --> 00:57:18,680 AUDIENCE: [INAUDIBLE] 1074 00:57:18,680 --> 00:57:19,180 Yeah. 1075 00:57:19,180 --> 00:57:22,810 It was order n, because those other two things are constant. 1076 00:57:22,810 --> 00:57:24,790 So exactly right. 1077 00:57:24,790 --> 00:57:27,220 It's order n. 1078 00:57:27,220 --> 00:57:31,180 In fact, this is a technique, by the way, 1079 00:57:31,180 --> 00:57:33,880 that's called strip mining, if you take away 1080 00:57:33,880 --> 00:57:36,970 the parallel thing, where you take a loop of length n. 1081 00:57:36,970 --> 00:57:39,010 And you really have nested loops here-- 1082 00:57:39,010 --> 00:57:44,200 one that has n over G iterations and one that has G iterations-- 1083 00:57:44,200 --> 00:57:47,290 and you're going through exactly the same stuff. 1084 00:57:47,290 --> 00:57:49,512 And that's the same as going through n iterations. 1085 00:57:49,512 --> 00:57:51,220 But you're replacing a singly-nested loop 1086 00:57:51,220 --> 00:57:52,528 by a doubly-nested loop. 1087 00:57:52,528 --> 00:57:54,820 And the only difference here is that in the inner loop, 1088 00:57:54,820 --> 00:57:56,320 I'm actually spawning off work. 1089 00:57:59,970 --> 00:58:04,260 So here, the work is order n, because I basically-- 1090 00:58:04,260 --> 00:58:07,080 if I'm spinning off just-- if G is 1, 1091 00:58:07,080 --> 00:58:09,990 then I'm spinning off one piece of work, 1092 00:58:09,990 --> 00:58:12,900 and I'm going to n minus 1, spinning off one. 1093 00:58:12,900 --> 00:58:14,550 So I've got order n work up here, 1094 00:58:14,550 --> 00:58:18,330 and order n work down below. 1095 00:58:18,330 --> 00:58:19,680 What's the span for this. 1096 00:58:22,870 --> 00:58:24,970 After all, I got in spans there now-- 1097 00:58:24,970 --> 00:58:27,780 sorry, n spawns, not n spans. 1098 00:58:27,780 --> 00:58:28,280 n spawns. 1099 00:58:35,640 --> 00:58:37,740 What's the span going to be? 1100 00:58:37,740 --> 00:58:38,506 Yeah. 1101 00:58:38,506 --> 00:58:39,500 AUDIENCE: [INAUDIBLE] 1102 00:58:39,500 --> 00:58:41,250 CHARLES LEISERSON: Sorry? 1103 00:58:41,250 --> 00:58:43,305 Sorry I, couldn't hear-- 1104 00:58:43,305 --> 00:58:44,180 AUDIENCE: [INAUDIBLE] 1105 00:58:44,180 --> 00:58:45,810 CHARLES LEISERSON: Theta S? 1106 00:58:45,810 --> 00:58:48,060 No, it's bigger than that. 1107 00:58:48,060 --> 00:58:50,250 Yeah, you'd think, gee, I just have to do 1108 00:58:50,250 --> 00:58:52,050 one thing to go down and up. 1109 00:58:52,050 --> 00:58:54,791 But the span is the longest path in the whole DAG. 1110 00:59:06,580 --> 00:59:08,300 It's the longest path in the whole DAG. 1111 00:59:10,910 --> 00:59:14,880 Where's the longest path in the whole DAG start? 1112 00:59:14,880 --> 00:59:16,350 Upper left, right? 1113 00:59:16,350 --> 00:59:18,810 And where does it end? 1114 00:59:18,810 --> 00:59:19,380 Upper right. 1115 00:59:22,150 --> 00:59:25,330 How long is that path? 1116 00:59:25,330 --> 00:59:26,830 What's the longest one? 1117 00:59:26,830 --> 00:59:30,640 It's going to go all the way down the backbone of the top 1118 00:59:30,640 --> 00:59:34,850 there, and then flip down and back up. 1119 00:59:34,850 --> 00:59:36,550 So how many things are in the-- 1120 00:59:36,550 --> 00:59:40,720 if G is 1, how many things are my spawning off there? 1121 00:59:40,720 --> 00:59:42,674 n things, so the span is? 1122 00:59:48,470 --> 00:59:49,000 Order n? 1123 00:59:51,770 --> 00:59:52,340 So order n. 1124 00:59:52,340 --> 00:59:54,080 It's long. 1125 00:59:54,080 --> 00:59:55,542 So what's the parallelism here? 1126 01:00:01,326 --> 01:00:02,930 AUDIENCE: [INAUDIBLE] 1127 01:00:02,930 --> 01:00:03,930 CHARLES LEISERSON: Yeah. 1128 01:00:03,930 --> 01:00:05,940 It's order 1. 1129 01:00:05,940 --> 01:00:07,644 And what do we call that? 1130 01:00:07,644 --> 01:00:08,792 AUDIENCE: Bad. 1131 01:00:08,792 --> 01:00:09,750 CHARLES LEISERSON: Bad. 1132 01:00:09,750 --> 01:00:11,060 Right. 1133 01:00:11,060 --> 01:00:15,060 But there's a more technical name. 1134 01:00:15,060 --> 01:00:16,758 They call that puny. 1135 01:00:16,758 --> 01:00:19,098 [LAUGHTER] 1136 01:00:19,652 --> 01:00:21,360 It's like, we went through all this work, 1137 01:00:21,360 --> 01:00:24,540 spawned off all that stuff, added all this overhead, 1138 01:00:24,540 --> 01:00:26,040 and it didn't go any faster. 1139 01:00:26,040 --> 01:00:27,540 I can't tell you how many times I've 1140 01:00:27,540 --> 01:00:32,640 seen people do this when they start parallel programming. 1141 01:00:32,640 --> 01:00:35,990 Oh, but I spawned off all this stuff! 1142 01:00:35,990 --> 01:00:38,075 Yeah, but you didn't reduce the span. 1143 01:00:45,710 --> 01:00:51,560 Let's now-- that was the analyze it in terms of n-- 1144 01:00:51,560 --> 01:00:55,020 sorry, in terms of G equals 1. 1145 01:00:55,020 --> 01:00:58,220 Now let's increase the grain size 1146 01:00:58,220 --> 01:01:01,490 and analyze it in terms of G. So once again, 1147 01:01:01,490 --> 01:01:04,700 what's the work now? 1148 01:01:04,700 --> 01:01:06,050 Work is always a gimme. 1149 01:01:11,230 --> 01:01:11,950 Yeah. 1150 01:01:11,950 --> 01:01:13,117 AUDIENCE: Same as before, n. 1151 01:01:13,117 --> 01:01:13,992 CHARLES LEISERSON: n. 1152 01:01:13,992 --> 01:01:14,790 Same as before. n. 1153 01:01:14,790 --> 01:01:17,415 The work doesn't change when you parallelize things differently 1154 01:01:17,415 --> 01:01:19,360 and stuff like that. 1155 01:01:19,360 --> 01:01:21,885 I'm doing order n iterations. 1156 01:01:21,885 --> 01:01:22,885 Oh, but what's the span? 1157 01:01:26,765 --> 01:01:27,640 This is a tricky one. 1158 01:01:27,640 --> 01:01:28,330 Yeah. 1159 01:01:28,330 --> 01:01:29,590 AUDIENCE: n over G. 1160 01:01:29,590 --> 01:01:31,600 CHARLES LEISERSON: Close. 1161 01:01:31,600 --> 01:01:33,085 That's half right. 1162 01:01:36,560 --> 01:01:39,382 That's half right. 1163 01:01:39,382 --> 01:01:41,830 Good. 1164 01:01:41,830 --> 01:01:43,300 That's half right. 1165 01:01:43,300 --> 01:01:44,042 Yeah. 1166 01:01:44,042 --> 01:01:45,430 AUDIENCE: [INAUDIBLE] 1167 01:01:45,430 --> 01:01:48,130 CHARLES LEISERSON: n over G plus G. 1168 01:01:48,130 --> 01:01:49,420 Don't forget that other term. 1169 01:01:49,420 --> 01:01:55,000 So the path that we care about goes along the top 1170 01:01:55,000 --> 01:01:57,730 here, and then goes down there. 1171 01:01:57,730 --> 01:02:04,630 And this has span G. So we've got n over G 1172 01:02:04,630 --> 01:02:07,270 here, because I'm doing chunks of G, plus G. 1173 01:02:07,270 --> 01:02:13,030 So it's G plus n over G. And now, how can I 1174 01:02:13,030 --> 01:02:15,800 choose G to minimize the span? 1175 01:02:15,800 --> 01:02:18,220 There's nothing to choose to minimize the work, 1176 01:02:18,220 --> 01:02:20,690 except there's some work overhead 1177 01:02:20,690 --> 01:02:21,690 that we're trying to do. 1178 01:02:21,690 --> 01:02:25,044 But how can I choose G to minimize the span? 1179 01:02:28,700 --> 01:02:31,230 What's the best value for G here? 1180 01:02:31,230 --> 01:02:31,730 Yeah. 1181 01:02:31,730 --> 01:02:32,610 AUDIENCE: [INAUDIBLE] 1182 01:02:32,610 --> 01:02:33,860 CHARLES LEISERSON: You got it. 1183 01:02:33,860 --> 01:02:36,300 Square root of n. 1184 01:02:36,300 --> 01:02:40,640 So one of these is increasing. 1185 01:02:40,640 --> 01:02:43,590 If G is increasing, n over G is decreasing, 1186 01:02:43,590 --> 01:02:45,180 where do they cross? 1187 01:02:45,180 --> 01:02:46,680 When they're equal. 1188 01:02:46,680 --> 01:02:52,410 That's when G equals n over G, or G is square root of n. 1189 01:02:54,966 --> 01:02:57,880 So this actually has decent-- 1190 01:02:57,880 --> 01:03:01,970 n big enough, square root of n, that's not bad. 1191 01:03:01,970 --> 01:03:04,985 So it is OK to spawn things off in chunks. 1192 01:03:04,985 --> 01:03:06,610 Just don't make the chunks real little. 1193 01:03:10,240 --> 01:03:12,702 What's the parallelism? 1194 01:03:12,702 --> 01:03:14,160 Once again, this is always a gimme. 1195 01:03:14,160 --> 01:03:15,267 It's the ratio. 1196 01:03:15,267 --> 01:03:16,100 So square root of n. 1197 01:03:19,170 --> 01:03:20,460 Quiz on parallel loops. 1198 01:03:25,180 --> 01:03:27,760 I'm going to let you folks do this offline. 1199 01:03:27,760 --> 01:03:28,935 Here's the answers. 1200 01:03:28,935 --> 01:03:30,600 If you quickly write it down, you 1201 01:03:30,600 --> 01:03:33,346 don't have to think about it. 1202 01:03:33,346 --> 01:03:36,280 [RAPID BREATHING] 1203 01:03:38,150 --> 01:03:38,650 OK. 1204 01:03:38,650 --> 01:03:40,542 [LAUGHTER] 1205 01:03:43,380 --> 01:03:45,610 OK. 1206 01:03:45,610 --> 01:03:47,310 So take a look at the notes afterwards, 1207 01:03:47,310 --> 01:03:51,840 and you can try to figure out why those things are so. 1208 01:03:51,840 --> 01:03:56,500 So there's some performance tips that make sense when 1209 01:03:56,500 --> 01:03:57,750 you're programming with loops. 1210 01:03:57,750 --> 01:04:01,450 One is, minimize the span to maximize the parallelism, 1211 01:04:01,450 --> 01:04:04,002 because the span's in the denominator. 1212 01:04:04,002 --> 01:04:05,460 And generally, you want to generate 1213 01:04:05,460 --> 01:04:08,220 10 times more parallelism than processors 1214 01:04:08,220 --> 01:04:11,082 if you want near-perfect linear speed-up. 1215 01:04:11,082 --> 01:04:13,290 So if you have a lot more parallelism than the number 1216 01:04:13,290 --> 01:04:15,630 of processors-- we talked about that last time-- 1217 01:04:15,630 --> 01:04:17,543 you get good speed-up. 1218 01:04:17,543 --> 01:04:18,960 If you have plenty of parallelism, 1219 01:04:18,960 --> 01:04:23,580 try to trade some of it off to reduce the work overhead. 1220 01:04:23,580 --> 01:04:25,710 So the idea was, for any of these things, 1221 01:04:25,710 --> 01:04:31,050 you can fiddle with the numbers, the grain size 1222 01:04:31,050 --> 01:04:32,920 in particular, to reduce-- 1223 01:04:32,920 --> 01:04:36,720 it reduces the parallelism, but it also reduces the overhead. 1224 01:04:36,720 --> 01:04:39,810 And as long as you've got sufficient parallelism, 1225 01:04:39,810 --> 01:04:42,870 your code is going to run just fine parallel. 1226 01:04:42,870 --> 01:04:45,072 It's only when you're in the place where, 1227 01:04:45,072 --> 01:04:46,530 ooh, don't have enough parallelism, 1228 01:04:46,530 --> 01:04:48,210 and I don't want to pay the overhead. 1229 01:04:48,210 --> 01:04:50,340 Those are the sticky ones. 1230 01:04:50,340 --> 01:04:53,760 But most of the time, you're going to be in the case 1231 01:04:53,760 --> 01:04:55,800 where you've got way more parallelism than you 1232 01:04:55,800 --> 01:04:58,710 need, and the question is, how can you reduce some of it 1233 01:04:58,710 --> 01:05:02,580 in order to reduced the work overhead? 1234 01:05:02,580 --> 01:05:05,790 Generally, you should use divide and conquer recursion 1235 01:05:05,790 --> 01:05:08,940 or parallel loops, rather than spawning one small thing 1236 01:05:08,940 --> 01:05:10,480 after another. 1237 01:05:10,480 --> 01:05:12,150 So it's better to do the Cilk for, 1238 01:05:12,150 --> 01:05:15,030 which already is doing divide and conquer parallelism, 1239 01:05:15,030 --> 01:05:18,270 than doing the spawn off one thing at a time 1240 01:05:18,270 --> 01:05:22,950 type of strategy, unless you can chunk them 1241 01:05:22,950 --> 01:05:25,020 so that you have relatively few things 1242 01:05:25,020 --> 01:05:26,130 that you're spawning off. 1243 01:05:26,130 --> 01:05:27,340 This would be fine. 1244 01:05:27,340 --> 01:05:30,180 The thing I say not-- this would be fine if foo of i 1245 01:05:30,180 --> 01:05:32,920 was really expensive. 1246 01:05:32,920 --> 01:05:35,560 Fine, then we'll have lots of parallelism, 1247 01:05:35,560 --> 01:05:37,480 because there's a lot of work there. 1248 01:05:37,480 --> 01:05:42,460 But generally, it's better to do the divide and conquer. 1249 01:05:42,460 --> 01:05:44,740 Generally, you should make sure that the work 1250 01:05:44,740 --> 01:05:48,460 that you're doing per number of spawns is sufficiently large. 1251 01:05:48,460 --> 01:05:50,620 So the spawns say, well, how much 1252 01:05:50,620 --> 01:05:54,580 are you busting your work into in terms of chunks? 1253 01:05:54,580 --> 01:05:57,790 Because the spawn has an overhead, and so the question 1254 01:05:57,790 --> 01:05:59,770 is, well, how big is that? 1255 01:05:59,770 --> 01:06:02,620 And so you can coarsen by using function calls 1256 01:06:02,620 --> 01:06:04,720 and in-lining near the leaves. 1257 01:06:04,720 --> 01:06:06,970 Generally better to parallelize outer loops as opposed 1258 01:06:06,970 --> 01:06:08,637 to inner loops, if you're forced to make 1259 01:06:08,637 --> 01:06:11,120 a choice, because the inner loops, 1260 01:06:11,120 --> 01:06:13,900 they're the overhead you're incurring every single time. 1261 01:06:13,900 --> 01:06:15,490 The outer loop, you can amortize it 1262 01:06:15,490 --> 01:06:17,680 against the work that's going on inside 1263 01:06:17,680 --> 01:06:20,050 that doesn't have the overhead. 1264 01:06:20,050 --> 01:06:22,910 And watch out for scheduling overhead. 1265 01:06:22,910 --> 01:06:28,860 So here's an example of two codes that have parallelism 2, 1266 01:06:28,860 --> 01:06:31,710 and one of them is an efficient code, 1267 01:06:31,710 --> 01:06:35,240 and the other one is lousy code. 1268 01:06:35,240 --> 01:06:38,280 The top one is efficient, because I 1269 01:06:38,280 --> 01:06:43,920 have two iterations that I run in parallel, and each of them 1270 01:06:43,920 --> 01:06:47,160 does a lot of work. 1271 01:06:47,160 --> 01:06:50,100 There's only one scheduling operation that happens. 1272 01:06:50,100 --> 01:06:55,920 The bottom one, I have n iterations, 1273 01:06:55,920 --> 01:07:00,660 and each iteration does work, too, 1274 01:07:00,660 --> 01:07:03,780 so I basically have n iterations with overhead. 1275 01:07:03,780 --> 01:07:06,750 And so if you just look at these, look at the overhead, 1276 01:07:06,750 --> 01:07:08,490 you can see what the difference is. 1277 01:07:08,490 --> 01:07:09,458 OK. 1278 01:07:09,458 --> 01:07:11,250 I want to talk a little bit about actually, 1279 01:07:11,250 --> 01:07:12,630 I have a whole bunch of things here 1280 01:07:12,630 --> 01:07:13,710 that I'm not going to get to, but I 1281 01:07:13,710 --> 01:07:14,920 didn't expect to get to them. 1282 01:07:14,920 --> 01:07:19,590 But I do want to get to some of matrix multiplication. 1283 01:07:19,590 --> 01:07:22,690 People familiar with this problem? 1284 01:07:22,690 --> 01:07:23,730 OK. 1285 01:07:23,730 --> 01:07:25,230 We're going to assume for simplicity 1286 01:07:25,230 --> 01:07:27,330 that n is a power of 2. 1287 01:07:27,330 --> 01:07:29,940 So here's the typical way of parallelizing matrix 1288 01:07:29,940 --> 01:07:31,350 multiplication. 1289 01:07:31,350 --> 01:07:36,180 I take the two outer loops and I parallelize them. 1290 01:07:36,180 --> 01:07:39,630 I can't easily parallelize the inner loop, because if I do, 1291 01:07:39,630 --> 01:07:42,090 I get a race condition, because I'll 1292 01:07:42,090 --> 01:07:45,470 have two iterations that are both trying to update C of i, 1293 01:07:45,470 --> 01:07:47,450 j. 1294 01:07:47,450 --> 01:07:50,390 So I can't just parallelize k, so I'm just 1295 01:07:50,390 --> 01:07:53,510 going to parallelize i and j. 1296 01:07:53,510 --> 01:07:54,710 The work for this is what? 1297 01:07:57,370 --> 01:07:58,539 Triply-nested loop. 1298 01:08:02,850 --> 01:08:03,360 n cubed. 1299 01:08:03,360 --> 01:08:05,220 Everybody knows-- matrix multiplication, 1300 01:08:05,220 --> 01:08:07,800 unless you do something clever like Strassen, 1301 01:08:07,800 --> 01:08:11,190 or one of the more recent-- 1302 01:08:11,190 --> 01:08:16,290 Virgie Williams algorithm-- you know 1303 01:08:16,290 --> 01:08:20,910 that the running time for the standard algorithm is n cubed. 1304 01:08:20,910 --> 01:08:23,290 The span for this is what? 1305 01:08:23,290 --> 01:08:23,790 Yeah. 1306 01:08:23,790 --> 01:08:27,015 That inner loop is linear size, and then you've got two log 1307 01:08:27,015 --> 01:08:28,380 n's-- 1308 01:08:28,380 --> 01:08:30,689 log n plus log n plus n-- 1309 01:08:30,689 --> 01:08:32,490 so it's order n. 1310 01:08:32,490 --> 01:08:37,020 So the parallelism is around n squared. 1311 01:08:37,020 --> 01:08:39,609 If I ignore constants, and I said 1312 01:08:39,609 --> 01:08:42,960 I was working on matrices of, say, 1,000 by 1,000 or so, 1313 01:08:42,960 --> 01:08:47,069 the parallelism is something like n squared, 1314 01:08:47,069 --> 01:08:49,260 which is about-- 1315 01:08:49,260 --> 01:08:52,970 1,000 squared is a million. 1316 01:08:52,970 --> 01:08:54,350 Wow. 1317 01:08:54,350 --> 01:08:56,120 That's a lot of parallelism. 1318 01:08:56,120 --> 01:08:59,870 How many processors are you running on? 1319 01:08:59,870 --> 01:09:03,500 Is it bigger than 10 times the number of processors? 1320 01:09:03,500 --> 01:09:06,529 By a little bit. 1321 01:09:06,529 --> 01:09:08,300 Now, there's another strategy that one 1322 01:09:08,300 --> 01:09:10,609 can use, which is divide and conquer, 1323 01:09:10,609 --> 01:09:12,890 and this is the strategy that's used in Strassen. 1324 01:09:12,890 --> 01:09:14,210 We're not going to do the Strassen's algorithm. 1325 01:09:14,210 --> 01:09:16,640 We're just going to use the eight multiply version of this. 1326 01:09:16,640 --> 01:09:18,660 For people who know Strassen, more power to you. 1327 01:09:18,660 --> 01:09:20,420 It's a great algorithm. 1328 01:09:20,420 --> 01:09:23,029 Really surprising, really amazing. 1329 01:09:23,029 --> 01:09:26,109 And it's actually worthwhile doing in practice, by the way, 1330 01:09:26,109 --> 01:09:28,189 for sufficiently large matrices. 1331 01:09:28,189 --> 01:09:33,740 So the idea here is, I can multiply two n by n matrices 1332 01:09:33,740 --> 01:09:36,770 by doing eight multiplications of n over 2 1333 01:09:36,770 --> 01:09:41,795 by n over 2 matrices, and then add two n by n matrices. 1334 01:09:47,270 --> 01:09:49,855 So when we start talking matrices-- 1335 01:09:49,855 --> 01:09:52,220 this is a little bit of a diversion from the algorithms, 1336 01:09:52,220 --> 01:09:55,520 but it's so important, because representation of matrices 1337 01:09:55,520 --> 01:09:58,220 is one of the things that gets people into trouble when 1338 01:09:58,220 --> 01:10:04,160 they're doing any kind of two-dimensional coding stuff. 1339 01:10:04,160 --> 01:10:06,580 And so I want to talk a little bit about index, 1340 01:10:06,580 --> 01:10:08,900 and we're going to talk about this more later when 1341 01:10:08,900 --> 01:10:12,360 we do cache behavior and such. 1342 01:10:12,360 --> 01:10:15,290 So how do you represent sub-matrices? 1343 01:10:15,290 --> 01:10:17,390 The standard way of representing those either 1344 01:10:17,390 --> 01:10:19,760 in row-major or column-major order, 1345 01:10:19,760 --> 01:10:21,530 depending upon the language you use. 1346 01:10:21,530 --> 01:10:24,260 Fortran uses column-major ordering, 1347 01:10:24,260 --> 01:10:25,760 so there are a lot of subroutines 1348 01:10:25,760 --> 01:10:26,720 that are column-major. 1349 01:10:26,720 --> 01:10:33,950 But for the most part, C, which we're using, is row-major. 1350 01:10:33,950 --> 01:10:36,170 And so the question is, if I take 1351 01:10:36,170 --> 01:10:38,930 a sub-matrix of a large matrix, how 1352 01:10:38,930 --> 01:10:44,540 do I calculate where the i, j element of that matrix is? 1353 01:10:44,540 --> 01:10:48,650 Here I have the i, j element here. 1354 01:10:48,650 --> 01:10:51,530 I've got a matrix M, which is embedded. 1355 01:10:51,530 --> 01:10:53,120 And by row major, remember, that means 1356 01:10:53,120 --> 01:10:56,450 I just take row after row, and I just put them in linear order 1357 01:10:56,450 --> 01:10:58,980 through the memory. 1358 01:10:58,980 --> 01:11:01,260 So every two-dimensional matrix, you can index 1359 01:11:01,260 --> 01:11:04,680 as a one-dimensional matrix, because all you have to do is-- 1360 01:11:04,680 --> 01:11:06,520 which is exactly what the code is doing-- 1361 01:11:06,520 --> 01:11:08,670 you need to know the beginning of the matrix. 1362 01:11:08,670 --> 01:11:13,240 But if you have a sub-matrix, it's a little more complicated. 1363 01:11:13,240 --> 01:11:14,960 So here's the idea. 1364 01:11:14,960 --> 01:11:17,860 Suppose that you have a sub-matrix m-- so 1365 01:11:17,860 --> 01:11:22,390 starting in location m of this outer matrix. 1366 01:11:22,390 --> 01:11:26,230 Here we have the outer matrix, which has length n sub M. 1367 01:11:26,230 --> 01:11:29,070 This is the big matrix-- 1368 01:11:29,070 --> 01:11:30,660 actually I should have called that m. 1369 01:11:33,250 --> 01:11:35,755 I should not have called this n instead of m. 1370 01:11:35,755 --> 01:11:37,630 I should have called it m sub something else, 1371 01:11:37,630 --> 01:11:40,750 because this is my m that I'm interested in, which 1372 01:11:40,750 --> 01:11:41,890 is this location here. 1373 01:11:44,740 --> 01:11:50,560 And what I'm interested in doing is finding out-- 1374 01:11:50,560 --> 01:11:53,170 I named these variables stupidly-- 1375 01:11:53,170 --> 01:11:55,750 is finding out, where is the i, j-th element 1376 01:11:55,750 --> 01:11:57,220 of this sub-matrix M? 1377 01:11:57,220 --> 01:12:01,450 If I tell you the beginning, what do I add to get to i, j? 1378 01:12:01,450 --> 01:12:05,110 And the answer is that I've got to add the number of rows 1379 01:12:05,110 --> 01:12:06,010 that comes down here. 1380 01:12:06,010 --> 01:12:09,040 Well, that's i times the width of the full matrix 1381 01:12:09,040 --> 01:12:11,770 that you're taking it out of, not the width 1382 01:12:11,770 --> 01:12:15,670 of your local sub-matrix. 1383 01:12:15,670 --> 01:12:19,720 And then you have to add in the-- 1384 01:12:19,720 --> 01:12:23,595 and then you add in j from that point. 1385 01:12:23,595 --> 01:12:24,430 There we go. 1386 01:12:24,430 --> 01:12:25,090 OK. 1387 01:12:25,090 --> 01:12:30,100 So I have to add in the length of the long matrix 1388 01:12:30,100 --> 01:12:33,970 plus j for each row i. 1389 01:12:33,970 --> 01:12:35,938 Does that make sense? 1390 01:12:35,938 --> 01:12:37,230 Because it's embedded in there. 1391 01:12:37,230 --> 01:12:40,080 You have to skip over full rows of the outer matrix. 1392 01:12:40,080 --> 01:12:43,410 So you can't generally just pass a sub-matrix 1393 01:12:43,410 --> 01:12:45,600 and expect to do indexing on that when it's 1394 01:12:45,600 --> 01:12:47,250 embedded in a large matrix. 1395 01:12:47,250 --> 01:12:49,530 If you make a copy, sure, then you 1396 01:12:49,530 --> 01:12:52,170 can index it according to whatever the new copy is. 1397 01:12:52,170 --> 01:12:54,720 But if you want to operate in place on matrices, which 1398 01:12:54,720 --> 01:12:58,770 is often the case, then you have to understand that every row, 1399 01:12:58,770 --> 01:13:00,930 you have to jump a row of the outer matrix, not 1400 01:13:00,930 --> 01:13:03,690 a row of whatever your sub-matrix is, when you're 1401 01:13:03,690 --> 01:13:06,240 doing the divide and conquer. 1402 01:13:06,240 --> 01:13:08,950 So when we look at doing divide and conquer-- 1403 01:13:08,950 --> 01:13:13,200 I have a matrix here which I want to now 1404 01:13:13,200 --> 01:13:18,000 divide into four sub-matrices of size M over 2. 1405 01:13:18,000 --> 01:13:20,820 And the question is, where's the starting corners 1406 01:13:20,820 --> 01:13:23,340 of each of those matrices? 1407 01:13:23,340 --> 01:13:29,790 So M 0, 0, that starts at the same place as M. That upper 1408 01:13:29,790 --> 01:13:30,660 left one. 1409 01:13:30,660 --> 01:13:32,634 Where does M 0, 1 start? 1410 01:13:40,230 --> 01:13:41,320 Where's M 0, 1 start? 1411 01:13:45,685 --> 01:13:46,780 AUDIENCE: [INAUDIBLE] 1412 01:13:46,780 --> 01:13:47,780 CHARLES LEISERSON: Yeah. 1413 01:13:47,780 --> 01:13:50,180 M plus n over 2. 1414 01:13:50,180 --> 01:13:53,101 Where does M 1, 0 start? 1415 01:13:53,101 --> 01:13:54,592 This is the tricky one. 1416 01:14:01,560 --> 01:14:03,690 Here's the answer. 1417 01:14:03,690 --> 01:14:07,650 M plus the long matrix times n over 2, 1418 01:14:07,650 --> 01:14:10,140 because I'm going down m over 2 rows, 1419 01:14:10,140 --> 01:14:11,940 and I've got to go down the number of rows 1420 01:14:11,940 --> 01:14:14,490 of the outer matrix. 1421 01:14:14,490 --> 01:14:18,250 And then M 1, 1 is the same as the 2 there. 1422 01:14:18,250 --> 01:14:24,120 So here's the-- in general, for row and column being 0 and 1, 1423 01:14:24,120 --> 01:14:27,350 in some sense, this is a general formula 1424 01:14:27,350 --> 01:14:34,710 that matches up with that, where I plug in 0 1 for each one. 1425 01:14:34,710 --> 01:14:35,960 And now here's my code. 1426 01:14:35,960 --> 01:14:37,960 And I just want to point out a couple of things, 1427 01:14:37,960 --> 01:14:39,510 and then we'll quit and I'll let you 1428 01:14:39,510 --> 01:14:43,980 take a look at the rest of this on your own. 1429 01:14:46,540 --> 01:14:49,410 Here's my divide and conquer matrix multiply. 1430 01:14:49,410 --> 01:14:50,940 I use restrict. 1431 01:14:50,940 --> 01:14:52,740 Everybody familiar with restrict? 1432 01:14:52,740 --> 01:14:55,320 It says, don't tell the compiler these things 1433 01:14:55,320 --> 01:14:59,040 you can assume are not aliased, so that when you change one, 1434 01:14:59,040 --> 01:15:00,210 you're not changing another. 1435 01:15:00,210 --> 01:15:03,090 That lets the compiler produce better code. 1436 01:15:03,090 --> 01:15:06,690 And then the row sizes are going to be n sub c, n sub a, 1437 01:15:06,690 --> 01:15:09,280 and n sub b. 1438 01:15:09,280 --> 01:15:12,550 And then the matrices that we're taking them out of, 1439 01:15:12,550 --> 01:15:14,680 those are the sizes of the sub-matrix. 1440 01:15:14,680 --> 01:15:19,210 The outer matrix is going to have size n by n, for which-- 1441 01:15:19,210 --> 01:15:21,220 when I have my recursion, I want to talk 1442 01:15:21,220 --> 01:15:23,590 about sub-matrices that are embedded in this larger 1443 01:15:23,590 --> 01:15:25,810 outside matrix. 1444 01:15:25,810 --> 01:15:30,190 Here is a great piece of bit tricks. 1445 01:15:30,190 --> 01:15:32,260 This says, n is a power of 2. 1446 01:15:35,630 --> 01:15:39,320 So go back and remind yourself of what the bit tricks are, 1447 01:15:39,320 --> 01:15:42,440 but that's a clever bit trick to say that n is a power of 2. 1448 01:15:42,440 --> 01:15:44,990 Very quick. 1449 01:15:44,990 --> 01:15:47,960 And so take a look at that. 1450 01:15:47,960 --> 01:15:50,480 And then we're going to coarsen leaves with a base case. 1451 01:15:50,480 --> 01:15:53,570 The base case just goes through and solves the problem 1452 01:15:53,570 --> 01:15:59,240 for small n, just with a typical triply-nested loop. 1453 01:15:59,240 --> 01:16:02,120 And what we're going to do is allocate a temporary n 1454 01:16:02,120 --> 01:16:07,750 by n array, and then we're going to define the temporary array 1455 01:16:07,750 --> 01:16:12,360 to having underlying row size n. 1456 01:16:12,360 --> 01:16:17,100 And then here is this fabulous macro that makes all the index 1457 01:16:17,100 --> 01:16:18,240 calculations easy. 1458 01:16:18,240 --> 01:16:22,860 It uses the sharp sharp operator, 1459 01:16:22,860 --> 01:16:27,090 which pastes together tokens, so that I can paste n sub c. 1460 01:16:27,090 --> 01:16:30,220 When I pass r and c, it passes-- 1461 01:16:30,220 --> 01:16:34,170 whatever value I pass for that, it pastes it together. 1462 01:16:34,170 --> 01:16:37,050 So it allows me to do the indexing of the-- 1463 01:16:37,050 --> 01:16:41,040 and have the right thing, so that for each of these address 1464 01:16:41,040 --> 01:16:46,590 calculations, I'm able to do them by just saying x of, 1465 01:16:46,590 --> 01:16:48,580 and just give the formulas these. 1466 01:16:48,580 --> 01:16:51,450 Otherwise, you'd be driven nuts by the formula. 1467 01:16:51,450 --> 01:16:54,240 So take a look at that macro, because that may help you 1468 01:16:54,240 --> 01:16:55,980 in some of your other things. 1469 01:16:55,980 --> 01:17:00,360 And then I sync, and then add it up. 1470 01:17:00,360 --> 01:17:02,790 And the addition is just going to be 1471 01:17:02,790 --> 01:17:07,650 a doubly-nested parallel addition, and then I free it. 1472 01:17:07,650 --> 01:17:11,520 So what I would like you to do is go home and take a look 1473 01:17:11,520 --> 01:17:13,270 at the analysis of this. 1474 01:17:13,270 --> 01:17:17,500 And it turns out this has way more panels than you need, 1475 01:17:17,500 --> 01:17:20,220 and if you reduce the amount of parallelism, 1476 01:17:20,220 --> 01:17:21,630 you get much better performance. 1477 01:17:21,630 --> 01:17:23,130 And there's several other algorithms 1478 01:17:23,130 --> 01:17:25,650 I put in there as well. 1479 01:17:25,650 --> 01:17:29,340 so I'll try to get this posted tonight. 1480 01:17:29,340 --> 01:17:31,430 Thanks very much.