1
00:00:01,550 --> 00:00:03,920
The following content is
provided under a Creative

2
00:00:03,920 --> 00:00:05,310
Commons license.

3
00:00:05,310 --> 00:00:07,520
Your support will help
MIT OpenCourseWare

4
00:00:07,520 --> 00:00:11,610
continue to offer high quality
educational resources for free.

5
00:00:11,610 --> 00:00:14,180
To make a donation or to
view additional materials

6
00:00:14,180 --> 00:00:18,140
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,140 --> 00:00:19,026
at ocw.mit.edu.

8
00:00:21,948 --> 00:00:23,490
CHARLES LEISERSON:
Today, we're going

9
00:00:23,490 --> 00:00:29,370
to talk about analyzing
task parallel algorithms--

10
00:00:29,370 --> 00:00:31,410
multi-threaded algorithms.

11
00:00:31,410 --> 00:00:36,810
And this is going
to rely on the fact

12
00:00:36,810 --> 00:00:39,360
that everybody has taken
an algorithms class.

13
00:00:42,000 --> 00:00:44,250
And so I want to remind
you of some of the stuff

14
00:00:44,250 --> 00:00:46,382
you learned in your
algorithms class.

15
00:00:46,382 --> 00:00:48,090
And if you don't
remember this, then it's

16
00:00:48,090 --> 00:00:50,130
probably good to bone
up on it a little bit,

17
00:00:50,130 --> 00:00:52,050
because it's going
to be essential.

18
00:00:52,050 --> 00:00:55,440
And that is the topic of
divide-and-conquer recurrences.

19
00:00:55,440 --> 00:00:58,680
Everybody remember divide
and conquer recurrences?

20
00:00:58,680 --> 00:01:02,745
These are-- and there's a
general method for solving them

21
00:01:02,745 --> 00:01:04,620
that will deal with most
of the ones we want,

22
00:01:04,620 --> 00:01:06,810
called the Master Method.

23
00:01:06,810 --> 00:01:10,680
And it deals with recurrences
in the form T of n

24
00:01:10,680 --> 00:01:14,675
equals a times T of
n over b plus f of n.

25
00:01:14,675 --> 00:01:16,050
And this is
generally interpreted

26
00:01:16,050 --> 00:01:18,810
as I have a problem of size n.

27
00:01:18,810 --> 00:01:24,060
I can solve it by solving a
problems of size n over b,

28
00:01:24,060 --> 00:01:29,250
and it costs me f of n
work to do that division

29
00:01:29,250 --> 00:01:33,180
and accumulate whatever
the results are of that

30
00:01:33,180 --> 00:01:37,230
to make my final result.

31
00:01:37,230 --> 00:01:41,355
For all these recurrences,
the unstated base case

32
00:01:41,355 --> 00:01:43,290
is that this is a running time.

33
00:01:43,290 --> 00:01:49,050
So T of n is constant
if n is small.

34
00:01:49,050 --> 00:01:50,980
So does that makes sense?

35
00:01:50,980 --> 00:01:54,140
Everybody familiar with this?

36
00:01:54,140 --> 00:01:55,998
Right?

37
00:01:55,998 --> 00:01:57,540
Well we're going to
review it anyway,

38
00:01:57,540 --> 00:02:01,080
because I don't like to go ahead
and just assume, and then leave

39
00:02:01,080 --> 00:02:06,750
20% of you or more, or less,
left behind in the woods.

40
00:02:06,750 --> 00:02:11,220
So let's just remind
ourselves of what this means.

41
00:02:11,220 --> 00:02:16,020
So it's easy to understand this
in terms of a recursion tree.

42
00:02:16,020 --> 00:02:19,080
I start out, and the
idea is a recursion tree

43
00:02:19,080 --> 00:02:23,550
is to take the
running time, here,

44
00:02:23,550 --> 00:02:28,930
and to reexpress it
using the recurrence.

45
00:02:28,930 --> 00:02:33,510
So if I reexpress
this and I've written

46
00:02:33,510 --> 00:02:36,000
it a little bit differently,
then I have an f of n.

47
00:02:36,000 --> 00:02:38,490
I can put an f of n
at the root, and have

48
00:02:38,490 --> 00:02:40,980
a copies of T of n over b.

49
00:02:40,980 --> 00:02:44,580
And that's exactly the same
amount of work as I had--

50
00:02:44,580 --> 00:02:47,220
or running time as
I had in the T of n.

51
00:02:47,220 --> 00:02:51,190
I've just simply expressed
it with the right hand side.

52
00:02:51,190 --> 00:02:54,900
And then I do it
again at every level.

53
00:02:54,900 --> 00:02:58,080
So I expand all the leaves.

54
00:02:58,080 --> 00:03:02,880
I only expanded one here
because I ran out of space.

55
00:03:02,880 --> 00:03:08,640
And you keep doing that
until you get down to T of 1.

56
00:03:08,640 --> 00:03:12,350
And so then the trick of
looking at these recurrences

57
00:03:12,350 --> 00:03:15,640
is to add across the rows.

58
00:03:15,640 --> 00:03:18,270
So the first row
adds up to f of n.

59
00:03:18,270 --> 00:03:22,770
The second row adds up
to a times f of n over b.

60
00:03:22,770 --> 00:03:27,880
The third one is a squared f of
n over b squared, and so forth.

61
00:03:27,880 --> 00:03:29,610
And the height here, now.

62
00:03:29,610 --> 00:03:32,820
Since I'm taking n and
dividing it by b each time,

63
00:03:32,820 --> 00:03:36,210
how many times can I divide
by b until I get to something

64
00:03:36,210 --> 00:03:37,620
that's constant size?

65
00:03:37,620 --> 00:03:43,170
That's just log base b of n.

66
00:03:43,170 --> 00:03:45,180
So, so far, review--

67
00:03:45,180 --> 00:03:47,630
any questions here?

68
00:03:47,630 --> 00:03:49,140
For anybody?

69
00:03:49,140 --> 00:03:49,920
OK.

70
00:03:49,920 --> 00:03:54,720
So I get the height, and
then I look at how many--

71
00:03:54,720 --> 00:03:58,140
if I've got T of 1
work at every leaf,

72
00:03:58,140 --> 00:03:59,430
how many leaves are there?

73
00:04:03,440 --> 00:04:05,690
And for this analysis we're
going to assume everything

74
00:04:05,690 --> 00:04:07,010
works out--

75
00:04:07,010 --> 00:04:12,330
n is a perfect power
of b and so forth.

76
00:04:12,330 --> 00:04:16,339
So if I go down k levels,
how many sub problems

77
00:04:16,339 --> 00:04:19,010
are there at k levels?

78
00:04:19,010 --> 00:04:20,209
a of the k.

79
00:04:20,209 --> 00:04:23,600
So how many levels
am I going down?

80
00:04:23,600 --> 00:04:26,570
h, which is log base b of n.

81
00:04:26,570 --> 00:04:32,420
So I end up with a to log
base b of n times what's

82
00:04:32,420 --> 00:04:33,830
at the leaf, which is T of 1.

83
00:04:37,160 --> 00:04:40,990
And T of 1 is constant.

84
00:04:40,990 --> 00:04:43,570
a log base b of n--

85
00:04:43,570 --> 00:04:47,360
that's the same as n
to the log base b of a.

86
00:04:47,360 --> 00:04:47,860
OK.

87
00:04:47,860 --> 00:04:52,840
That's just a little bit
of exponential algebra.

88
00:04:52,840 --> 00:04:54,700
And you can-- one
way to see that is,

89
00:04:54,700 --> 00:04:56,770
take the log of both
sides of both equations,

90
00:04:56,770 --> 00:04:58,950
and you realize that
all that's used there

91
00:04:58,950 --> 00:05:00,780
is the commutative law.

92
00:05:00,780 --> 00:05:03,310
Because if you
take the log base--

93
00:05:06,900 --> 00:05:14,200
if you take log of a log bn,
you get log bn times log--

94
00:05:14,200 --> 00:05:15,790
if you take a base b, log ba.

95
00:05:15,790 --> 00:05:22,780
And then you get the same thing
if you take the log base b

96
00:05:22,780 --> 00:05:25,030
of what I have as
the result. Then you

97
00:05:25,030 --> 00:05:29,770
get the exponent log base
ba times log base b of n.

98
00:05:29,770 --> 00:05:32,650
So same thing, just
in different orders.

99
00:05:32,650 --> 00:05:35,698
So that's just a little bit
of math, because this is--

100
00:05:35,698 --> 00:05:37,990
basically, we're interested
in, what's the growth in n?

101
00:05:37,990 --> 00:05:40,450
So we prefer not to have
log n's in the denominator.

102
00:05:40,450 --> 00:05:42,280
We prefer to have n's--

103
00:05:42,280 --> 00:05:46,660
sorry, in the exponent
we prefer to have n's.

104
00:05:46,660 --> 00:05:49,850
So that's basically
the number of things.

105
00:05:49,850 --> 00:05:54,310
And so then the question
is, how much work is there

106
00:05:54,310 --> 00:05:59,830
if I add up all of these
guys all the way down there?

107
00:05:59,830 --> 00:06:03,930
How much work is in
all those levels?

108
00:06:03,930 --> 00:06:07,810
And it turns out
there's a trick,

109
00:06:07,810 --> 00:06:16,770
and the trick is to compare n
to log base b of a with f of n.

110
00:06:16,770 --> 00:06:19,840
And there are three cases that
are very commonly arising,

111
00:06:19,840 --> 00:06:22,090
and for the most part, that's
what we're going to see,

112
00:06:22,090 --> 00:06:25,240
is just these three cases.

113
00:06:25,240 --> 00:06:35,590
So case 1 is the case where
n to the log base b of a

114
00:06:35,590 --> 00:06:39,550
is much bigger than f of n.

115
00:06:39,550 --> 00:06:44,510
And by much bigger, I mean it's
bigger by a polynomial amount.

116
00:06:44,510 --> 00:06:46,480
In other words,
there's an epsilon such

117
00:06:46,480 --> 00:06:50,823
that the ratio between the two
is at least n to the epsilon.

118
00:06:50,823 --> 00:06:52,240
There's an epsilon
greater than 0.

119
00:06:52,240 --> 00:06:55,180
In other words, f of n
is O of n to the log base

120
00:06:55,180 --> 00:06:59,170
b of a minus epsilon
in the numerator

121
00:06:59,170 --> 00:07:01,510
there, which is the
same as n log base b

122
00:07:01,510 --> 00:07:03,670
of a divided by
n to the epsilon.

123
00:07:03,670 --> 00:07:07,870
In that case, this is
geometrically increasing,

124
00:07:07,870 --> 00:07:09,430
and so all the weight--

125
00:07:09,430 --> 00:07:12,070
the constant fraction
of the weight--

126
00:07:12,070 --> 00:07:14,880
is in the leaves.

127
00:07:14,880 --> 00:07:20,570
So then the answer is T of n
is n to the log base b of a.

128
00:07:20,570 --> 00:07:24,120
So if n to log base b of
a is bigger than f of n,

129
00:07:24,120 --> 00:07:25,800
the answer is n
to the log base b

130
00:07:25,800 --> 00:07:29,610
of a, as long as it's bigger
by a polynomial amount.

131
00:07:32,130 --> 00:07:42,150
Now, case 2 is the situation
where n to the log base b of a

132
00:07:42,150 --> 00:07:45,870
is approximately
equal to f of n.

133
00:07:45,870 --> 00:07:48,510
They're very similar in growth.

134
00:07:48,510 --> 00:07:51,740
And specifically, we're going
to look at the case where f of n

135
00:07:51,740 --> 00:07:58,230
is n to the log base b of
a poly-logarithmic factor--

136
00:07:58,230 --> 00:08:01,980
log to the k of n for some
constant k greater than

137
00:08:01,980 --> 00:08:03,240
or equal to 0.

138
00:08:03,240 --> 00:08:06,600
That greater than or equal
to 0 is very important.

139
00:08:06,600 --> 00:08:09,420
You can't do this
for negative k.

140
00:08:09,420 --> 00:08:12,940
Even though negative k is
defined and meaningful,

141
00:08:12,940 --> 00:08:15,150
this is not the answer
when k is negative.

142
00:08:15,150 --> 00:08:17,395
But if k is greater
than or equal to 0,

143
00:08:17,395 --> 00:08:19,020
then it turns out
that what's happening

144
00:08:19,020 --> 00:08:24,520
is it's growing arithmetically
from beginning to end.

145
00:08:24,520 --> 00:08:27,450
And so when you solve
it, what happens is,

146
00:08:27,450 --> 00:08:30,210
you essentially add
an extra log term.

147
00:08:30,210 --> 00:08:34,650
So the answer is, if f
of n is n to the log base

148
00:08:34,650 --> 00:08:38,940
b of a log to the k n, the
answer is n to the log base

149
00:08:38,940 --> 00:08:43,100
b of a log to the k plus 1 of n.

150
00:08:43,100 --> 00:08:47,350
So you kick in one extra log.

151
00:08:47,350 --> 00:08:48,750
And basically, it's like--

152
00:08:48,750 --> 00:08:50,430
on average, there's basically--

153
00:08:53,760 --> 00:08:57,990
it's almost all equal,
and there are log layers.

154
00:08:57,990 --> 00:09:01,650
That's not quite the math,
but it's good intuition

155
00:09:01,650 --> 00:09:04,380
that they're almost all equal
and there are log layers,

156
00:09:04,380 --> 00:09:05,970
so you tack on an extra log.

157
00:09:08,610 --> 00:09:14,460
And then finally, case 3 is the
case when no to the log base

158
00:09:14,460 --> 00:09:19,230
b is much less than f
of n, and specifically

159
00:09:19,230 --> 00:09:24,060
where it is smaller by, once
again, a polynomial factor--

160
00:09:24,060 --> 00:09:28,140
by an n to the epsilon factor
for epsilon greater than 0.

161
00:09:28,140 --> 00:09:30,210
It's also the case
here that f has

162
00:09:30,210 --> 00:09:33,480
to satisfy what's called
a regularity condition.

163
00:09:33,480 --> 00:09:35,970
And this is a condition
that's satisfied

164
00:09:35,970 --> 00:09:38,760
by all the functions we're
going to look at-- polynomials

165
00:09:38,760 --> 00:09:40,830
and polynomials
times logarithms,

166
00:09:40,830 --> 00:09:43,320
and things of that nature.

167
00:09:43,320 --> 00:09:46,170
It's not satisfied
for weird functions

168
00:09:46,170 --> 00:09:50,380
like sines and cosines
and things like that.

169
00:09:50,380 --> 00:09:52,650
It's also not-- more
relevantly, it's

170
00:09:52,650 --> 00:09:58,530
not satisfied if you have
things like exponentials.

171
00:09:58,530 --> 00:10:04,995
So this is-- but for all the
things we're going to look at,

172
00:10:04,995 --> 00:10:06,000
that's the case.

173
00:10:06,000 --> 00:10:09,850
And in that case, things are
geometrically decreasing,

174
00:10:09,850 --> 00:10:13,950
and so all the work
is at the root.

175
00:10:13,950 --> 00:10:17,310
And the root is
basically cos f of n,

176
00:10:17,310 --> 00:10:20,093
so the solution is theta f of n.

177
00:10:24,990 --> 00:10:28,210
We're going to hand
out a cheat sheet.

178
00:10:28,210 --> 00:10:31,920
So if you could conscript
some of the TAs to get that

179
00:10:31,920 --> 00:10:34,940
distributed as
quickly as possible.

180
00:10:34,940 --> 00:10:35,910
OK.

181
00:10:35,910 --> 00:10:41,580
So let's do a
little puzzle here.

182
00:10:44,790 --> 00:10:46,220
So here's the cheat sheet.

183
00:10:46,220 --> 00:10:47,470
That's basically what's on it.

184
00:10:50,260 --> 00:10:56,350
And we'll do a little
in-class quiz, self-quiz.

185
00:10:56,350 --> 00:11:00,300
So we have T of n is
4T n over 2 plus n.

186
00:11:00,300 --> 00:11:01,530
And the solution is?

187
00:11:07,853 --> 00:11:09,770
This is a thing that,
as a computer scientist,

188
00:11:09,770 --> 00:11:13,672
you just memorize this so that
you can-- in any situation,

189
00:11:13,672 --> 00:11:15,630
you don't have to even
look at the cheat sheet.

190
00:11:15,630 --> 00:11:16,460
You just know it.

191
00:11:16,460 --> 00:11:20,780
It's one of these basic things
that all computer scientists

192
00:11:20,780 --> 00:11:21,680
should know.

193
00:11:21,680 --> 00:11:24,035
It's kind of like,
what's 2 to the 15th?

194
00:11:28,600 --> 00:11:30,570
What is it?

195
00:11:30,570 --> 00:11:31,862
AUDIENCE: [INAUDIBLE]

196
00:11:31,862 --> 00:11:32,820
CHARLES LEISERSON: Yes.

197
00:11:32,820 --> 00:11:35,990
And interestingly,
that's my office number.

198
00:11:35,990 --> 00:11:41,160
I'm in 32-G768.

199
00:11:41,160 --> 00:11:45,330
I'm the only one in this data
center with a power of 2 office

200
00:11:45,330 --> 00:11:46,680
number.

201
00:11:46,680 --> 00:11:48,310
And that was totally unplanned.

202
00:11:53,790 --> 00:11:56,400
So if you need to remember my
office number, 2 to the 15th.

203
00:11:58,950 --> 00:12:00,360
OK, so what's the solution here?

204
00:12:04,280 --> 00:12:05,760
AUDIENCE: Case 1.

205
00:12:05,760 --> 00:12:07,440
CHARLES LEISERSON: It's case 1.

206
00:12:07,440 --> 00:12:08,630
And what's the solution?

207
00:12:12,187 --> 00:12:13,020
AUDIENCE: n squared?

208
00:12:13,020 --> 00:12:13,590
CHARLES LEISERSON: n squared.

209
00:12:13,590 --> 00:12:14,190
Very good.

210
00:12:14,190 --> 00:12:14,980
Yeah.

211
00:12:14,980 --> 00:12:21,930
So n to the log base b of a
is n to the log base 2 of 4.

212
00:12:21,930 --> 00:12:24,840
Log base 2 of 4 is 2,
so that's n squared.

213
00:12:24,840 --> 00:12:26,880
That's much bigger than n.

214
00:12:26,880 --> 00:12:31,920
So it's case 1, and the
answer is theta n squared.

215
00:12:31,920 --> 00:12:34,338
Pretty easy.

216
00:12:34,338 --> 00:12:35,130
How about this one?

217
00:12:42,600 --> 00:12:43,332
Yeah.

218
00:12:43,332 --> 00:12:44,720
AUDIENCE: [INAUDIBLE]

219
00:12:44,720 --> 00:12:44,910
CHARLES LEISERSON: Yeah.

220
00:12:44,910 --> 00:12:46,050
It's n squared log n.

221
00:12:46,050 --> 00:12:48,240
Once again, the first
part is the same.

222
00:12:48,240 --> 00:12:50,760
n to the log base b
of a is n squared.

223
00:12:50,760 --> 00:12:55,050
n squared is n squared
log to the 0 n.

224
00:12:55,050 --> 00:12:58,650
So it's case 2 with k
equals 0, and so you just

225
00:12:58,650 --> 00:13:01,620
tack on an extra log factor.

226
00:13:01,620 --> 00:13:03,533
So it's n squared log n.

227
00:13:03,533 --> 00:13:05,450
And then, of course,
we've got to do this one.

228
00:13:17,190 --> 00:13:17,735
Yeah.

229
00:13:17,735 --> 00:13:18,610
AUDIENCE: [INAUDIBLE]

230
00:13:18,610 --> 00:13:22,050
CHARLES LEISERSON: Yeah, n
cubed, because once again,

231
00:13:22,050 --> 00:13:23,730
n to log base b
of a is n squared.

232
00:13:23,730 --> 00:13:25,950
That's much less than n
cubed. n cubed's bigger,

233
00:13:25,950 --> 00:13:27,330
so that dominates.

234
00:13:27,330 --> 00:13:30,727
So we have theta n squared.

235
00:13:30,727 --> 00:13:31,560
What about this one?

236
00:13:38,245 --> 00:13:38,745
Yeah.

237
00:13:38,745 --> 00:13:41,363
AUDIENCE: Theta of n squared.

238
00:13:41,363 --> 00:13:42,280
CHARLES LEISERSON: No.

239
00:13:42,280 --> 00:13:43,197
That's not the answer.

240
00:13:46,460 --> 00:13:48,470
Which case do you think it is?

241
00:13:48,470 --> 00:13:50,650
AUDIENCE: [INAUDIBLE]

242
00:13:50,650 --> 00:13:51,733
CHARLES LEISERSON: Case 2?

243
00:13:51,733 --> 00:13:52,580
AUDIENCE: Yeah.

244
00:13:52,580 --> 00:13:53,100
CHARLES LEISERSON: OK.

245
00:13:53,100 --> 00:13:53,600
No.

246
00:13:55,870 --> 00:13:56,370
Yeah.

247
00:13:56,370 --> 00:13:57,300
AUDIENCE: None of the cases?

248
00:13:57,300 --> 00:13:59,050
CHARLES LEISERSON:
It's none of the cases.

249
00:13:59,050 --> 00:14:00,390
It's a trick question.

250
00:14:00,390 --> 00:14:03,540
Oh, I'm a nasty guy.

251
00:14:03,540 --> 00:14:05,550
I'm a nasty guy.

252
00:14:05,550 --> 00:14:08,820
This is one where the master
method does not apply.

253
00:14:08,820 --> 00:14:11,040
This would be case
2, but k has to be

254
00:14:11,040 --> 00:14:14,440
greater than or equal to
0, and here k is minus 1.

255
00:14:14,440 --> 00:14:18,120
So case two doesn't apply.

256
00:14:18,120 --> 00:14:23,320
And case 1 doesn't
apply, where we're

257
00:14:23,320 --> 00:14:26,560
comparing n squared to
n squared over log n,

258
00:14:26,560 --> 00:14:30,210
because the ratio there is
1 over log n, and that--

259
00:14:30,210 --> 00:14:37,860
sorry, the ratio there is log n,
and log n is smaller than any n

260
00:14:37,860 --> 00:14:38,920
to the epsilon.

261
00:14:38,920 --> 00:14:41,440
And you need to have an n
to the epsilon separation.

262
00:14:44,500 --> 00:14:47,530
There's actually a more--
the actual answer is n

263
00:14:47,530 --> 00:14:51,670
squared log log n for that one,
by the way, which you can prove

264
00:14:51,670 --> 00:14:54,850
by the substitution method.

265
00:14:54,850 --> 00:14:59,260
And it uses the same idea.

266
00:14:59,260 --> 00:15:01,270
You just do a little
bit different math.

267
00:15:01,270 --> 00:15:03,790
There's a more general solution
to this kind of recurrence

268
00:15:03,790 --> 00:15:07,540
called the Akra-Bazzi method.

269
00:15:07,540 --> 00:15:09,300
But for most of what
we're going to see,

270
00:15:09,300 --> 00:15:10,780
it's sufficient to just--

271
00:15:10,780 --> 00:15:13,510
applying the Akra-Bazzi
method is more complicated

272
00:15:13,510 --> 00:15:19,210
than simply doing the table
lookup of which is bigger

273
00:15:19,210 --> 00:15:21,670
and if sufficiently big,
it's one or the other,

274
00:15:21,670 --> 00:15:25,700
or the common case where they're
about the same within a log

275
00:15:25,700 --> 00:15:26,200
factor.

276
00:15:26,200 --> 00:15:28,810
So we're going to use
the master method,

277
00:15:28,810 --> 00:15:30,610
but there are more
general ways of solving

278
00:15:30,610 --> 00:15:31,670
these kinds of things.

279
00:15:31,670 --> 00:15:32,170
OK.

280
00:15:32,170 --> 00:15:36,020
Let's talk about some
multi-threaded algorithms.

281
00:15:36,020 --> 00:15:38,500
First thing I want to
do is talk about loops,

282
00:15:38,500 --> 00:15:45,490
because loops are a great
thing to analyze and understand

283
00:15:45,490 --> 00:15:47,230
because so many
programs have loops.

284
00:15:47,230 --> 00:15:52,990
Probably 90% or more of the
programs that are parallelized

285
00:15:52,990 --> 00:15:56,180
are parallelized by
making parallel loops.

286
00:15:56,180 --> 00:16:00,640
The spawn and sync
types of parallelism,

287
00:16:00,640 --> 00:16:03,130
the subroutine-type
parallelism, is not

288
00:16:03,130 --> 00:16:06,130
done that frequently in code.

289
00:16:06,130 --> 00:16:08,260
Usually, it's loops.

290
00:16:08,260 --> 00:16:12,370
So what we're going
to look at is a code

291
00:16:12,370 --> 00:16:16,510
to do an in-place
matrix transpose,

292
00:16:16,510 --> 00:16:18,110
as an example of this.

293
00:16:18,110 --> 00:16:20,380
So if you look at
this code, I want

294
00:16:20,380 --> 00:16:22,600
to swap the lower
side of the matrix

295
00:16:22,600 --> 00:16:24,460
with the upper
side of the matrix,

296
00:16:24,460 --> 00:16:27,090
and here's some
code to do it, where

297
00:16:27,090 --> 00:16:28,540
I parallelize the outer loop.

298
00:16:32,980 --> 00:16:38,650
So we're running the outer
index from i equals 1 to n.

299
00:16:38,650 --> 00:16:43,780
I'm actually running the
indexes from 0 to n minus 1.

300
00:16:43,780 --> 00:16:50,170
And then the inner loop
goes from 0 up to i minus 1.

301
00:16:50,170 --> 00:16:52,690
Now, I've seen people
write transpose code--

302
00:16:52,690 --> 00:16:54,250
this is one of these
trick questions

303
00:16:54,250 --> 00:16:57,970
they give you in interviews,
where they say, write

304
00:16:57,970 --> 00:17:03,080
the transpose of a
matrix with nested loops.

305
00:17:03,080 --> 00:17:05,890
And what many people will
do is, the inner loop,

306
00:17:05,890 --> 00:17:11,869
they'll run to n rather
than running to i.

307
00:17:11,869 --> 00:17:14,990
And what happens if you
run the inner loop to n?

308
00:17:20,060 --> 00:17:24,450
It's a very expensive
identity function.

309
00:17:24,450 --> 00:17:26,450
And there's an
easier, faster way

310
00:17:26,450 --> 00:17:30,410
to compute identity than with
doubly nested loops where

311
00:17:30,410 --> 00:17:33,920
you swap everything and
you swap them all back.

312
00:17:33,920 --> 00:17:35,870
So it's important that
the iteration space

313
00:17:35,870 --> 00:17:38,340
here, what's the shape
of the iteration space?

314
00:17:38,340 --> 00:17:40,820
If you look at
the i and j values

315
00:17:40,820 --> 00:17:44,360
and you map them out on a plane,
what's the shape that you get?

316
00:17:44,360 --> 00:17:46,550
It's not a square,
which it would

317
00:17:46,550 --> 00:17:52,955
be if they were both going
from 1 to n, or 0 to n minus 1.

318
00:17:52,955 --> 00:17:58,280
What's the shape of
this iteration space?

319
00:17:58,280 --> 00:17:59,800
Yeah, it's a triangle.

320
00:17:59,800 --> 00:18:03,170
It's basically-- we're
going to run through all

321
00:18:03,170 --> 00:18:07,550
the things in this lower area.

322
00:18:07,550 --> 00:18:09,590
That's the idea.

323
00:18:09,590 --> 00:18:12,260
And we're going to swap it with
the things in the upper one.

324
00:18:12,260 --> 00:18:16,460
But the iteration space
runs through just the lower

325
00:18:16,460 --> 00:18:18,195
triangle-- or,
correspondingly, it

326
00:18:18,195 --> 00:18:19,820
runs through the
upper triangle, if you

327
00:18:19,820 --> 00:18:21,487
want to view it from
that point of view.

328
00:18:21,487 --> 00:18:23,480
But it doesn't go
through both triangles,

329
00:18:23,480 --> 00:18:25,550
because then you
will get an identity.

330
00:18:25,550 --> 00:18:29,960
So anyway, that's just a tip
when you're interviewing.

331
00:18:29,960 --> 00:18:31,730
Double-check that
they've got the loop

332
00:18:31,730 --> 00:18:34,662
indices to be what
they ought to be.

333
00:18:34,662 --> 00:18:36,620
And here what we've done
is, we've parallelized

334
00:18:36,620 --> 00:18:39,770
the outer loop, which
means, how much work is

335
00:18:39,770 --> 00:18:41,960
on each iteration of this loop?

336
00:18:47,300 --> 00:18:50,810
How much time does it take to
execute each iteration of loop?

337
00:18:50,810 --> 00:18:54,400
For a given value
of i, what does it

338
00:18:54,400 --> 00:19:01,580
cost us to execute the loop?

339
00:19:01,580 --> 00:19:02,232
Yeah.

340
00:19:02,232 --> 00:19:03,380
AUDIENCE: [INAUDIBLE]

341
00:19:03,380 --> 00:19:04,338
CHARLES LEISERSON: Yes.

342
00:19:04,338 --> 00:19:07,430
Theta i, which means that--

343
00:19:07,430 --> 00:19:09,080
if you think about
this, if you've

344
00:19:09,080 --> 00:19:11,630
got a certain number
of processors,

345
00:19:11,630 --> 00:19:15,470
you don't want to just chunk it
up so that each processor gets

346
00:19:15,470 --> 00:19:20,210
an equal range of i to work on.

347
00:19:20,210 --> 00:19:23,330
You need something that's
going to load balance.

348
00:19:23,330 --> 00:19:27,800
And this is where
the Cilk technology

349
00:19:27,800 --> 00:19:32,870
is best, is when there are these
unbalanced things, because it

350
00:19:32,870 --> 00:19:35,960
does the right
thing, as we'll see.

351
00:19:40,040 --> 00:19:43,610
So let's talk a little bit
about how loops are actually

352
00:19:43,610 --> 00:19:51,390
implemented by the Open Cilk
compiler and runtime system.

353
00:19:51,390 --> 00:19:57,872
So what happens is, we have
this doubly-nested loop here,

354
00:19:57,872 --> 00:19:59,580
but the only one that
we're interested in

355
00:19:59,580 --> 00:20:02,740
is the outer loop, basically.

356
00:20:02,740 --> 00:20:08,640
And what it does is, it
creates this recursive program

357
00:20:08,640 --> 00:20:11,290
for the loop.

358
00:20:11,290 --> 00:20:14,030
And what is this program doing?

359
00:20:14,030 --> 00:20:16,020
I'm highlighting,
essentially, this part.

360
00:20:16,020 --> 00:20:19,190
This is basically
the loop body here,

361
00:20:19,190 --> 00:20:25,380
which has been lifted into
this recursive program.

362
00:20:25,380 --> 00:20:30,060
And what it's doing is,
it is finding a midpoint

363
00:20:30,060 --> 00:20:34,740
and then recursively
calling itself

364
00:20:34,740 --> 00:20:40,470
on the two sides until it
gets down to, in this case,

365
00:20:40,470 --> 00:20:44,580
a one-element iteration.

366
00:20:44,580 --> 00:20:51,060
And then it executes the body
of the loop, which in this case

367
00:20:51,060 --> 00:20:55,950
is itself a for loop, but
not a parallel for loop.

368
00:20:55,950 --> 00:20:57,450
So it's doing
divide and conquer.

369
00:20:57,450 --> 00:20:59,304
It's just basically
tree splitting.

370
00:21:07,270 --> 00:21:10,450
So basically, it's got
this control on top of it.

371
00:21:10,450 --> 00:21:13,090
And if I take a look at what's
going on in the control,

372
00:21:13,090 --> 00:21:16,210
it looks something like this.

373
00:21:16,210 --> 00:21:21,770
So this is using the DAG
model that we saw before.

374
00:21:21,770 --> 00:21:26,806
And now what I have
here highlighted

375
00:21:26,806 --> 00:21:30,010
is the lifted body of the loop--

376
00:21:30,010 --> 00:21:31,690
sorry, of the control.

377
00:21:31,690 --> 00:21:35,090
And then down below in the
purple, I have the lifted body.

378
00:21:35,090 --> 00:21:38,650
And what it's doing
is basically saying,

379
00:21:38,650 --> 00:21:46,180
let me divide it into two parts,
and then I spawn one recurrence

380
00:21:46,180 --> 00:21:47,840
and I call the other.

381
00:21:47,840 --> 00:21:51,220
And I just keep dividing
like that till I get down

382
00:21:51,220 --> 00:21:53,133
to the base condition.

383
00:21:53,133 --> 00:21:54,550
And then the work
that I'm doing--

384
00:21:54,550 --> 00:21:57,040
I've sort of illustrated here--

385
00:21:57,040 --> 00:21:59,890
the work I'm doing in
each iteration of the loop

386
00:21:59,890 --> 00:22:03,330
is growing from 1 to n.

387
00:22:03,330 --> 00:22:06,250
I'm showing it for 8, but in
general, this is working from 1

388
00:22:06,250 --> 00:22:10,570
to n for this particular one.

389
00:22:10,570 --> 00:22:11,250
Is that clear?

390
00:22:11,250 --> 00:22:12,708
So that's what's
actually going on.

391
00:22:12,708 --> 00:22:20,070
So the Open Cilk runtime system
does not have a loop primitive.

392
00:22:20,070 --> 00:22:22,300
It doesn't have loops.

393
00:22:22,300 --> 00:22:27,040
It only has, essentially, this
ability to spawn and so forth.

394
00:22:27,040 --> 00:22:30,370
And so things, effectively,
are translated into this divide

395
00:22:30,370 --> 00:22:32,200
and conquer, and
that's the way that you

396
00:22:32,200 --> 00:22:35,930
need to think about loops when
you're thinking in parallel.

397
00:22:35,930 --> 00:22:37,430
Make sense?

398
00:22:37,430 --> 00:22:41,710
And so one of the questions is,
that seems like a lot of code

399
00:22:41,710 --> 00:22:43,600
to write for a simple loop.

400
00:22:43,600 --> 00:22:45,340
What do we pay for that?

401
00:22:45,340 --> 00:22:46,810
How much did that cost us?

402
00:22:46,810 --> 00:22:49,140
So let's analyze
this a little bit--

403
00:22:49,140 --> 00:22:52,580
analyze parallel loops.

404
00:22:52,580 --> 00:22:54,220
So as you know,
we analyze things

405
00:22:54,220 --> 00:22:58,660
in terms of work and span.

406
00:22:58,660 --> 00:23:01,420
So what is the work
of this computation?

407
00:23:04,258 --> 00:23:06,050
Well, what's the work
before you get there?

408
00:23:06,050 --> 00:23:08,900
What's the work of the
original computation--

409
00:23:08,900 --> 00:23:11,630
the doubly-nested loop?

410
00:23:11,630 --> 00:23:15,537
If you just think about
it in terms of loops,

411
00:23:15,537 --> 00:23:17,870
if they were serial loops,
how much work would be there?

412
00:23:23,440 --> 00:23:26,570
Doubly-nested loop.

413
00:23:26,570 --> 00:23:29,010
In a loop, n iterations.

414
00:23:29,010 --> 00:23:33,450
In your iteration,
you're doing i work.

415
00:23:33,450 --> 00:23:34,920
Sum of i.

416
00:23:34,920 --> 00:23:36,120
i equals 1 to n.

417
00:23:36,120 --> 00:23:37,070
What do you get?

418
00:23:37,070 --> 00:23:38,070
AUDIENCE: [INAUDIBLE]

419
00:23:38,070 --> 00:23:39,028
CHARLES LEISERSON: Yes.

420
00:23:39,028 --> 00:23:40,800
Theta n squared.

421
00:23:40,800 --> 00:23:41,870
Doubly-nested group.

422
00:23:41,870 --> 00:23:45,660
And although you're not
doing half the work,

423
00:23:45,660 --> 00:23:48,380
you are doing the other
half of the work--

424
00:23:48,380 --> 00:23:50,130
of the n squared work
that you might think

425
00:23:50,130 --> 00:23:53,850
was there if you wrote the
unfortunate identity function.

426
00:23:56,990 --> 00:23:59,190
So the question is,
how much work is

427
00:23:59,190 --> 00:24:00,990
in this particular computation?

428
00:24:00,990 --> 00:24:05,490
Because now I've got this whole
tree-spawning business going on

429
00:24:05,490 --> 00:24:10,530
in addition to the work that
I'm doing in the leaves.

430
00:24:10,530 --> 00:24:14,760
So the leaf work
here, along the bottom

431
00:24:14,760 --> 00:24:18,990
here, that's all going to
be order n squared work,

432
00:24:18,990 --> 00:24:24,330
because that's the same as
in the serial loop case.

433
00:24:24,330 --> 00:24:26,990
How much does that other
stuff up top [INAUDIBLE]??

434
00:24:26,990 --> 00:24:27,810
It looks huge.

435
00:24:27,810 --> 00:24:32,160
It's bigger than the
other stuff, isn't it?

436
00:24:32,160 --> 00:24:32,990
How much is there?

437
00:24:41,220 --> 00:24:42,195
Basic computer science.

438
00:24:44,840 --> 00:24:45,715
AUDIENCE: Theta of n?

439
00:24:45,715 --> 00:24:46,715
CHARLES LEISERSON: Yeah.

440
00:24:46,715 --> 00:24:47,460
It's theta of n.

441
00:24:47,460 --> 00:24:50,610
Why is it theta if
n in the upper part?

442
00:24:50,610 --> 00:24:51,304
Yep.

443
00:24:51,304 --> 00:24:53,724
AUDIENCE: Because it's
geometrically decreasing

444
00:24:53,724 --> 00:24:56,628
[INAUDIBLE]

445
00:24:57,380 --> 00:24:58,380
CHARLES LEISERSON: Yeah.

446
00:24:58,380 --> 00:25:03,180
So going from the leaves to the
root, every level is halving,

447
00:25:03,180 --> 00:25:04,080
so it's geometric.

448
00:25:04,080 --> 00:25:05,880
So it's the total
number of leaves,

449
00:25:05,880 --> 00:25:09,300
because there's constant work
in each of those phrases.

450
00:25:09,300 --> 00:25:13,350
So the total amount
is theta n squared.

451
00:25:13,350 --> 00:25:14,820
Another way of
thinking about this

452
00:25:14,820 --> 00:25:17,610
is, you've got a complete
binary tree that we've

453
00:25:17,610 --> 00:25:20,490
created with n leaves.

454
00:25:20,490 --> 00:25:22,080
How many internal
nodes are there

455
00:25:22,080 --> 00:25:24,540
in a complete binary
tree with n leaves?

456
00:25:24,540 --> 00:25:26,567
In this case, there's
actually n over--

457
00:25:26,567 --> 00:25:27,900
let's just say there's n leaves.

458
00:25:27,900 --> 00:25:28,523
Yeah.

459
00:25:28,523 --> 00:25:29,940
How many internal
nodes are there?

460
00:25:29,940 --> 00:25:34,943
If I have n leaves, how many
internal nodes to the tree--

461
00:25:34,943 --> 00:25:36,360
that is, nodes
that have children?

462
00:25:39,770 --> 00:25:42,320
There's exactly n minus 1.

463
00:25:42,320 --> 00:25:45,530
That's a property that's true
of any full binary tree--

464
00:25:45,530 --> 00:25:48,770
that is, any binary tree
in which every non-leaf has

465
00:25:48,770 --> 00:25:50,540
two children.

466
00:25:50,540 --> 00:25:53,940
There's exactly n minus 1.

467
00:25:53,940 --> 00:25:56,420
So nice tree properties, nice
computer science properties,

468
00:25:56,420 --> 00:25:56,920
right?

469
00:25:56,920 --> 00:25:58,280
We like computer science.

470
00:25:58,280 --> 00:25:59,330
That's why we're here--

471
00:25:59,330 --> 00:26:01,288
not because we're going
to make a lot of money.

472
00:26:06,880 --> 00:26:08,500
OK.

473
00:26:08,500 --> 00:26:10,470
Let's look at the span of this.

474
00:26:10,470 --> 00:26:12,720
Hmm.

475
00:26:12,720 --> 00:26:19,655
What's the span of
this calculation?

476
00:26:19,655 --> 00:26:21,530
Because that's how we
understand parallelism,

477
00:26:21,530 --> 00:26:23,120
is by understanding
work and span.

478
00:26:32,680 --> 00:26:35,410
I see some familiar hands.

479
00:26:35,410 --> 00:26:36,550
OK.

480
00:26:36,550 --> 00:26:37,930
AUDIENCE: Theta n.

481
00:26:37,930 --> 00:26:39,160
CHARLES LEISERSON: Theta n.

482
00:26:39,160 --> 00:26:39,430
Yeah.

483
00:26:39,430 --> 00:26:40,305
How did you get that?

484
00:26:43,264 --> 00:26:50,000
AUDIENCE: The largest path
would be the [INAUDIBLE] node

485
00:26:50,000 --> 00:26:54,090
is size theta n and [INAUDIBLE]

486
00:26:54,090 --> 00:26:55,090
CHARLES LEISERSON: Yeah.

487
00:26:55,090 --> 00:26:58,240
So we're basically--
the longest path

488
00:26:58,240 --> 00:27:03,220
is basically going from
here down, down, down to 8,

489
00:27:03,220 --> 00:27:05,560
and then back up.

490
00:27:05,560 --> 00:27:10,870
And so the eight is really
n in the general case.

491
00:27:10,870 --> 00:27:12,460
That's really n in
the general case.

492
00:27:12,460 --> 00:27:16,410
And so we basically
are going down,

493
00:27:16,410 --> 00:27:19,960
And so the span of the
loop control is log n.

494
00:27:19,960 --> 00:27:21,380
And that's the
key takeaway here.

495
00:27:21,380 --> 00:27:24,550
The span of loop
control is log n.

496
00:27:24,550 --> 00:27:26,940
When I do divide and
conquer like that,

497
00:27:26,940 --> 00:27:29,590
if I had an infinite
number of processors,

498
00:27:29,590 --> 00:27:33,890
I could get it all done
in logarithmic time.

499
00:27:33,890 --> 00:27:36,800
But the 8 there is linear.

500
00:27:36,800 --> 00:27:39,500
That's order n.

501
00:27:39,500 --> 00:27:42,980
In this case, n is 8.

502
00:27:42,980 --> 00:27:43,970
So that's order n.

503
00:27:43,970 --> 00:27:46,370
So then it's log
n plus order log

504
00:27:46,370 --> 00:27:50,180
n, which is therefore order n.

505
00:27:55,880 --> 00:27:57,180
So what's the parallelism here?

506
00:28:02,868 --> 00:28:03,820
AUDIENCE: Theta n.

507
00:28:03,820 --> 00:28:04,945
CHARLES LEISERSON: Theta n.

508
00:28:04,945 --> 00:28:07,270
It's the ratio of the two.

509
00:28:07,270 --> 00:28:09,190
The ratio of the two is theta n.

510
00:28:09,190 --> 00:28:12,235
Is that good?

511
00:28:12,235 --> 00:28:14,242
AUDIENCE: Theta of n squared?

512
00:28:14,242 --> 00:28:15,950
CHARLES LEISERSON:
Well, parallelism of n

513
00:28:15,950 --> 00:28:16,825
squared, do you mean?

514
00:28:16,825 --> 00:28:20,434
Or-- is this good parallelism?

515
00:28:24,880 --> 00:28:27,780
Yeah, that's pretty good.

516
00:28:27,780 --> 00:28:30,120
That's pretty good,
because typically, you're

517
00:28:30,120 --> 00:28:35,400
going to be working on
systems that have maybe--

518
00:28:35,400 --> 00:28:38,550
if you are working
on a big, big system,

519
00:28:38,550 --> 00:28:43,770
you've got maybe 64 cores
or 128 cores or something.

520
00:28:43,770 --> 00:28:46,470
That's pretty big.

521
00:28:46,470 --> 00:28:49,268
Whereas this is saying,
if you're doing that,

522
00:28:49,268 --> 00:28:51,060
you better have a
problem that's really big

523
00:28:51,060 --> 00:28:53,490
that you're running it on.

524
00:28:53,490 --> 00:28:56,580
And so typically,
n is way bigger

525
00:28:56,580 --> 00:29:02,110
than the number of processors
for a problem like this.

526
00:29:02,110 --> 00:29:05,080
Not always the case,
but here it is.

527
00:29:05,080 --> 00:29:06,660
Any questions about this?

528
00:29:06,660 --> 00:29:11,710
So we can use our
work and span analysis

529
00:29:11,710 --> 00:29:16,330
to understand
that, hey, the work

530
00:29:16,330 --> 00:29:18,280
overhead is a constant factor.

531
00:29:18,280 --> 00:29:22,780
And We're going to talk more
about the overhead of work.

532
00:29:22,780 --> 00:29:26,980
But basically, from an
asymptotic point of view,

533
00:29:26,980 --> 00:29:30,520
our work is n squared just
like the original code,

534
00:29:30,520 --> 00:29:32,260
and we have a fair
amount of parallelism.

535
00:29:32,260 --> 00:29:36,290
We have order n parallelism.

536
00:29:36,290 --> 00:29:43,840
How about if we make the inner
loop be parallel as well?

537
00:29:43,840 --> 00:29:46,150
So rather than just
parallelize the outer loop,

538
00:29:46,150 --> 00:29:49,810
we're also going to
parallelize the inner loop.

539
00:29:49,810 --> 00:29:52,840
So how much work do we
have for this situation?

540
00:30:02,160 --> 00:30:06,810
Hint-- all work questions are
trivial, or at least no harder

541
00:30:06,810 --> 00:30:10,380
than they were
when you were doing

542
00:30:10,380 --> 00:30:12,315
ordinary serial algorithms.

543
00:30:17,030 --> 00:30:20,480
Maybe we can come up with a
trick question on the exam

544
00:30:20,480 --> 00:30:24,620
where the work changes,
but almost always, the work

545
00:30:24,620 --> 00:30:26,450
doesn't change.

546
00:30:26,450 --> 00:30:28,120
So what's the work?

547
00:30:28,120 --> 00:30:30,860
Yeah. n squared.

548
00:30:30,860 --> 00:30:33,800
Parallelizing stuff
doesn't change the work.

549
00:30:33,800 --> 00:30:38,750
What it hopefully does is reduce
the span of the calculation.

550
00:30:38,750 --> 00:30:41,540
And by reducing the span,
we get big parallelism.

551
00:30:41,540 --> 00:30:43,238
That's the idea.

552
00:30:43,238 --> 00:30:45,530
Now, sometimes it's the case
when you parallelize stuff

553
00:30:45,530 --> 00:30:48,560
that you add work, and
that's unfortunate,

554
00:30:48,560 --> 00:30:50,240
because it means that
even if you end up

555
00:30:50,240 --> 00:30:54,290
taking your parallel program
and running it on one processing

556
00:30:54,290 --> 00:30:57,770
core, you're not going
to get any speedup.

557
00:30:57,770 --> 00:30:59,720
It's going to be a
slowdown compared

558
00:30:59,720 --> 00:31:01,287
to the original algorithm.

559
00:31:01,287 --> 00:31:02,870
So we're actually
interested generally

560
00:31:02,870 --> 00:31:05,423
in work-efficient
parallel algorithms, which

561
00:31:05,423 --> 00:31:06,590
we'll talk more about later.

562
00:31:06,590 --> 00:31:09,360
So generally, we're
after work efficiency.

563
00:31:09,360 --> 00:31:09,860
OK.

564
00:31:09,860 --> 00:31:11,605
What's the span of this?

565
00:31:11,605 --> 00:31:14,008
AUDIENCE: Is it theta n still?

566
00:31:14,008 --> 00:31:15,800
CHARLES LEISERSON: It
is not still theta n.

567
00:31:15,800 --> 00:31:19,151
What was your thinking
to say it was theta of n?

568
00:31:19,151 --> 00:31:25,495
AUDIENCE: So the path would
be similar to 8, and then--

569
00:31:25,495 --> 00:31:26,870
CHARLES LEISERSON:
But now notice

570
00:31:26,870 --> 00:31:30,634
that 8 is a for loop itself.

571
00:31:30,634 --> 00:31:31,259
AUDIENCE: Yeah.

572
00:31:31,259 --> 00:31:33,191
I'm saying maybe you
could extend the path

573
00:31:33,191 --> 00:31:36,270
another n so it would be 2n.

574
00:31:36,270 --> 00:31:37,820
CHARLES LEISERSON: OK.

575
00:31:37,820 --> 00:31:43,670
Not quite, but-- so
this man is commendable.

576
00:31:46,376 --> 00:31:48,560
[APPLAUSE]

577
00:31:48,560 --> 00:31:50,340
Absolutely.

578
00:31:50,340 --> 00:31:52,055
This is commendable,
because this

579
00:31:52,055 --> 00:31:55,220
is-- this is why I try to have
a bit of a Socratic method

580
00:31:55,220 --> 00:31:59,030
in here, where I'm asking
questions as opposed to just

581
00:31:59,030 --> 00:32:02,690
sitting here lecturing and
having it go over your heads.

582
00:32:02,690 --> 00:32:05,720
You have the opportunity
to ask questions,

583
00:32:05,720 --> 00:32:08,690
and to have your particular
misunderstandings or whatever

584
00:32:08,690 --> 00:32:10,070
corrected.

585
00:32:10,070 --> 00:32:12,830
That's how you learn.

586
00:32:12,830 --> 00:32:15,590
And so I'm really in
favor of anybody who

587
00:32:15,590 --> 00:32:18,860
wants to come here and learn.

588
00:32:18,860 --> 00:32:21,470
That's my desire,
and that's my job,

589
00:32:21,470 --> 00:32:24,130
is to teach people
who want to learn.

590
00:32:24,130 --> 00:32:30,650
So I hope that this is a
safe space for you folks

591
00:32:30,650 --> 00:32:34,220
to be willing to put
yourself out there and not

592
00:32:34,220 --> 00:32:37,230
necessarily get stuff right.

593
00:32:37,230 --> 00:32:39,930
I can't tell you how many
times I've screwed up,

594
00:32:39,930 --> 00:32:42,732
and it's only by
airing it and so forth

595
00:32:42,732 --> 00:32:44,690
and having somebody say,
no, I don't think it's

596
00:32:44,690 --> 00:32:47,075
like that, Charles.

597
00:32:47,075 --> 00:32:48,530
This is like this.

598
00:32:48,530 --> 00:32:50,520
And I said, oh
yeah, you're right.

599
00:32:50,520 --> 00:32:51,515
God, that was stupid.

600
00:32:54,260 --> 00:32:56,630
But the fact is that
I no longer beat

601
00:32:56,630 --> 00:32:59,690
my head when I'm being stupid.

602
00:32:59,690 --> 00:33:02,720
Our natural state is stupidity.

603
00:33:02,720 --> 00:33:06,880
We have to work hard
not to be stupid.

604
00:33:06,880 --> 00:33:08,180
Right?

605
00:33:08,180 --> 00:33:11,240
It's hard work not to be stupid.

606
00:33:11,240 --> 00:33:12,330
Yeah, question.

607
00:33:12,330 --> 00:33:13,872
AUDIENCE: It's not
really a question.

608
00:33:13,872 --> 00:33:16,250
My philosophy on
talking in mid-lecture

609
00:33:16,250 --> 00:33:18,688
is that I don't want to
waste other people's time.

610
00:33:18,688 --> 00:33:20,480
CHARLES LEISERSON:
Yeah, but usually when--

611
00:33:20,480 --> 00:33:24,350
my experience is-- and this
is, let me tell you from--

612
00:33:24,350 --> 00:33:27,500
I've been at MIT
almost 38 years.

613
00:33:27,500 --> 00:33:31,570
My experience is that one
person has a question,

614
00:33:31,570 --> 00:33:33,320
there's all these other
people in the room

615
00:33:33,320 --> 00:33:36,200
who have the same question.

616
00:33:36,200 --> 00:33:37,850
And by you
articulating it, you're

617
00:33:37,850 --> 00:33:39,320
actually helping them out.

618
00:33:39,320 --> 00:33:43,520
If I think you're going to slow,
if things are going too slow,

619
00:33:43,520 --> 00:33:45,200
that we're wasting
people's time,

620
00:33:45,200 --> 00:33:48,787
that's my job as the lecturer to
make sure that doesn't happen.

621
00:33:48,787 --> 00:33:50,370
And I'll say, let's
take this offline.

622
00:33:50,370 --> 00:33:52,970
We can talk after class.

623
00:33:52,970 --> 00:33:56,030
But I appreciate
your point of view,

624
00:33:56,030 --> 00:33:58,160
because that's considerate.

625
00:33:58,160 --> 00:34:01,120
But actually, it's
more consideration

626
00:34:01,120 --> 00:34:04,220
if you're willing to
air what you think

627
00:34:04,220 --> 00:34:07,915
and have other people say, you
know, I had that same question.

628
00:34:07,915 --> 00:34:10,040
Certainly there are going
to be people in the class

629
00:34:10,040 --> 00:34:13,070
who, say, roll their
eyes or whatever.

630
00:34:13,070 --> 00:34:17,150
But look, I don't
teach to the top 10%.

631
00:34:17,150 --> 00:34:21,170
I try to teach to the top 90%.

632
00:34:21,170 --> 00:34:22,520
And believe me--

633
00:34:22,520 --> 00:34:24,920
[LAUGHTER]

634
00:34:24,920 --> 00:34:32,330
Believe me that I get
farther with students

635
00:34:32,330 --> 00:34:34,760
and have more people
enjoying the course

636
00:34:34,760 --> 00:34:36,590
and learning this stuff--

637
00:34:36,590 --> 00:34:39,620
which is not
necessarily easy stuff.

638
00:34:39,620 --> 00:34:44,060
After the fact, you're going
to discover this is easy.

639
00:34:44,060 --> 00:34:46,670
But while you're learning
it, it's not easy.

640
00:34:46,670 --> 00:34:52,250
This is what Steven Pinker
calls the curse of knowledge.

641
00:34:52,250 --> 00:34:56,600
Once you know something,
you have a really hard time

642
00:34:56,600 --> 00:34:59,150
putting yourself in
the position of what

643
00:34:59,150 --> 00:35:00,500
it was like to not know it.

644
00:35:04,190 --> 00:35:06,740
And so it's very easy
to learn something,

645
00:35:06,740 --> 00:35:08,490
and then when somebody
doesn't understand,

646
00:35:08,490 --> 00:35:13,000
it's like, oh, whatever.

647
00:35:13,000 --> 00:35:16,340
But the fact of the matter
is that most of us--

648
00:35:16,340 --> 00:35:17,420
it's that empathy.

649
00:35:17,420 --> 00:35:21,410
That's what makes for you
to be a good communicator.

650
00:35:21,410 --> 00:35:24,050
And all of you I know
are at some point

651
00:35:24,050 --> 00:35:25,520
going to have to
do communication

652
00:35:25,520 --> 00:35:28,910
with other people who
are not as technically

653
00:35:28,910 --> 00:35:32,210
sophisticated as you folks are.

654
00:35:32,210 --> 00:35:36,020
And so this is really
good to sort of appreciate

655
00:35:36,020 --> 00:35:42,620
how important it is to
recognize that this stuff isn't

656
00:35:42,620 --> 00:35:46,910
necessarily easy when
you're learning it.

657
00:35:46,910 --> 00:35:49,195
Later, you can learn it,
and then it'll be easy.

658
00:35:49,195 --> 00:35:51,570
But that doesn't mean it's
not so easy for somebody else.

659
00:35:51,570 --> 00:35:55,820
So those of you who think that
some of these answers are like,

660
00:35:55,820 --> 00:35:59,960
come on, move along,
move along, please

661
00:35:59,960 --> 00:36:02,842
be patient with the other
people in the class.

662
00:36:02,842 --> 00:36:04,300
If they learn
better, they're going

663
00:36:04,300 --> 00:36:09,320
to be better teammates
on projects and so forth.

664
00:36:09,320 --> 00:36:10,640
And we'll all learn.

665
00:36:10,640 --> 00:36:13,190
Nobody's in competition
with anybody here,

666
00:36:13,190 --> 00:36:15,710
for grades or anything.

667
00:36:15,710 --> 00:36:16,970
Nobody's in competition.

668
00:36:16,970 --> 00:36:19,610
We all set it up so we're
going against benchmarks and so

669
00:36:19,610 --> 00:36:20,210
forth.

670
00:36:20,210 --> 00:36:21,558
You're not in competition.

671
00:36:21,558 --> 00:36:23,600
So we want to make this
something where everybody

672
00:36:23,600 --> 00:36:24,990
helps everybody learn.

673
00:36:24,990 --> 00:36:27,570
I probably spent too much time
on that, but in some sense,

674
00:36:27,570 --> 00:36:30,180
not nearly enough.

675
00:36:30,180 --> 00:36:31,050
OK.

676
00:36:31,050 --> 00:36:34,590
So the span is not order n.

677
00:36:34,590 --> 00:36:36,800
We got that much.

678
00:36:36,800 --> 00:36:38,540
Who else would
like to hazard to--

679
00:36:38,540 --> 00:36:39,270
OK.

680
00:36:39,270 --> 00:36:40,488
AUDIENCE: Is it log n?

681
00:36:40,488 --> 00:36:41,780
CHARLES LEISERSON: It is log n.

682
00:36:41,780 --> 00:36:42,970
What's your reasoning?

683
00:36:42,970 --> 00:36:45,475
AUDIENCE: It's the normal
log n from the time before,

684
00:36:45,475 --> 00:36:48,076
but since we're
expanding the n--

685
00:36:48,076 --> 00:36:49,034
CHARLES LEISERSON: Yup.

686
00:36:49,034 --> 00:36:51,930
AUDIENCE: --again into another
tree, it's log n plus log n.

687
00:36:51,930 --> 00:36:53,430
CHARLES LEISERSON:
Log n plus log n.

688
00:36:53,430 --> 00:36:54,177
Good.

689
00:36:54,177 --> 00:36:55,490
AUDIENCE: [INAUDIBLE]

690
00:36:55,490 --> 00:36:57,740
CHARLES LEISERSON: And
then what about the leaves?

691
00:36:57,740 --> 00:36:59,240
AUDIENCE: [INAUDIBLE]

692
00:36:59,240 --> 00:37:00,590
CHARLES LEISERSON: What's--
you got to add in the span

693
00:37:00,590 --> 00:37:01,173
of the leaves.

694
00:37:01,173 --> 00:37:04,037
That was just the
span of the control.

695
00:37:04,037 --> 00:37:05,370
AUDIENCE: The leaves are just 1.

696
00:37:05,370 --> 00:37:07,078
CHARLES LEISERSON:
The leaves are just 1.

697
00:37:09,260 --> 00:37:10,190
Boom.

698
00:37:10,190 --> 00:37:13,370
So the span of the outer
loop is order log n.

699
00:37:13,370 --> 00:37:15,560
The inner loop is order log n.

700
00:37:15,560 --> 00:37:18,020
And the span of the
body is order 1,

701
00:37:18,020 --> 00:37:19,760
because we're going
down to the body,

702
00:37:19,760 --> 00:37:22,880
now it's just doing one
iteration of serial execution.

703
00:37:22,880 --> 00:37:24,980
It's not doing i iterations.

704
00:37:24,980 --> 00:37:26,760
It's only doing one iteration.

705
00:37:26,760 --> 00:37:29,120
And so I add all that
together, and I get log n.

706
00:37:32,563 --> 00:37:33,480
Does that makes sense?

707
00:37:38,750 --> 00:37:40,235
So the parallelism is?

708
00:37:49,798 --> 00:37:51,590
This one, I should--
every hand in the room

709
00:37:51,590 --> 00:37:55,760
should be up, waving to
call on me, call on me.

710
00:37:55,760 --> 00:37:56,532
Sure.

711
00:37:56,532 --> 00:37:57,950
AUDIENCE: [INAUDIBLE]

712
00:37:57,950 --> 00:38:00,053
CHARLES LEISERSON: Yeah.
n squared over log n.

713
00:38:00,053 --> 00:38:01,220
That's the ratio of the two.

714
00:38:04,240 --> 00:38:04,740
Good.

715
00:38:04,740 --> 00:38:07,940
Any questions about that?

716
00:38:07,940 --> 00:38:09,560
OK.

717
00:38:09,560 --> 00:38:14,840
So the parallelism is
n squared over log n,

718
00:38:14,840 --> 00:38:21,710
and this is more parallel
than the previous one.

719
00:38:21,710 --> 00:38:23,540
But it turns out--

720
00:38:23,540 --> 00:38:25,970
you've got to remember, even
though it's more parallel,

721
00:38:25,970 --> 00:38:28,070
is it a better
algorithm in practice?

722
00:38:30,720 --> 00:38:32,910
Not necessarily,
because parallelism

723
00:38:32,910 --> 00:38:34,710
is like a thresholding thing.

724
00:38:34,710 --> 00:38:39,180
What you need is enough
parallelism beyond the number

725
00:38:39,180 --> 00:38:41,050
of processors that you have--

726
00:38:41,050 --> 00:38:44,520
the parallel
slackness, remember?

727
00:38:44,520 --> 00:38:47,580
So you have to have the number--
the amount of parallelism,

728
00:38:47,580 --> 00:38:50,060
if it's much greater than
the number of processors,

729
00:38:50,060 --> 00:38:52,070
you're good.

730
00:38:52,070 --> 00:38:57,460
So for something like this,
if with order n parallelism

731
00:38:57,460 --> 00:39:00,080
you're way bigger than
the number of processors,

732
00:39:00,080 --> 00:39:02,380
you don't need to
parallelize the inner loop.

733
00:39:04,845 --> 00:39:06,720
You don't need to
parallelize the inner loop.

734
00:39:06,720 --> 00:39:07,345
You'll be fine.

735
00:39:07,345 --> 00:39:11,180
And in fact, we're talking a
little bit about overheads,

736
00:39:11,180 --> 00:39:13,700
and I'm going to do
that with an example

737
00:39:13,700 --> 00:39:17,750
from using vector addition.

738
00:39:17,750 --> 00:39:22,770
So here's a really
simple piece of code.

739
00:39:22,770 --> 00:39:26,330
It's a vector-- add two
vectors together, two arrays.

740
00:39:26,330 --> 00:39:28,220
And all it does is,
it adds b into a.

741
00:39:28,220 --> 00:39:32,150
You can see every
position as b into a.

742
00:39:32,150 --> 00:39:34,340
And I'm going to
parallelize this

743
00:39:34,340 --> 00:39:38,540
by putting a Cilk for in front,
rather than an ordinary for.

744
00:39:38,540 --> 00:39:40,640
And what that does is,
it gives us this divide

745
00:39:40,640 --> 00:39:46,850
and conquer tree once
again, with n leaves.

746
00:39:46,850 --> 00:39:50,670
And the work here is
order n, because that's--

747
00:39:50,670 --> 00:39:54,930
we've got n iterations
of constant time.

748
00:39:54,930 --> 00:39:56,570
And the span is
just the control--

749
00:39:56,570 --> 00:39:58,070
log n.

750
00:39:58,070 --> 00:40:00,200
And so the parallelism
is n over log n.

751
00:40:05,090 --> 00:40:09,440
So this is basically easier
than what we just did.

752
00:40:09,440 --> 00:40:12,350
So now-- if I look at
this, though, the work

753
00:40:12,350 --> 00:40:18,818
here includes some
substantial overhead,

754
00:40:18,818 --> 00:40:20,610
because there are all
these function calls.

755
00:40:20,610 --> 00:40:24,090
It may be order n,
and that's good enough

756
00:40:24,090 --> 00:40:27,990
if you're certain
kinds of theoreticians.

757
00:40:27,990 --> 00:40:31,230
This kind of theoretician,
that's not good enough.

758
00:40:31,230 --> 00:40:36,140
I want to understand where
these overheads are going.

759
00:40:36,140 --> 00:40:40,116
So the first thing
that I might do

760
00:40:40,116 --> 00:40:43,330
to get rid of that overhead--

761
00:40:43,330 --> 00:40:46,720
so in this case,
what I'm saying is

762
00:40:46,720 --> 00:40:49,270
that as I do the
divide and conquer,

763
00:40:49,270 --> 00:40:52,090
if I go all the way
down to n equals 1,

764
00:40:52,090 --> 00:40:53,500
what am I doing in a leaf?

765
00:40:53,500 --> 00:40:55,540
How much work is in one
of these leaves here?

766
00:40:59,964 --> 00:41:02,090
It's an add.

767
00:41:02,090 --> 00:41:06,887
It's two memory fetches and
a memory store and an add.

768
00:41:06,887 --> 00:41:09,470
The memory operations are going
to be the most expensive thing

769
00:41:09,470 --> 00:41:11,630
there.

770
00:41:11,630 --> 00:41:13,610
That's all that's going on.

771
00:41:13,610 --> 00:41:20,250
And yet, right before then,
I've done a subroutine call--

772
00:41:20,250 --> 00:41:24,260
a parallel subroutine
call, mind you--

773
00:41:24,260 --> 00:41:27,975
and that's going to have
substantial overhead.

774
00:41:27,975 --> 00:41:30,100
And so the question is, do
you do a subroutine call

775
00:41:30,100 --> 00:41:31,300
to add two numbers together?

776
00:41:34,060 --> 00:41:35,930
That's pretty expensive.

777
00:41:35,930 --> 00:41:38,710
So let's take a look at
how we can optimize away

778
00:41:38,710 --> 00:41:41,060
some of this overhead.

779
00:41:41,060 --> 00:41:45,940
And so this gets more into
the realm of engineering.

780
00:41:45,940 --> 00:41:50,320
So the Open Cilk
system has a pragma.

781
00:41:50,320 --> 00:41:52,570
Pragma is a compiler directive--

782
00:41:52,570 --> 00:41:54,850
suggestion to the compiler--

783
00:41:54,850 --> 00:41:58,750
where it can suggest, in this
case, that there be a grain

784
00:41:58,750 --> 00:42:05,990
size up here of G, for
whatever you set G to.

785
00:42:05,990 --> 00:42:08,170
And the grain size
is essentially--

786
00:42:08,170 --> 00:42:09,070
we're going to use--

787
00:42:09,070 --> 00:42:11,230
and it shows up
here in the code--

788
00:42:11,230 --> 00:42:13,090
as instead of
ending up-- it used

789
00:42:13,090 --> 00:42:15,490
to be high greater than
low plus 1, so that you

790
00:42:15,490 --> 00:42:17,470
ended with a single element.

791
00:42:17,470 --> 00:42:20,960
Now it's going to be plus
G, so that at the leaves,

792
00:42:20,960 --> 00:42:25,210
I'm going to have up
to G elements per chunk

793
00:42:25,210 --> 00:42:27,910
that I do when I'm doing
my divide and conquer.

794
00:42:27,910 --> 00:42:31,000
So therefore, I can take
my subroutine overhead

795
00:42:31,000 --> 00:42:33,700
and amortize it
across G iterations

796
00:42:33,700 --> 00:42:37,510
rather than amortizing
across one iteration.

797
00:42:37,510 --> 00:42:40,240
So that's coarsening.

798
00:42:40,240 --> 00:42:44,980
Now, if the grain size
pragma is not specified,

799
00:42:44,980 --> 00:42:47,740
the Cilk runtime system
makes its best guess

800
00:42:47,740 --> 00:42:50,440
to minimize the overhead.

801
00:42:50,440 --> 00:42:52,260
So what it actually
does at runtime is,

802
00:42:52,260 --> 00:42:55,890
it figures out for the loop
how many cores it's running on,

803
00:42:55,890 --> 00:43:00,180
and makes a good guess
as to the actual--

804
00:43:00,180 --> 00:43:02,460
how much to run
serially at the leaves

805
00:43:02,460 --> 00:43:04,645
and how much to do in parallel.

806
00:43:04,645 --> 00:43:05,520
Does that make sense?

807
00:43:08,000 --> 00:43:09,750
So it's basically
trying to overcome that.

808
00:43:09,750 --> 00:43:11,490
So let's analyze
this a little bit.

809
00:43:14,760 --> 00:43:19,560
Let's let i be the time for
one iteration of the loop body.

810
00:43:19,560 --> 00:43:22,590
So this is i for iteration.

811
00:43:22,590 --> 00:43:25,770
This is of this
particular loop body--

812
00:43:25,770 --> 00:43:28,320
so basically, the cost
of those three memory

813
00:43:28,320 --> 00:43:31,350
operations plus an add.

814
00:43:31,350 --> 00:43:36,510
And G is the grain size.

815
00:43:36,510 --> 00:43:39,660
And now let's take a
look-- add another variable

816
00:43:39,660 --> 00:43:43,070
here, which is the time to
perform a spawn and return.

817
00:43:43,070 --> 00:43:44,730
I'm going to call
a spawn and return.

818
00:43:44,730 --> 00:43:47,355
It's basically the
overhead for spawning.

819
00:43:50,400 --> 00:43:54,470
So if I look at the
work in this context,

820
00:43:54,470 --> 00:44:02,280
I can view it as I've
got T1 work, which

821
00:44:02,280 --> 00:44:07,920
is n here times the
number of iterations--

822
00:44:07,920 --> 00:44:12,120
because I've got one, two,
three, up to n iterations.

823
00:44:12,120 --> 00:44:15,062
And then I have--

824
00:44:15,062 --> 00:44:16,770
and those are just
the normal iterations.

825
00:44:16,770 --> 00:44:21,210
And then, since I have n over
G minus 1-- there's n over G

826
00:44:21,210 --> 00:44:24,870
leaves here of
size G. So I have n

827
00:44:24,870 --> 00:44:29,130
over G minus 1
internal nodes, which

828
00:44:29,130 --> 00:44:30,390
are my subroutine overhead.

829
00:44:30,390 --> 00:44:35,610
That's S. So the total work
is n times i plus n over G

830
00:44:35,610 --> 00:44:38,460
minus 1 times S.

831
00:44:38,460 --> 00:44:42,390
Now, in the original code,
effectively, the work is what?

832
00:44:45,320 --> 00:44:47,360
If I had the code
without the Cilk

833
00:44:47,360 --> 00:44:56,640
for loop, how much work is
there before I put in all

834
00:44:56,640 --> 00:44:59,490
this parallel control stuff?

835
00:44:59,490 --> 00:45:00,460
What would the work be?

836
00:45:00,460 --> 00:45:00,960
Yeah.

837
00:45:00,960 --> 00:45:02,100
AUDIENCE: n i?

838
00:45:02,100 --> 00:45:03,450
CHARLES LEISERSON: n times i.

839
00:45:03,450 --> 00:45:05,378
We're just doing n iterations.

840
00:45:05,378 --> 00:45:07,170
Yeah, there's a little
bit of loop control,

841
00:45:07,170 --> 00:45:10,600
but that loop control
is really cheap.

842
00:45:10,600 --> 00:45:15,450
And on a modern
out-of-order processor,

843
00:45:15,450 --> 00:45:17,640
the cost of
incrementing a variable

844
00:45:17,640 --> 00:45:21,780
and testing against its bound
is dwarfed by the stuff going on

845
00:45:21,780 --> 00:45:24,060
inside the loop.

846
00:45:24,060 --> 00:45:25,290
So it's ni.

847
00:45:25,290 --> 00:45:27,480
So this part here--

848
00:45:27,480 --> 00:45:29,010
oops, what did I do?

849
00:45:29,010 --> 00:45:30,110
Oops, I went back.

850
00:45:30,110 --> 00:45:32,100
I see.

851
00:45:32,100 --> 00:45:35,570
So this part here--

852
00:45:35,570 --> 00:45:37,700
this part here, there we go--

853
00:45:37,700 --> 00:45:40,160
is all overhead.

854
00:45:40,160 --> 00:45:41,870
This is what it
costs-- this part here

855
00:45:41,870 --> 00:45:44,118
is what cost me originally.

856
00:45:47,170 --> 00:45:53,040
So let's take a look
at the span of this.

857
00:45:53,040 --> 00:45:59,940
So the span is
going to be, well,

858
00:45:59,940 --> 00:46:05,080
if I add up what's at the
leaves, that's just G times i.

859
00:46:05,080 --> 00:46:08,100
And now I've got
the overhead here

860
00:46:08,100 --> 00:46:11,890
for any of these paths, which
is basically proportional--

861
00:46:11,890 --> 00:46:14,730
I'm ignoring constants
here to make it easier--

862
00:46:14,730 --> 00:46:19,650
log of n over G times S,
because it's going log levels.

863
00:46:19,650 --> 00:46:22,890
And I've got n over G
chunks, because each--

864
00:46:22,890 --> 00:46:26,340
I've got G things at the
iterations of each leaf,

865
00:46:26,340 --> 00:46:30,240
so therefore the number
of leaves n over G.

866
00:46:30,240 --> 00:46:34,990
And I've got n
minus 1 of those--

867
00:46:34,990 --> 00:46:38,220
sorry, got log n of
those-- actually, 2 log n.

868
00:46:38,220 --> 00:46:43,020
2 log n over G of those times
S. Actually, maybe I don't.

869
00:46:43,020 --> 00:46:44,520
Maybe I just have
log n, because I'm

870
00:46:44,520 --> 00:46:46,350
going to count it going
down and going up.

871
00:46:46,350 --> 00:46:49,260
So actually, constant
of 1 is fine.

872
00:46:49,260 --> 00:46:51,760
Who's confused?

873
00:46:51,760 --> 00:46:52,410
OK.

874
00:46:52,410 --> 00:46:54,610
Let's ask some questions.

875
00:46:54,610 --> 00:46:57,610
You have a question?

876
00:46:57,610 --> 00:46:59,020
I know you're confused.

877
00:46:59,020 --> 00:47:00,760
Believe me, I spend--

878
00:47:00,760 --> 00:47:03,010
one of my great
successes in life

879
00:47:03,010 --> 00:47:08,440
was discovering that, oh,
confusion is how I usually am.

880
00:47:08,440 --> 00:47:13,930
And then it's getting
confused that is--

881
00:47:13,930 --> 00:47:16,150
that's the thing, because
I see so many people going

882
00:47:16,150 --> 00:47:18,490
through life thinking
they're not confused,

883
00:47:18,490 --> 00:47:22,240
but you know what,
they're confused.

884
00:47:22,240 --> 00:47:24,100
And that's a worse
state of affairs

885
00:47:24,100 --> 00:47:26,448
to be in than knowing
that you're confused.

886
00:47:26,448 --> 00:47:27,490
Let's ask some questions.

887
00:47:27,490 --> 00:47:29,920
People who are confused,
let's ask some questions,

888
00:47:29,920 --> 00:47:33,650
because I want to make sure
that everybody gets this.

889
00:47:33,650 --> 00:47:38,170
And for those who you think
know it already, sometimes

890
00:47:38,170 --> 00:47:40,360
it helps them to know it
a little bit even better

891
00:47:40,360 --> 00:47:42,680
when we go through a
discussion like this.

892
00:47:42,680 --> 00:47:43,930
So somebody ask me a question.

893
00:47:46,700 --> 00:47:47,555
Yes.

894
00:47:47,555 --> 00:47:50,350
AUDIENCE: Could you explain the
second half of that [INAUDIBLE]

895
00:47:50,350 --> 00:47:51,110
CHARLES LEISERSON: Yeah.

896
00:47:51,110 --> 00:47:51,610
OK.

897
00:47:51,610 --> 00:47:53,390
The second half
of the work part.

898
00:47:53,390 --> 00:47:54,020
OK.

899
00:47:54,020 --> 00:47:56,900
So the second half of the
work part, n over G minus 1.

900
00:47:56,900 --> 00:47:59,630
So the first thing is,
if I've got G iterations

901
00:47:59,630 --> 00:48:02,700
at the leaves of a binary
tree, how many leaves

902
00:48:02,700 --> 00:48:07,288
do I have if I've got a
total of n iterations?

903
00:48:07,288 --> 00:48:08,600
AUDIENCE: Is it n over G?

904
00:48:08,600 --> 00:48:11,470
CHARLES LEISERSON: n over
G. That's the first thing.

905
00:48:11,470 --> 00:48:14,390
The second thing is a
fact about binary trees--

906
00:48:14,390 --> 00:48:17,520
of any full binary tree, but
in particular complete binary

907
00:48:17,520 --> 00:48:18,150
trees.

908
00:48:18,150 --> 00:48:19,800
How many internal
nodes are there

909
00:48:19,800 --> 00:48:22,770
in a complete binary tree?

910
00:48:22,770 --> 00:48:25,410
If n is the number of
leaves, it's n minus 1.

911
00:48:25,410 --> 00:48:28,710
Here, the number of
leaves is n over G,

912
00:48:28,710 --> 00:48:31,215
so it's n over G minus 1.

913
00:48:31,215 --> 00:48:34,230
That clear up something
for some people?

914
00:48:34,230 --> 00:48:36,420
OK, good.

915
00:48:36,420 --> 00:48:37,380
So that's where that--

916
00:48:37,380 --> 00:48:39,990
and now each of those,
I've got to do those three

917
00:48:39,990 --> 00:48:44,890
colorful operations, which
is what I'm calling S.

918
00:48:44,890 --> 00:48:46,560
So you got the work down?

919
00:48:46,560 --> 00:48:47,160
OK.

920
00:48:47,160 --> 00:48:49,140
Who has a question about span?

921
00:48:49,140 --> 00:48:50,618
Span's my favorite.

922
00:48:53,370 --> 00:48:56,010
Work is good right.

923
00:48:56,010 --> 00:49:00,060
Work is more important,
actually, in most contexts.

924
00:49:00,060 --> 00:49:03,170
But span is so cool.

925
00:49:03,170 --> 00:49:04,868
Yeah.

926
00:49:04,868 --> 00:49:08,756
AUDIENCE: What did you mean
when you said [INAUDIBLE]

927
00:49:10,242 --> 00:49:12,200
CHARLES LEISERSON: So
what I was saying-- well,

928
00:49:12,200 --> 00:49:13,070
I think what I was saying--

929
00:49:13,070 --> 00:49:15,480
I think I was mis-saying
something, probably, there.

930
00:49:15,480 --> 00:49:17,630
But the point is that
the span is basically

931
00:49:17,630 --> 00:49:21,740
starting at the top here, and
taking any path down to a leaf

932
00:49:21,740 --> 00:49:24,890
and then going back up.

933
00:49:24,890 --> 00:49:26,900
And so if I look at
that, that's going

934
00:49:26,900 --> 00:49:30,050
to be then log of
the number of leaves.

935
00:49:30,050 --> 00:49:33,860
Well, the number of leaves,
as we agreed, was n over G.

936
00:49:33,860 --> 00:49:36,800
And then each of
those is, at most,

937
00:49:36,800 --> 00:49:42,200
S to do the subroutine calling
and so forth that's bookkeeping

938
00:49:42,200 --> 00:49:44,480
that's in that node.

939
00:49:44,480 --> 00:49:46,820
That make sense?

940
00:49:46,820 --> 00:49:48,290
Still I didn't
answer the question?

941
00:49:48,290 --> 00:49:48,790
Or--

942
00:49:48,790 --> 00:49:52,922
AUDIENCE: Why is that the span?

943
00:49:52,922 --> 00:49:55,140
Why shouldn't it be [INAUDIBLE]

944
00:49:55,140 --> 00:49:57,140
CHARLES LEISERSON: It
could be any of the paths.

945
00:49:57,140 --> 00:50:00,950
But take a look at all the
paths, go down, and back up.

946
00:50:00,950 --> 00:50:03,770
There's no path that's
going down and around

947
00:50:03,770 --> 00:50:04,980
and up and so forth.

948
00:50:04,980 --> 00:50:06,590
This is a DAG.

949
00:50:06,590 --> 00:50:08,900
So if you just look at the
directions of the arrows.

950
00:50:08,900 --> 00:50:11,480
You got to follow the
directions of the arrows.

951
00:50:11,480 --> 00:50:13,380
You can't go down and up.

952
00:50:13,380 --> 00:50:18,670
You're either going down,
or you've started back up.

953
00:50:18,670 --> 00:50:21,610
So it's always going
to be, essentially,

954
00:50:21,610 --> 00:50:23,590
down through a set of
subroutines and back up

955
00:50:23,590 --> 00:50:25,402
through a set of subroutines.

956
00:50:25,402 --> 00:50:26,870
Does that make sense?

957
00:50:26,870 --> 00:50:31,090
And if you think about the
code, the recursive code, what's

958
00:50:31,090 --> 00:50:33,040
happening when you do
divide and conquer?

959
00:50:33,040 --> 00:50:34,960
If you were operating
with a stack,

960
00:50:34,960 --> 00:50:39,070
how many things would get
stacked up and then unstacked?

961
00:50:39,070 --> 00:50:43,930
So the path down and
back up would also

962
00:50:43,930 --> 00:50:46,060
be logarithmic at most.

963
00:50:49,336 --> 00:50:51,200
Does that makes sense?

964
00:50:51,200 --> 00:50:52,750
So I don't have a--

965
00:50:52,750 --> 00:50:57,150
if I had one subtree here,
for example, dependent on--

966
00:50:57,150 --> 00:51:00,780
oops, that's not the mode
I want to be in-- so one

967
00:51:00,780 --> 00:51:03,880
subtree here dependent
on another subtree,

968
00:51:03,880 --> 00:51:07,818
then indeed, the
span would grow.

969
00:51:07,818 --> 00:51:10,360
But the whole point is not to
have these two things-- to make

970
00:51:10,360 --> 00:51:12,070
these two things
independent, so I

971
00:51:12,070 --> 00:51:14,620
can run them at the same time.

972
00:51:14,620 --> 00:51:17,870
So there's no dependency there.

973
00:51:17,870 --> 00:51:19,970
We good?

974
00:51:19,970 --> 00:51:22,700
OK.

975
00:51:22,700 --> 00:51:25,820
So here I have the
work and the span.

976
00:51:25,820 --> 00:51:28,490
I have two things
I want out of this.

977
00:51:28,490 --> 00:51:31,740
Number one, I want
the work to be small.

978
00:51:31,740 --> 00:51:37,340
I want work to be close
to the work of the n times

979
00:51:37,340 --> 00:51:43,310
i, the work of the
ordinary serial algorithm.

980
00:51:43,310 --> 00:51:46,130
And I want the span to be
small, so it's as parallel

981
00:51:46,130 --> 00:51:47,300
as possible.

982
00:51:47,300 --> 00:51:50,300
Those things are working
in opposite directions,

983
00:51:50,300 --> 00:51:53,750
because if you look,
the dominant term

984
00:51:53,750 --> 00:52:01,400
for G in the first
equation is dividing n.

985
00:52:01,400 --> 00:52:08,190
So if I want the work to be
small, I want G to be what?

986
00:52:08,190 --> 00:52:08,690
Big.

987
00:52:11,270 --> 00:52:15,050
The dominant term
for G in the span

988
00:52:15,050 --> 00:52:18,800
is the G multiplied by the i.

989
00:52:18,800 --> 00:52:22,410
There is another term there,
but that's a lower-order term.

990
00:52:22,410 --> 00:52:29,930
So if I want the span to be
small, I want G to be small.

991
00:52:29,930 --> 00:52:33,410
They're going in
opposite directions.

992
00:52:33,410 --> 00:52:37,540
So what we're interested
in is picking a--

993
00:52:37,540 --> 00:52:39,950
finding a medium place.

994
00:52:39,950 --> 00:52:42,040
We want G to be--

995
00:52:42,040 --> 00:52:45,760
and in particular, if you
look at this, what I want

996
00:52:45,760 --> 00:52:48,910
is G to be at least S over i.

997
00:52:48,910 --> 00:52:49,610
Why?

998
00:52:49,610 --> 00:52:52,975
If I make G be much
bigger than S over i--

999
00:52:55,670 --> 00:52:58,820
so if G is bigger
than S over i--

1000
00:52:58,820 --> 00:53:02,870
then this term
multiplied by S ends up

1001
00:53:02,870 --> 00:53:06,160
being much less than this term.

1002
00:53:06,160 --> 00:53:07,205
You see that?

1003
00:53:07,205 --> 00:53:07,830
That's algebra.

1004
00:53:11,770 --> 00:53:15,606
So do you see that
if I make G be--

1005
00:53:15,606 --> 00:53:18,457
if G is much less
than S over i--

1006
00:53:18,457 --> 00:53:19,540
so get rid of the minus 1.

1007
00:53:19,540 --> 00:53:21,670
That doesn't matter.

1008
00:53:21,670 --> 00:53:28,210
So that's really n times S
over G, so therefore S over G,

1009
00:53:28,210 --> 00:53:32,630
that's basically
much smaller than i.

1010
00:53:32,630 --> 00:53:34,790
So I end up with something
where the result is

1011
00:53:34,790 --> 00:53:38,530
much smaller than n i.

1012
00:53:38,530 --> 00:53:41,470
Does that make sense?

1013
00:53:41,470 --> 00:53:42,130
OK.

1014
00:53:42,130 --> 00:53:43,430
How we doing on time?

1015
00:53:43,430 --> 00:53:44,170
OK.

1016
00:53:44,170 --> 00:53:45,628
I'm going to get
through everything

1017
00:53:45,628 --> 00:53:48,130
that I expect to get
through, despite my rant.

1018
00:53:52,450 --> 00:53:52,950
OK.

1019
00:53:52,950 --> 00:53:53,825
Does that make sense?

1020
00:53:53,825 --> 00:53:56,560
We want G to be much
greater than S over i.

1021
00:53:56,560 --> 00:53:59,350
Then the overhead is
going to be small,

1022
00:53:59,350 --> 00:54:03,007
because I'm going to do a
whole bunch of iterations that

1023
00:54:03,007 --> 00:54:05,340
are going to make it so that
that function call was just

1024
00:54:05,340 --> 00:54:09,000
like, eh, who cares?

1025
00:54:09,000 --> 00:54:09,780
That's the idea.

1026
00:54:14,340 --> 00:54:15,360
So that's the goal.

1027
00:54:15,360 --> 00:54:17,648
So let's take a look at--

1028
00:54:17,648 --> 00:54:18,690
let's see, what was the--

1029
00:54:27,580 --> 00:54:28,480
let me just see here.

1030
00:54:28,480 --> 00:54:32,580
Did I-- somehow I feel like
I have something out of order

1031
00:54:32,580 --> 00:54:45,922
here, because now I have
the other implementation.

1032
00:54:48,796 --> 00:54:49,296
Huh.

1033
00:54:55,570 --> 00:54:56,090
OK.

1034
00:54:56,090 --> 00:54:58,170
I think-- maybe that
is where I left it.

1035
00:54:58,170 --> 00:54:58,670
OK.

1036
00:54:58,670 --> 00:55:00,410
I think we come back to this.

1037
00:55:00,410 --> 00:55:01,030
Let me see.

1038
00:55:01,030 --> 00:55:02,030
I'm going to lecture on.

1039
00:55:07,110 --> 00:55:13,770
So here's another
implementation of the for loop

1040
00:55:13,770 --> 00:55:15,210
to add two vectors.

1041
00:55:15,210 --> 00:55:17,490
And what this is going
to use as a subroutine,

1042
00:55:17,490 --> 00:55:20,340
I have this operator
called v add,

1043
00:55:20,340 --> 00:55:24,120
which itself does just
a serial vector add.

1044
00:55:24,120 --> 00:55:29,310
And now what I'm going to do
is run through the loop here

1045
00:55:29,310 --> 00:55:34,470
and spawn off additions--

1046
00:55:34,470 --> 00:55:38,010
and the min there is just
for a boundary condition.

1047
00:55:38,010 --> 00:55:43,630
I'm going to spin off
things in groups of G.

1048
00:55:43,630 --> 00:55:47,590
So I spin off, do a vector
add of size G, go on vector

1049
00:55:47,590 --> 00:55:52,810
add of size G, vector add of
size G, jumping G each time.

1050
00:55:52,810 --> 00:55:55,070
So let's take a look at
the analysis of that.

1051
00:55:55,070 --> 00:55:58,930
So now what I've got is, I've
got G iterations, each of which

1052
00:55:58,930 --> 00:56:00,940
costs me i.

1053
00:56:00,940 --> 00:56:03,520
And this is the
DAG structure I've

1054
00:56:03,520 --> 00:56:07,540
got, because the
for loop here that

1055
00:56:07,540 --> 00:56:10,360
has the Cilk spawn
in it is going along,

1056
00:56:10,360 --> 00:56:16,330
and notice that the
Cilk spawn is in a loop.

1057
00:56:16,330 --> 00:56:19,550
And so it's basically going--
it's spawning off G iterations.

1058
00:56:19,550 --> 00:56:23,170
So it's spawning off
the vector add, which

1059
00:56:23,170 --> 00:56:25,330
is going to do G iterations--

1060
00:56:25,330 --> 00:56:28,900
because I'm giving basically G,
because the boundary case let's

1061
00:56:28,900 --> 00:56:30,610
not worry about.

1062
00:56:30,610 --> 00:56:34,220
And then spawn off G, spawn off
G, spawn off G, and so forth.

1063
00:56:34,220 --> 00:56:37,476
So what's the work of this?

1064
00:56:37,476 --> 00:56:38,434
Let's see.

1065
00:56:41,310 --> 00:56:44,170
Well, let's make things
easy to begin with.

1066
00:56:44,170 --> 00:56:48,595
Let's assume G is
1 and analyze it.

1067
00:56:48,595 --> 00:56:50,220
And this is a common
thing, by the way,

1068
00:56:50,220 --> 00:56:52,530
is you as assume
that grain size is 1

1069
00:56:52,530 --> 00:56:55,170
and analyze it, and then
as a practical matter,

1070
00:56:55,170 --> 00:56:58,390
coarsen it to make
it more efficient.

1071
00:56:58,390 --> 00:57:00,492
So if G is 1, what's
the work of this?

1072
00:57:14,760 --> 00:57:16,036
Yeah.

1073
00:57:16,036 --> 00:57:18,680
AUDIENCE: [INAUDIBLE]

1074
00:57:18,680 --> 00:57:19,180
Yeah.

1075
00:57:19,180 --> 00:57:22,810
It was order n, because those
other two things are constant.

1076
00:57:22,810 --> 00:57:24,790
So exactly right.

1077
00:57:24,790 --> 00:57:27,220
It's order n.

1078
00:57:27,220 --> 00:57:31,180
In fact, this is a
technique, by the way,

1079
00:57:31,180 --> 00:57:33,880
that's called strip
mining, if you take away

1080
00:57:33,880 --> 00:57:36,970
the parallel thing, where
you take a loop of length n.

1081
00:57:36,970 --> 00:57:39,010
And you really have
nested loops here--

1082
00:57:39,010 --> 00:57:44,200
one that has n over G iterations
and one that has G iterations--

1083
00:57:44,200 --> 00:57:47,290
and you're going through
exactly the same stuff.

1084
00:57:47,290 --> 00:57:49,512
And that's the same as
going through n iterations.

1085
00:57:49,512 --> 00:57:51,220
But you're replacing
a singly-nested loop

1086
00:57:51,220 --> 00:57:52,528
by a doubly-nested loop.

1087
00:57:52,528 --> 00:57:54,820
And the only difference here
is that in the inner loop,

1088
00:57:54,820 --> 00:57:56,320
I'm actually spawning off work.

1089
00:57:59,970 --> 00:58:04,260
So here, the work is order
n, because I basically--

1090
00:58:04,260 --> 00:58:07,080
if I'm spinning off
just-- if G is 1,

1091
00:58:07,080 --> 00:58:09,990
then I'm spinning off
one piece of work,

1092
00:58:09,990 --> 00:58:12,900
and I'm going to n minus
1, spinning off one.

1093
00:58:12,900 --> 00:58:14,550
So I've got order
n work up here,

1094
00:58:14,550 --> 00:58:18,330
and order n work down below.

1095
00:58:18,330 --> 00:58:19,680
What's the span for this.

1096
00:58:22,870 --> 00:58:24,970
After all, I got in
spans there now--

1097
00:58:24,970 --> 00:58:27,780
sorry, n spawns, not n spans.

1098
00:58:27,780 --> 00:58:28,280
n spawns.

1099
00:58:35,640 --> 00:58:37,740
What's the span going to be?

1100
00:58:37,740 --> 00:58:38,506
Yeah.

1101
00:58:38,506 --> 00:58:39,500
AUDIENCE: [INAUDIBLE]

1102
00:58:39,500 --> 00:58:41,250
CHARLES LEISERSON: Sorry?

1103
00:58:41,250 --> 00:58:43,305
Sorry I, couldn't hear--

1104
00:58:43,305 --> 00:58:44,180
AUDIENCE: [INAUDIBLE]

1105
00:58:44,180 --> 00:58:45,810
CHARLES LEISERSON: Theta S?

1106
00:58:45,810 --> 00:58:48,060
No, it's bigger than that.

1107
00:58:48,060 --> 00:58:50,250
Yeah, you'd think,
gee, I just have to do

1108
00:58:50,250 --> 00:58:52,050
one thing to go down and up.

1109
00:58:52,050 --> 00:58:54,791
But the span is the longest
path in the whole DAG.

1110
00:59:06,580 --> 00:59:08,300
It's the longest path
in the whole DAG.

1111
00:59:10,910 --> 00:59:14,880
Where's the longest path
in the whole DAG start?

1112
00:59:14,880 --> 00:59:16,350
Upper left, right?

1113
00:59:16,350 --> 00:59:18,810
And where does it end?

1114
00:59:18,810 --> 00:59:19,380
Upper right.

1115
00:59:22,150 --> 00:59:25,330
How long is that path?

1116
00:59:25,330 --> 00:59:26,830
What's the longest one?

1117
00:59:26,830 --> 00:59:30,640
It's going to go all the way
down the backbone of the top

1118
00:59:30,640 --> 00:59:34,850
there, and then flip
down and back up.

1119
00:59:34,850 --> 00:59:36,550
So how many things are in the--

1120
00:59:36,550 --> 00:59:40,720
if G is 1, how many things
are my spawning off there?

1121
00:59:40,720 --> 00:59:42,674
n things, so the span is?

1122
00:59:48,470 --> 00:59:49,000
Order n?

1123
00:59:51,770 --> 00:59:52,340
So order n.

1124
00:59:52,340 --> 00:59:54,080
It's long.

1125
00:59:54,080 --> 00:59:55,542
So what's the parallelism here?

1126
01:00:01,326 --> 01:00:02,930
AUDIENCE: [INAUDIBLE]

1127
01:00:02,930 --> 01:00:03,930
CHARLES LEISERSON: Yeah.

1128
01:00:03,930 --> 01:00:05,940
It's order 1.

1129
01:00:05,940 --> 01:00:07,644
And what do we call that?

1130
01:00:07,644 --> 01:00:08,792
AUDIENCE: Bad.

1131
01:00:08,792 --> 01:00:09,750
CHARLES LEISERSON: Bad.

1132
01:00:09,750 --> 01:00:11,060
Right.

1133
01:00:11,060 --> 01:00:15,060
But there's a more
technical name.

1134
01:00:15,060 --> 01:00:16,758
They call that puny.

1135
01:00:16,758 --> 01:00:19,098
[LAUGHTER]

1136
01:00:19,652 --> 01:00:21,360
It's like, we went
through all this work,

1137
01:00:21,360 --> 01:00:24,540
spawned off all that stuff,
added all this overhead,

1138
01:00:24,540 --> 01:00:26,040
and it didn't go any faster.

1139
01:00:26,040 --> 01:00:27,540
I can't tell you
how many times I've

1140
01:00:27,540 --> 01:00:32,640
seen people do this when they
start parallel programming.

1141
01:00:32,640 --> 01:00:35,990
Oh, but I spawned
off all this stuff!

1142
01:00:35,990 --> 01:00:38,075
Yeah, but you didn't
reduce the span.

1143
01:00:45,710 --> 01:00:51,560
Let's now-- that was the
analyze it in terms of n--

1144
01:00:51,560 --> 01:00:55,020
sorry, in terms of G equals 1.

1145
01:00:55,020 --> 01:00:58,220
Now let's increase
the grain size

1146
01:00:58,220 --> 01:01:01,490
and analyze it in terms
of G. So once again,

1147
01:01:01,490 --> 01:01:04,700
what's the work now?

1148
01:01:04,700 --> 01:01:06,050
Work is always a gimme.

1149
01:01:11,230 --> 01:01:11,950
Yeah.

1150
01:01:11,950 --> 01:01:13,117
AUDIENCE: Same as before, n.

1151
01:01:13,117 --> 01:01:13,992
CHARLES LEISERSON: n.

1152
01:01:13,992 --> 01:01:14,790
Same as before. n.

1153
01:01:14,790 --> 01:01:17,415
The work doesn't change when you
parallelize things differently

1154
01:01:17,415 --> 01:01:19,360
and stuff like that.

1155
01:01:19,360 --> 01:01:21,885
I'm doing order n iterations.

1156
01:01:21,885 --> 01:01:22,885
Oh, but what's the span?

1157
01:01:26,765 --> 01:01:27,640
This is a tricky one.

1158
01:01:27,640 --> 01:01:28,330
Yeah.

1159
01:01:28,330 --> 01:01:29,590
AUDIENCE: n over G.

1160
01:01:29,590 --> 01:01:31,600
CHARLES LEISERSON: Close.

1161
01:01:31,600 --> 01:01:33,085
That's half right.

1162
01:01:36,560 --> 01:01:39,382
That's half right.

1163
01:01:39,382 --> 01:01:41,830
Good.

1164
01:01:41,830 --> 01:01:43,300
That's half right.

1165
01:01:43,300 --> 01:01:44,042
Yeah.

1166
01:01:44,042 --> 01:01:45,430
AUDIENCE: [INAUDIBLE]

1167
01:01:45,430 --> 01:01:48,130
CHARLES LEISERSON:
n over G plus G.

1168
01:01:48,130 --> 01:01:49,420
Don't forget that other term.

1169
01:01:49,420 --> 01:01:55,000
So the path that we care
about goes along the top

1170
01:01:55,000 --> 01:01:57,730
here, and then goes down there.

1171
01:01:57,730 --> 01:02:04,630
And this has span G.
So we've got n over G

1172
01:02:04,630 --> 01:02:07,270
here, because I'm doing
chunks of G, plus G.

1173
01:02:07,270 --> 01:02:13,030
So it's G plus n over
G. And now, how can I

1174
01:02:13,030 --> 01:02:15,800
choose G to minimize the span?

1175
01:02:15,800 --> 01:02:18,220
There's nothing to choose
to minimize the work,

1176
01:02:18,220 --> 01:02:20,690
except there's
some work overhead

1177
01:02:20,690 --> 01:02:21,690
that we're trying to do.

1178
01:02:21,690 --> 01:02:25,044
But how can I choose G
to minimize the span?

1179
01:02:28,700 --> 01:02:31,230
What's the best
value for G here?

1180
01:02:31,230 --> 01:02:31,730
Yeah.

1181
01:02:31,730 --> 01:02:32,610
AUDIENCE: [INAUDIBLE]

1182
01:02:32,610 --> 01:02:33,860
CHARLES LEISERSON: You got it.

1183
01:02:33,860 --> 01:02:36,300
Square root of n.

1184
01:02:36,300 --> 01:02:40,640
So one of these is increasing.

1185
01:02:40,640 --> 01:02:43,590
If G is increasing, n
over G is decreasing,

1186
01:02:43,590 --> 01:02:45,180
where do they cross?

1187
01:02:45,180 --> 01:02:46,680
When they're equal.

1188
01:02:46,680 --> 01:02:52,410
That's when G equals n over
G, or G is square root of n.

1189
01:02:54,966 --> 01:02:57,880
So this actually has decent--

1190
01:02:57,880 --> 01:03:01,970
n big enough, square root
of n, that's not bad.

1191
01:03:01,970 --> 01:03:04,985
So it is OK to spawn
things off in chunks.

1192
01:03:04,985 --> 01:03:06,610
Just don't make the
chunks real little.

1193
01:03:10,240 --> 01:03:12,702
What's the parallelism?

1194
01:03:12,702 --> 01:03:14,160
Once again, this
is always a gimme.

1195
01:03:14,160 --> 01:03:15,267
It's the ratio.

1196
01:03:15,267 --> 01:03:16,100
So square root of n.

1197
01:03:19,170 --> 01:03:20,460
Quiz on parallel loops.

1198
01:03:25,180 --> 01:03:27,760
I'm going to let you
folks do this offline.

1199
01:03:27,760 --> 01:03:28,935
Here's the answers.

1200
01:03:28,935 --> 01:03:30,600
If you quickly
write it down, you

1201
01:03:30,600 --> 01:03:33,346
don't have to think about it.

1202
01:03:33,346 --> 01:03:36,280
[RAPID BREATHING]

1203
01:03:38,150 --> 01:03:38,650
OK.

1204
01:03:38,650 --> 01:03:40,542
[LAUGHTER]

1205
01:03:43,380 --> 01:03:45,610
OK.

1206
01:03:45,610 --> 01:03:47,310
So take a look at
the notes afterwards,

1207
01:03:47,310 --> 01:03:51,840
and you can try to figure
out why those things are so.

1208
01:03:51,840 --> 01:03:56,500
So there's some performance
tips that make sense when

1209
01:03:56,500 --> 01:03:57,750
you're programming with loops.

1210
01:03:57,750 --> 01:04:01,450
One is, minimize the span
to maximize the parallelism,

1211
01:04:01,450 --> 01:04:04,002
because the span's
in the denominator.

1212
01:04:04,002 --> 01:04:05,460
And generally, you
want to generate

1213
01:04:05,460 --> 01:04:08,220
10 times more parallelism
than processors

1214
01:04:08,220 --> 01:04:11,082
if you want near-perfect
linear speed-up.

1215
01:04:11,082 --> 01:04:13,290
So if you have a lot more
parallelism than the number

1216
01:04:13,290 --> 01:04:15,630
of processors-- we talked
about that last time--

1217
01:04:15,630 --> 01:04:17,543
you get good speed-up.

1218
01:04:17,543 --> 01:04:18,960
If you have plenty
of parallelism,

1219
01:04:18,960 --> 01:04:23,580
try to trade some of it off
to reduce the work overhead.

1220
01:04:23,580 --> 01:04:25,710
So the idea was, for
any of these things,

1221
01:04:25,710 --> 01:04:31,050
you can fiddle with the
numbers, the grain size

1222
01:04:31,050 --> 01:04:32,920
in particular, to reduce--

1223
01:04:32,920 --> 01:04:36,720
it reduces the parallelism, but
it also reduces the overhead.

1224
01:04:36,720 --> 01:04:39,810
And as long as you've got
sufficient parallelism,

1225
01:04:39,810 --> 01:04:42,870
your code is going to
run just fine parallel.

1226
01:04:42,870 --> 01:04:45,072
It's only when you're
in the place where,

1227
01:04:45,072 --> 01:04:46,530
ooh, don't have
enough parallelism,

1228
01:04:46,530 --> 01:04:48,210
and I don't want to
pay the overhead.

1229
01:04:48,210 --> 01:04:50,340
Those are the sticky ones.

1230
01:04:50,340 --> 01:04:53,760
But most of the time, you're
going to be in the case

1231
01:04:53,760 --> 01:04:55,800
where you've got way
more parallelism than you

1232
01:04:55,800 --> 01:04:58,710
need, and the question is,
how can you reduce some of it

1233
01:04:58,710 --> 01:05:02,580
in order to reduced
the work overhead?

1234
01:05:02,580 --> 01:05:05,790
Generally, you should use
divide and conquer recursion

1235
01:05:05,790 --> 01:05:08,940
or parallel loops, rather
than spawning one small thing

1236
01:05:08,940 --> 01:05:10,480
after another.

1237
01:05:10,480 --> 01:05:12,150
So it's better to
do the Cilk for,

1238
01:05:12,150 --> 01:05:15,030
which already is doing divide
and conquer parallelism,

1239
01:05:15,030 --> 01:05:18,270
than doing the spawn
off one thing at a time

1240
01:05:18,270 --> 01:05:22,950
type of strategy, unless
you can chunk them

1241
01:05:22,950 --> 01:05:25,020
so that you have
relatively few things

1242
01:05:25,020 --> 01:05:26,130
that you're spawning off.

1243
01:05:26,130 --> 01:05:27,340
This would be fine.

1244
01:05:27,340 --> 01:05:30,180
The thing I say not-- this
would be fine if foo of i

1245
01:05:30,180 --> 01:05:32,920
was really expensive.

1246
01:05:32,920 --> 01:05:35,560
Fine, then we'll have
lots of parallelism,

1247
01:05:35,560 --> 01:05:37,480
because there's a
lot of work there.

1248
01:05:37,480 --> 01:05:42,460
But generally, it's better
to do the divide and conquer.

1249
01:05:42,460 --> 01:05:44,740
Generally, you should
make sure that the work

1250
01:05:44,740 --> 01:05:48,460
that you're doing per number of
spawns is sufficiently large.

1251
01:05:48,460 --> 01:05:50,620
So the spawns say,
well, how much

1252
01:05:50,620 --> 01:05:54,580
are you busting your work
into in terms of chunks?

1253
01:05:54,580 --> 01:05:57,790
Because the spawn has an
overhead, and so the question

1254
01:05:57,790 --> 01:05:59,770
is, well, how big is that?

1255
01:05:59,770 --> 01:06:02,620
And so you can coarsen
by using function calls

1256
01:06:02,620 --> 01:06:04,720
and in-lining near the leaves.

1257
01:06:04,720 --> 01:06:06,970
Generally better to parallelize
outer loops as opposed

1258
01:06:06,970 --> 01:06:08,637
to inner loops, if
you're forced to make

1259
01:06:08,637 --> 01:06:11,120
a choice, because
the inner loops,

1260
01:06:11,120 --> 01:06:13,900
they're the overhead you're
incurring every single time.

1261
01:06:13,900 --> 01:06:15,490
The outer loop,
you can amortize it

1262
01:06:15,490 --> 01:06:17,680
against the work
that's going on inside

1263
01:06:17,680 --> 01:06:20,050
that doesn't have the overhead.

1264
01:06:20,050 --> 01:06:22,910
And watch out for
scheduling overhead.

1265
01:06:22,910 --> 01:06:28,860
So here's an example of two
codes that have parallelism 2,

1266
01:06:28,860 --> 01:06:31,710
and one of them is
an efficient code,

1267
01:06:31,710 --> 01:06:35,240
and the other one is lousy code.

1268
01:06:35,240 --> 01:06:38,280
The top one is
efficient, because I

1269
01:06:38,280 --> 01:06:43,920
have two iterations that I run
in parallel, and each of them

1270
01:06:43,920 --> 01:06:47,160
does a lot of work.

1271
01:06:47,160 --> 01:06:50,100
There's only one scheduling
operation that happens.

1272
01:06:50,100 --> 01:06:55,920
The bottom one, I
have n iterations,

1273
01:06:55,920 --> 01:07:00,660
and each iteration
does work, too,

1274
01:07:00,660 --> 01:07:03,780
so I basically have n
iterations with overhead.

1275
01:07:03,780 --> 01:07:06,750
And so if you just look at
these, look at the overhead,

1276
01:07:06,750 --> 01:07:08,490
you can see what
the difference is.

1277
01:07:08,490 --> 01:07:09,458
OK.

1278
01:07:09,458 --> 01:07:11,250
I want to talk a little
bit about actually,

1279
01:07:11,250 --> 01:07:12,630
I have a whole
bunch of things here

1280
01:07:12,630 --> 01:07:13,710
that I'm not going
to get to, but I

1281
01:07:13,710 --> 01:07:14,920
didn't expect to get to them.

1282
01:07:14,920 --> 01:07:19,590
But I do want to get to some
of matrix multiplication.

1283
01:07:19,590 --> 01:07:22,690
People familiar
with this problem?

1284
01:07:22,690 --> 01:07:23,730
OK.

1285
01:07:23,730 --> 01:07:25,230
We're going to
assume for simplicity

1286
01:07:25,230 --> 01:07:27,330
that n is a power of 2.

1287
01:07:27,330 --> 01:07:29,940
So here's the typical way
of parallelizing matrix

1288
01:07:29,940 --> 01:07:31,350
multiplication.

1289
01:07:31,350 --> 01:07:36,180
I take the two outer loops
and I parallelize them.

1290
01:07:36,180 --> 01:07:39,630
I can't easily parallelize the
inner loop, because if I do,

1291
01:07:39,630 --> 01:07:42,090
I get a race
condition, because I'll

1292
01:07:42,090 --> 01:07:45,470
have two iterations that are
both trying to update C of i,

1293
01:07:45,470 --> 01:07:47,450
j.

1294
01:07:47,450 --> 01:07:50,390
So I can't just
parallelize k, so I'm just

1295
01:07:50,390 --> 01:07:53,510
going to parallelize i and j.

1296
01:07:53,510 --> 01:07:54,710
The work for this is what?

1297
01:07:57,370 --> 01:07:58,539
Triply-nested loop.

1298
01:08:02,850 --> 01:08:03,360
n cubed.

1299
01:08:03,360 --> 01:08:05,220
Everybody knows--
matrix multiplication,

1300
01:08:05,220 --> 01:08:07,800
unless you do something
clever like Strassen,

1301
01:08:07,800 --> 01:08:11,190
or one of the more recent--

1302
01:08:11,190 --> 01:08:16,290
Virgie Williams
algorithm-- you know

1303
01:08:16,290 --> 01:08:20,910
that the running time for the
standard algorithm is n cubed.

1304
01:08:20,910 --> 01:08:23,290
The span for this is what?

1305
01:08:23,290 --> 01:08:23,790
Yeah.

1306
01:08:23,790 --> 01:08:27,015
That inner loop is linear size,
and then you've got two log

1307
01:08:27,015 --> 01:08:28,380
n's--

1308
01:08:28,380 --> 01:08:30,689
log n plus log n plus n--

1309
01:08:30,689 --> 01:08:32,490
so it's order n.

1310
01:08:32,490 --> 01:08:37,020
So the parallelism
is around n squared.

1311
01:08:37,020 --> 01:08:39,609
If I ignore
constants, and I said

1312
01:08:39,609 --> 01:08:42,960
I was working on matrices of,
say, 1,000 by 1,000 or so,

1313
01:08:42,960 --> 01:08:47,069
the parallelism is
something like n squared,

1314
01:08:47,069 --> 01:08:49,260
which is about--

1315
01:08:49,260 --> 01:08:52,970
1,000 squared is a million.

1316
01:08:52,970 --> 01:08:54,350
Wow.

1317
01:08:54,350 --> 01:08:56,120
That's a lot of parallelism.

1318
01:08:56,120 --> 01:08:59,870
How many processors
are you running on?

1319
01:08:59,870 --> 01:09:03,500
Is it bigger than 10 times
the number of processors?

1320
01:09:03,500 --> 01:09:06,529
By a little bit.

1321
01:09:06,529 --> 01:09:08,300
Now, there's another
strategy that one

1322
01:09:08,300 --> 01:09:10,609
can use, which is
divide and conquer,

1323
01:09:10,609 --> 01:09:12,890
and this is the strategy
that's used in Strassen.

1324
01:09:12,890 --> 01:09:14,210
We're not going to do
the Strassen's algorithm.

1325
01:09:14,210 --> 01:09:16,640
We're just going to use the
eight multiply version of this.

1326
01:09:16,640 --> 01:09:18,660
For people who know
Strassen, more power to you.

1327
01:09:18,660 --> 01:09:20,420
It's a great algorithm.

1328
01:09:20,420 --> 01:09:23,029
Really surprising,
really amazing.

1329
01:09:23,029 --> 01:09:26,109
And it's actually worthwhile
doing in practice, by the way,

1330
01:09:26,109 --> 01:09:28,189
for sufficiently large matrices.

1331
01:09:28,189 --> 01:09:33,740
So the idea here is, I can
multiply two n by n matrices

1332
01:09:33,740 --> 01:09:36,770
by doing eight
multiplications of n over 2

1333
01:09:36,770 --> 01:09:41,795
by n over 2 matrices, and
then add two n by n matrices.

1334
01:09:47,270 --> 01:09:49,855
So when we start
talking matrices--

1335
01:09:49,855 --> 01:09:52,220
this is a little bit of a
diversion from the algorithms,

1336
01:09:52,220 --> 01:09:55,520
but it's so important, because
representation of matrices

1337
01:09:55,520 --> 01:09:58,220
is one of the things that
gets people into trouble when

1338
01:09:58,220 --> 01:10:04,160
they're doing any kind of
two-dimensional coding stuff.

1339
01:10:04,160 --> 01:10:06,580
And so I want to talk a
little bit about index,

1340
01:10:06,580 --> 01:10:08,900
and we're going to talk
about this more later when

1341
01:10:08,900 --> 01:10:12,360
we do cache behavior and such.

1342
01:10:12,360 --> 01:10:15,290
So how do you
represent sub-matrices?

1343
01:10:15,290 --> 01:10:17,390
The standard way of
representing those either

1344
01:10:17,390 --> 01:10:19,760
in row-major or
column-major order,

1345
01:10:19,760 --> 01:10:21,530
depending upon the
language you use.

1346
01:10:21,530 --> 01:10:24,260
Fortran uses
column-major ordering,

1347
01:10:24,260 --> 01:10:25,760
so there are a
lot of subroutines

1348
01:10:25,760 --> 01:10:26,720
that are column-major.

1349
01:10:26,720 --> 01:10:33,950
But for the most part, C, which
we're using, is row-major.

1350
01:10:33,950 --> 01:10:36,170
And so the question
is, if I take

1351
01:10:36,170 --> 01:10:38,930
a sub-matrix of a
large matrix, how

1352
01:10:38,930 --> 01:10:44,540
do I calculate where the i,
j element of that matrix is?

1353
01:10:44,540 --> 01:10:48,650
Here I have the
i, j element here.

1354
01:10:48,650 --> 01:10:51,530
I've got a matrix M,
which is embedded.

1355
01:10:51,530 --> 01:10:53,120
And by row major,
remember, that means

1356
01:10:53,120 --> 01:10:56,450
I just take row after row, and
I just put them in linear order

1357
01:10:56,450 --> 01:10:58,980
through the memory.

1358
01:10:58,980 --> 01:11:01,260
So every two-dimensional
matrix, you can index

1359
01:11:01,260 --> 01:11:04,680
as a one-dimensional matrix,
because all you have to do is--

1360
01:11:04,680 --> 01:11:06,520
which is exactly what
the code is doing--

1361
01:11:06,520 --> 01:11:08,670
you need to know the
beginning of the matrix.

1362
01:11:08,670 --> 01:11:13,240
But if you have a sub-matrix,
it's a little more complicated.

1363
01:11:13,240 --> 01:11:14,960
So here's the idea.

1364
01:11:14,960 --> 01:11:17,860
Suppose that you have
a sub-matrix m-- so

1365
01:11:17,860 --> 01:11:22,390
starting in location m
of this outer matrix.

1366
01:11:22,390 --> 01:11:26,230
Here we have the outer matrix,
which has length n sub M.

1367
01:11:26,230 --> 01:11:29,070
This is the big matrix--

1368
01:11:29,070 --> 01:11:30,660
actually I should
have called that m.

1369
01:11:33,250 --> 01:11:35,755
I should not have called
this n instead of m.

1370
01:11:35,755 --> 01:11:37,630
I should have called it
m sub something else,

1371
01:11:37,630 --> 01:11:40,750
because this is my m that
I'm interested in, which

1372
01:11:40,750 --> 01:11:41,890
is this location here.

1373
01:11:44,740 --> 01:11:50,560
And what I'm interested
in doing is finding out--

1374
01:11:50,560 --> 01:11:53,170
I named these
variables stupidly--

1375
01:11:53,170 --> 01:11:55,750
is finding out, where
is the i, j-th element

1376
01:11:55,750 --> 01:11:57,220
of this sub-matrix M?

1377
01:11:57,220 --> 01:12:01,450
If I tell you the beginning,
what do I add to get to i, j?

1378
01:12:01,450 --> 01:12:05,110
And the answer is that I've
got to add the number of rows

1379
01:12:05,110 --> 01:12:06,010
that comes down here.

1380
01:12:06,010 --> 01:12:09,040
Well, that's i times the
width of the full matrix

1381
01:12:09,040 --> 01:12:11,770
that you're taking it
out of, not the width

1382
01:12:11,770 --> 01:12:15,670
of your local sub-matrix.

1383
01:12:15,670 --> 01:12:19,720
And then you have
to add in the--

1384
01:12:19,720 --> 01:12:23,595
and then you add in
j from that point.

1385
01:12:23,595 --> 01:12:24,430
There we go.

1386
01:12:24,430 --> 01:12:25,090
OK.

1387
01:12:25,090 --> 01:12:30,100
So I have to add in the
length of the long matrix

1388
01:12:30,100 --> 01:12:33,970
plus j for each row i.

1389
01:12:33,970 --> 01:12:35,938
Does that make sense?

1390
01:12:35,938 --> 01:12:37,230
Because it's embedded in there.

1391
01:12:37,230 --> 01:12:40,080
You have to skip over full
rows of the outer matrix.

1392
01:12:40,080 --> 01:12:43,410
So you can't generally
just pass a sub-matrix

1393
01:12:43,410 --> 01:12:45,600
and expect to do indexing
on that when it's

1394
01:12:45,600 --> 01:12:47,250
embedded in a large matrix.

1395
01:12:47,250 --> 01:12:49,530
If you make a copy,
sure, then you

1396
01:12:49,530 --> 01:12:52,170
can index it according to
whatever the new copy is.

1397
01:12:52,170 --> 01:12:54,720
But if you want to operate
in place on matrices, which

1398
01:12:54,720 --> 01:12:58,770
is often the case, then you have
to understand that every row,

1399
01:12:58,770 --> 01:13:00,930
you have to jump a row
of the outer matrix, not

1400
01:13:00,930 --> 01:13:03,690
a row of whatever your
sub-matrix is, when you're

1401
01:13:03,690 --> 01:13:06,240
doing the divide and conquer.

1402
01:13:06,240 --> 01:13:08,950
So when we look at doing
divide and conquer--

1403
01:13:08,950 --> 01:13:13,200
I have a matrix here
which I want to now

1404
01:13:13,200 --> 01:13:18,000
divide into four sub-matrices
of size M over 2.

1405
01:13:18,000 --> 01:13:20,820
And the question is,
where's the starting corners

1406
01:13:20,820 --> 01:13:23,340
of each of those matrices?

1407
01:13:23,340 --> 01:13:29,790
So M 0, 0, that starts at the
same place as M. That upper

1408
01:13:29,790 --> 01:13:30,660
left one.

1409
01:13:30,660 --> 01:13:32,634
Where does M 0, 1 start?

1410
01:13:40,230 --> 01:13:41,320
Where's M 0, 1 start?

1411
01:13:45,685 --> 01:13:46,780
AUDIENCE: [INAUDIBLE]

1412
01:13:46,780 --> 01:13:47,780
CHARLES LEISERSON: Yeah.

1413
01:13:47,780 --> 01:13:50,180
M plus n over 2.

1414
01:13:50,180 --> 01:13:53,101
Where does M 1, 0 start?

1415
01:13:53,101 --> 01:13:54,592
This is the tricky one.

1416
01:14:01,560 --> 01:14:03,690
Here's the answer.

1417
01:14:03,690 --> 01:14:07,650
M plus the long
matrix times n over 2,

1418
01:14:07,650 --> 01:14:10,140
because I'm going
down m over 2 rows,

1419
01:14:10,140 --> 01:14:11,940
and I've got to go
down the number of rows

1420
01:14:11,940 --> 01:14:14,490
of the outer matrix.

1421
01:14:14,490 --> 01:14:18,250
And then M 1, 1 is the
same as the 2 there.

1422
01:14:18,250 --> 01:14:24,120
So here's the-- in general, for
row and column being 0 and 1,

1423
01:14:24,120 --> 01:14:27,350
in some sense, this
is a general formula

1424
01:14:27,350 --> 01:14:34,710
that matches up with that, where
I plug in 0 1 for each one.

1425
01:14:34,710 --> 01:14:35,960
And now here's my code.

1426
01:14:35,960 --> 01:14:37,960
And I just want to point
out a couple of things,

1427
01:14:37,960 --> 01:14:39,510
and then we'll quit
and I'll let you

1428
01:14:39,510 --> 01:14:43,980
take a look at the rest
of this on your own.

1429
01:14:46,540 --> 01:14:49,410
Here's my divide and
conquer matrix multiply.

1430
01:14:49,410 --> 01:14:50,940
I use restrict.

1431
01:14:50,940 --> 01:14:52,740
Everybody familiar
with restrict?

1432
01:14:52,740 --> 01:14:55,320
It says, don't tell the
compiler these things

1433
01:14:55,320 --> 01:14:59,040
you can assume are not aliased,
so that when you change one,

1434
01:14:59,040 --> 01:15:00,210
you're not changing another.

1435
01:15:00,210 --> 01:15:03,090
That lets the compiler
produce better code.

1436
01:15:03,090 --> 01:15:06,690
And then the row sizes are
going to be n sub c, n sub a,

1437
01:15:06,690 --> 01:15:09,280
and n sub b.

1438
01:15:09,280 --> 01:15:12,550
And then the matrices that
we're taking them out of,

1439
01:15:12,550 --> 01:15:14,680
those are the sizes
of the sub-matrix.

1440
01:15:14,680 --> 01:15:19,210
The outer matrix is going to
have size n by n, for which--

1441
01:15:19,210 --> 01:15:21,220
when I have my
recursion, I want to talk

1442
01:15:21,220 --> 01:15:23,590
about sub-matrices that
are embedded in this larger

1443
01:15:23,590 --> 01:15:25,810
outside matrix.

1444
01:15:25,810 --> 01:15:30,190
Here is a great
piece of bit tricks.

1445
01:15:30,190 --> 01:15:32,260
This says, n is a power of 2.

1446
01:15:35,630 --> 01:15:39,320
So go back and remind yourself
of what the bit tricks are,

1447
01:15:39,320 --> 01:15:42,440
but that's a clever bit trick
to say that n is a power of 2.

1448
01:15:42,440 --> 01:15:44,990
Very quick.

1449
01:15:44,990 --> 01:15:47,960
And so take a look at that.

1450
01:15:47,960 --> 01:15:50,480
And then we're going to coarsen
leaves with a base case.

1451
01:15:50,480 --> 01:15:53,570
The base case just goes
through and solves the problem

1452
01:15:53,570 --> 01:15:59,240
for small n, just with a
typical triply-nested loop.

1453
01:15:59,240 --> 01:16:02,120
And what we're going to do
is allocate a temporary n

1454
01:16:02,120 --> 01:16:07,750
by n array, and then we're going
to define the temporary array

1455
01:16:07,750 --> 01:16:12,360
to having underlying row size n.

1456
01:16:12,360 --> 01:16:17,100
And then here is this fabulous
macro that makes all the index

1457
01:16:17,100 --> 01:16:18,240
calculations easy.

1458
01:16:18,240 --> 01:16:22,860
It uses the sharp
sharp operator,

1459
01:16:22,860 --> 01:16:27,090
which pastes together tokens,
so that I can paste n sub c.

1460
01:16:27,090 --> 01:16:30,220
When I pass r and c, it passes--

1461
01:16:30,220 --> 01:16:34,170
whatever value I pass for
that, it pastes it together.

1462
01:16:34,170 --> 01:16:37,050
So it allows me to do
the indexing of the--

1463
01:16:37,050 --> 01:16:41,040
and have the right thing, so
that for each of these address

1464
01:16:41,040 --> 01:16:46,590
calculations, I'm able to
do them by just saying x of,

1465
01:16:46,590 --> 01:16:48,580
and just give the
formulas these.

1466
01:16:48,580 --> 01:16:51,450
Otherwise, you'd be driven
nuts by the formula.

1467
01:16:51,450 --> 01:16:54,240
So take a look at that macro,
because that may help you

1468
01:16:54,240 --> 01:16:55,980
in some of your other things.

1469
01:16:55,980 --> 01:17:00,360
And then I sync,
and then add it up.

1470
01:17:00,360 --> 01:17:02,790
And the addition
is just going to be

1471
01:17:02,790 --> 01:17:07,650
a doubly-nested parallel
addition, and then I free it.

1472
01:17:07,650 --> 01:17:11,520
So what I would like you to
do is go home and take a look

1473
01:17:11,520 --> 01:17:13,270
at the analysis of this.

1474
01:17:13,270 --> 01:17:17,500
And it turns out this has way
more panels than you need,

1475
01:17:17,500 --> 01:17:20,220
and if you reduce the
amount of parallelism,

1476
01:17:20,220 --> 01:17:21,630
you get much better performance.

1477
01:17:21,630 --> 01:17:23,130
And there's several
other algorithms

1478
01:17:23,130 --> 01:17:25,650
I put in there as well.

1479
01:17:25,650 --> 01:17:29,340
so I'll try to get
this posted tonight.

1480
01:17:29,340 --> 01:17:31,430
Thanks very much.