1
00:00:01,550 --> 00:00:03,920
The following content is
provided under a Creative

2
00:00:03,920 --> 00:00:05,310
Commons license.

3
00:00:05,310 --> 00:00:07,520
Your support will help
MIT OpenCourseWare

4
00:00:07,520 --> 00:00:11,610
continue to offer high-quality
educational resources for free.

5
00:00:11,610 --> 00:00:14,180
To make a donation or to
view additional materials

6
00:00:14,180 --> 00:00:18,140
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,140 --> 00:00:19,026
at ocw.mit.edu.

8
00:00:23,420 --> 00:00:25,592
JULIAN SHUN: Hi, good
afternoon, everyone.

9
00:00:25,592 --> 00:00:27,050
So today, we're
going to be talking

10
00:00:27,050 --> 00:00:30,650
about graph optimizations.

11
00:00:30,650 --> 00:00:32,689
And as a reminder,
on Thursday, we're

12
00:00:32,689 --> 00:00:36,980
going to have a guest lecture
by Professor Johnson of the MIT

13
00:00:36,980 --> 00:00:38,097
Math Department.

14
00:00:38,097 --> 00:00:39,680
And he'll be talking
about performance

15
00:00:39,680 --> 00:00:40,730
of high-level languages.

16
00:00:40,730 --> 00:00:46,670
So please be sure to attend
the guest lecture on Thursday.

17
00:00:46,670 --> 00:00:48,410
So here's an outline
of what I'm going

18
00:00:48,410 --> 00:00:49,500
to be talking about today.

19
00:00:49,500 --> 00:00:54,480
So we're first going to remind
ourselves what a graph is.

20
00:00:54,480 --> 00:00:57,020
And then we're going to
talk about various ways

21
00:00:57,020 --> 00:01:00,880
to represent a graph in memory.

22
00:01:00,880 --> 00:01:02,930
And then we'll talk
about how to implement

23
00:01:02,930 --> 00:01:06,470
an efficient breadth-first
search algorithm, both serially

24
00:01:06,470 --> 00:01:08,930
and also in parallel.

25
00:01:08,930 --> 00:01:12,810
And then I'll talk about how to
use graph compression and graph

26
00:01:12,810 --> 00:01:15,890
reordering to improve the
locality of graph algorithms.

27
00:01:18,410 --> 00:01:21,830
So first of all,
what is a graph?

28
00:01:21,830 --> 00:01:24,410
So a graph contains
vertices and edges,

29
00:01:24,410 --> 00:01:27,950
where vertices represent
certain objects of interest,

30
00:01:27,950 --> 00:01:32,330
and edges between objects model
relationships between the two

31
00:01:32,330 --> 00:01:34,530
objects.

32
00:01:34,530 --> 00:01:36,440
For example, you can
have a social network,

33
00:01:36,440 --> 00:01:39,290
where the people are
represented as vertices

34
00:01:39,290 --> 00:01:41,570
and edges between
people mean that they're

35
00:01:41,570 --> 00:01:44,390
friends with each other.

36
00:01:44,390 --> 00:01:49,100
The edges in this graph don't
have to be bi-directional.

37
00:01:49,100 --> 00:01:51,260
So you could have a
one-way relationship.

38
00:01:51,260 --> 00:01:53,510
For example, if you're looking
at the Twitter network,

39
00:01:53,510 --> 00:01:55,940
Alice could follow Bob,
but Bob doesn't necessarily

40
00:01:55,940 --> 00:01:58,790
have to follow Alice back.

41
00:01:58,790 --> 00:02:01,400
The graph also doesn't
have to be connected.

42
00:02:01,400 --> 00:02:04,160
So here, this graph
here is connected.

43
00:02:04,160 --> 00:02:08,270
But, for example, there
could be some people

44
00:02:08,270 --> 00:02:09,919
who don't like to
talk to other people.

45
00:02:09,919 --> 00:02:14,150
And then they're just off
in their own component.

46
00:02:14,150 --> 00:02:17,060
You can also use graphs to
model protein networks, where

47
00:02:17,060 --> 00:02:20,330
the vertices are proteins,
and edges between vertices

48
00:02:20,330 --> 00:02:22,190
means that there's some
sort of interaction

49
00:02:22,190 --> 00:02:23,600
between the proteins.

50
00:02:23,600 --> 00:02:27,560
So this is useful in
computational biology.

51
00:02:27,560 --> 00:02:29,180
As I said, edges
can be directed,

52
00:02:29,180 --> 00:02:33,290
so their relationship can
go one way or both ways.

53
00:02:33,290 --> 00:02:37,190
In this graph here, we have some
directed edges and then also

54
00:02:37,190 --> 00:02:40,620
some edges that are
directed in both directions.

55
00:02:40,620 --> 00:02:42,770
So here, John follows Alice.

56
00:02:42,770 --> 00:02:44,330
Alice follows Peter.

57
00:02:44,330 --> 00:02:49,130
And then Alice follows Bob,
and Bob also follows Alice.

58
00:02:49,130 --> 00:02:52,400
If you use a graph to
represent the world wide web,

59
00:02:52,400 --> 00:02:54,470
then the vertices
would be websites,

60
00:02:54,470 --> 00:02:58,280
and then the edges would denote
that there is a hyperlink

61
00:02:58,280 --> 00:03:00,360
from one website to another.

62
00:03:00,360 --> 00:03:03,980
And again, the edges here
don't have to be bi-directional

63
00:03:03,980 --> 00:03:06,200
because website A could
have a link to website B.

64
00:03:06,200 --> 00:03:08,000
But website B
doesn't necessarily

65
00:03:08,000 --> 00:03:11,000
have to have a link back.

66
00:03:11,000 --> 00:03:12,382
Edges can also be weighted.

67
00:03:12,382 --> 00:03:14,090
So you can have a
weight on the edge that

68
00:03:14,090 --> 00:03:16,520
denotes the strength
of the relationship

69
00:03:16,520 --> 00:03:19,850
or some sort of distance
measure corresponding

70
00:03:19,850 --> 00:03:21,950
to that relationship.

71
00:03:21,950 --> 00:03:26,180
So here, I have an example
where I am using a graph

72
00:03:26,180 --> 00:03:28,430
to represent cities.

73
00:03:28,430 --> 00:03:31,340
And the edges
between cities have

74
00:03:31,340 --> 00:03:34,220
a weight that corresponds to
the distance between the two

75
00:03:34,220 --> 00:03:35,240
cities.

76
00:03:35,240 --> 00:03:38,255
And if I want to find the
quickest way to get from city A

77
00:03:38,255 --> 00:03:41,090
to city B, then I would
be interested in finding

78
00:03:41,090 --> 00:03:44,600
the shortest path from A
to B in this graph here.

79
00:03:47,400 --> 00:03:50,340
Here's another example,
where the edge weights now

80
00:03:50,340 --> 00:03:53,790
are the costs of a direct
flight from city A to city B.

81
00:03:53,790 --> 00:03:55,500
And here the edges are directed.

82
00:03:55,500 --> 00:03:57,180
So, for example, this
says that there's

83
00:03:57,180 --> 00:04:01,080
a flight from San
Francisco to LA for $45.

84
00:04:01,080 --> 00:04:02,790
And if I want to
find the cheapest

85
00:04:02,790 --> 00:04:06,450
way to get from one
city to another city,

86
00:04:06,450 --> 00:04:09,720
then, again, I would try to find
the shortest path in this graph

87
00:04:09,720 --> 00:04:14,800
from city A to city B.

88
00:04:14,800 --> 00:04:18,760
Vertices and edges can
also have metadata on them,

89
00:04:18,760 --> 00:04:20,089
and they can also have types.

90
00:04:20,089 --> 00:04:22,630
So, for example, here's
the Google Knowledge Graph,

91
00:04:22,630 --> 00:04:25,360
which represents all the
knowledge on the internet

92
00:04:25,360 --> 00:04:27,130
that Google knows about.

93
00:04:27,130 --> 00:04:30,140
And here, the nodes
have metadata on them.

94
00:04:30,140 --> 00:04:32,740
So, for example, the node
corresponding to da Vinci

95
00:04:32,740 --> 00:04:37,160
is labeled with his date
of birth and date of death.

96
00:04:37,160 --> 00:04:38,880
And the vertices
also have a color

97
00:04:38,880 --> 00:04:44,788
corresponding to the type of
knowledge that they refer to.

98
00:04:44,788 --> 00:04:46,830
So you can see that some
of these nodes are blue,

99
00:04:46,830 --> 00:04:49,290
some of them are red,
some of them are green,

100
00:04:49,290 --> 00:04:51,780
and some of them have
other things on them.

101
00:04:51,780 --> 00:04:54,840
So in general, graphs can
have types and metadata

102
00:04:54,840 --> 00:04:56,895
on both the vertices
as well as the edges.

103
00:04:59,400 --> 00:05:03,580
Let's look at some more
applications of graphs.

104
00:05:03,580 --> 00:05:07,420
So graphs are very useful
for implementing queries

105
00:05:07,420 --> 00:05:09,920
on social networks.

106
00:05:09,920 --> 00:05:11,552
So here are some
examples of queries

107
00:05:11,552 --> 00:05:13,510
that you might want to
ask on a social network.

108
00:05:13,510 --> 00:05:16,270
So, for example, you might
be interested in finding

109
00:05:16,270 --> 00:05:19,120
all of your friends who went
to the same high school as you

110
00:05:19,120 --> 00:05:20,650
on Facebook.

111
00:05:20,650 --> 00:05:24,710
So that can be implemented
using a graph algorithm.

112
00:05:24,710 --> 00:05:26,560
You might also be
interested in finding

113
00:05:26,560 --> 00:05:29,260
all of the common friends
you have with somebody else--

114
00:05:29,260 --> 00:05:31,680
again, a graph algorithm.

115
00:05:31,680 --> 00:05:34,540
And a social network service
might run a graph algorithm

116
00:05:34,540 --> 00:05:37,750
to recommend people that
you might know and want

117
00:05:37,750 --> 00:05:40,060
to become friends with.

118
00:05:40,060 --> 00:05:41,620
And they might use
a graph algorithm

119
00:05:41,620 --> 00:05:43,480
to recommend certain
products that you

120
00:05:43,480 --> 00:05:45,843
might be interested in.

121
00:05:45,843 --> 00:05:48,010
So these are all examples
of social network queries.

122
00:05:48,010 --> 00:05:49,780
And there are many
other queries that you

123
00:05:49,780 --> 00:05:51,940
might be interested in
running on a social network.

124
00:05:51,940 --> 00:05:53,680
And many of them
can be implemented

125
00:05:53,680 --> 00:05:57,580
using graph algorithms.

126
00:05:57,580 --> 00:06:00,030
Another important
application is clustering.

127
00:06:00,030 --> 00:06:02,320
So here, the goal is to
find groups of vertices

128
00:06:02,320 --> 00:06:03,940
in a graph that
are well-connected

129
00:06:03,940 --> 00:06:07,310
internally and
poorly-connected externally.

130
00:06:07,310 --> 00:06:11,890
So in this image here, each blob
of vertices of the same color

131
00:06:11,890 --> 00:06:13,870
corresponds to a cluster.

132
00:06:13,870 --> 00:06:15,980
And you can see that
inside a cluster,

133
00:06:15,980 --> 00:06:18,790
there are a lot of edges
going among the vertices.

134
00:06:18,790 --> 00:06:24,010
And between clusters, there
are relatively fewer edges.

135
00:06:24,010 --> 00:06:26,020
And some applications
of clustering

136
00:06:26,020 --> 00:06:28,488
include community detection
and social networks.

137
00:06:28,488 --> 00:06:30,280
So here, you might be
interested in finding

138
00:06:30,280 --> 00:06:33,190
groups of people with
similar interests or hobbies.

139
00:06:33,190 --> 00:06:36,370
You can also use clustering
to detect fraudulent websites

140
00:06:36,370 --> 00:06:37,540
on the internet.

141
00:06:37,540 --> 00:06:40,420
You can use it for
clustering documents.

142
00:06:40,420 --> 00:06:42,070
So you would cluster
documents that

143
00:06:42,070 --> 00:06:44,350
have similar text together.

144
00:06:44,350 --> 00:06:47,710
And clustering is often used
for unsupervised learning

145
00:06:47,710 --> 00:06:49,290
and machine learning
applications.

146
00:06:52,950 --> 00:06:55,320
Another application
is connectomics.

147
00:06:55,320 --> 00:06:59,910
So connectomics is the study
of the structure, the network

148
00:06:59,910 --> 00:07:01,390
structure of the brain.

149
00:07:01,390 --> 00:07:04,500
And here, the vertices
correspond to neurons.

150
00:07:04,500 --> 00:07:06,630
And edges between
two vertices means

151
00:07:06,630 --> 00:07:09,840
that there's some sort of
interaction between the two

152
00:07:09,840 --> 00:07:10,950
neurons.

153
00:07:10,950 --> 00:07:13,830
And recently, there's
been a lot of work

154
00:07:13,830 --> 00:07:17,040
on trying to do
high-performance connectomics.

155
00:07:17,040 --> 00:07:20,130
And some of this work has
been going on here at MIT

156
00:07:20,130 --> 00:07:23,550
by Professor Charles Leiserson
and Professor Nir Shavit's

157
00:07:23,550 --> 00:07:24,360
research group.

158
00:07:24,360 --> 00:07:29,280
So recently, this has
been a very hot area.

159
00:07:29,280 --> 00:07:31,230
Graphs are also used
in computer vision--

160
00:07:31,230 --> 00:07:33,630
for example, in
image segmentation.

161
00:07:33,630 --> 00:07:36,030
So here, you want to
segment your image

162
00:07:36,030 --> 00:07:40,140
into the distinct objects
that appear in the image.

163
00:07:40,140 --> 00:07:43,050
And you can construct a graph
by representing the pixels

164
00:07:43,050 --> 00:07:44,310
as vertices.

165
00:07:44,310 --> 00:07:46,800
And then you would place
an edge between every pair

166
00:07:46,800 --> 00:07:50,100
of neighboring pixels with
a weight that corresponds

167
00:07:50,100 --> 00:07:52,740
to their similarity.

168
00:07:52,740 --> 00:07:56,430
And then you would run some sort
of minimum cost cut algorithm

169
00:07:56,430 --> 00:07:59,640
to partition your graph into
the different objects that

170
00:07:59,640 --> 00:08:02,478
appear in the image.

171
00:08:02,478 --> 00:08:04,020
So there are many
other applications.

172
00:08:04,020 --> 00:08:05,910
And I'm not going to have
time to go through all of them

173
00:08:05,910 --> 00:08:06,580
today.

174
00:08:06,580 --> 00:08:11,850
But here's just a flavor of some
of the applications of graphs.

175
00:08:11,850 --> 00:08:13,150
So any questions so far?

176
00:08:20,820 --> 00:08:23,690
OK, so next, let's
look at how we can

177
00:08:23,690 --> 00:08:25,415
represent a graph in memory.

178
00:08:29,110 --> 00:08:30,610
So for the rest of
this lecture, I'm

179
00:08:30,610 --> 00:08:33,929
going to assume that my vertices
are labeled in the range from 0

180
00:08:33,929 --> 00:08:35,140
to n minus 1.

181
00:08:35,140 --> 00:08:37,960
So they have an
integer in this range.

182
00:08:37,960 --> 00:08:40,630
Sometimes, your graph
might be given to you

183
00:08:40,630 --> 00:08:43,510
where the vertices are
already labeled in this range,

184
00:08:43,510 --> 00:08:44,380
sometimes, not.

185
00:08:44,380 --> 00:08:46,060
But you can always
get these labels

186
00:08:46,060 --> 00:08:48,100
by mapping each
of the identifiers

187
00:08:48,100 --> 00:08:50,482
to a unique integer
in this range.

188
00:08:50,482 --> 00:08:51,940
So for the rest of
the lecture, I'm

189
00:08:51,940 --> 00:08:54,670
just going to assume that
we have these labels from 0

190
00:08:54,670 --> 00:08:57,250
to n minus 1 for the vertices.

191
00:08:57,250 --> 00:09:02,090
One way to represent a graph
is to use an adjacency matrix.

192
00:09:02,090 --> 00:09:04,660
So this is going to
be n by n matrix.

193
00:09:04,660 --> 00:09:09,040
And there's a 1 bit in
i-th row in j-th column

194
00:09:09,040 --> 00:09:12,400
if there's an edge that goes
from vertex I to vertex J,

195
00:09:12,400 --> 00:09:15,910
and 0 otherwise.

196
00:09:15,910 --> 00:09:20,130
Another way to represent a graph
is the edgeless representation,

197
00:09:20,130 --> 00:09:22,300
where we just store a
list of the edges that

198
00:09:22,300 --> 00:09:23,660
appear in the graph.

199
00:09:23,660 --> 00:09:26,320
So we have one
pair for each edge,

200
00:09:26,320 --> 00:09:29,260
where the pair contains the
two coordinates of that edge.

201
00:09:31,960 --> 00:09:34,060
So what is the space
requirement for each

202
00:09:34,060 --> 00:09:37,270
of these two representations in
terms of the number of edges m

203
00:09:37,270 --> 00:09:41,760
and the number of
vertices n in the graph?

204
00:09:41,760 --> 00:09:43,060
So it should be pretty easy.

205
00:09:45,740 --> 00:09:46,680
Yes.

206
00:09:46,680 --> 00:09:48,640
AUDIENCE: n squared
for the [INAUDIBLE]

207
00:09:48,640 --> 00:09:50,110
and m for the [INAUDIBLE].

208
00:09:50,110 --> 00:09:52,730
JULIAN SHUN: Yes, so the
space for the adjacency matrix

209
00:09:52,730 --> 00:09:54,560
is order n squared
because you have n

210
00:09:54,560 --> 00:09:56,990
squared cells in this matrix.

211
00:09:56,990 --> 00:09:59,540
And you have 1 bit
for each of the cells.

212
00:09:59,540 --> 00:10:01,730
For the edge list, it's
going to be order m

213
00:10:01,730 --> 00:10:03,020
because you have m edges.

214
00:10:03,020 --> 00:10:05,270
And for each edge, you're
storing a constant amount

215
00:10:05,270 --> 00:10:06,490
of data in the edge list.

216
00:10:10,346 --> 00:10:13,160
So here's another way
to represent a graph.

217
00:10:13,160 --> 00:10:16,850
This is known as the
adjacency list format.

218
00:10:16,850 --> 00:10:18,500
And idea here is
that we're going

219
00:10:18,500 --> 00:10:21,440
to have an array of
pointers, 1 per vertex.

220
00:10:21,440 --> 00:10:25,160
And each pointer points
to a linked list storing

221
00:10:25,160 --> 00:10:27,020
the edges for that vertex.

222
00:10:27,020 --> 00:10:32,030
And the linked list is
unordered in this example.

223
00:10:32,030 --> 00:10:35,104
So what's the space requirement
of this representation?

224
00:10:42,020 --> 00:10:43,510
AUDIENCE: It's n plus m.

225
00:10:43,510 --> 00:10:46,430
JULIAN SHUN: Yeah, so it's
going to be order n plus m.

226
00:10:46,430 --> 00:10:48,770
And this is because
we have n pointers.

227
00:10:48,770 --> 00:10:52,730
And the number of entries
across all of the linked lists

228
00:10:52,730 --> 00:10:55,500
is just equal to the number of
edges in the graph, which is m.

229
00:10:58,340 --> 00:11:02,450
What's one potential issue with
this sort of representation

230
00:11:02,450 --> 00:11:05,870
if you think in terms
of cache performance?

231
00:11:05,870 --> 00:11:08,760
Does anyone see a potential
performance issue here?

232
00:11:13,360 --> 00:11:13,916
Yeah.

233
00:11:13,916 --> 00:11:15,499
AUDIENCE: So it
could be [INAUDIBLE]..

234
00:11:22,560 --> 00:11:24,210
JULIAN SHUN: Right.

235
00:11:24,210 --> 00:11:26,430
So the issue here
is that if you're

236
00:11:26,430 --> 00:11:28,770
trying to loop over all of
the neighbors of a vertex,

237
00:11:28,770 --> 00:11:31,860
you're going to have to
dereference the pointer

238
00:11:31,860 --> 00:11:33,270
in every linked list node.

239
00:11:33,270 --> 00:11:35,520
Because these are not
contiguous in memory.

240
00:11:35,520 --> 00:11:38,362
And every time you
dereference linked lists node,

241
00:11:38,362 --> 00:11:40,320
that's going to be a
random access into memory.

242
00:11:40,320 --> 00:11:43,110
So that can be bad
for cache performance.

243
00:11:43,110 --> 00:11:45,720
One way you can improve
cache performance

244
00:11:45,720 --> 00:11:49,980
is instead of using linked
lists for each of these neighbor

245
00:11:49,980 --> 00:11:51,580
lists, you can use an array.

246
00:11:51,580 --> 00:11:54,870
So now you can store the
neighbors just in this array,

247
00:11:54,870 --> 00:11:57,013
and they'll be
contiguous in memory.

248
00:11:57,013 --> 00:11:58,680
One drawback of this
approach is that it

249
00:11:58,680 --> 00:12:01,680
becomes more expensive if you're
trying to update the graph.

250
00:12:01,680 --> 00:12:03,300
And we'll talk more
about that later.

251
00:12:06,440 --> 00:12:07,840
So any questions so far?

252
00:12:18,250 --> 00:12:21,120
So what's another way to
represent the graph that we've

253
00:12:21,120 --> 00:12:22,870
seen in a previous lecture?

254
00:12:29,400 --> 00:12:31,920
What's a more compressed
or compact way

255
00:12:31,920 --> 00:12:35,220
to represent a graph,
especially a sparse graph?

256
00:12:46,660 --> 00:12:50,400
So does anybody remember the
compressed sparse row format?

257
00:12:53,050 --> 00:12:56,550
So we looked at this in
one of the early lectures.

258
00:12:56,550 --> 00:13:00,450
And in that lecture, we used
it to store a sparse matrix.

259
00:13:00,450 --> 00:13:03,510
But you can also use it
to store a sparse graph.

260
00:13:03,510 --> 00:13:06,810
And as a reminder, we have two
arrays in the compressed sparse

261
00:13:06,810 --> 00:13:08,490
row, or CSR format.

262
00:13:08,490 --> 00:13:11,480
We have the Offsets array
and the Edges array.

263
00:13:11,480 --> 00:13:14,760
The Offsets array stores
an offset for each vertex

264
00:13:14,760 --> 00:13:17,160
into the Edges array,
telling us where

265
00:13:17,160 --> 00:13:18,990
the edges for that
particular vertex

266
00:13:18,990 --> 00:13:21,860
begins in the Edges array.

267
00:13:21,860 --> 00:13:23,760
So Offsets of i
stores the offset

268
00:13:23,760 --> 00:13:27,330
of where vertex i's edges
start in the Edges array.

269
00:13:27,330 --> 00:13:31,780
So in this example, vertex
0 has an offset of 0.

270
00:13:31,780 --> 00:13:35,550
So its edges start at
position 0 in the Edges array.

271
00:13:35,550 --> 00:13:37,740
Vertex 1 has an
offset of 4, so it

272
00:13:37,740 --> 00:13:42,270
starts at index 4 in
this Offsets array.

273
00:13:42,270 --> 00:13:43,950
So with this
representation, how can we

274
00:13:43,950 --> 00:13:45,370
get the degree of a vertex?

275
00:13:45,370 --> 00:13:47,820
So we're not storing the
degree explicitly here.

276
00:13:47,820 --> 00:13:49,905
Can we get the
degree efficiently?

277
00:13:55,345 --> 00:13:55,845
Yes.

278
00:13:55,845 --> 00:13:58,820
AUDIENCE: [INAUDIBLE]

279
00:13:58,820 --> 00:14:01,650
JULIAN SHUN: Yeah, so you can
get the degree of a vertex

280
00:14:01,650 --> 00:14:03,240
just by looking
at the difference

281
00:14:03,240 --> 00:14:06,480
between the next offset
and its own offset.

282
00:14:06,480 --> 00:14:10,170
So for vertex 0, you can
see that its degree is 4

283
00:14:10,170 --> 00:14:14,590
because vertex 1's offset is
4, and vertex 0's offset is 0.

284
00:14:14,590 --> 00:14:19,320
And similarly you can do that
for all of the other vertices.

285
00:14:19,320 --> 00:14:22,234
So what's the space usage
of this representation?

286
00:14:28,042 --> 00:14:29,628
AUDIENCE: [INAUDIBLE]

287
00:14:29,628 --> 00:14:31,086
JULIAN SHUN: Sorry,
can you repeat?

288
00:14:31,086 --> 00:14:32,900
AUDIENCE: [INAUDIBLE]

289
00:14:32,900 --> 00:14:35,600
JULIAN SHUN: Yeah, so again,
it's going to be order m plus n

290
00:14:35,600 --> 00:14:39,530
because you need order n space
for the Offsets array and order

291
00:14:39,530 --> 00:14:42,830
m space for the Edges array.

292
00:14:42,830 --> 00:14:46,070
You can also store values
or weights on their edges.

293
00:14:46,070 --> 00:14:50,510
One way to do this is to create
an additional array of size m.

294
00:14:50,510 --> 00:14:54,050
And then for edge i, you
just store the weight

295
00:14:54,050 --> 00:14:58,430
or the value in the i-th
index of this additional array

296
00:14:58,430 --> 00:15:00,020
that you created.

297
00:15:00,020 --> 00:15:03,380
If you're always accessing the
weight when you access an edge,

298
00:15:03,380 --> 00:15:05,390
then it's actually better
for a cache locality

299
00:15:05,390 --> 00:15:09,620
to interleave the weights
with the edge targets.

300
00:15:09,620 --> 00:15:11,750
So instead of creating
two arrays of size m,

301
00:15:11,750 --> 00:15:14,150
you have one array of size 2m.

302
00:15:14,150 --> 00:15:18,170
And every other
entry is the weight.

303
00:15:18,170 --> 00:15:20,720
And this improves cache
locality because every time

304
00:15:20,720 --> 00:15:24,200
you access an edge, its weight
is going to be right next to it

305
00:15:24,200 --> 00:15:24,800
in memory.

306
00:15:24,800 --> 00:15:27,600
And it's going to likely
be on the same cache line.

307
00:15:27,600 --> 00:15:30,660
So that's one way to
improve cache locality.

308
00:15:30,660 --> 00:15:32,070
Any questions so far?

309
00:15:37,365 --> 00:15:38,990
So let's look at some
of the trade-offs

310
00:15:38,990 --> 00:15:40,790
in these different
graph representations

311
00:15:40,790 --> 00:15:42,990
that we've looked at so far.

312
00:15:42,990 --> 00:15:44,750
So here, I'm listing
the storage costs

313
00:15:44,750 --> 00:15:46,520
for each of these
representations which

314
00:15:46,520 --> 00:15:47,870
we already discussed.

315
00:15:47,870 --> 00:15:50,900
This is also the cost for just
scanning the whole graph in one

316
00:15:50,900 --> 00:15:53,090
of these representations.

317
00:15:53,090 --> 00:15:55,400
What's the cost of
adding an edge in each

318
00:15:55,400 --> 00:15:56,480
of these representations?

319
00:15:56,480 --> 00:16:01,830
So for adjacency matrix, what's
the cost of adding an edge?

320
00:16:01,830 --> 00:16:03,210
AUDIENCE: Order 1.

321
00:16:03,210 --> 00:16:04,800
JULIAN SHUN: So for
adjacency matrix,

322
00:16:04,800 --> 00:16:08,400
it's just order
1 to add an edge.

323
00:16:08,400 --> 00:16:12,180
Because you have random
access into this matrix,

324
00:16:12,180 --> 00:16:15,120
so you just have to
access to i, j-th entry

325
00:16:15,120 --> 00:16:18,510
and flip the bit from 0 to 1.

326
00:16:18,510 --> 00:16:20,130
What about for the edge list?

327
00:16:30,040 --> 00:16:31,930
So assuming that the
edge list is unordered,

328
00:16:31,930 --> 00:16:37,300
so you don't have to keep
the list in any sorted order.

329
00:16:37,300 --> 00:16:37,800
Yeah.

330
00:16:37,800 --> 00:16:39,695
AUDIENCE: I guess it's O of 1.

331
00:16:39,695 --> 00:16:41,570
JULIAN SHUN: Yeah, so
again, it's just O of 1

332
00:16:41,570 --> 00:16:44,700
because you can just add it
to the end of the edge list.

333
00:16:44,700 --> 00:16:47,870
So that's a constant time.

334
00:16:47,870 --> 00:16:49,640
What about for the
adjacency list?

335
00:16:49,640 --> 00:16:51,710
So actually, this
depends on whether we're

336
00:16:51,710 --> 00:16:55,580
using linked lists or
arrays for the neighbor

337
00:16:55,580 --> 00:16:57,490
lists of the vertices.

338
00:16:57,490 --> 00:16:59,630
If we're using a linked
list, adding an edge just

339
00:16:59,630 --> 00:17:01,670
takes constant time
because we can just

340
00:17:01,670 --> 00:17:04,940
put it at the beginning
of the linked list.

341
00:17:04,940 --> 00:17:06,950
If we're using an
array, then we actually

342
00:17:06,950 --> 00:17:10,280
need to create a new array
to make space for this edge

343
00:17:10,280 --> 00:17:11,510
that we add.

344
00:17:11,510 --> 00:17:14,869
And that's going to cost
us a degree of v work

345
00:17:14,869 --> 00:17:18,020
to do that because we have to
copy all the existing edges

346
00:17:18,020 --> 00:17:20,300
over to this new array
and then add this new edge

347
00:17:20,300 --> 00:17:23,119
to the end of that array.

348
00:17:23,119 --> 00:17:25,099
Of course, you could
amortize this cost

349
00:17:25,099 --> 00:17:26,359
across multiple updates.

350
00:17:26,359 --> 00:17:27,859
So if you run out
of memory, you can

351
00:17:27,859 --> 00:17:29,359
double the size of
your array so you

352
00:17:29,359 --> 00:17:32,450
don't have to create these
new arrays too often.

353
00:17:32,450 --> 00:17:34,880
But the cost for any
individual addition

354
00:17:34,880 --> 00:17:38,720
is still relatively expensive
compared to, say, an edge list

355
00:17:38,720 --> 00:17:41,700
or adjacency matrix.

356
00:17:41,700 --> 00:17:44,840
And then finally, for the
compressed sparse row format,

357
00:17:44,840 --> 00:17:46,980
if you add an edge,
in the worst case,

358
00:17:46,980 --> 00:17:50,270
it's going to cost us
order m plus n work.

359
00:17:50,270 --> 00:17:53,210
Because we're going to have to
reconstruct the entire Offsets

360
00:17:53,210 --> 00:17:56,120
array and the entire Edges
array in the worst case.

361
00:17:56,120 --> 00:17:58,100
Because we have to put
something in and then

362
00:17:58,100 --> 00:17:59,600
shift-- in the Edges
array, you have

363
00:17:59,600 --> 00:18:01,810
to put something in and
shift all of the values

364
00:18:01,810 --> 00:18:04,003
to the right of that
over by one location.

365
00:18:04,003 --> 00:18:05,420
And then for the
Offsets array, we

366
00:18:05,420 --> 00:18:08,393
have to modify the offset for
the particular vertex we're

367
00:18:08,393 --> 00:18:10,310
adding an edge to and
then the offsets for all

368
00:18:10,310 --> 00:18:12,440
of the vertices after that.

369
00:18:12,440 --> 00:18:14,570
So the compressed sparse
row representation

370
00:18:14,570 --> 00:18:19,700
is not particularly
friendly to edge updates.

371
00:18:19,700 --> 00:18:24,380
What about for deleting an
edge from some vertex v?

372
00:18:24,380 --> 00:18:26,600
So for adjacency
matrix, again, it's

373
00:18:26,600 --> 00:18:28,610
going to be constant
time because you just

374
00:18:28,610 --> 00:18:30,980
randomly access
the correct entry

375
00:18:30,980 --> 00:18:34,520
and flip the bit from 1 to 0.

376
00:18:34,520 --> 00:18:35,810
What about for an edge list?

377
00:18:42,037 --> 00:18:45,113
AUDIENCE: [INAUDIBLE]

378
00:18:45,113 --> 00:18:47,530
JULIAN SHUN: Yeah, so for an
edge list, in the worst case,

379
00:18:47,530 --> 00:18:51,450
it's going to cost us order m
work because the edges are not

380
00:18:51,450 --> 00:18:52,530
in any sorted order.

381
00:18:52,530 --> 00:18:55,030
So we have to scan through the
whole thing in the worst case

382
00:18:55,030 --> 00:18:58,410
to find the edge that
we're trying to delete.

383
00:18:58,410 --> 00:19:03,420
For adjacency list, it's going
to take order degree of v work

384
00:19:03,420 --> 00:19:05,600
because the neighbors
are not sorted.

385
00:19:05,600 --> 00:19:07,350
So we have to scan
through the whole thing

386
00:19:07,350 --> 00:19:09,693
to find this edge that
we're trying to delete.

387
00:19:09,693 --> 00:19:11,610
And then finally, for a
compressed sparse row,

388
00:19:11,610 --> 00:19:13,740
it's going to be order
m plus n because we're

389
00:19:13,740 --> 00:19:16,365
going to have to reconstruct the
whole thing in the worst case.

390
00:19:20,290 --> 00:19:22,240
What about finding
all of the neighbors

391
00:19:22,240 --> 00:19:25,060
of a particular vertex v?

392
00:19:25,060 --> 00:19:27,895
What's the cost of doing
this in the adjacency matrix?

393
00:19:31,150 --> 00:19:34,090
AUDIENCE: [INAUDIBLE]

394
00:19:34,090 --> 00:19:36,460
JULIAN SHUN: Yes, so
it's going to cost us

395
00:19:36,460 --> 00:19:38,800
order n work to find
all the neighbors

396
00:19:38,800 --> 00:19:40,420
of a particular
vertex because we just

397
00:19:40,420 --> 00:19:43,840
scan the correct row
in this matrix, the row

398
00:19:43,840 --> 00:19:48,270
corresponding to vertex
v. For the edge list,

399
00:19:48,270 --> 00:19:50,402
we're going to have to
scan the entire edge

400
00:19:50,402 --> 00:19:51,610
list because it's not sorted.

401
00:19:51,610 --> 00:19:54,430
So in the worst case,
that's going to be order m.

402
00:19:54,430 --> 00:19:59,140
For adjacency list, that's
going to take order degree of v

403
00:19:59,140 --> 00:20:03,640
because we can just find
a pointer to the linked

404
00:20:03,640 --> 00:20:05,720
list for that vertex
in constant time.

405
00:20:05,720 --> 00:20:08,090
And then we just traverse
over the linked list.

406
00:20:08,090 --> 00:20:11,175
And that takes order
degree of v time.

407
00:20:11,175 --> 00:20:13,300
And then finally, for
compressed sparse row format,

408
00:20:13,300 --> 00:20:15,970
it's also order degree of v
because we have constant time

409
00:20:15,970 --> 00:20:19,120
access into the appropriate
location in the Edges array.

410
00:20:19,120 --> 00:20:21,220
And then we can just
read off the edges, which

411
00:20:21,220 --> 00:20:22,540
are consecutive in memory.

412
00:20:26,070 --> 00:20:31,400
So what about finding if a
vertex w is a neighbor of v?

413
00:20:31,400 --> 00:20:33,350
So I'll just give
you the answer.

414
00:20:33,350 --> 00:20:35,390
So for the adjacency
matrix, it's

415
00:20:35,390 --> 00:20:38,060
going to take constant
time because again,

416
00:20:38,060 --> 00:20:41,630
we just have to check the
v-th row in the w-th column

417
00:20:41,630 --> 00:20:44,270
and check if the
bit is set there.

418
00:20:44,270 --> 00:20:46,520
For edge list, we have to
traverse the entire list

419
00:20:46,520 --> 00:20:49,380
to see if the edge is there.

420
00:20:49,380 --> 00:20:52,020
And then for adjacency list
and compressed sparse row,

421
00:20:52,020 --> 00:20:53,570
it's going to be
order degree of v

422
00:20:53,570 --> 00:20:58,280
because we just have to scan the
neighbor list for that vertex.

423
00:20:58,280 --> 00:21:01,640
So these are some
graph representations.

424
00:21:01,640 --> 00:21:04,250
But there are actually many
other graph representations,

425
00:21:04,250 --> 00:21:07,400
including variance of the ones
that I've talked about here.

426
00:21:07,400 --> 00:21:09,260
So, for example,
for the adjacency,

427
00:21:09,260 --> 00:21:11,705
I said you can either use
a linked list or an array

428
00:21:11,705 --> 00:21:12,830
to store the neighbor list.

429
00:21:12,830 --> 00:21:15,170
But you can actually use
a hybrid approach, where

430
00:21:15,170 --> 00:21:17,870
you store the linked list, but
each linked list node actually

431
00:21:17,870 --> 00:21:19,410
stores more than one vertex.

432
00:21:19,410 --> 00:21:22,310
So you can store
maybe 16 vertices

433
00:21:22,310 --> 00:21:23,450
in each linked list node.

434
00:21:23,450 --> 00:21:25,989
And that gives us
better cache locality.

435
00:21:30,680 --> 00:21:32,350
So for the rest of
this lecture, I'm

436
00:21:32,350 --> 00:21:34,120
going to talk about
algorithms that

437
00:21:34,120 --> 00:21:38,458
are best implemented using the
compressed sparse row format.

438
00:21:38,458 --> 00:21:40,000
And this is because
we're going to be

439
00:21:40,000 --> 00:21:42,400
dealing with sparse graphs.

440
00:21:42,400 --> 00:21:44,978
We're going to be looking at
static algorithms, where we

441
00:21:44,978 --> 00:21:46,270
don't have to update the graph.

442
00:21:46,270 --> 00:21:47,687
If we do have to
update the graph,

443
00:21:47,687 --> 00:21:49,600
then CSR isn't a good choice.

444
00:21:49,600 --> 00:21:52,240
But we're just going to be
looking at static algorithms

445
00:21:52,240 --> 00:21:53,870
today.

446
00:21:53,870 --> 00:21:56,540
And then for all the algorithms
that we'll be looking at,

447
00:21:56,540 --> 00:22:00,280
we're going to need to scan over
all the neighbors of a vertex

448
00:22:00,280 --> 00:22:02,470
that we visit.

449
00:22:02,470 --> 00:22:04,405
And CSR is very good
for that because all

450
00:22:04,405 --> 00:22:06,130
of the neighbors for
a particular vertex

451
00:22:06,130 --> 00:22:07,900
are stored
contiguously in memory.

452
00:22:10,750 --> 00:22:12,040
So any questions so far?

453
00:22:21,568 --> 00:22:23,360
OK, I do want to talk
about some properties

454
00:22:23,360 --> 00:22:24,235
of real-world graphs.

455
00:22:27,470 --> 00:22:31,640
So first, we're seeing graphs
that are quite large today.

456
00:22:31,640 --> 00:22:34,260
But actually, they're
not too large.

457
00:22:34,260 --> 00:22:37,110
So here are the sizes of
some of the real-world graphs

458
00:22:37,110 --> 00:22:37,610
out there.

459
00:22:37,610 --> 00:22:40,158
So there is a Twitter network.

460
00:22:40,158 --> 00:22:42,200
That's actually a snapshot
of the Twitter network

461
00:22:42,200 --> 00:22:43,400
from a couple of years ago.

462
00:22:43,400 --> 00:22:47,220
It has 41 million vertices
and 1.5 billion edges.

463
00:22:47,220 --> 00:22:50,690
And you can store this graph in
about 6.3 gigabytes of memory.

464
00:22:50,690 --> 00:22:55,140
So you can probably store it in
the main memory of your laptop.

465
00:22:55,140 --> 00:22:57,290
The largest publicly
available graph out

466
00:22:57,290 --> 00:22:59,910
there now is this
Common Crawl web graph.

467
00:22:59,910 --> 00:23:05,470
It has 3.5 billion vertices
and 128 billion edges.

468
00:23:05,470 --> 00:23:07,550
So storing this graph
requires a little

469
00:23:07,550 --> 00:23:10,700
over 1/2 terabyte of memory.

470
00:23:10,700 --> 00:23:11,930
It is quite a bit of memory.

471
00:23:11,930 --> 00:23:15,050
But it's actually not too big
because there are machines out

472
00:23:15,050 --> 00:23:18,560
there with main memory sizes
in the order of terabytes

473
00:23:18,560 --> 00:23:20,270
of memory nowadays.

474
00:23:20,270 --> 00:23:23,780
So, for example, you can rent
2-terabyte or 4-terabyte memory

475
00:23:23,780 --> 00:23:27,050
instance on AWS, which you're
using for your homework

476
00:23:27,050 --> 00:23:28,130
assignments.

477
00:23:28,130 --> 00:23:29,750
See if you have any
leftover credits

478
00:23:29,750 --> 00:23:31,333
at the end of the
semester, and you

479
00:23:31,333 --> 00:23:32,750
want to play around
on this graph,

480
00:23:32,750 --> 00:23:36,055
you can rent one of
these terabyte machines.

481
00:23:36,055 --> 00:23:37,430
Just remember to
turn it off when

482
00:23:37,430 --> 00:23:39,410
you're done because
it's kind of expensive.

483
00:23:41,930 --> 00:23:43,670
Another property of
real-world graphs

484
00:23:43,670 --> 00:23:44,930
is that they're quite sparse.

485
00:23:44,930 --> 00:23:47,570
So m tends to be much
less than n squared.

486
00:23:47,570 --> 00:23:53,420
So most of the possible
edges are not actually there.

487
00:23:53,420 --> 00:23:56,390
And finally, the degree
distributions of the vertices

488
00:23:56,390 --> 00:23:59,360
can be highly skewed in
many real-world graphs.

489
00:23:59,360 --> 00:24:03,710
So here I'm plotting
the degree on the x-axis

490
00:24:03,710 --> 00:24:06,080
and the number of vertices
with that particular degree

491
00:24:06,080 --> 00:24:07,020
on the y-axis.

492
00:24:07,020 --> 00:24:09,230
And we can see that
it's highly skewed.

493
00:24:09,230 --> 00:24:11,750
And, for example, in a social
network, most of the people

494
00:24:11,750 --> 00:24:15,710
would be on the left-hand
side, so their degree is not

495
00:24:15,710 --> 00:24:17,330
that high.

496
00:24:17,330 --> 00:24:19,460
And then we have some
very popular people

497
00:24:19,460 --> 00:24:23,230
on the right-hand side, where
their degree is very high,

498
00:24:23,230 --> 00:24:24,730
but we don't have
too many of those.

499
00:24:27,500 --> 00:24:31,937
So this is what's known as a
power law degree distribution.

500
00:24:31,937 --> 00:24:33,395
And there have been
various studies

501
00:24:33,395 --> 00:24:35,870
that have shown that many
real-world graphs have

502
00:24:35,870 --> 00:24:38,630
approximately a power
law degree distribution.

503
00:24:38,630 --> 00:24:41,330
And mathematically, this
means that the number

504
00:24:41,330 --> 00:24:45,860
of vertices with degree
d is proportional to d

505
00:24:45,860 --> 00:24:47,330
to the negative p.

506
00:24:47,330 --> 00:24:49,610
So negative p is the exponent.

507
00:24:49,610 --> 00:24:55,190
And for many graphs, the value
of p lies between 2 and 3.

508
00:24:55,190 --> 00:24:56,870
And this power law
degree distribution

509
00:24:56,870 --> 00:24:58,940
does have implications
when we're

510
00:24:58,940 --> 00:25:01,120
trying to implement parallel
algorithms to process

511
00:25:01,120 --> 00:25:02,090
these graphs.

512
00:25:02,090 --> 00:25:05,600
Because with graphs that have
a skewed degree distribution,

513
00:25:05,600 --> 00:25:07,970
you could run into load
and balance issues.

514
00:25:07,970 --> 00:25:10,820
If you just parallelize
across the vertices,

515
00:25:10,820 --> 00:25:17,370
the number of edges they
have can vary significantly.

516
00:25:17,370 --> 00:25:18,590
Any questions?

517
00:25:22,730 --> 00:25:25,990
OK, so now let's talk about
how we can implement a graph

518
00:25:25,990 --> 00:25:27,100
algorithm.

519
00:25:27,100 --> 00:25:30,335
And I'm going to talk about the
breadth-first search algorithm.

520
00:25:30,335 --> 00:25:32,710
So how many of you have seen
breadth-first search before?

521
00:25:35,910 --> 00:25:39,320
OK, so about half of you.

522
00:25:39,320 --> 00:25:41,940
I did talk about breadth-first
search in a previous lecture,

523
00:25:41,940 --> 00:25:46,140
so I was hoping everybody
would raise their hands.

524
00:25:46,140 --> 00:25:49,340
OK, so as a reminder,
in the BFS algorithm,

525
00:25:49,340 --> 00:25:51,290
we're given a source
vertex s, and we

526
00:25:51,290 --> 00:25:53,780
want to visit the vertices
in order of their distance

527
00:25:53,780 --> 00:25:55,880
from the source s.

528
00:25:55,880 --> 00:25:57,358
And there are many
possible outputs

529
00:25:57,358 --> 00:25:58,400
that we might care about.

530
00:25:58,400 --> 00:26:00,020
One possible output
is, we just want

531
00:26:00,020 --> 00:26:01,580
to report the
vertices in the order

532
00:26:01,580 --> 00:26:05,330
that they were visited by the
breadth-first search traversal.

533
00:26:05,330 --> 00:26:07,580
So let's say we have
this graph here.

534
00:26:07,580 --> 00:26:10,820
And our source vertex
is D. So what's

535
00:26:10,820 --> 00:26:13,775
one possible order in which we
can traverse these vertices?

536
00:26:23,054 --> 00:26:24,790
Now, I should specify
that we should

537
00:26:24,790 --> 00:26:28,810
traverse this graph in a
breadth-first search manner.

538
00:26:28,810 --> 00:26:30,910
So what's the first vertex
we're going to explore?

539
00:26:33,814 --> 00:26:35,510
AUDIENCE: D.

540
00:26:35,510 --> 00:26:37,010
JULIAN SHUN: D. So
we're first going

541
00:26:37,010 --> 00:26:41,180
to look at D because
that's our source vertex.

542
00:26:41,180 --> 00:26:43,250
The second vertex,
we can actually

543
00:26:43,250 --> 00:26:47,390
choose between B, C, and E
because all we care about

544
00:26:47,390 --> 00:26:49,422
is that we're visiting
these vertices

545
00:26:49,422 --> 00:26:51,380
in the order of their
distance from the source.

546
00:26:51,380 --> 00:26:53,630
But these three vertices are
all of the same distance.

547
00:26:53,630 --> 00:26:56,330
So let's just pick
B, C, and then E.

548
00:26:56,330 --> 00:26:57,830
And then finally,
I'm going to visit

549
00:26:57,830 --> 00:27:02,280
vertex A, which has a
distance of 2 from the source.

550
00:27:02,280 --> 00:27:04,603
So this is one
possible solution.

551
00:27:04,603 --> 00:27:06,020
There are other
possible solutions

552
00:27:06,020 --> 00:27:12,150
because we could have visited E
before we visited B and so on.

553
00:27:12,150 --> 00:27:14,150
Another possible output
that we might care about

554
00:27:14,150 --> 00:27:18,140
is we might want to report
the distance from each vertex

555
00:27:18,140 --> 00:27:20,720
to the source vertex s.

556
00:27:20,720 --> 00:27:23,400
So in this example
here are the distances.

557
00:27:23,400 --> 00:27:26,630
So D has a distance of 0; B,C,
and E all have a distance of 1;

558
00:27:26,630 --> 00:27:30,110
and A has a distance of 2.

559
00:27:30,110 --> 00:27:33,560
We might also want to generate a
breadth-first search tree where

560
00:27:33,560 --> 00:27:38,030
each vertex in the
tree has a parent which

561
00:27:38,030 --> 00:27:39,770
is a neighbor in
the previous level

562
00:27:39,770 --> 00:27:41,090
of the breadth-first search.

563
00:27:41,090 --> 00:27:43,760
Or in other words,
the parent should

564
00:27:43,760 --> 00:27:47,480
have a distance of 1 less
than that vertex itself.

565
00:27:47,480 --> 00:27:51,020
So here's an example of a
breadth-first search tree.

566
00:27:51,020 --> 00:27:54,290
And we can see that
each of the vertices

567
00:27:54,290 --> 00:27:57,950
has a parent whose breadth-first
search distance is 1 less

568
00:27:57,950 --> 00:27:58,610
than itself.

569
00:28:01,220 --> 00:28:04,530
So the algorithms that I'm
going to be talking about today

570
00:28:04,530 --> 00:28:09,920
will generate the distances
as well as the BFS tree.

571
00:28:09,920 --> 00:28:12,860
And BFS actually has
many applications.

572
00:28:12,860 --> 00:28:15,650
So it's used as a
subroutine in betweenness

573
00:28:15,650 --> 00:28:19,130
centrality, which is a
very popular graph mining

574
00:28:19,130 --> 00:28:21,770
algorithm used to rank
the importance of nodes

575
00:28:21,770 --> 00:28:23,030
in a network.

576
00:28:23,030 --> 00:28:25,460
And the importance of
nodes here corresponds

577
00:28:25,460 --> 00:28:30,380
to how many shortest paths
go through that node.

578
00:28:30,380 --> 00:28:33,800
Other applications include
eccentricity estimation,

579
00:28:33,800 --> 00:28:34,660
maximum flows.

580
00:28:34,660 --> 00:28:37,940
Some max flow algorithms
use BFS as a subroutine.

581
00:28:37,940 --> 00:28:39,710
You can use BFS
to crawl the web,

582
00:28:39,710 --> 00:28:43,910
do cycle detection, garbage
collection, and so on.

583
00:28:43,910 --> 00:28:46,670
So let's now look at a
serial BFS algorithm.

584
00:28:46,670 --> 00:28:49,710
And here, I'm just going
to show the pseudocode.

585
00:28:49,710 --> 00:28:53,000
So first, we're going to
initialize the distances

586
00:28:53,000 --> 00:28:54,420
to all INFINITY.

587
00:28:54,420 --> 00:28:57,140
And we're going to initialize
the parents to be NIL.

588
00:28:59,780 --> 00:29:03,898
And then we're going to
create queue data structure.

589
00:29:03,898 --> 00:29:05,690
We're going to set the
distance of the root

590
00:29:05,690 --> 00:29:10,160
to be 0 because the root has
a distance of 0 to itself.

591
00:29:10,160 --> 00:29:13,055
And then we're going to place
the root onto this queue.

592
00:29:15,800 --> 00:29:17,430
And then, while the
queue is not empty,

593
00:29:17,430 --> 00:29:20,207
we're going to dequeue the
first thing in the queue.

594
00:29:20,207 --> 00:29:22,790
We're going to look at all the
neighbors of the current vertex

595
00:29:22,790 --> 00:29:23,950
that we dequeued.

596
00:29:23,950 --> 00:29:25,550
And for each
neighbor, we're going

597
00:29:25,550 --> 00:29:28,495
to check if its
distance is INFINITY.

598
00:29:28,495 --> 00:29:29,870
If the distance
is INFINITY, that

599
00:29:29,870 --> 00:29:31,703
means we haven't explored
that neighbor yet.

600
00:29:31,703 --> 00:29:33,590
So we're going to go
ahead and explore it.

601
00:29:33,590 --> 00:29:36,230
And we do so by setting
its distance value

602
00:29:36,230 --> 00:29:39,323
to be the current
vertex's distance plus 1.

603
00:29:39,323 --> 00:29:41,240
We're going to set the
parent of that neighbor

604
00:29:41,240 --> 00:29:42,500
to be the current vertex.

605
00:29:42,500 --> 00:29:46,910
And then we'll place the
neighbor onto the queue.

606
00:29:46,910 --> 00:29:48,950
So it's some pretty
simple algorithm.

607
00:29:48,950 --> 00:29:51,530
And we're just going to keep
iterating in this while loop

608
00:29:51,530 --> 00:29:56,090
until there are no more
vertices left in the queue.

609
00:29:56,090 --> 00:30:01,000
So what's the work of this
algorithm in terms of n and m?

610
00:30:01,000 --> 00:30:04,550
So how much work are
we doing per edge?

611
00:30:12,330 --> 00:30:12,870
Yes.

612
00:30:12,870 --> 00:30:18,590
AUDIENCE: [INAUDIBLE]

613
00:30:18,590 --> 00:30:22,210
JULIAN SHUN: Yeah, so assuming
that the enqueue and dequeue

614
00:30:22,210 --> 00:30:24,120
operators are
constant time, then

615
00:30:24,120 --> 00:30:26,790
we're doing constant
amount of work per edge.

616
00:30:26,790 --> 00:30:29,408
So summed across all edges,
that's going to be order m.

617
00:30:29,408 --> 00:30:31,200
And then we're also
doing a constant amount

618
00:30:31,200 --> 00:30:35,313
of work per vertex because
we have to basically place it

619
00:30:35,313 --> 00:30:37,230
onto the queue and then
take it off the queue,

620
00:30:37,230 --> 00:30:38,772
and then also
initialize their value.

621
00:30:38,772 --> 00:30:41,160
So the overall work is
going to be order m plus n.

622
00:30:44,130 --> 00:30:46,100
OK, so let's now look
at some actual code

623
00:30:46,100 --> 00:30:48,985
to implement the serial
BFS algorithm using

624
00:30:48,985 --> 00:30:50,360
the compressed
sparse row format.

625
00:30:53,430 --> 00:30:55,850
So first, I'm going to
initialize two arrays--

626
00:30:55,850 --> 00:30:57,710
parent and queue.

627
00:30:57,710 --> 00:31:01,518
And these are going to be
integer arrays of size n.

628
00:31:01,518 --> 00:31:03,560
I'm going to initialize
all of the parent entries

629
00:31:03,560 --> 00:31:04,540
to be negative 1.

630
00:31:04,540 --> 00:31:08,060
I'm going to place a source
vertex onto the queue.

631
00:31:08,060 --> 00:31:10,540
So it's going to appear
at queue of 0, that's

632
00:31:10,540 --> 00:31:11,958
the beginning of the queue.

633
00:31:11,958 --> 00:31:14,000
And then I'll set the
parent of the source vertex

634
00:31:14,000 --> 00:31:15,980
to be the source itself.

635
00:31:15,980 --> 00:31:18,830
And then I also
have two integers

636
00:31:18,830 --> 00:31:21,185
that point to the front
and the back of the queue.

637
00:31:21,185 --> 00:31:23,600
So initially, the front of
the queue is at position 0,

638
00:31:23,600 --> 00:31:26,980
and the back is at position 1.

639
00:31:26,980 --> 00:31:29,320
And then while the
queue is not empty--

640
00:31:29,320 --> 00:31:32,180
and I can check that by
checking if q_front is not

641
00:31:32,180 --> 00:31:33,470
equal to q_back--

642
00:31:33,470 --> 00:31:36,890
then I'm going to dequeue
the first vertex in my queue.

643
00:31:36,890 --> 00:31:38,930
I'm going to set current
to be that vertex.

644
00:31:38,930 --> 00:31:42,350
And then I'll increment q_front.

645
00:31:42,350 --> 00:31:44,600
And then I'll compute the
degree of that vertex, which

646
00:31:44,600 --> 00:31:46,142
I can do by looking
at the difference

647
00:31:46,142 --> 00:31:48,350
between consecutive offsets.

648
00:31:48,350 --> 00:31:51,200
And I also assume
that Offsets of n

649
00:31:51,200 --> 00:31:56,778
is equal to m, just to
deal with the last vertex

650
00:31:56,778 --> 00:31:59,070
And then I'm going to loop
through all of the neighbors

651
00:31:59,070 --> 00:32:01,200
for the current vertex.

652
00:32:01,200 --> 00:32:03,900
And to access each
neighbor, what I do

653
00:32:03,900 --> 00:32:05,580
is I go into the Edges array.

654
00:32:05,580 --> 00:32:09,570
And I know that my neighbors
start at Offsets of current.

655
00:32:09,570 --> 00:32:11,490
And therefore, to get
the i-th neighbor,

656
00:32:11,490 --> 00:32:13,770
I just do Offsets
of current plus i.

657
00:32:13,770 --> 00:32:17,400
That's my index into
the Edges array.

658
00:32:17,400 --> 00:32:20,640
Now I'm going to check if my
neighbor has been explored yet.

659
00:32:20,640 --> 00:32:23,430
And I can check that by
checking if parent of neighbor

660
00:32:23,430 --> 00:32:24,780
is equal to negative 1.

661
00:32:24,780 --> 00:32:27,560
If it is, that means I
haven't explored it yet.

662
00:32:27,560 --> 00:32:30,930
And then I'll set a parent
of neighbor to be current.

663
00:32:30,930 --> 00:32:34,380
And then I'll place the neighbor
onto the back of the queue

664
00:32:34,380 --> 00:32:37,523
and increment q_back.

665
00:32:37,523 --> 00:32:39,690
And I'm just going to keep
repeating this while loop

666
00:32:39,690 --> 00:32:42,450
until it becomes empty.

667
00:32:42,450 --> 00:32:44,580
And here, I'm only generating
the parent pointers.

668
00:32:44,580 --> 00:32:46,470
But I could also
generate the distances

669
00:32:46,470 --> 00:32:49,290
if I wanted to with just
a slight modification

670
00:32:49,290 --> 00:32:51,060
of this code.

671
00:32:51,060 --> 00:32:53,010
So any questions on
how this code works?

672
00:32:56,906 --> 00:32:58,180
OK, so here's a question.

673
00:32:58,180 --> 00:33:00,570
What's the most expensive
part of the code?

674
00:33:00,570 --> 00:33:02,560
Can you point to one
particular line here

675
00:33:02,560 --> 00:33:04,485
that is the most expensive?

676
00:33:16,950 --> 00:33:17,450
Yes.

677
00:33:17,450 --> 00:33:22,482
AUDIENCE: I'm going to guess the
[INAUDIBLE] that's gonna be all

678
00:33:22,482 --> 00:33:26,060
over the place in terms
of memory locations--

679
00:33:26,060 --> 00:33:28,080
ngh equals Edges.

680
00:33:28,080 --> 00:33:30,060
JULIAN SHUN: OK, so
actually, it turns out

681
00:33:30,060 --> 00:33:33,330
that that's not the most
expensive part of this code.

682
00:33:33,330 --> 00:33:35,430
But you're close.

683
00:33:35,430 --> 00:33:38,135
So anyone have any other ideas?

684
00:33:49,890 --> 00:33:50,390
Yes.

685
00:33:50,390 --> 00:33:53,410
AUDIENCE: Is it looking
up the parent array?

686
00:33:53,410 --> 00:33:57,440
JULIAN SHUN: Yes, so it turns
out that this line here,

687
00:33:57,440 --> 00:33:59,690
where we're accessing
parent of neighbor,

688
00:33:59,690 --> 00:34:02,930
that turns out to be
the most expensive.

689
00:34:02,930 --> 00:34:05,773
Because whenever we
access this parent array,

690
00:34:05,773 --> 00:34:07,565
the neighbor can appear
anywhere in memory.

691
00:34:07,565 --> 00:34:10,400
So that's going to
be a random access.

692
00:34:10,400 --> 00:34:12,690
And if the parent array
doesn't fit in our cache,

693
00:34:12,690 --> 00:34:15,679
then that's going to cost us a
cache miss almost every time.

694
00:34:18,409 --> 00:34:22,429
This Edges array is actually
mostly accessed sequentially.

695
00:34:22,429 --> 00:34:25,190
Because for each
vertex, all of its edges

696
00:34:25,190 --> 00:34:27,260
are stored
contiguously in memory,

697
00:34:27,260 --> 00:34:30,530
we do have one random access
into the Edges array per vertex

698
00:34:30,530 --> 00:34:32,630
because we have to look
up the starting location

699
00:34:32,630 --> 00:34:33,590
for that vertex.

700
00:34:33,590 --> 00:34:37,639
But it's not 1 per edge, unlike
this check of the parent array.

701
00:34:37,639 --> 00:34:40,659
That occurs for every edge.

702
00:34:40,659 --> 00:34:41,659
So does that make sense?

703
00:34:44,590 --> 00:34:46,480
So let's do a
back-of-the-envelope

704
00:34:46,480 --> 00:34:49,570
calculation to figure out how
many cache misses we would

705
00:34:49,570 --> 00:34:53,620
incur, assuming that we
started with a cold cache.

706
00:34:53,620 --> 00:34:55,780
And we also assume
that n is much larger

707
00:34:55,780 --> 00:34:57,940
than the size of the
cache, so we can't fit

708
00:34:57,940 --> 00:35:00,190
any of these arrays into cache.

709
00:35:00,190 --> 00:35:03,520
We'll assume that a
cache line has 64 bytes,

710
00:35:03,520 --> 00:35:08,120
and integers are 4 bytes each.

711
00:35:08,120 --> 00:35:09,430
So let's try to analyze this.

712
00:35:09,430 --> 00:35:16,960
So the initialization will
cost us n/16 cache misses.

713
00:35:16,960 --> 00:35:20,710
And the reason here is that
we're initializing this array

714
00:35:20,710 --> 00:35:21,410
sequentially.

715
00:35:21,410 --> 00:35:23,260
So we're accessing
contiguous locations.

716
00:35:23,260 --> 00:35:26,920
And this can take advantage
of spatial locality.

717
00:35:26,920 --> 00:35:30,010
On each cache line, we can
fit 16 of the integers.

718
00:35:30,010 --> 00:35:34,150
So overall, we're going to
need n/16 cache misses just

719
00:35:34,150 --> 00:35:35,590
to initialize this array.

720
00:35:39,400 --> 00:35:43,270
We also need n/16 cache misses
across the entire algorithm

721
00:35:43,270 --> 00:35:46,900
to dequeue the vertex from
the front of the queue.

722
00:35:46,900 --> 00:35:49,660
Because again, this is going
to be a sequential access

723
00:35:49,660 --> 00:35:51,190
into this queue array.

724
00:35:51,190 --> 00:35:53,980
And across all vertices,
that's going to be n/16

725
00:35:53,980 --> 00:35:58,870
cache misses because we can fit
16 integers on a cache line.

726
00:36:01,930 --> 00:36:05,110
To compute the
degree here, that's

727
00:36:05,110 --> 00:36:08,080
going to take n
cache misses overall.

728
00:36:08,080 --> 00:36:11,578
Because each of these
accesses to Offsets

729
00:36:11,578 --> 00:36:13,120
array is going to
be a random access.

730
00:36:13,120 --> 00:36:16,420
Because we have no idea what
the value of current here is.

731
00:36:16,420 --> 00:36:17,900
It could be anything.

732
00:36:17,900 --> 00:36:20,200
So across the entire
algorithm, we're

733
00:36:20,200 --> 00:36:23,260
going to need n cache misses
to access this Offsets array.

734
00:36:27,080 --> 00:36:29,770
And then to access
this Edges array,

735
00:36:29,770 --> 00:36:34,450
I claim that we're going to
need at most 2n plus m/16 cache

736
00:36:34,450 --> 00:36:35,170
misses.

737
00:36:35,170 --> 00:36:38,050
So does anyone see where
that bound comes from?

738
00:36:54,690 --> 00:36:56,730
So where does the
m/16 come from?

739
00:37:07,490 --> 00:37:07,990
Yeah.

740
00:37:07,990 --> 00:37:11,950
AUDIENCE: You have to access
that at least once for an edge.

741
00:37:11,950 --> 00:37:14,810
JULIAN SHUN: Right, so you
have to pay m/16 because you're

742
00:37:14,810 --> 00:37:17,120
accessing every edge once.

743
00:37:17,120 --> 00:37:20,060
And you're accessing
the Edges contiguously.

744
00:37:20,060 --> 00:37:21,770
So therefore, across
all Edges, that's

745
00:37:21,770 --> 00:37:24,620
going to take m/16 cache misses.

746
00:37:24,620 --> 00:37:27,290
But we also have to add 2n.

747
00:37:27,290 --> 00:37:30,740
Because whenever we access the
Edges for a particular vertex,

748
00:37:30,740 --> 00:37:34,340
the first cache
line might not only

749
00:37:34,340 --> 00:37:36,380
contain that vertex's edges.

750
00:37:36,380 --> 00:37:37,940
And similarly, the
last cache line

751
00:37:37,940 --> 00:37:40,460
that we access might
also not just contain

752
00:37:40,460 --> 00:37:42,800
that vertex's edges.

753
00:37:42,800 --> 00:37:45,320
So therefore, we're going
to waste the first cache

754
00:37:45,320 --> 00:37:48,920
line and the last cache line in
the worst case for each vertex.

755
00:37:48,920 --> 00:37:51,200
And summed cross all vertices,
that's going to be 2n.

756
00:37:51,200 --> 00:37:53,570
So this is the upper
bound, 2n plus m/16.

757
00:37:56,420 --> 00:37:58,130
Accessing this
parent array, that's

758
00:37:58,130 --> 00:38:00,200
going to be a random
access every time.

759
00:38:00,200 --> 00:38:01,700
So we're going to
incur a cache miss

760
00:38:01,700 --> 00:38:03,410
in the worst case every time.

761
00:38:03,410 --> 00:38:05,960
So summed across
all edge accesses,

762
00:38:05,960 --> 00:38:08,590
that's going to
be m cache misses.

763
00:38:08,590 --> 00:38:10,090
And then finally,
we're going to pay

764
00:38:10,090 --> 00:38:14,420
n/16 cache misses to enqueue
the neighbor onto the queue

765
00:38:14,420 --> 00:38:18,020
because these are
sequential accesses.

766
00:38:18,020 --> 00:38:22,550
So in total, we're going
to incur at most 51/16 n

767
00:38:22,550 --> 00:38:26,180
plus 17/16 16 m cache misses.

768
00:38:26,180 --> 00:38:30,230
And if m is greater than
3n, then the second term

769
00:38:30,230 --> 00:38:31,490
here is going to dominate.

770
00:38:31,490 --> 00:38:36,080
And m is usually greater than
3n in most real-world graphs.

771
00:38:36,080 --> 00:38:39,290
And the second term here is
dominated by this random access

772
00:38:39,290 --> 00:38:42,920
into the parent array.

773
00:38:42,920 --> 00:38:46,010
So let's see if we can
optimize this code so that we

774
00:38:46,010 --> 00:38:49,190
get better cache performance.

775
00:38:49,190 --> 00:38:52,730
So let's say we could fit a bit
vector of size n into cache.

776
00:38:52,730 --> 00:38:55,610
But we couldn't fit the entire
parent array into cache.

777
00:38:55,610 --> 00:38:59,590
What can we do to reduce
the number of cache misses?

778
00:38:59,590 --> 00:39:02,170
So does anyone have any ideas?

779
00:39:02,170 --> 00:39:02,852
Yeah.

780
00:39:02,852 --> 00:39:07,190
AUDIENCE: Is bitvector
to keep track of which

781
00:39:07,190 --> 00:39:10,082
vertices of other
parents then [INAUDIBLE]??

782
00:39:14,440 --> 00:39:16,760
JULIAN SHUN: Yeah, so
that's exactly correct.

783
00:39:16,760 --> 00:39:19,820
So we're going to
use a bit vector

784
00:39:19,820 --> 00:39:23,120
to store whether the vertex
has been explored yet or not.

785
00:39:23,120 --> 00:39:24,440
So we only need 1 bit for that.

786
00:39:24,440 --> 00:39:26,960
We're not storing the parent
ID in this bit vector.

787
00:39:26,960 --> 00:39:29,480
We're just storing a bit to
say whether that vertex has

788
00:39:29,480 --> 00:39:31,640
been explored yet or not.

789
00:39:31,640 --> 00:39:34,552
And then, before we
check this parent array,

790
00:39:34,552 --> 00:39:36,260
we're going to first
check the bit vector

791
00:39:36,260 --> 00:39:39,200
to see if that vertex
has been explored yet.

792
00:39:39,200 --> 00:39:41,150
And if it has been
explored yet, we

793
00:39:41,150 --> 00:39:44,270
don't even need to
access this parent array.

794
00:39:44,270 --> 00:39:45,950
If it hasn't been
explored, then we

795
00:39:45,950 --> 00:39:50,270
won't go ahead and access the
parent entry of the neighbor.

796
00:39:50,270 --> 00:39:51,920
But we only have
to do this one time

797
00:39:51,920 --> 00:39:55,880
for each vertex in the
graph because we can only

798
00:39:55,880 --> 00:39:57,360
visit each vertex once.

799
00:39:57,360 --> 00:39:59,360
And therefore, we can
reduce the number of cache

800
00:39:59,360 --> 00:40:01,065
misses from m down to n.

801
00:40:03,890 --> 00:40:07,130
So overall, this might improve
the number of cache misses.

802
00:40:07,130 --> 00:40:11,720
In fact, it does if
the number of edges

803
00:40:11,720 --> 00:40:15,690
is large enough relative
to the number of vertices.

804
00:40:15,690 --> 00:40:18,140
However, you do have to do a
little bit more computation

805
00:40:18,140 --> 00:40:22,040
because you have to do bit
vector manipulation to check

806
00:40:22,040 --> 00:40:25,250
this bit vector and then also
to set the bit vector when

807
00:40:25,250 --> 00:40:27,680
you explore a neighbor.

808
00:40:27,680 --> 00:40:33,320
So here's the code using
the bit vector optimization.

809
00:40:33,320 --> 00:40:36,580
So here, I'm initializing this
bit vector called visited.

810
00:40:36,580 --> 00:40:38,140
It's of size,
approximately, n/32.

811
00:40:40,640 --> 00:40:42,140
And then I'm setting
all of the bits

812
00:40:42,140 --> 00:40:45,560
to 0, except for the
source vertex, where

813
00:40:45,560 --> 00:40:47,120
I'm going to set its bit to 1.

814
00:40:47,120 --> 00:40:50,570
And I'm doing this
bit calculation here

815
00:40:50,570 --> 00:40:54,230
to figure out the bit
for the source vertex.

816
00:40:54,230 --> 00:40:57,680
And then now, when I'm
trying to visit a neighbor,

817
00:40:57,680 --> 00:41:00,230
I'm first going to check
if the neighbor is visited

818
00:41:00,230 --> 00:41:02,540
by checking this bit array.

819
00:41:02,540 --> 00:41:05,900
And I can do this using
this computation here--

820
00:41:05,900 --> 00:41:09,590
AND visited of neighbor
over 32, by this mask--

821
00:41:09,590 --> 00:41:14,300
1 left shifted by
neighbor mod 32.

822
00:41:14,300 --> 00:41:16,790
And if that's false,
that means the neighbor

823
00:41:16,790 --> 00:41:17,960
hasn't been visited yet.

824
00:41:17,960 --> 00:41:20,420
So I'll go inside
this IF clause.

825
00:41:20,420 --> 00:41:22,250
And then I'll set
the visited bit

826
00:41:22,250 --> 00:41:24,380
to be true using
this statement here.

827
00:41:24,380 --> 00:41:27,260
And then I do the same
operations as I did before.

828
00:41:31,220 --> 00:41:32,850
It turns out that
this version is

829
00:41:32,850 --> 00:41:35,370
faster for large
enough values of m

830
00:41:35,370 --> 00:41:38,370
relative to n because you
reduce the number of cache

831
00:41:38,370 --> 00:41:40,350
misses overall.

832
00:41:40,350 --> 00:41:44,240
You still have to do this
extra computation here,

833
00:41:44,240 --> 00:41:45,690
this bit manipulation.

834
00:41:45,690 --> 00:41:49,050
But if m is large enough,
then the reduction

835
00:41:49,050 --> 00:41:51,180
in number of cache
misses outweighs

836
00:41:51,180 --> 00:41:55,050
the additional computation
that you have to do.

837
00:41:55,050 --> 00:41:55,870
Any questions?

838
00:42:04,190 --> 00:42:06,400
OK, so that was a
serial implementation

839
00:42:06,400 --> 00:42:07,600
of breadth-first search.

840
00:42:07,600 --> 00:42:10,950
Now let's look at a
parallel implementation.

841
00:42:10,950 --> 00:42:12,670
So I'm first going
to do an animation

842
00:42:12,670 --> 00:42:17,645
of how a parallel breadth-first
search algorithm would work.

843
00:42:17,645 --> 00:42:19,270
The parallel reference
search algorithm

844
00:42:19,270 --> 00:42:22,050
is going to operate
on frontiers,

845
00:42:22,050 --> 00:42:25,540
where the initial frontier
contains just a source vertex.

846
00:42:25,540 --> 00:42:27,010
And on every
iteration, I'm going

847
00:42:27,010 --> 00:42:29,920
to explore all of the
vertices on the frontier

848
00:42:29,920 --> 00:42:31,840
and then place any
unexplored neighbors

849
00:42:31,840 --> 00:42:32,980
onto the next frontier.

850
00:42:32,980 --> 00:42:35,400
And then I move on
to the next frontier.

851
00:42:35,400 --> 00:42:36,900
So in the first
iteration, I'm going

852
00:42:36,900 --> 00:42:38,920
to mark the source
vertex as explored,

853
00:42:38,920 --> 00:42:41,050
set its distance to
be 0, and then place

854
00:42:41,050 --> 00:42:45,100
the neighbors of that source
vertex onto the next frontier.

855
00:42:45,100 --> 00:42:48,190
In the next iteration, I'm
going to do the same thing, set

856
00:42:48,190 --> 00:42:49,520
these distances to 1.

857
00:42:49,520 --> 00:42:51,820
I also am going to
generate a parent pointer

858
00:42:51,820 --> 00:42:53,590
for each of these vertices.

859
00:42:53,590 --> 00:42:56,330
And this parent should come
from the previous frontier,

860
00:42:56,330 --> 00:42:58,833
and it should be a
neighbor of the vertex.

861
00:42:58,833 --> 00:43:00,250
And here, there's
only one option,

862
00:43:00,250 --> 00:43:02,480
which is the source vertex.

863
00:43:02,480 --> 00:43:04,692
So I'll just pick
that as the parent.

864
00:43:04,692 --> 00:43:06,400
And then I'm going to
place the neighbors

865
00:43:06,400 --> 00:43:10,000
onto the next frontier again,
mark those as explored,

866
00:43:10,000 --> 00:43:12,520
set their distances,
and generate a parent

867
00:43:12,520 --> 00:43:14,678
pointer again.

868
00:43:14,678 --> 00:43:16,720
And notice here, when I'm
generating these parent

869
00:43:16,720 --> 00:43:18,580
pointers, there's actually
more than one choice

870
00:43:18,580 --> 00:43:19,750
for some of these vertices.

871
00:43:19,750 --> 00:43:21,708
And this is because there
are multiple vertices

872
00:43:21,708 --> 00:43:23,630
on the previous frontier.

873
00:43:23,630 --> 00:43:26,170
And some of them explored
the same neighbor

874
00:43:26,170 --> 00:43:28,160
on the current frontier.

875
00:43:28,160 --> 00:43:30,310
So a parallel
implementation has to be

876
00:43:30,310 --> 00:43:32,440
aware of this potential race.

877
00:43:32,440 --> 00:43:36,890
Here, I'm just picking
an arbitrary parent.

878
00:43:36,890 --> 00:43:38,950
So as we see here,
you can process each

879
00:43:38,950 --> 00:43:40,345
of these frontiers in parallel.

880
00:43:40,345 --> 00:43:42,970
So you can parallelize over all
of the vertices on the frontier

881
00:43:42,970 --> 00:43:45,070
as well as all of
their outgoing edges.

882
00:43:45,070 --> 00:43:47,530
However, you do need to
process one frontier before you

883
00:43:47,530 --> 00:43:52,530
move on to the next one
in this BFS algorithm.

884
00:43:52,530 --> 00:43:54,130
And a parallel
implementation has

885
00:43:54,130 --> 00:43:56,920
to be aware of potential races.

886
00:43:56,920 --> 00:43:59,895
So as I said earlier, we
could have multiple vertices

887
00:43:59,895 --> 00:44:02,020
on the frontier trying to
visit the same neighbors.

888
00:44:02,020 --> 00:44:04,660
So somehow, that
has to be resolved.

889
00:44:04,660 --> 00:44:07,060
And also, the amount of
work on each frontier

890
00:44:07,060 --> 00:44:11,380
is changing throughout the
course of the algorithm.

891
00:44:11,380 --> 00:44:13,420
So you have to be careful
with load balancing.

892
00:44:13,420 --> 00:44:15,790
Because you have to make
sure that the amount of work

893
00:44:15,790 --> 00:44:20,133
each processor has to
do is about the same.

894
00:44:20,133 --> 00:44:21,550
If you use Cilk
to implement this,

895
00:44:21,550 --> 00:44:24,190
then load balancing doesn't
really become a problem.

896
00:44:28,140 --> 00:44:30,440
So any questions on
the BFS algorithm

897
00:44:30,440 --> 00:44:31,940
before I go over the code?

898
00:44:36,010 --> 00:44:39,210
OK, so here's the actual code.

899
00:44:39,210 --> 00:44:43,160
And here I'm going to
initialize these four arrays, so

900
00:44:43,160 --> 00:44:45,990
the parent array, which
is the same as before.

901
00:44:45,990 --> 00:44:48,230
I'm going to have an array
called frontier, which

902
00:44:48,230 --> 00:44:50,462
stores the current frontier.

903
00:44:50,462 --> 00:44:51,920
And then I'm going
to have an array

904
00:44:51,920 --> 00:44:54,320
called frontierNext,
which is a temporary array

905
00:44:54,320 --> 00:44:57,650
that I use to store the
next frontier of the BFS.

906
00:44:57,650 --> 00:44:59,525
And then also I have an
array called degrees.

907
00:45:02,128 --> 00:45:04,170
I'm going to initialize
all of the parent entries

908
00:45:04,170 --> 00:45:05,045
to be negative 1.

909
00:45:05,045 --> 00:45:08,370
I do that using a cilk_for loop.

910
00:45:08,370 --> 00:45:12,870
I'm going to place the source
vertex at the 0-th index

911
00:45:12,870 --> 00:45:14,210
of the frontier.

912
00:45:14,210 --> 00:45:15,960
I'll set the
frontierSize to be 1.

913
00:45:15,960 --> 00:45:19,530
And then I set the parent of the
source to be the source itself.

914
00:45:19,530 --> 00:45:21,490
While the frontierSize
is greater than 0,

915
00:45:21,490 --> 00:45:24,310
that means I still
have more work to do.

916
00:45:24,310 --> 00:45:26,530
I'm going to first
iterate over all

917
00:45:26,530 --> 00:45:29,700
of the vertices on my frontier
in parallel using a cilk_for

918
00:45:29,700 --> 00:45:30,420
loop.

919
00:45:30,420 --> 00:45:34,680
And then I'll set the i-th
entry of the degrees array

920
00:45:34,680 --> 00:45:38,220
to be the degree of the
i-th vertex on the frontier.

921
00:45:38,220 --> 00:45:40,620
And I can do this just
using the difference

922
00:45:40,620 --> 00:45:43,950
between consecutive offsets.

923
00:45:43,950 --> 00:45:46,950
And then I'm going to perform
a prefix sum on this degrees

924
00:45:46,950 --> 00:45:47,850
array.

925
00:45:47,850 --> 00:45:51,420
And we'll see in a minute why
I'm doing this prefix sum.

926
00:45:51,420 --> 00:45:54,510
But first of all, does anybody
recall what prefix sum is?

927
00:46:03,140 --> 00:46:04,730
So who knows what prefix sum is?

928
00:46:08,750 --> 00:46:10,230
Do you want to
tell us what it is?

929
00:46:10,230 --> 00:46:13,610
AUDIENCE: That's the sum array
where index i is the sum of

930
00:46:13,610 --> 00:46:15,910
[INAUDIBLE].

931
00:46:15,910 --> 00:46:17,680
JULIAN SHUN: Yeah,
so prefix sum--

932
00:46:20,680 --> 00:46:24,530
so here I'm going to demonstrate
this with an example.

933
00:46:24,530 --> 00:46:27,430
So let's say this
is our input array.

934
00:46:27,430 --> 00:46:31,360
The output of this array
would store for each location

935
00:46:31,360 --> 00:46:34,730
the sum of everything before
that location in the input

936
00:46:34,730 --> 00:46:35,230
array.

937
00:46:35,230 --> 00:46:38,860
So here we see that the first
position has a value of 0

938
00:46:38,860 --> 00:46:40,700
because a sum of
everything before it is 0.

939
00:46:40,700 --> 00:46:43,210
There's nothing before
it in the input.

940
00:46:43,210 --> 00:46:45,250
The second position
has a value of 2

941
00:46:45,250 --> 00:46:48,280
because the sum of
everything before it is just

942
00:46:48,280 --> 00:46:49,840
the first location.

943
00:46:49,840 --> 00:46:51,850
The third location
has a value of 6

944
00:46:51,850 --> 00:46:54,430
because the sum of
everything before it is 2

945
00:46:54,430 --> 00:46:57,590
plus 4, which is 6, and so on.

946
00:46:57,590 --> 00:47:00,730
So I believe this was on one
of your homework assignments.

947
00:47:00,730 --> 00:47:04,270
So hopefully, everyone
knows what prefix sum is.

948
00:47:04,270 --> 00:47:06,550
And later on, we'll
see how we use

949
00:47:06,550 --> 00:47:10,110
this to do the parallel
breadth-first search.

950
00:47:10,110 --> 00:47:14,080
OK, so I'm going to do a prefix
sum on this degrees array.

951
00:47:14,080 --> 00:47:19,550
And then I'm going to loop over
my frontier again in parallel.

952
00:47:19,550 --> 00:47:22,810
I'm going to let v be the
i-th vertex on the frontier.

953
00:47:22,810 --> 00:47:25,630
Index is going to be
equal to degrees of i.

954
00:47:25,630 --> 00:47:29,200
And then my degree is
going to be Offsets of v

955
00:47:29,200 --> 00:47:33,270
plus 1 minus Offsets of v.

956
00:47:33,270 --> 00:47:36,910
Now I'm going to loop
through all v's neighbors.

957
00:47:36,910 --> 00:47:38,650
And here I just have
a serial for loop.

958
00:47:38,650 --> 00:47:40,870
But you could actually
parallelize this for loop.

959
00:47:40,870 --> 00:47:44,470
It turns out that if the number
of iterations in the for loop

960
00:47:44,470 --> 00:47:46,720
is small enough, there's
additional overhead

961
00:47:46,720 --> 00:47:49,630
to making this parallel, so I
just made it serial for now.

962
00:47:49,630 --> 00:47:52,450
But you could make it parallel.

963
00:47:52,450 --> 00:47:55,860
To get the neighbor, I just
index into this Edges array.

964
00:47:55,860 --> 00:47:57,850
I look at Offsets of v plus j.

965
00:48:00,812 --> 00:48:02,770
Then now I'm going to
check if the neighbor has

966
00:48:02,770 --> 00:48:04,030
been explored yet.

967
00:48:04,030 --> 00:48:05,630
And I can check if
parent of neighbor

968
00:48:05,630 --> 00:48:07,928
is equal to negative 1.

969
00:48:07,928 --> 00:48:09,970
So that means it hasn't
been explored yet, so I'm

970
00:48:09,970 --> 00:48:11,320
going to try to explore it.

971
00:48:11,320 --> 00:48:13,780
And I do so using
a compare-and-swap.

972
00:48:13,780 --> 00:48:16,240
I'm going to try to
swap in the value of v

973
00:48:16,240 --> 00:48:18,790
with the original
value of negative 1

974
00:48:18,790 --> 00:48:21,180
in parent of neighbor.

975
00:48:21,180 --> 00:48:22,570
And the compare-and-swap
is going

976
00:48:22,570 --> 00:48:26,540
to return true if it was
successful and false otherwise.

977
00:48:26,540 --> 00:48:28,360
And if it returns
true, that means

978
00:48:28,360 --> 00:48:31,522
this vertex becomes the
parent of this neighbor.

979
00:48:31,522 --> 00:48:32,980
And then I'll place
the neighbor on

980
00:48:32,980 --> 00:48:35,695
to frontierNext at
this particular index--

981
00:48:35,695 --> 00:48:36,980
index plus j.

982
00:48:36,980 --> 00:48:41,870
And otherwise, I'll set a
negative 1 at that location.

983
00:48:41,870 --> 00:48:45,710
OK, so let's see why I'm
using index plus j here.

984
00:48:45,710 --> 00:48:48,850
So here's how
frontierNext is organized.

985
00:48:48,850 --> 00:48:51,190
So each vertex on
the frontier owns

986
00:48:51,190 --> 00:48:55,060
a subset of these locations
in the frontierNext array.

987
00:48:55,060 --> 00:48:58,320
And these are all
contiguous memory locations.

988
00:48:58,320 --> 00:49:00,220
And it turns out that
the starting location

989
00:49:00,220 --> 00:49:03,160
for each of these vertices
in this frontierNext array

990
00:49:03,160 --> 00:49:07,150
is exactly the value in this
prefix sum array up here.

991
00:49:07,150 --> 00:49:10,780
So vertex 1 has its first
location at index 0.

992
00:49:10,780 --> 00:49:13,750
Vertex 2 has its first
location at index 2.

993
00:49:13,750 --> 00:49:18,260
Vertex 3 has its first
location at index 6, and so on.

994
00:49:18,260 --> 00:49:21,070
So by using a prefix
sum, I can guarantee

995
00:49:21,070 --> 00:49:24,910
that all of these vertices
have a disjoint subarray

996
00:49:24,910 --> 00:49:26,200
in this frontierNext array.

997
00:49:26,200 --> 00:49:29,140
And then they can all write
to this frontierNext array

998
00:49:29,140 --> 00:49:32,560
in parallel without any races.

999
00:49:32,560 --> 00:49:35,880
And index plus j just
gives us the right location

1000
00:49:35,880 --> 00:49:37,490
to write to in this array.

1001
00:49:37,490 --> 00:49:40,300
So index is the
starting location,

1002
00:49:40,300 --> 00:49:42,290
and then j is for
the j-th neighbor.

1003
00:49:45,160 --> 00:49:48,310
So here is one potential
output after we write

1004
00:49:48,310 --> 00:49:50,260
to this frontierNext array.

1005
00:49:50,260 --> 00:49:52,970
So we have some
non-negative values.

1006
00:49:52,970 --> 00:49:55,840
And these are vertices that
we explored in this iteration.

1007
00:49:55,840 --> 00:49:58,180
We also have some
negative 1 values.

1008
00:49:58,180 --> 00:50:01,300
And the negative 1 here means
that either the vertex has

1009
00:50:01,300 --> 00:50:03,760
already been explored
in a previous iteration,

1010
00:50:03,760 --> 00:50:06,190
or we tried to explore it
in the current iteration,

1011
00:50:06,190 --> 00:50:08,020
but somebody else
got there before us.

1012
00:50:08,020 --> 00:50:10,623
Because somebody else is
doing the compare-and-swap

1013
00:50:10,623 --> 00:50:13,165
at the same time, and they could
have finished before we did,

1014
00:50:13,165 --> 00:50:15,960
so we failed on the
compare-and-swap.

1015
00:50:15,960 --> 00:50:18,820
So we don't actually want these
negative 1 values, so we're

1016
00:50:18,820 --> 00:50:20,530
going to filter them out.

1017
00:50:20,530 --> 00:50:24,160
And we can filter them out
using a prefix sum again.

1018
00:50:24,160 --> 00:50:27,070
And this is going to
give us a new frontier.

1019
00:50:27,070 --> 00:50:29,680
And we'll set the
frontierSize equal to the size

1020
00:50:29,680 --> 00:50:30,940
of this new frontier.

1021
00:50:30,940 --> 00:50:32,530
And then we repeat
this while loop

1022
00:50:32,530 --> 00:50:34,960
until there are no more
vertices on the frontier.

1023
00:50:38,060 --> 00:50:41,590
So any questions on this
parallel BFS algorithm?

1024
00:50:50,420 --> 00:50:51,470
Yeah.

1025
00:50:51,470 --> 00:50:54,410
AUDIENCE: Can you go over
like the last [INAUDIBLE]??

1026
00:50:57,243 --> 00:50:58,910
JULIAN SHUN: Do you
mean the filter out?

1027
00:50:58,910 --> 00:50:59,780
AUDIENCE: Yeah.

1028
00:50:59,780 --> 00:51:02,240
JULIAN SHUN: Yeah,
so what you can do

1029
00:51:02,240 --> 00:51:06,500
is, you can create another
array, which stores a 1

1030
00:51:06,500 --> 00:51:11,480
in location i if that location
is not a negative 1 and 0

1031
00:51:11,480 --> 00:51:12,710
if it is a negative 1.

1032
00:51:12,710 --> 00:51:14,430
Then you do a prefix
sum on that array,

1033
00:51:14,430 --> 00:51:17,070
which gives us unique
offsets into an output array.

1034
00:51:17,070 --> 00:51:21,170
So then everybody just looks
at the prefix sum array there.

1035
00:51:21,170 --> 00:51:23,320
And then it writes
to the output array.

1036
00:51:23,320 --> 00:51:26,540
So it might be easier if I
tried to draw this on the board.

1037
00:51:40,950 --> 00:51:44,730
OK, so let's say we have
an array of size 5 here.

1038
00:51:44,730 --> 00:51:46,230
So what I'm going
to do is I'm going

1039
00:51:46,230 --> 00:51:49,440
to generate another
array which stores

1040
00:51:49,440 --> 00:51:54,300
a 1 if the value in the
corresponding location

1041
00:51:54,300 --> 00:51:56,778
is not a negative
1 and 0 otherwise.

1042
00:52:00,770 --> 00:52:04,170
And then I do a prefix
sum on this array here.

1043
00:52:04,170 --> 00:52:15,300
And this gives me
0, 1, 1, 2, and 2.

1044
00:52:15,300 --> 00:52:20,100
And now each of these values
that are not negative 1,

1045
00:52:20,100 --> 00:52:22,710
they can just look up
the corresponding index

1046
00:52:22,710 --> 00:52:24,060
in this output array.

1047
00:52:24,060 --> 00:52:28,620
And this gives us a unique
index into an output array.

1048
00:52:28,620 --> 00:52:31,530
So this element will
write to position 0,

1049
00:52:31,530 --> 00:52:33,210
this element would
write to position 1,

1050
00:52:33,210 --> 00:52:38,320
and this element would write to
position 2 in my final output.

1051
00:52:38,320 --> 00:52:39,960
So this would be
my final frontier.

1052
00:52:51,155 --> 00:52:52,030
Does that make sense?

1053
00:52:58,440 --> 00:53:02,570
OK, so let's now
analyze the working span

1054
00:53:02,570 --> 00:53:06,410
of this parallel BFS algorithm.

1055
00:53:06,410 --> 00:53:09,830
So a number of iterations
required by the BFS algorithm

1056
00:53:09,830 --> 00:53:12,880
is upper-bounded by the
diameter D of the graph.

1057
00:53:12,880 --> 00:53:17,120
And the diameter of a graph
is just the maximum shortest

1058
00:53:17,120 --> 00:53:19,682
path between any pair of
vertices in the graph.

1059
00:53:19,682 --> 00:53:21,890
And that's an upper bound
on the number of iterations

1060
00:53:21,890 --> 00:53:23,720
we need to do.

1061
00:53:23,720 --> 00:53:26,330
Each iteration is
going to take a log m

1062
00:53:26,330 --> 00:53:30,275
span for the clik_for loops,
the prefix sum, and the filter.

1063
00:53:30,275 --> 00:53:32,150
And this is also assuming
that the inner loop

1064
00:53:32,150 --> 00:53:36,690
is parallelized, the inner loop
over the neighbors of a vertex.

1065
00:53:36,690 --> 00:53:39,420
So to get the span, we just
multiply these two terms.

1066
00:53:39,420 --> 00:53:43,820
So we get theta of
D times log m span.

1067
00:53:43,820 --> 00:53:45,870
What about the work?

1068
00:53:45,870 --> 00:53:47,750
So to compute the work,
we have to figure out

1069
00:53:47,750 --> 00:53:50,960
how much work we're doing
per vertex and per edge.

1070
00:53:50,960 --> 00:53:53,300
So first, notice that
the sum of the frontier

1071
00:53:53,300 --> 00:53:55,370
sizes across entire
algorithm is going

1072
00:53:55,370 --> 00:53:58,365
to be n because each vertex
can be on the frontier at most

1073
00:53:58,365 --> 00:53:58,865
once.

1074
00:54:01,490 --> 00:54:04,340
Also, each edge is going to
be traversed exactly once.

1075
00:54:04,340 --> 00:54:06,560
So that leads to m
total edge visits.

1076
00:54:09,662 --> 00:54:11,120
On each iteration
of the algorithm,

1077
00:54:11,120 --> 00:54:12,980
we're doing a prefix sum.

1078
00:54:12,980 --> 00:54:15,020
And the cost of
this prefix sum is

1079
00:54:15,020 --> 00:54:17,600
going to be proportional
to the frontier size.

1080
00:54:17,600 --> 00:54:20,660
So summed across all iterations,
the cost of the prefix

1081
00:54:20,660 --> 00:54:23,540
sum is going to be theta of n.

1082
00:54:23,540 --> 00:54:25,340
We also have to do this filter.

1083
00:54:25,340 --> 00:54:27,860
But the work of the filter
is proportional to the number

1084
00:54:27,860 --> 00:54:30,400
of edges traversed
in that iteration.

1085
00:54:30,400 --> 00:54:32,120
And summed across all
iterations, that's

1086
00:54:32,120 --> 00:54:34,610
going to give theta of m total.

1087
00:54:34,610 --> 00:54:36,200
So overall, the
work is going to be

1088
00:54:36,200 --> 00:54:39,770
theta of n plus m for this
parallel BFS algorithm.

1089
00:54:39,770 --> 00:54:41,450
So this is a
work-efficient algorithm.

1090
00:54:41,450 --> 00:54:45,810
The work matches out
the serial algorithm.

1091
00:54:45,810 --> 00:54:47,360
Any questions on the analysis?

1092
00:54:53,780 --> 00:54:56,850
OK, so let's look at
how this parallel BFS

1093
00:54:56,850 --> 00:54:59,880
algorithm runs in practice.

1094
00:54:59,880 --> 00:55:03,000
So here, I ran some
experiments on a random graph

1095
00:55:03,000 --> 00:55:06,660
with 10 million vertices
and 100 million edges.

1096
00:55:06,660 --> 00:55:08,670
And the edges were
randomly generated.

1097
00:55:08,670 --> 00:55:12,050
And I made sure that
each vertex had 10 edges.

1098
00:55:12,050 --> 00:55:14,160
I ran experiments
on a 40-core machine

1099
00:55:14,160 --> 00:55:16,063
with 2-way hyperthreading.

1100
00:55:16,063 --> 00:55:17,730
Does anyone know what
hyperthreading is?

1101
00:55:21,510 --> 00:55:22,700
Yeah, what is it?

1102
00:55:22,700 --> 00:55:24,850
AUDIENCE: It's when you
have like one CPU core that

1103
00:55:24,850 --> 00:55:28,670
can execute two instruction
screens at the same time

1104
00:55:28,670 --> 00:55:30,787
so it can [INAUDIBLE]
high number latency.

1105
00:55:30,787 --> 00:55:32,620
JULIAN SHUN: Yeah, so
that's a great answer.

1106
00:55:32,620 --> 00:55:35,100
So hyperthreading is
an Intel technology

1107
00:55:35,100 --> 00:55:38,320
where for each physical core,
the operating system actually

1108
00:55:38,320 --> 00:55:40,450
sees it as two logical cores.

1109
00:55:40,450 --> 00:55:42,100
They share many of
the same resources,

1110
00:55:42,100 --> 00:55:43,517
but they have their
own registers.

1111
00:55:43,517 --> 00:55:46,360
So if one of the logical
cores stalls on a long latency

1112
00:55:46,360 --> 00:55:48,610
operation, the
other logical core

1113
00:55:48,610 --> 00:55:53,870
can use the shared resources
and hide some of the latency.

1114
00:55:53,870 --> 00:55:56,800
OK, so here I am
plotting the speedup

1115
00:55:56,800 --> 00:56:00,670
over the single-threaded time
of the parallel algorithm

1116
00:56:00,670 --> 00:56:02,660
versus the number of threads.

1117
00:56:02,660 --> 00:56:04,900
So we see that on
40 threads, we get

1118
00:56:04,900 --> 00:56:08,380
a speedup of about 22 or 23X.

1119
00:56:08,380 --> 00:56:09,970
And when we turn
on hyperthreading

1120
00:56:09,970 --> 00:56:13,750
and use all 80 threads, the
speedup is about 32 times

1121
00:56:13,750 --> 00:56:14,650
on 40 cores.

1122
00:56:14,650 --> 00:56:18,280
And this is actually pretty good
for a parallel graph algorithm.

1123
00:56:18,280 --> 00:56:20,530
It's very hard to get
very good speedups

1124
00:56:20,530 --> 00:56:23,350
on these irregular
graph algorithms.

1125
00:56:23,350 --> 00:56:26,590
So 32X on 40 cores
is pretty good.

1126
00:56:26,590 --> 00:56:29,200
I also compared this to
the serial BFS algorithm

1127
00:56:29,200 --> 00:56:30,700
because that's
what we ultimately

1128
00:56:30,700 --> 00:56:32,440
want to compare against.

1129
00:56:32,440 --> 00:56:38,610
So we see that on 80 threads,
the speedup over the serial BFS

1130
00:56:38,610 --> 00:56:42,250
is about 21, 22X.

1131
00:56:42,250 --> 00:56:47,650
And the serial BFS is 54%
faster than the parallel BFS

1132
00:56:47,650 --> 00:56:49,870
on one thread.

1133
00:56:49,870 --> 00:56:52,960
This is because it's doing less
work than the parallel version.

1134
00:56:52,960 --> 00:56:55,570
The parallel version has to
do actual work with the prefix

1135
00:56:55,570 --> 00:56:57,910
sum in the filter, whereas
the serial version doesn't

1136
00:56:57,910 --> 00:57:00,550
have to do that.

1137
00:57:00,550 --> 00:57:02,830
But overall, the
parallel implementation

1138
00:57:02,830 --> 00:57:05,050
is still pretty good.

1139
00:57:05,050 --> 00:57:05,850
OK, questions?

1140
00:57:16,990 --> 00:57:22,110
So a couple of lectures
ago, we saw this slide here.

1141
00:57:22,110 --> 00:57:23,790
So Charles told
us never to write

1142
00:57:23,790 --> 00:57:27,060
nondeterministic parallel
programs because it's

1143
00:57:27,060 --> 00:57:29,340
very hard to debug these
programs and hard to reason

1144
00:57:29,340 --> 00:57:31,080
about them.

1145
00:57:31,080 --> 00:57:34,230
So is there nondeterminism
in this BFS code

1146
00:57:34,230 --> 00:57:35,050
that we looked at?

1147
00:57:37,896 --> 00:57:39,271
AUDIENCE: You have
nondeterminism

1148
00:57:39,271 --> 00:57:40,210
in the compare-and-swap.

1149
00:57:40,210 --> 00:57:42,043
JULIAN SHUN: Yeah, so
there's nondeterminism

1150
00:57:42,043 --> 00:57:44,740
in the compare-and-swap.

1151
00:57:44,740 --> 00:57:46,015
So let's go back to the code.

1152
00:57:48,880 --> 00:57:50,560
So this compare-and-swap
here, there's

1153
00:57:50,560 --> 00:57:54,820
a race there because we get
multiple vertices trying

1154
00:57:54,820 --> 00:57:58,270
to write to the parent entry of
the neighbor at the same time.

1155
00:57:58,270 --> 00:58:00,800
And the one that wins
is nondeterministic.

1156
00:58:00,800 --> 00:58:04,107
So the BFS tree that you get
at the end is nondeterministic.

1157
00:58:08,580 --> 00:58:15,510
OK, so let's see how we can
try to fix this nondeterminism.

1158
00:58:15,510 --> 00:58:17,640
OK so, as we said,
this is a line

1159
00:58:17,640 --> 00:58:22,800
that causes the nondeterminism.

1160
00:58:22,800 --> 00:58:27,360
It turns out that we can
actually make the output BFS

1161
00:58:27,360 --> 00:58:30,540
tree, be deterministic
by going over

1162
00:58:30,540 --> 00:58:33,990
the outgoing edges in each
iteration in two phases.

1163
00:58:33,990 --> 00:58:37,830
So how this works is
that in the first phase,

1164
00:58:37,830 --> 00:58:40,200
the vertices on the
frontier are not actually

1165
00:58:40,200 --> 00:58:42,210
going to write to
the parent array.

1166
00:58:42,210 --> 00:58:43,960
Or they are going to
write, but they're

1167
00:58:43,960 --> 00:58:48,210
going to be using this
writeMin operator.

1168
00:58:48,210 --> 00:58:51,120
And the writeMin operator
is an atomic operation

1169
00:58:51,120 --> 00:58:53,250
that guarantees that we
have concurrent writes

1170
00:58:53,250 --> 00:58:54,370
to the same location.

1171
00:58:54,370 --> 00:58:57,090
The smallest value
gets written there.

1172
00:58:57,090 --> 00:58:58,590
So the value that
gets written there

1173
00:58:58,590 --> 00:58:59,760
is going to be deterministic.

1174
00:58:59,760 --> 00:59:01,260
It's always going
to be the smallest

1175
00:59:01,260 --> 00:59:03,990
one that tries to write there.

1176
00:59:03,990 --> 00:59:06,942
Then in the second
phase, each vertex

1177
00:59:06,942 --> 00:59:08,400
is going to check
for each neighbor

1178
00:59:08,400 --> 00:59:12,390
whether a parent of neighbor
is equal to v. If it is,

1179
00:59:12,390 --> 00:59:14,940
that means it was the vertex
that successfully wrote

1180
00:59:14,940 --> 00:59:17,550
to parent of neighbor
in the first phase.

1181
00:59:17,550 --> 00:59:20,490
And therefore, it's going to
be responsible for placing

1182
00:59:20,490 --> 00:59:24,030
this neighbor onto
the next frontier.

1183
00:59:24,030 --> 00:59:26,100
And we're also going to
set parent of neighbor

1184
00:59:26,100 --> 00:59:29,950
to be negative v. This
is just a minor detail.

1185
00:59:29,950 --> 00:59:34,110
And this is because when we're
doing this writeMin operator,

1186
00:59:34,110 --> 00:59:36,870
we could have a future iteration
where a lower vertex tries

1187
00:59:36,870 --> 00:59:39,270
to visit the same vertex
that we already explored.

1188
00:59:39,270 --> 00:59:41,297
But if we set this
to a negative value,

1189
00:59:41,297 --> 00:59:43,380
we're only going to be
writing non-negative values

1190
00:59:43,380 --> 00:59:44,560
to this location.

1191
00:59:44,560 --> 00:59:48,232
So the writeMin on a neighbor
that has already been explored

1192
00:59:48,232 --> 00:59:49,065
would never succeed.

1193
00:59:52,420 --> 00:59:58,120
OK, so the final BFS tree
that's generated by this code

1194
00:59:58,120 --> 01:00:00,800
is always going to be the
same every time you run it.

1195
01:00:00,800 --> 01:00:03,070
I want to point out
that this code is still

1196
01:00:03,070 --> 01:00:05,140
notdeterministic with
respect to the order

1197
01:00:05,140 --> 01:00:07,610
in which individual memory
locations get updated.

1198
01:00:07,610 --> 01:00:10,210
So you still have a
deterministic race here

1199
01:00:10,210 --> 01:00:11,380
in the writeMin operator.

1200
01:00:11,380 --> 01:00:14,240
But it's still better than
a nondeterministic code

1201
01:00:14,240 --> 01:00:17,770
in that you always
get the same BFS tree.

1202
01:00:17,770 --> 01:00:21,037
So how do you actually implement
the writeMin operation?

1203
01:00:21,037 --> 01:00:22,870
So it turns out you can
implement this using

1204
01:00:22,870 --> 01:00:25,900
a loop with a compare-and-swap.

1205
01:00:25,900 --> 01:00:28,840
So writeMin takes as
input two arguments--

1206
01:00:28,840 --> 01:00:31,060
the memory address that
we're trying to update

1207
01:00:31,060 --> 01:00:34,150
and the new value that we
want to write to that address.

1208
01:00:34,150 --> 01:00:36,670
We're first going to set
oldval equal to the value

1209
01:00:36,670 --> 01:00:38,560
at that memory address.

1210
01:00:38,560 --> 01:00:40,990
And we're going to check if
newval is less than oldval.

1211
01:00:40,990 --> 01:00:42,640
If it is, then we're
going to attempt

1212
01:00:42,640 --> 01:00:45,580
to do a compare-and-swap
at that location,

1213
01:00:45,580 --> 01:00:49,210
writing newval into that
address if its initial value

1214
01:00:49,210 --> 01:00:50,350
was oldval.

1215
01:00:50,350 --> 01:00:52,540
And if that succeeds,
then we return.

1216
01:00:52,540 --> 01:00:54,130
Otherwise, we failed.

1217
01:00:54,130 --> 01:00:56,380
And that means that somebody
else came in the meantime

1218
01:00:56,380 --> 01:00:58,060
and changed the value there.

1219
01:00:58,060 --> 01:01:00,770
And therefore, we have
to reread the old value

1220
01:01:00,770 --> 01:01:01,750
at the memory address.

1221
01:01:01,750 --> 01:01:04,090
And then we repeat.

1222
01:01:04,090 --> 01:01:07,840
And there are two ways that this
writeMin operator could finish.

1223
01:01:07,840 --> 01:01:10,840
One is if the compare-and-swap
was successful.

1224
01:01:10,840 --> 01:01:15,100
The other one is if
newval is greater than

1225
01:01:15,100 --> 01:01:16,180
or equal to oldval.

1226
01:01:16,180 --> 01:01:18,700
In that case, we no longer
have to try to write anymore

1227
01:01:18,700 --> 01:01:22,013
because the value that's there
is already smaller than what

1228
01:01:22,013 --> 01:01:22,930
we're trying to write.

1229
01:01:25,690 --> 01:01:29,440
So I implemented an
optimized version

1230
01:01:29,440 --> 01:01:32,440
of this deterministic
parallel BFS code

1231
01:01:32,440 --> 01:01:35,470
and compared it to the
nondeterministic version.

1232
01:01:35,470 --> 01:01:37,450
And it turns out
on 32 cores, it's

1233
01:01:37,450 --> 01:01:40,090
only a little bit slower than
the nondeterministic version.

1234
01:01:40,090 --> 01:01:44,060
So it's about 5% to 20% slower
on a range of different input

1235
01:01:44,060 --> 01:01:44,560
graphs.

1236
01:01:44,560 --> 01:01:47,770
So this is a pretty small
price to pay for determinism.

1237
01:01:47,770 --> 01:01:51,160
And you get many nice
benefits, such as ease

1238
01:01:51,160 --> 01:01:54,310
of debugging and ease of
reasoning about the performance

1239
01:01:54,310 --> 01:01:57,070
of your code.

1240
01:01:57,070 --> 01:01:58,030
Any questions?

1241
01:02:05,690 --> 01:02:09,940
OK, so let me talk about
another optimization

1242
01:02:09,940 --> 01:02:12,040
for breadth-first search.

1243
01:02:12,040 --> 01:02:15,550
And this is called the
direction optimization.

1244
01:02:15,550 --> 01:02:19,570
And the idea is motivated by
how the sizes of the frontiers

1245
01:02:19,570 --> 01:02:23,680
change in a typical BFS
algorithm over time.

1246
01:02:23,680 --> 01:02:25,510
So here I'm plotting
the frontier size

1247
01:02:25,510 --> 01:02:27,920
on the y-axis in log scale.

1248
01:02:27,920 --> 01:02:30,790
And the x-axis is
the iteration number.

1249
01:02:30,790 --> 01:02:33,430
And on the left, we have a
random graph, on the right,

1250
01:02:33,430 --> 01:02:34,870
we have a parallel graph.

1251
01:02:34,870 --> 01:02:37,390
And we see that the
frontier size actually

1252
01:02:37,390 --> 01:02:40,240
grows pretty rapidly, especially
for the power law graph.

1253
01:02:40,240 --> 01:02:41,710
And then it drops
pretty rapidly.

1254
01:02:44,240 --> 01:02:46,690
So this is true for many
of the real-world graphs

1255
01:02:46,690 --> 01:02:50,440
that we see because many of
them look like power law graphs.

1256
01:02:50,440 --> 01:02:52,540
And in the BFS algorithm,
most of the work

1257
01:02:52,540 --> 01:02:55,653
is done when the frontier
is relatively large.

1258
01:02:55,653 --> 01:02:57,070
So most of the
work is going to be

1259
01:02:57,070 --> 01:02:59,170
done in these middle
iterations where

1260
01:02:59,170 --> 01:03:01,510
the frontier is very large.

1261
01:03:04,900 --> 01:03:06,610
And it turns out that
there are two ways

1262
01:03:06,610 --> 01:03:08,860
to do breadth-first search.

1263
01:03:08,860 --> 01:03:10,630
One way is the
traditional way, which

1264
01:03:10,630 --> 01:03:13,660
I'm going to refer to
as the top-down method.

1265
01:03:13,660 --> 01:03:15,160
And this is just
what we did before.

1266
01:03:15,160 --> 01:03:17,707
We look at the
frontier vertices,

1267
01:03:17,707 --> 01:03:19,540
and explore all of their
outgoing neighbors,

1268
01:03:19,540 --> 01:03:22,420
and mark any of the
unexplored ones as explored,

1269
01:03:22,420 --> 01:03:25,270
and place them on to
the next frontier.

1270
01:03:25,270 --> 01:03:27,820
But there's actually another
way to do breadth-first search.

1271
01:03:27,820 --> 01:03:29,887
And this is known as
the bottom-up method.

1272
01:03:29,887 --> 01:03:31,470
And in the bottom-up
method, I'm going

1273
01:03:31,470 --> 01:03:33,340
to look at all of the
vertices in the graph that

1274
01:03:33,340 --> 01:03:34,840
haven't been
explored yet, and I'm

1275
01:03:34,840 --> 01:03:37,310
going to look at
their incoming edges.

1276
01:03:37,310 --> 01:03:40,870
And if I find an incoming edge
that's on the current frontier,

1277
01:03:40,870 --> 01:03:43,542
I can just say that that
incoming neighbor is my parent.

1278
01:03:43,542 --> 01:03:45,250
And I don't even need
to look at the rest

1279
01:03:45,250 --> 01:03:47,170
of my incoming neighbors.

1280
01:03:47,170 --> 01:03:49,867
So in this example here,
vertices 9 through 12,

1281
01:03:49,867 --> 01:03:51,700
when they loop through
their incoming edges,

1282
01:03:51,700 --> 01:03:54,490
they found incoming
neighbor on the frontier,

1283
01:03:54,490 --> 01:03:57,070
and they chose that
neighbor as their parent.

1284
01:03:57,070 --> 01:03:59,140
And they get marked as explored.

1285
01:03:59,140 --> 01:04:02,020
And we can actually save some
edge traversals here because,

1286
01:04:02,020 --> 01:04:04,090
for example, if you
look at vertex 9,

1287
01:04:04,090 --> 01:04:07,180
and you imagine the
edges being traversed

1288
01:04:07,180 --> 01:04:10,450
in a top-to-bottom manner,
then vertex 9 is only

1289
01:04:10,450 --> 01:04:12,490
going to look at its
first incoming edge

1290
01:04:12,490 --> 01:04:14,943
and find the incoming
neighbors on the frontier.

1291
01:04:14,943 --> 01:04:16,360
So it doesn't even
need to inspect

1292
01:04:16,360 --> 01:04:18,130
the rest of the incoming
edges because all

1293
01:04:18,130 --> 01:04:21,430
we care about finding is just
one parent in the BFS tree.

1294
01:04:21,430 --> 01:04:25,730
We don't need to find all
of the possible parents.

1295
01:04:25,730 --> 01:04:29,020
In this example here, vertices
13 through 15 actually ended up

1296
01:04:29,020 --> 01:04:31,718
wasting work because they looked
at all of their incoming edges.

1297
01:04:31,718 --> 01:04:34,010
And none of the incoming
neighbors are on the frontier.

1298
01:04:34,010 --> 01:04:36,690
So they don't actually
find a neighbor.

1299
01:04:36,690 --> 01:04:38,200
So the bottom-up
approach turns out

1300
01:04:38,200 --> 01:04:40,420
to work pretty well when
the frontier is large

1301
01:04:40,420 --> 01:04:42,310
and many vertices have
been already explored.

1302
01:04:42,310 --> 01:04:45,170
Because in this case, you don't
have to look at many vertices.

1303
01:04:45,170 --> 01:04:46,810
And for the ones
that you do look at,

1304
01:04:46,810 --> 01:04:49,060
when you scan over
their incoming edges,

1305
01:04:49,060 --> 01:04:50,830
it's very likely
that early on, you'll

1306
01:04:50,830 --> 01:04:53,440
find a neighbor that is
on the current frontier,

1307
01:04:53,440 --> 01:04:57,520
and you can skip a bunch
of edge traversals.

1308
01:04:57,520 --> 01:04:59,080
And the top-down
approach is better

1309
01:04:59,080 --> 01:05:03,130
when the frontier
is relatively small.

1310
01:05:03,130 --> 01:05:06,070
And in a paper by
Scott Beamer in 2012,

1311
01:05:06,070 --> 01:05:07,780
he actually studied
the performance

1312
01:05:07,780 --> 01:05:10,180
of these two approaches in BFS.

1313
01:05:10,180 --> 01:05:13,960
And this plot here
plots the running time

1314
01:05:13,960 --> 01:05:17,483
versus the iteration number
for a power law graph

1315
01:05:17,483 --> 01:05:19,900
and compares the performance
of the top-down and bottom-up

1316
01:05:19,900 --> 01:05:20,840
approach.

1317
01:05:20,840 --> 01:05:22,570
So we see that for
the first two steps,

1318
01:05:22,570 --> 01:05:25,787
the top-down approach is faster
than the bottom-up approach.

1319
01:05:25,787 --> 01:05:27,370
But then for the
next couple of steps,

1320
01:05:27,370 --> 01:05:31,150
the bottom-up approach is
faster than a top-down approach.

1321
01:05:31,150 --> 01:05:33,400
And then when we get to the
end, the top-down approach

1322
01:05:33,400 --> 01:05:36,730
becomes faster again.

1323
01:05:36,730 --> 01:05:38,620
So the top-down
approach, as I said,

1324
01:05:38,620 --> 01:05:41,440
is more efficient
for small frontiers,

1325
01:05:41,440 --> 01:05:42,940
whereas a bottom-up
approach is more

1326
01:05:42,940 --> 01:05:46,000
efficient for large frontiers.

1327
01:05:46,000 --> 01:05:48,850
Also, I want to point out that
in the top-down approach, when

1328
01:05:48,850 --> 01:05:51,430
we update the parent array,
that actually has to be atomic.

1329
01:05:51,430 --> 01:05:53,263
Because we can have
multiple vertices trying

1330
01:05:53,263 --> 01:05:54,640
to update the same neighbor.

1331
01:05:54,640 --> 01:05:57,190
But in a bottom-up approach,
the update to the parent array

1332
01:05:57,190 --> 01:05:58,780
doesn't have to be atomic.

1333
01:05:58,780 --> 01:06:01,390
Because we're scanning
over the incoming neighbors

1334
01:06:01,390 --> 01:06:04,420
of any particular
vertex v serially.

1335
01:06:04,420 --> 01:06:06,550
And therefore, there can
only be one processor

1336
01:06:06,550 --> 01:06:10,030
that's writing to parent of v.

1337
01:06:10,030 --> 01:06:12,160
So we choose between
these two approaches based

1338
01:06:12,160 --> 01:06:14,140
on the size of the frontier.

1339
01:06:14,140 --> 01:06:18,870
We found that a threshold of
a frontier size of about n/20

1340
01:06:18,870 --> 01:06:20,120
works pretty well in practice.

1341
01:06:20,120 --> 01:06:23,170
So if the frontier has
more than n/20 vertices,

1342
01:06:23,170 --> 01:06:24,460
we used a bottom up approach.

1343
01:06:24,460 --> 01:06:27,160
And otherwise, we used
a top-down approach.

1344
01:06:27,160 --> 01:06:30,250
You can also use more
sophisticated thresholds,

1345
01:06:30,250 --> 01:06:33,430
such as also considering
the sum of out-degrees,

1346
01:06:33,430 --> 01:06:35,830
since the actual work
is dependent on the sum

1347
01:06:35,830 --> 01:06:38,830
of out-degrees of the
vertices on the frontier.

1348
01:06:38,830 --> 01:06:40,990
You can also use
different thresholds

1349
01:06:40,990 --> 01:06:45,370
for going from top-down
to bottom-up and then

1350
01:06:45,370 --> 01:06:48,960
another threshold for going
from bottom-up back to top-down.

1351
01:06:48,960 --> 01:06:51,220
And in fact, that's what
the original paper did.

1352
01:06:51,220 --> 01:06:54,460
They had two
different thresholds.

1353
01:06:54,460 --> 01:06:56,830
We also need to generate
the inverse graph

1354
01:06:56,830 --> 01:06:59,830
or the transposed graph
if we're using this method

1355
01:06:59,830 --> 01:07:01,810
if the graph is directed.

1356
01:07:01,810 --> 01:07:04,407
Because if the
graph is directed,

1357
01:07:04,407 --> 01:07:05,990
in the bottom-up
approach, we actually

1358
01:07:05,990 --> 01:07:07,610
need to look at the
incoming neighbors, not

1359
01:07:07,610 --> 01:07:08,580
the outgoing neighbors.

1360
01:07:08,580 --> 01:07:11,383
So if the graph wasn't
already symmetrized,

1361
01:07:11,383 --> 01:07:13,550
then we have to generate
both the incoming neighbors

1362
01:07:13,550 --> 01:07:15,510
and outgoing neighbors
for each vertex.

1363
01:07:15,510 --> 01:07:18,980
So we can do that as
a pre-processing step.

1364
01:07:18,980 --> 01:07:19,940
Any questions?

1365
01:07:26,900 --> 01:07:30,230
OK, so how do we actually
represent the frontier?

1366
01:07:30,230 --> 01:07:31,730
So one way to
represent the frontier

1367
01:07:31,730 --> 01:07:33,380
is just use a sparse
integer array,

1368
01:07:33,380 --> 01:07:36,980
which is what we did before.

1369
01:07:36,980 --> 01:07:39,530
Another way to do this
is to use a dense array.

1370
01:07:39,530 --> 01:07:42,560
So, for example, here I
have an array of bytes.

1371
01:07:42,560 --> 01:07:45,360
The array is of size n, where
n is the number of vertices.

1372
01:07:45,360 --> 01:07:48,470
And I have a 1 in
position i if vertex i

1373
01:07:48,470 --> 01:07:51,440
is on the frontier
and 0 otherwise.

1374
01:07:51,440 --> 01:07:55,190
I can also use a bit vector
to further compress this

1375
01:07:55,190 --> 01:07:59,870
and then use additional bit
level operations to access it.

1376
01:07:59,870 --> 01:08:02,870
So for the top-down approach,
a sparse representation

1377
01:08:02,870 --> 01:08:05,240
is better because the
top-down approach usually

1378
01:08:05,240 --> 01:08:06,950
deals with small frontiers.

1379
01:08:06,950 --> 01:08:08,660
And if we use a
sparse array, we only

1380
01:08:08,660 --> 01:08:11,300
have to do work proportional
to the number of vertices

1381
01:08:11,300 --> 01:08:12,740
on the frontier.

1382
01:08:12,740 --> 01:08:14,330
And then in the
bottom-up approach,

1383
01:08:14,330 --> 01:08:16,370
it turns out that dense
representation is better

1384
01:08:16,370 --> 01:08:19,670
because we're looking at
most of the vertices anyways.

1385
01:08:19,670 --> 01:08:23,240
And then we need to switch
between these two methods based

1386
01:08:23,240 --> 01:08:26,090
on the approach
that we're using.

1387
01:08:29,790 --> 01:08:33,050
So here's some performance
numbers comparing the three

1388
01:08:33,050 --> 01:08:34,310
different modes of traversal.

1389
01:08:34,310 --> 01:08:36,740
So we have bottom-up,
top-down, and then

1390
01:08:36,740 --> 01:08:38,330
the direction
optimizing approach

1391
01:08:38,330 --> 01:08:40,975
using a threshold of n/20.

1392
01:08:40,975 --> 01:08:43,069
First of all, we see that
the bottom-up approach

1393
01:08:43,069 --> 01:08:45,899
is the slowest for
both of these graphs.

1394
01:08:45,899 --> 01:08:48,859
And this is because it's
doing a lot of wasted work

1395
01:08:48,859 --> 01:08:52,010
in the early iterations.

1396
01:08:52,010 --> 01:08:55,040
We also see that the direction
optimizing approach is always

1397
01:08:55,040 --> 01:08:58,910
faster than both the top-down
and the bottom-up approach.

1398
01:08:58,910 --> 01:09:01,670
This is because if we switch
to the bottom-up approach

1399
01:09:01,670 --> 01:09:03,140
at an appropriate
time, then we can

1400
01:09:03,140 --> 01:09:05,390
save a lot of edge traversals.

1401
01:09:05,390 --> 01:09:07,729
And, for example, you can
see for the power law graph,

1402
01:09:07,729 --> 01:09:09,184
the direction
optimizing approach

1403
01:09:09,184 --> 01:09:12,380
is almost three times faster
than the top-down approach.

1404
01:09:15,380 --> 01:09:17,300
The benefits of this
approach are highly

1405
01:09:17,300 --> 01:09:20,140
dependent on the input graph.

1406
01:09:20,140 --> 01:09:24,170
So it works very well for
power law and random graphs.

1407
01:09:24,170 --> 01:09:27,649
But if you have graphs where the
frontier size is always small,

1408
01:09:27,649 --> 01:09:29,870
such as a grid graph
or a road network,

1409
01:09:29,870 --> 01:09:32,490
then you would never use
a bottom-up approach.

1410
01:09:32,490 --> 01:09:37,130
So this wouldn't actually give
you any performance gains.

1411
01:09:37,130 --> 01:09:38,253
Any questions?

1412
01:09:43,810 --> 01:09:46,040
So it turns out that
this direction optimizing

1413
01:09:46,040 --> 01:09:49,220
idea is more general than
just breadth-first search.

1414
01:09:49,220 --> 01:09:51,800
So a couple years
ago, I developed

1415
01:09:51,800 --> 01:09:54,710
this framework called
Ligra, where I generalized

1416
01:09:54,710 --> 01:09:58,760
the direction optimizing idea
to other graph algorithms,

1417
01:09:58,760 --> 01:10:01,730
such as betweenness centrality,
connected components, sparse

1418
01:10:01,730 --> 01:10:04,180
PageRank, shortest
paths, and so on.

1419
01:10:04,180 --> 01:10:07,280
And in the Ligra framework,
we have an EDGEMAP operator

1420
01:10:07,280 --> 01:10:09,950
that chooses between a
sparse implementation

1421
01:10:09,950 --> 01:10:13,310
and a dense implementation based
on the size of the frontier.

1422
01:10:13,310 --> 01:10:15,920
So the sparse here corresponds
to the top-down approach.

1423
01:10:15,920 --> 01:10:19,340
And dense corresponds to
the bottom-up approach.

1424
01:10:19,340 --> 01:10:22,070
And it turns out that using
this direction optimizing

1425
01:10:22,070 --> 01:10:23,570
idea for these
other applications

1426
01:10:23,570 --> 01:10:26,390
also gives you performance
gains in practice.

1427
01:10:31,660 --> 01:10:35,760
OK, so let me now talk about
another optimization, which

1428
01:10:35,760 --> 01:10:37,680
is graph compression.

1429
01:10:37,680 --> 01:10:41,340
And the goal here is to reduce
the amount of memory usage

1430
01:10:41,340 --> 01:10:43,560
in the graph algorithm.

1431
01:10:43,560 --> 01:10:46,800
So recall, this was
our CSR representation.

1432
01:10:46,800 --> 01:10:48,690
And in the Edges
array, we just stored

1433
01:10:48,690 --> 01:10:52,680
the values of the target edges.

1434
01:10:52,680 --> 01:10:55,080
Instead of storing
the actual targets,

1435
01:10:55,080 --> 01:10:59,160
we can actually do better by
first sorting the edges so

1436
01:10:59,160 --> 01:11:01,680
that they appear in
non-decreasing order

1437
01:11:01,680 --> 01:11:03,900
and then just storing
the differences

1438
01:11:03,900 --> 01:11:05,750
between consecutive edges.

1439
01:11:05,750 --> 01:11:08,040
And then for the first edge
for any particular vertex,

1440
01:11:08,040 --> 01:11:10,110
we'll store the difference
between the target

1441
01:11:10,110 --> 01:11:11,520
and the source of that edge.

1442
01:11:14,330 --> 01:11:17,810
So, for example,
here, for vertex 0,

1443
01:11:17,810 --> 01:11:20,568
the first edge is going
to have a value of 2

1444
01:11:20,568 --> 01:11:23,110
because we're going to take the
difference between the target

1445
01:11:23,110 --> 01:11:23,735
and the source.

1446
01:11:23,735 --> 01:11:26,180
So 2 minus 0 is 2.

1447
01:11:26,180 --> 01:11:28,190
Then for the next
edge, we're going

1448
01:11:28,190 --> 01:11:30,420
to take the difference
between the second edge

1449
01:11:30,420 --> 01:11:35,270
and the first edge, so
7 minus 2, which is 5.

1450
01:11:35,270 --> 01:11:39,290
And then similarly we do that
for all of the remaining edges.

1451
01:11:39,290 --> 01:11:41,400
Notice that there are
some negative values here.

1452
01:11:41,400 --> 01:11:45,870
And this is because the target
is smaller than the source.

1453
01:11:45,870 --> 01:11:48,810
So in this example,
1 is smaller than 2.

1454
01:11:48,810 --> 01:11:51,560
So if you do 1 minus
2, you get a negative--

1455
01:11:51,560 --> 01:11:52,880
negative 1.

1456
01:11:52,880 --> 01:11:55,100
And this can only happen
for the first edge

1457
01:11:55,100 --> 01:11:57,410
for any particular
vertex because for all

1458
01:11:57,410 --> 01:12:00,470
the other edges, we're
encoding the difference

1459
01:12:00,470 --> 01:12:02,150
between that edge and
the previous edge.

1460
01:12:02,150 --> 01:12:03,710
And we already
sorted these edges

1461
01:12:03,710 --> 01:12:06,260
so that they appear in
non-decreasing order.

1462
01:12:08,870 --> 01:12:12,770
OK, so this
compressed edges array

1463
01:12:12,770 --> 01:12:15,140
will typically
contain smaller values

1464
01:12:15,140 --> 01:12:17,610
than this original edges array.

1465
01:12:17,610 --> 01:12:21,050
So now we want to be
able to use fewer bits

1466
01:12:21,050 --> 01:12:22,350
to represent these values.

1467
01:12:22,350 --> 01:12:26,490
We don't want to use 32 or
64 bits like we did before.

1468
01:12:26,490 --> 01:12:28,680
Otherwise, we wouldn't
be saving any space.

1469
01:12:28,680 --> 01:12:31,190
So one way to reduce
the space usage

1470
01:12:31,190 --> 01:12:34,400
is to store these values using
what's called a variable length

1471
01:12:34,400 --> 01:12:36,560
code or a k-bit code.

1472
01:12:36,560 --> 01:12:40,400
And the idea is to encode each
value in chunks of k bits,

1473
01:12:40,400 --> 01:12:44,630
where for each chunk, we use k
minus 1 bits for the data and 1

1474
01:12:44,630 --> 01:12:47,190
bit as the continue bit.

1475
01:12:47,190 --> 01:12:49,820
So for example, let's
encode the integer 401

1476
01:12:49,820 --> 01:12:52,490
using 8-bit or byte codes.

1477
01:12:52,490 --> 01:12:55,340
So first, we're going to write
this value out in binary.

1478
01:12:55,340 --> 01:12:57,650
And then we're going to
take the bottom 7 bits,

1479
01:12:57,650 --> 01:12:59,660
and we're going to
place that into the data

1480
01:12:59,660 --> 01:13:01,820
field of the first chunk.

1481
01:13:01,820 --> 01:13:04,133
And then in the last
bit of this chunk,

1482
01:13:04,133 --> 01:13:06,050
we're going to check if
we still have any more

1483
01:13:06,050 --> 01:13:07,400
bits that we need to encode.

1484
01:13:07,400 --> 01:13:10,280
And if we do, then we're going
to set a 1 in the continue bit

1485
01:13:10,280 --> 01:13:11,510
position.

1486
01:13:11,510 --> 01:13:13,000
And then we create
another chunk.

1487
01:13:13,000 --> 01:13:16,160
We'll replace the next
7 bits into the data

1488
01:13:16,160 --> 01:13:17,278
field of that chunk.

1489
01:13:17,278 --> 01:13:19,820
And then now we're actually done
encoding this integer value.

1490
01:13:19,820 --> 01:13:23,660
So we can place a 0
in the continue bit.

1491
01:13:23,660 --> 01:13:25,520
So that's how the
encoding works.

1492
01:13:25,520 --> 01:13:29,090
And decoding is just doing
this process backwards.

1493
01:13:29,090 --> 01:13:31,970
So you read chunks until
you find a chunk with a 0

1494
01:13:31,970 --> 01:13:33,590
continue bit.

1495
01:13:33,590 --> 01:13:35,420
And then you shift
all of the data values

1496
01:13:35,420 --> 01:13:37,220
left accordingly and
sum them together

1497
01:13:37,220 --> 01:13:42,210
to reconstruct the integer
value that you encoded.

1498
01:13:42,210 --> 01:13:45,410
One performance issue
that might occur here

1499
01:13:45,410 --> 01:13:47,030
is that when you're
decoding, you

1500
01:13:47,030 --> 01:13:49,670
have to check this continue
bit for every chunk

1501
01:13:49,670 --> 01:13:52,550
and decide what to do
based on that continue bit.

1502
01:13:52,550 --> 01:13:56,140
And this is actually
unpredictable branch.

1503
01:13:56,140 --> 01:13:59,360
So you can suffer from
branch mispredictions

1504
01:13:59,360 --> 01:14:03,420
from checking this continue bit.

1505
01:14:03,420 --> 01:14:06,230
So one way you can optimize
this is to get rid of these

1506
01:14:06,230 --> 01:14:08,050
continue bits.

1507
01:14:08,050 --> 01:14:10,220
And the idea here is
to first figure out

1508
01:14:10,220 --> 01:14:11,840
how many bytes
you need to encode

1509
01:14:11,840 --> 01:14:14,640
each integer in the sequence.

1510
01:14:14,640 --> 01:14:18,020
And then you group
together integers

1511
01:14:18,020 --> 01:14:21,500
that require the same
number of bytes to encode.

1512
01:14:21,500 --> 01:14:25,190
Use a run-length encoding idea
to encode all of these integers

1513
01:14:25,190 --> 01:14:28,445
together by using a header
byte, where in the header byte,

1514
01:14:28,445 --> 01:14:31,940
you use the lower 6 bits to
store the size of the group

1515
01:14:31,940 --> 01:14:35,780
and the highest 2 bits to
store the number of bytes each

1516
01:14:35,780 --> 01:14:38,690
of these integers
needs to decode.

1517
01:14:38,690 --> 01:14:42,650
And now all of the
integers in this group

1518
01:14:42,650 --> 01:14:44,450
will just be stored
after this header byte.

1519
01:14:44,450 --> 01:14:47,690
And we'd know exactly how many
bytes they need to decode.

1520
01:14:47,690 --> 01:14:50,975
So we don't need to store a
continue bit in these chunks.

1521
01:14:53,870 --> 01:14:56,000
This does slightly
increase the space usage.

1522
01:14:56,000 --> 01:14:58,790
But it makes decoding cheaper
because we no longer have

1523
01:14:58,790 --> 01:15:02,030
to suffer from
branch mispredictions

1524
01:15:02,030 --> 01:15:06,020
from checking this continue bit.

1525
01:15:06,020 --> 01:15:09,800
OK, so now we have to decode
these edge lists on the fly

1526
01:15:09,800 --> 01:15:11,330
as we're running our algorithm.

1527
01:15:11,330 --> 01:15:13,080
If we decoded everything
at the beginning,

1528
01:15:13,080 --> 01:15:14,788
we wouldn't actually
be saving any space.

1529
01:15:14,788 --> 01:15:16,820
We need to decode these
edges as we access them

1530
01:15:16,820 --> 01:15:18,770
in our algorithm.

1531
01:15:18,770 --> 01:15:20,720
Since we encoded
all of these edge

1532
01:15:20,720 --> 01:15:22,460
lists separately
for each vertex,

1533
01:15:22,460 --> 01:15:24,230
we can decode all
of them in parallel.

1534
01:15:26,750 --> 01:15:29,660
And each vertex just decodes
its edge list sequentially.

1535
01:15:29,660 --> 01:15:32,240
But what about
high-degree vertices?

1536
01:15:32,240 --> 01:15:33,620
If you have a
high-degree vertex,

1537
01:15:33,620 --> 01:15:35,830
you stop to decode its
edge list sequentially.

1538
01:15:35,830 --> 01:15:37,550
And if you're running
this in parallel,

1539
01:15:37,550 --> 01:15:41,400
this could lead
to load imbalance.

1540
01:15:41,400 --> 01:15:44,810
So one way to fix this is,
instead of just encoding

1541
01:15:44,810 --> 01:15:46,970
the whole thing sequentially,
you can chunk it up

1542
01:15:46,970 --> 01:15:50,360
into chunks of size T.
And then for each chunk,

1543
01:15:50,360 --> 01:15:52,280
you encode it like
you did before,

1544
01:15:52,280 --> 01:15:56,120
where you store the first value
relative to the source vertex

1545
01:15:56,120 --> 01:15:59,130
and then all of the other values
relative to the previous edge.

1546
01:15:59,130 --> 01:16:01,550
And now you can actually
decode the first value

1547
01:16:01,550 --> 01:16:04,460
here for each of these
chunks all in parallel

1548
01:16:04,460 --> 01:16:09,280
without having to wait for the
previous edge to be decoded.

1549
01:16:09,280 --> 01:16:11,730
And then this gives us
much more parallelism

1550
01:16:11,730 --> 01:16:15,860
because all of these chunks
can be decoded in parallel.

1551
01:16:15,860 --> 01:16:20,690
And we found that a value of
T-- where T is the chunk size--

1552
01:16:20,690 --> 01:16:23,360
between 100 and 10,000 works
pretty well in practice.

1553
01:16:26,940 --> 01:16:29,600
OK, so I'm not
going to have time

1554
01:16:29,600 --> 01:16:30,830
to go over the experiments.

1555
01:16:30,830 --> 01:16:33,080
But at a high level,
the experiments

1556
01:16:33,080 --> 01:16:37,430
show that compression
schemes do save space.

1557
01:16:37,430 --> 01:16:40,250
And serially, it's
only slightly slower

1558
01:16:40,250 --> 01:16:41,870
than the uncompressed version.

1559
01:16:41,870 --> 01:16:43,880
But surprisingly, when
you run it in parallel,

1560
01:16:43,880 --> 01:16:47,000
it actually becomes faster
than the uncompressed version.

1561
01:16:47,000 --> 01:16:49,790
And this is because these graph
algorithms are memory bound.

1562
01:16:49,790 --> 01:16:51,500
And we're using less memory.

1563
01:16:51,500 --> 01:16:54,350
You can alleviate this
memory subsystem bottleneck

1564
01:16:54,350 --> 01:16:57,590
and get better scalability.

1565
01:16:57,590 --> 01:17:00,200
And the decoding part of
these compressed algorithms

1566
01:17:00,200 --> 01:17:02,240
actually gets very
good parallel speedup

1567
01:17:02,240 --> 01:17:04,220
because they're just
doing local operations.

1568
01:17:09,050 --> 01:17:11,510
OK, so let me summarize now.

1569
01:17:11,510 --> 01:17:14,750
So we saw some properties
of real-world graphs.

1570
01:17:14,750 --> 01:17:17,270
We saw that they're quite
large, but they can still

1571
01:17:17,270 --> 01:17:19,250
fit on a multi-core server.

1572
01:17:19,250 --> 01:17:20,900
And they're relatively sparse.

1573
01:17:20,900 --> 01:17:23,990
They also have a power
law degree distribution.

1574
01:17:23,990 --> 01:17:26,990
Many graph algorithms are
irregular in that they involve

1575
01:17:26,990 --> 01:17:28,700
many random memory accesses.

1576
01:17:28,700 --> 01:17:30,950
So that becomes a bottleneck
of the performance

1577
01:17:30,950 --> 01:17:32,210
of these algorithms.

1578
01:17:32,210 --> 01:17:35,810
And you can improve performance
with algorithmic optimization,

1579
01:17:35,810 --> 01:17:39,170
such as using this
direction optimization

1580
01:17:39,170 --> 01:17:40,820
and also by creating
and exploiting

1581
01:17:40,820 --> 01:17:44,930
locality, for example, by using
this bit vector optimization.

1582
01:17:44,930 --> 01:17:46,910
And finally,
optimizations for graphs

1583
01:17:46,910 --> 01:17:48,440
might work well
for certain graphs,

1584
01:17:48,440 --> 01:17:50,480
but they might not work
well for other graphs.

1585
01:17:50,480 --> 01:17:52,490
For example, the direction
optimization idea

1586
01:17:52,490 --> 01:17:55,268
works well for power law
graphs but not for road graphs.

1587
01:17:55,268 --> 01:17:57,560
So when you're trying to
optimize your graph algorithm,

1588
01:17:57,560 --> 01:18:00,440
we should definitely test it
on different types of graphs

1589
01:18:00,440 --> 01:18:03,842
and see where it works well
and where it doesn't work.

1590
01:18:03,842 --> 01:18:05,628
So that's all I have.

1591
01:18:05,628 --> 01:18:07,170
If you have any
additional questions,

1592
01:18:07,170 --> 01:18:09,020
please feel free to
ask me after class.

1593
01:18:09,020 --> 01:18:12,200
And as a reminder, we have
a guest lecture on Thursday

1594
01:18:12,200 --> 01:18:15,770
by Professor Johnson of
the MIT Math Department.

1595
01:18:15,770 --> 01:18:17,770
And he'll be talking about
high-level languages,

1596
01:18:17,770 --> 01:18:20,380
so please be sure to attend.