1
00:00:15,405 --> 00:00:16,280
ADAM YALA: OK, great.

2
00:00:16,280 --> 00:00:18,113
Well, thank you for
the great setup.

3
00:00:18,113 --> 00:00:20,530
So for this section, I'm gonna
talk about some of our work

4
00:00:20,530 --> 00:00:22,275
in interpreting
mammograms for cancer.

5
00:00:22,275 --> 00:00:24,400
Specifically it's going to
go into cancer detection

6
00:00:24,400 --> 00:00:25,510
and triage mammograms.

7
00:00:25,510 --> 00:00:27,940
Next, we'll talk about
our technical approach

8
00:00:27,940 --> 00:00:29,380
in breast cancer risk.

9
00:00:29,380 --> 00:00:31,840
And then finally close up in
the many, many different ways

10
00:00:31,840 --> 00:00:33,850
to mess up and the way
things can go wrong,

11
00:00:33,850 --> 00:00:35,360
and how does it [INAUDIBLE]
clinical implementation.

12
00:00:35,360 --> 00:00:36,843
So let's kind of
look more closely

13
00:00:36,843 --> 00:00:39,010
at the numbers of the actual
breast cancer screening

14
00:00:39,010 --> 00:00:39,550
workflow.

15
00:00:39,550 --> 00:00:42,190
So as Connie already said,
you might see something

16
00:00:42,190 --> 00:00:43,570
like 1,000 patients.

17
00:00:43,570 --> 00:00:44,680
All them take mammograms.

18
00:00:44,680 --> 00:00:46,780
Of that 1,000, on
average maybe 100

19
00:00:46,780 --> 00:00:48,910
they called back for
additional imaging.

20
00:00:48,910 --> 00:00:52,172
Of that 100, something
like 20 will get biopsied.

21
00:00:52,172 --> 00:00:54,130
And you end up with maybe
five or six diagnoses

22
00:00:54,130 --> 00:00:55,250
of breast cancer.

23
00:00:55,250 --> 00:00:57,880
So one very clear thing
you see about problems

24
00:00:57,880 --> 00:01:00,820
when you look at this
funnel is that way

25
00:01:00,820 --> 00:01:04,860
over 99% of people that you see
in a given day are cancer-free.

26
00:01:04,860 --> 00:01:07,002
So your actual
incidence is very low.

27
00:01:07,002 --> 00:01:09,460
And so there's kind of a natural
question that can come up.

28
00:01:09,460 --> 00:01:10,960
What can you do in
terms of modeling

29
00:01:10,960 --> 00:01:13,720
if you have an even OK
cancer detection model

30
00:01:13,720 --> 00:01:15,730
to raise the incidence
of this population

31
00:01:15,730 --> 00:01:17,590
but automatically reading
a portion of the population

32
00:01:17,590 --> 00:01:18,090
is healthy.

33
00:01:18,090 --> 00:01:21,220
Does everybody just
follow that broad idea?

34
00:01:21,220 --> 00:01:21,720
OK.

35
00:01:21,720 --> 00:01:23,407
That's enough head nods.

36
00:01:23,407 --> 00:01:24,990
So the broad idea
here is you're going

37
00:01:24,990 --> 00:01:27,730
to train the cancer detection
model to try to find cancer

38
00:01:27,730 --> 00:01:28,597
as well as we can.

39
00:01:28,597 --> 00:01:30,180
Given that, we're
going to try to say,

40
00:01:30,180 --> 00:01:32,940
what's a threshold on
a development set such

41
00:01:32,940 --> 00:01:34,950
that we can kind of
say below the threshold

42
00:01:34,950 --> 00:01:36,035
no one has cancer.

43
00:01:36,035 --> 00:01:37,410
And if we use that
at test times,

44
00:01:37,410 --> 00:01:39,390
simulating clinical
implementation, what would that

45
00:01:39,390 --> 00:01:40,030
look like?

46
00:01:40,030 --> 00:01:43,560
And can we actually do better
by doing this kind of process?

47
00:01:43,560 --> 00:01:46,240
And the kind of broad plan of
how I'm gonna talk about this--

48
00:01:46,240 --> 00:01:47,700
I'm gonna do this for
the next product as well.

49
00:01:47,700 --> 00:01:48,960
Of course, we're going
to talk about the kind

50
00:01:48,960 --> 00:01:51,085
of dataset collection and
how we think about, like,

51
00:01:51,085 --> 00:01:54,000
you know, what is good data
and how do we think about that.

52
00:01:54,000 --> 00:01:56,940
Next, the actual methodology and
go into the general challenges

53
00:01:56,940 --> 00:01:59,930
when you're modeling mammograms
for any computer mission tasks,

54
00:01:59,930 --> 00:02:02,342
specifically in cancer,
and also, obviously, risk.

55
00:02:02,342 --> 00:02:04,800
And lastly, how we thought
about the analysis and some kind

56
00:02:04,800 --> 00:02:06,270
of objectives there.

57
00:02:06,270 --> 00:02:08,789
So to kind of dive into it, we
took consecutive mammograms.

58
00:02:08,789 --> 00:02:10,039
I'll get back into this later.

59
00:02:10,039 --> 00:02:11,450
This is actually
quite important.

60
00:02:11,450 --> 00:02:14,760
We took consecutive
mammograms from 2009 to 2016.

61
00:02:14,760 --> 00:02:17,740
This started off with
about 280,000 cancers.

62
00:02:17,740 --> 00:02:21,640
And once we kind of filtered--
so at least one year follow up,

63
00:02:21,640 --> 00:02:23,400
we ended up with
this final setting

64
00:02:23,400 --> 00:02:27,660
where we had 220,000
mammograms for training

65
00:02:27,660 --> 00:02:30,088
and about 26,000 for
development and testing.

66
00:02:30,088 --> 00:02:31,880
And the way we had it,
it all comes to say,

67
00:02:31,880 --> 00:02:33,432
is this a positive
mammogram or not?

68
00:02:33,432 --> 00:02:34,890
We didn't look at
what cancers were

69
00:02:34,890 --> 00:02:36,220
caught by the radiologists.

70
00:02:36,220 --> 00:02:38,303
We'd say, you know, what
was cancer that was found

71
00:02:38,303 --> 00:02:39,910
in any means within a year?

72
00:02:39,910 --> 00:02:42,510
And where we looked to was
through the radiology, EHR,

73
00:02:42,510 --> 00:02:44,732
and the Partners-- kind
of five hospital registry.

74
00:02:44,732 --> 00:02:46,440
And there we were
trying to save cancer--

75
00:02:46,440 --> 00:02:48,300
if anyway we can tell
a cancer occurred,

76
00:02:48,300 --> 00:02:51,240
let's mart it as such regardless
of what others caught on MRI

77
00:02:51,240 --> 00:02:53,383
or some kind of later stage.

78
00:02:53,383 --> 00:02:55,050
And so the thing we're
trying to do here

79
00:02:55,050 --> 00:02:57,270
is just mimic the
real world of what

80
00:02:57,270 --> 00:02:59,023
are we trying to catch cancer.

81
00:02:59,023 --> 00:03:00,690
And finally, important
details we always

82
00:03:00,690 --> 00:03:04,290
split by patient so that
your results aren't just

83
00:03:04,290 --> 00:03:07,030
memorizing this specific
patient didn't have cancer.

84
00:03:07,030 --> 00:03:10,130
And so we have some overlap
that's some bad bias to have.

85
00:03:10,130 --> 00:03:10,630
OK.

86
00:03:10,630 --> 00:03:11,505
That's pretty simple.

87
00:03:11,505 --> 00:03:12,850
Now let's go into the modeling.

88
00:03:12,850 --> 00:03:15,510
There's going to kind
of follow two chunks.

89
00:03:15,510 --> 00:03:18,070
One chunk is going to be on
the kind of general challenges,

90
00:03:18,070 --> 00:03:20,680
and it's kind of shared between
the variety of projects.

91
00:03:20,680 --> 00:03:23,190
And next is going to be kind
of more specific analysis

92
00:03:23,190 --> 00:03:25,020
for this project.

93
00:03:25,020 --> 00:03:27,900
So kind of a general
question you might be asking,

94
00:03:27,900 --> 00:03:28,650
I have some image.

95
00:03:28,650 --> 00:03:29,483
I have some outcome.

96
00:03:29,483 --> 00:03:31,470
Obviously, this is just
image classification.

97
00:03:31,470 --> 00:03:34,330
How is it different
from ImageNet?

98
00:03:34,330 --> 00:03:36,090
Well, it's quite similar.

99
00:03:36,090 --> 00:03:37,260
Most lessons are shared.

100
00:03:37,260 --> 00:03:39,190
But there are some
key differences.

101
00:03:39,190 --> 00:03:40,600
So I gave you two examples.

102
00:03:40,600 --> 00:03:42,457
One of them is a
scene in my kitchen.

103
00:03:42,457 --> 00:03:44,040
Can anyone tell me
what the object is?

104
00:03:44,040 --> 00:03:46,376
This is not a particularly
hard question.

105
00:03:46,376 --> 00:03:46,737
AUDIENCE: [Intermingled
voices] Dog.

106
00:03:46,737 --> 00:03:46,810
Bear.

107
00:03:46,810 --> 00:03:47,690
ADAM YALA: Right.

108
00:03:47,690 --> 00:03:49,420
AUDIENCE: Dog.

109
00:03:49,420 --> 00:03:51,340
ADAM YALA: It is almost
all of those things.

110
00:03:51,340 --> 00:03:53,180
So that is my dog, the best dog.

111
00:03:53,180 --> 00:03:53,680
OK.

112
00:03:53,680 --> 00:03:55,300
So can anyone tell
me, now that you've

113
00:03:55,300 --> 00:03:58,490
had some training with Connie,
if this mammogram indicates

114
00:03:58,490 --> 00:03:58,990
cancer?

115
00:04:01,560 --> 00:04:02,310
Well, it does.

116
00:04:02,310 --> 00:04:05,260
And this is unfair for
a couple of reasons.

117
00:04:05,260 --> 00:04:07,200
Let's go into, like,
why this is hard.

118
00:04:07,200 --> 00:04:09,533
It's unfair in part because
you don't have the training.

119
00:04:09,533 --> 00:04:11,880
But it's actually a much
harder signal to learn.

120
00:04:11,880 --> 00:04:15,630
So first let's kind
of delve into it.

121
00:04:15,630 --> 00:04:18,180
In this kind of task,
the image is really huge.

122
00:04:18,180 --> 00:04:21,810
So you have something like a
3,200 by 2,600 pixel image.

123
00:04:21,810 --> 00:04:23,367
This is a single
view of a breast.

124
00:04:23,367 --> 00:04:25,450
And in that, the actual
cancer they're looking for

125
00:04:25,450 --> 00:04:27,030
might be 50 by 50 pixels.

126
00:04:27,030 --> 00:04:29,780
So intuitively your signal to
noise ratio is very different.

127
00:04:29,780 --> 00:04:30,780
Whereas an image that--

128
00:04:30,780 --> 00:04:32,150
my dog is like the entire image.

129
00:04:32,150 --> 00:04:35,130
She's huge in real
life and in that photo.

130
00:04:35,130 --> 00:04:36,720
And the image itself
is much smaller.

131
00:04:36,720 --> 00:04:39,030
So not only do you have
much smaller images,

132
00:04:39,030 --> 00:04:41,410
but you're kind of, like, the
relative size of the object

133
00:04:41,410 --> 00:04:42,615
in there is much larger.

134
00:04:42,615 --> 00:04:44,520
To kind of further
compound the difficulty,

135
00:04:44,520 --> 00:04:47,820
the pattern you're looking
for inside the mammogram

136
00:04:47,820 --> 00:04:49,540
is really context-dependent.

137
00:04:49,540 --> 00:04:52,440
So if you saw that pattern
somewhere else in the breast,

138
00:04:52,440 --> 00:04:54,863
it doesn't indicate
the same thing.

139
00:04:54,863 --> 00:04:56,280
And so you really
care about where

140
00:04:56,280 --> 00:04:58,368
in this kind of global
context this comes out.

141
00:04:58,368 --> 00:04:59,910
And if you kind of
take the mammogram

142
00:04:59,910 --> 00:05:02,060
at different times with
different compressions,

143
00:05:02,060 --> 00:05:04,650
you would have this kind
of non-rigid morphing

144
00:05:04,650 --> 00:05:06,960
of the image that's much
more difficult to model.

145
00:05:06,960 --> 00:05:09,330
Whereas that's a more or
less context-independent dog.

146
00:05:09,330 --> 00:05:11,520
You see that kind of
frame kind of anywhere,

147
00:05:11,520 --> 00:05:12,360
you know it's a dog.

148
00:05:12,360 --> 00:05:14,490
And so it's a much
easier thing to learn

149
00:05:14,490 --> 00:05:17,302
in a traditional
computer vision setting.

150
00:05:17,302 --> 00:05:19,510
And so the core challenge
here is that both the image

151
00:05:19,510 --> 00:05:21,340
is too big and too small.

152
00:05:21,340 --> 00:05:24,600
So if you're looking at just
the number of cancers we have,

153
00:05:24,600 --> 00:05:27,030
the cancer might be less
than 1% of the mammogram

154
00:05:27,030 --> 00:05:29,610
and about 0.7% of your
images have cancers,

155
00:05:29,610 --> 00:05:32,560
even in this data set,
which is from 2000 to 2016

156
00:05:32,560 --> 00:05:35,820
MGH, a massive imaging center,
in total across all of that,

157
00:05:35,820 --> 00:05:39,220
you will still have
less than 2,000 cancers.

158
00:05:39,220 --> 00:05:41,670
And this is super tiny
compared to regular object

159
00:05:41,670 --> 00:05:43,710
classification data sets.

160
00:05:43,710 --> 00:05:45,630
And this is looking at
over a million images

161
00:05:45,630 --> 00:05:48,163
if you look at all the
four views of the exams.

162
00:05:48,163 --> 00:05:49,830
And at the same time,
it's also too big.

163
00:05:49,830 --> 00:05:52,740
So even if I downsample
these images,

164
00:05:52,740 --> 00:05:56,670
I can only really fit three
of them for a single GPU.

165
00:05:56,670 --> 00:05:59,510
And so this kind of limits the
batch size I can work with.

166
00:05:59,510 --> 00:06:01,220
And whereas the
kind of comparable,

167
00:06:01,220 --> 00:06:02,970
if I took just the
regular image net size,

168
00:06:02,970 --> 00:06:05,297
I could fit batches of
128, easily happy days

169
00:06:05,297 --> 00:06:06,880
and do all this
parallelization stuff,

170
00:06:06,880 --> 00:06:08,838
and it's just much
easier to play with.

171
00:06:08,838 --> 00:06:11,130
And finally, the actual data
set itself is quite large.

172
00:06:11,130 --> 00:06:12,490
And so you have to do some--

173
00:06:12,490 --> 00:06:14,340
there's nuisances to deal
with in terms of, like, just

174
00:06:14,340 --> 00:06:16,298
setting up your server
infrastructure to handle

175
00:06:16,298 --> 00:06:21,730
these massive data sets, also
be able to train efficiently.

176
00:06:21,730 --> 00:06:23,770
So you know, the
core challenge here

177
00:06:23,770 --> 00:06:25,435
across all of
these kind of tasks

178
00:06:25,435 --> 00:06:27,310
is, how do we make this
model actually learn?

179
00:06:27,310 --> 00:06:29,010
The core problem is that
our signal to noise ratio

180
00:06:29,010 --> 00:06:29,690
is quite low.

181
00:06:29,690 --> 00:06:31,540
So training ends up
being quite unstable.

182
00:06:31,540 --> 00:06:33,820
And there's a kind of a
couple of simple levers

183
00:06:33,820 --> 00:06:34,780
you can play with.

184
00:06:34,780 --> 00:06:38,512
The first lever is often
deep learning initialization.

185
00:06:38,512 --> 00:06:40,720
Next, we're gonna talk about
kind of the optimization

186
00:06:40,720 --> 00:06:42,700
or architecture choice
and how this compares

187
00:06:42,700 --> 00:06:44,990
to what people often
do in the community,

188
00:06:44,990 --> 00:06:46,782
including in a recent
paper from yesterday.

189
00:06:46,782 --> 00:06:49,073
And then finally, we're gonna
talk about something more

190
00:06:49,073 --> 00:06:51,820
explicit for the triage idea and
how we actually use this model

191
00:06:51,820 --> 00:06:53,720
once it's trained.

192
00:06:53,720 --> 00:06:54,220
OK.

193
00:06:54,220 --> 00:06:56,638
So before I go into how
we made these choices,

194
00:06:56,638 --> 00:06:58,930
I'm just going to say what
we chose to give you context

195
00:06:58,930 --> 00:07:00,830
before I dive in.

196
00:07:00,830 --> 00:07:03,190
So we followed some
image initialization.

197
00:07:03,190 --> 00:07:06,160
We use a relatively large
batch size-ish of 24.

198
00:07:06,160 --> 00:07:08,032
And the way we do that
is by taking 4 GPUs

199
00:07:08,032 --> 00:07:09,490
and just stepping
a couple of times

200
00:07:09,490 --> 00:07:11,177
before doing an optimizer step.

201
00:07:11,177 --> 00:07:12,760
So when you do a
couple rounds of back

202
00:07:12,760 --> 00:07:14,690
prop first to accumulate
those gradients

203
00:07:14,690 --> 00:07:16,710
before doing optimization.

204
00:07:16,710 --> 00:07:18,760
And you sample balanced
batches of training time.

205
00:07:18,760 --> 00:07:20,950
And for backbone architecture
we use ResNet-18.

206
00:07:20,950 --> 00:07:23,750
It's just kind of,
like, fairly standard.

207
00:07:23,750 --> 00:07:24,250
OK.

208
00:07:24,250 --> 00:07:26,770
But as I said before, one
of the first key decisions

209
00:07:26,770 --> 00:07:29,620
is how do you think about
your initialization?

210
00:07:29,620 --> 00:07:32,642
So this is a figure of
ImageNet initialization

211
00:07:32,642 --> 00:07:33,850
versus random initialization.

212
00:07:33,850 --> 00:07:36,190
It's not any
particular experiment.

213
00:07:36,190 --> 00:07:37,870
I've done this across
many, many times.

214
00:07:37,870 --> 00:07:39,040
It's always like this.

215
00:07:39,040 --> 00:07:41,200
Where if you use
image initialization,

216
00:07:41,200 --> 00:07:42,998
your loss drops
immediately, both in

217
00:07:42,998 --> 00:07:45,040
train loss and development
loss when you actually

218
00:07:45,040 --> 00:07:46,330
learn something.

219
00:07:46,330 --> 00:07:48,565
Whereas when you do
random initialization,

220
00:07:48,565 --> 00:07:49,940
you kind of don't
learn anything.

221
00:07:49,940 --> 00:07:51,732
And your loss kind of
bounds around the top

222
00:07:51,732 --> 00:07:54,175
for a very long time before
it finds some region where

223
00:07:54,175 --> 00:07:55,300
it quickly starts learning.

224
00:07:55,300 --> 00:07:57,217
And then it will plateau
again for a long time

225
00:07:57,217 --> 00:07:58,780
before quickly start learning.

226
00:07:58,780 --> 00:08:00,460
And to kind of
give some context,

227
00:08:00,460 --> 00:08:04,400
to give about 50 epochs takes
on the order of, like, 15,

228
00:08:04,400 --> 00:08:06,140
16 hours.

229
00:08:06,140 --> 00:08:08,623
And so to wait long
enough to even see

230
00:08:08,623 --> 00:08:10,540
if random initialization
could perform as well

231
00:08:10,540 --> 00:08:11,853
is beyond my level of patience.

232
00:08:11,853 --> 00:08:14,020
It just takes too long, and
I have other experiments

233
00:08:14,020 --> 00:08:16,010
to be running.

234
00:08:16,010 --> 00:08:18,100
So this is more of an
empirical observation

235
00:08:18,100 --> 00:08:20,290
that the image initialization
learns immediately.

236
00:08:20,290 --> 00:08:22,955
And there's some kind of
questions of why is this?

237
00:08:22,955 --> 00:08:25,330
Our theoretical understanding
of this is not that strong.

238
00:08:25,330 --> 00:08:27,710
We have some intuitions of
why this might be happening.

239
00:08:27,710 --> 00:08:31,330
We don't think it's anything
about this particular filter

240
00:08:31,330 --> 00:08:34,030
of this dog is really
great for breast cancer.

241
00:08:34,030 --> 00:08:35,080
That's quite implausible.

242
00:08:35,080 --> 00:08:37,663
But if you look it into a lot
of the earlier research in terms

243
00:08:37,663 --> 00:08:40,048
of the right kind of random
initialization for things

244
00:08:40,048 --> 00:08:41,590
like revenue networks,
a lot of focus

245
00:08:41,590 --> 00:08:44,200
was on does the
activation pattern not

246
00:08:44,200 --> 00:08:45,890
blow up as you go
further down the line.

247
00:08:45,890 --> 00:08:48,460
One of the benefits of starting
with the pre-trained network

248
00:08:48,460 --> 00:08:50,725
is that a lot of
those kind of dynamics

249
00:08:50,725 --> 00:08:52,810
are already figured out
for a specific task.

250
00:08:52,810 --> 00:08:55,487
And so shifting from
that to other tasks

251
00:08:55,487 --> 00:08:57,070
has seemed to be not
that challenging.

252
00:08:57,070 --> 00:08:58,947
Another possible
area of explanation

253
00:08:58,947 --> 00:09:00,530
is actually in a
BatchNorm statistics.

254
00:09:00,530 --> 00:09:03,118
So if you remember, we can
only fit three images per GPU.

255
00:09:03,118 --> 00:09:05,410
And the way the BatchNorm
initialization is implemented

256
00:09:05,410 --> 00:09:08,320
across every deep learning
library that I know of,

257
00:09:08,320 --> 00:09:10,330
it computes
independently per GPU

258
00:09:10,330 --> 00:09:12,880
to minimize the kind of
inter-GPU communication.

259
00:09:12,880 --> 00:09:15,368
And so it's also less able to
kind of guess from scratch.

260
00:09:15,368 --> 00:09:17,535
But if you're starting with
the BatchNorm statistics

261
00:09:17,535 --> 00:09:19,910
to ImageNet and just
slowly shifting it over,

262
00:09:19,910 --> 00:09:21,910
it might also result in
some stability benefits.

263
00:09:24,820 --> 00:09:26,568
But in general, or
like, a true deeper

264
00:09:26,568 --> 00:09:29,110
theoretical understanding, but
as I said, it still eludes us.

265
00:09:29,110 --> 00:09:32,650
And it isn't something I can
give too much conclusions

266
00:09:32,650 --> 00:09:34,160
about, unfortunately.

267
00:09:34,160 --> 00:09:34,660
OK.

268
00:09:34,660 --> 00:09:35,980
So that's initialization.

269
00:09:35,980 --> 00:09:37,360
And if you don't get this
right, kind of nothing

270
00:09:37,360 --> 00:09:38,588
works for a very long time.

271
00:09:38,588 --> 00:09:40,630
So if you're gonna start
a project in this space,

272
00:09:40,630 --> 00:09:41,545
try this.

273
00:09:41,545 --> 00:09:43,795
Next, another important
decision that if you don't do,

274
00:09:43,795 --> 00:09:46,540
it kind of breaks, is your
optimization/architecture

275
00:09:46,540 --> 00:09:47,740
choice.

276
00:09:47,740 --> 00:09:50,140
So as I said before, kind of
a core problem in stability

277
00:09:50,140 --> 00:09:52,600
here is this idea that our
just signal to noise ratio

278
00:09:52,600 --> 00:09:54,070
is really low.

279
00:09:54,070 --> 00:09:56,135
And so a very common
approach throughout a lot

280
00:09:56,135 --> 00:09:57,760
of the prior work
and things I actually

281
00:09:57,760 --> 00:10:01,600
have tried myself before
is to say, OK, let's

282
00:10:01,600 --> 00:10:02,860
just break down this problem.

283
00:10:02,860 --> 00:10:05,102
We can train at a
patch level first.

284
00:10:05,102 --> 00:10:07,060
We're going to take just
subsets of a mammogram

285
00:10:07,060 --> 00:10:08,590
in this little
bonding box, have it

286
00:10:08,590 --> 00:10:11,860
annotated for radiology
findings like benign masses

287
00:10:11,860 --> 00:10:14,020
or calcification and
things of that sort.

288
00:10:14,020 --> 00:10:15,870
We're going to
pre-train on that task

289
00:10:15,870 --> 00:10:17,890
to have this kind of
pixel level prediction.

290
00:10:17,890 --> 00:10:18,800
And then once we're
done with that,

291
00:10:18,800 --> 00:10:20,950
we're going to fine tune
that initialized model

292
00:10:20,950 --> 00:10:24,280
across the entire image.

293
00:10:24,280 --> 00:10:26,837
So you kind of have this
two-stage training procedure.

294
00:10:26,837 --> 00:10:29,170
And actually, another paper
that came out just yesterday

295
00:10:29,170 --> 00:10:31,690
does the exact same approach
with some slightly different

296
00:10:31,690 --> 00:10:34,543
details.

297
00:10:34,543 --> 00:10:36,460
But one of the things
we wanted to investigate

298
00:10:36,460 --> 00:10:38,740
is if you just-- oh, And
the base architecture

299
00:10:38,740 --> 00:10:40,180
that's always used
for this, there

300
00:10:40,180 --> 00:10:42,260
is quite a few valid
options of things

301
00:10:42,260 --> 00:10:44,830
that just get
reasonable performance

302
00:10:44,830 --> 00:10:48,210
and ImageNet, things like
VGG, Wide ResNets and ResNets.

303
00:10:48,210 --> 00:10:50,890
In my experience, they all
performed fairly similarly.

304
00:10:50,890 --> 00:10:53,722
So it's kind of a
speed/benefit trade-off.

305
00:10:53,722 --> 00:10:55,930
And there's an advantage to
using fully convolutional

306
00:10:55,930 --> 00:10:58,450
architectures because if you
have fully connected layers

307
00:10:58,450 --> 00:11:00,550
that are assumed
specific dimensionality,

308
00:11:00,550 --> 00:11:02,788
you can convert them to
convolutional layers.

309
00:11:02,788 --> 00:11:04,330
They're just more
convenient to start

310
00:11:04,330 --> 00:11:06,010
with a full convolutional
architecture.

311
00:11:06,010 --> 00:11:08,290
There's going to be
resolution invariant.

312
00:11:08,290 --> 00:11:08,875
Yes.

313
00:11:08,875 --> 00:11:11,120
AUDIENCE: In the last
slide when you do patches--

314
00:11:11,120 --> 00:11:11,745
ADAM YALA: Yes.

315
00:11:11,745 --> 00:11:13,890
AUDIENCE: How do you
label every single patch?

316
00:11:13,890 --> 00:11:16,317
Are they just labeled
with a global label?

317
00:11:16,317 --> 00:11:18,602
Or do you have to
actually look and catch,

318
00:11:18,602 --> 00:11:19,980
and figure out what's happened?

319
00:11:19,980 --> 00:11:21,397
ADAM YALA: So
normally what you do

320
00:11:21,397 --> 00:11:23,860
is you have positive
patches labeled.

321
00:11:23,860 --> 00:11:25,828
And then you randomly
sample other patches.

322
00:11:25,828 --> 00:11:28,120
So from your annotation--
so, for example, a lot people

323
00:11:28,120 --> 00:11:31,192
do this on public data sets like
the Florida DSM dataset that

324
00:11:31,192 --> 00:11:32,650
has some entries,
of like, here are

325
00:11:32,650 --> 00:11:35,920
benign masses, benign calcs,
malignant calcs, et cetera.

326
00:11:35,920 --> 00:11:38,440
What people do then is
take those annotations.

327
00:11:38,440 --> 00:11:40,510
They will randomly
select other patches

328
00:11:40,510 --> 00:11:42,750
and say, if it's not
there, it's negative.

329
00:11:42,750 --> 00:11:44,470
And I'm going to
call it healthy.

330
00:11:44,470 --> 00:11:45,970
And then they'll
say if this bonding

331
00:11:45,970 --> 00:11:47,950
box overlaps with patch
by some marginal call,

332
00:11:47,950 --> 00:11:49,210
it's the same label.

333
00:11:49,210 --> 00:11:50,813
So do this heuristically.

334
00:11:50,813 --> 00:11:53,230
And other data sets that are
proprietary also kind of play

335
00:11:53,230 --> 00:11:54,610
with a similar trick.

336
00:11:54,610 --> 00:11:57,950
In general, they don't actually
label every single pixel

337
00:11:57,950 --> 00:11:58,450
accordingly.

338
00:11:58,450 --> 00:12:00,640
But there's relatively
minor differences

339
00:12:00,640 --> 00:12:01,840
in how people do this.

340
00:12:01,840 --> 00:12:04,110
But the results are fairly
similar, regardless.

341
00:12:04,110 --> 00:12:04,866
Yes.

342
00:12:04,866 --> 00:12:08,027
AUDIENCE: When you go from the
patch level to the full image,

343
00:12:08,027 --> 00:12:10,360
if I understand correctly,
the architecture hasn't quite

344
00:12:10,360 --> 00:12:13,348
changed because it's just
convolution is over a larger--

345
00:12:13,348 --> 00:12:14,140
ADAM YALA: Exactly.

346
00:12:14,140 --> 00:12:18,010
So the end thing right before we
do the prediction is normally--

347
00:12:18,010 --> 00:12:20,620
ResNet, for example, does
a global average pool.

348
00:12:20,620 --> 00:12:23,260
Channel lies across
the entire feature map.

349
00:12:23,260 --> 00:12:24,660
And so they just--

350
00:12:24,660 --> 00:12:27,257
for the patch level they take
in an image that's 250 by 250,

351
00:12:27,257 --> 00:12:28,840
do the global average
pool across that

352
00:12:28,840 --> 00:12:29,970
to make the prediction.

353
00:12:29,970 --> 00:12:32,220
And when they just go up to
the full resolution image,

354
00:12:32,220 --> 00:12:34,900
now you're taking a global
average pool over a 3,000

355
00:12:34,900 --> 00:12:36,110
by 2,000.

356
00:12:36,110 --> 00:12:39,670
AUDIENCE: And presumably there
might be some scaling issue

357
00:12:39,670 --> 00:12:43,610
that you might need to adjust.

358
00:12:43,610 --> 00:12:45,204
Do you do any of that?

359
00:12:45,204 --> 00:12:46,150
Or are you just--

360
00:12:46,150 --> 00:12:48,280
ADAM YALA: So you feed it
in at the full resolution

361
00:12:48,280 --> 00:12:49,520
the entire time.

362
00:12:49,520 --> 00:12:51,680
So you just-- do
you see what I mean?

363
00:12:51,680 --> 00:12:53,710
So you're taking a crop.

364
00:12:53,710 --> 00:12:55,280
So the resolution
isn't changing.

365
00:12:55,280 --> 00:12:57,530
So the same filter map should
be able to kind of scale

366
00:12:57,530 --> 00:12:58,340
accordingly.

367
00:12:58,340 --> 00:13:00,460
But if you do things
like average pooling,

368
00:13:00,460 --> 00:13:01,555
then you're kind of--

369
00:13:01,555 --> 00:13:03,430
any one thing that has
a very high activation

370
00:13:03,430 --> 00:13:04,665
will get averaged down lower.

371
00:13:04,665 --> 00:13:06,040
And so, for example,
in our work,

372
00:13:06,040 --> 00:13:09,240
we use max pooling to
kind of get around that.

373
00:13:09,240 --> 00:13:10,840
Any other questions?

374
00:13:10,840 --> 00:13:12,307
But if this looks
complicated, have

375
00:13:12,307 --> 00:13:14,890
no worries because we actually
think it's totally unnecessary.

376
00:13:14,890 --> 00:13:16,015
And this is the next slide.

377
00:13:16,015 --> 00:13:18,920
So good for you.

378
00:13:18,920 --> 00:13:21,018
So as I said before,
this kind of,

379
00:13:21,018 --> 00:13:22,810
what are the problems
that signal to noise?

380
00:13:22,810 --> 00:13:25,630
So one obvious thing to kind
of think about is, like, OK.

381
00:13:25,630 --> 00:13:27,640
Maybe doing SGD with
a batch size of three

382
00:13:27,640 --> 00:13:30,850
when the lesion is less than
1% of the image is a bad idea.

383
00:13:30,850 --> 00:13:32,590
If I just take less
noisy gradients

384
00:13:32,590 --> 00:13:35,650
by increasing my batch size,
which means use more GPUs,

385
00:13:35,650 --> 00:13:39,680
take more steps before
doing the weight update,

386
00:13:39,680 --> 00:13:42,340
we actually find that the
need to do this actually

387
00:13:42,340 --> 00:13:43,580
goes away completely.

388
00:13:43,580 --> 00:13:46,122
So these are experiments I did
in the publicly available data

389
00:13:46,122 --> 00:13:48,608
set a while back while we
were figuring this out.

390
00:13:48,608 --> 00:13:50,650
If you take this kind of
[INAUDIBLE] architecture

391
00:13:50,650 --> 00:13:54,670
and fine tune with a batch
size of 2, 4, 10, 16,

392
00:13:54,670 --> 00:13:56,950
and compare that to just
a one-stage training where

393
00:13:56,950 --> 00:13:58,830
you just do the
[INAUDIBLE] beginning

394
00:13:58,830 --> 00:14:01,247
and initialized in ImageNet
and as you use different batch

395
00:14:01,247 --> 00:14:03,460
sizes, you quickly
start to close the gap

396
00:14:03,460 --> 00:14:05,240
on the development AUC.

397
00:14:05,240 --> 00:14:07,930
And so for all the
experiments that we do broadly

398
00:14:07,930 --> 00:14:10,520
we find that we actually get
reasonably stable training

399
00:14:10,520 --> 00:14:13,900
by just using a batch
size of 20 and above.

400
00:14:13,900 --> 00:14:16,540
And this kind of comes down to
if you use a batch size of one,

401
00:14:16,540 --> 00:14:18,210
it's just particularly unstable.

402
00:14:18,210 --> 00:14:20,290
In other details that we always
sample the balanced batches.

403
00:14:20,290 --> 00:14:21,940
Cause otherwise you'd
be sampling like,

404
00:14:21,940 --> 00:14:24,065
20 batches before you see
a single positive sample.

405
00:14:24,065 --> 00:14:25,360
You just don't learn anything.

406
00:14:25,360 --> 00:14:25,930
Cool.

407
00:14:25,930 --> 00:14:27,550
So that is like,
if you do that, you

408
00:14:27,550 --> 00:14:28,800
don't do anything complicated.

409
00:14:28,800 --> 00:14:31,210
You don't do any fancy cropping
or anything of that sort,

410
00:14:31,210 --> 00:14:33,290
or like, dealing with
like VGG annotations.

411
00:14:33,290 --> 00:14:35,800
We found that the actual using
VGG annotation for this task

412
00:14:35,800 --> 00:14:38,620
is not actually helpful.

413
00:14:38,620 --> 00:14:39,240
OK.

414
00:14:39,240 --> 00:14:40,140
No questions?

415
00:14:40,140 --> 00:14:40,832
Yes.

416
00:14:40,832 --> 00:14:42,370
AUDIENCE: So with
the larger batch

417
00:14:42,370 --> 00:14:45,237
sizing you don't use
the magnified patches?

418
00:14:45,237 --> 00:14:46,070
ADAM YALA: We don't.

419
00:14:46,070 --> 00:14:47,780
We just take the whole
image from beginning.

420
00:14:47,780 --> 00:14:49,280
Pretend you-- like,
can you just see

421
00:14:49,280 --> 00:14:51,630
the annotation as
whole image, cancer

422
00:14:51,630 --> 00:14:54,330
with less than within a year.

423
00:14:54,330 --> 00:14:55,443
It's a much simpler setup.

424
00:14:55,443 --> 00:14:56,360
AUDIENCE: I don't get.

425
00:14:56,360 --> 00:14:57,662
That's the same thing
I thought you said you

426
00:14:57,662 --> 00:14:58,980
couldn't do for memory reasons.

427
00:14:58,980 --> 00:14:59,563
ADAM YALA: Oh.

428
00:14:59,563 --> 00:15:02,900
So you just-- instead of--
so normally when you do,

429
00:15:02,900 --> 00:15:04,532
you're going to
train the network,

430
00:15:04,532 --> 00:15:06,990
the most common approach is
you do back prop and then step.

431
00:15:06,990 --> 00:15:08,520
Cause you do back
prop several times,

432
00:15:08,520 --> 00:15:10,820
you're accumulating the
gradients, at least in PyTorch.

433
00:15:10,820 --> 00:15:12,610
And then you can
do step afterwards.

434
00:15:12,610 --> 00:15:15,060
So instead of doing the
whole batch at one time,

435
00:15:15,060 --> 00:15:16,290
you just do it serially.

436
00:15:16,290 --> 00:15:19,950
So there you're just
trading time for space.

437
00:15:19,950 --> 00:15:22,830
The minimum, though, is you have
to fit at least a single image

438
00:15:22,830 --> 00:15:24,150
per GPU.

439
00:15:24,150 --> 00:15:26,350
And in our case
we can fit three.

440
00:15:26,350 --> 00:15:28,850
But to make this actually scale,
we use four GPUs at a time.

441
00:15:31,500 --> 00:15:32,000
Yes.

442
00:15:32,000 --> 00:15:35,150
AUDIENCE: How much is
the trade-off with time?

443
00:15:35,150 --> 00:15:37,585
ADAM YALA: So if I'm gonna
take one batch size any bigger,

444
00:15:37,585 --> 00:15:39,460
I would only do it in
increments of let's say

445
00:15:39,460 --> 00:15:42,740
12, because that's how much I
can fit within my set of GPUs

446
00:15:42,740 --> 00:15:44,212
at the same time.

447
00:15:44,212 --> 00:15:45,920
But to control the
size of the experiment

448
00:15:45,920 --> 00:15:47,930
you want to have the kind of the
same number of gradient updates

449
00:15:47,930 --> 00:15:49,015
per experiment.

450
00:15:49,015 --> 00:15:50,640
So if I want to use
a batch size of 48,

451
00:15:50,640 --> 00:15:53,190
so all my experiments, instead
of taking about half a day,

452
00:15:53,190 --> 00:15:55,200
it takes about a day.

453
00:15:55,200 --> 00:15:57,790
And so there's kind of,
like, this natural trade-off

454
00:15:57,790 --> 00:15:58,620
as you go along.

455
00:15:58,620 --> 00:16:00,620
So one of the things I
mentioned at the very end

456
00:16:00,620 --> 00:16:02,610
is we're considering
some adversarial approach

457
00:16:02,610 --> 00:16:03,500
for something.

458
00:16:03,500 --> 00:16:04,580
And one of the annoying
things about that

459
00:16:04,580 --> 00:16:07,070
is that if I have five
discriminator steps, oh my god.

460
00:16:07,070 --> 00:16:08,930
My my experiment-- I'll take
three days per experiment.

461
00:16:08,930 --> 00:16:10,320
And [INAUDIBLE]
update of someone

462
00:16:10,320 --> 00:16:11,390
that's trying to
design a better model

463
00:16:11,390 --> 00:16:13,250
becomes really slow
when the experiments

464
00:16:13,250 --> 00:16:16,220
start taking this long.

465
00:16:16,220 --> 00:16:17,030
Yes.

466
00:16:17,030 --> 00:16:20,257
AUDIENCE: So you said
the annotations did not

467
00:16:20,257 --> 00:16:21,215
help with the training.

468
00:16:21,215 --> 00:16:25,250
Is that because
the actual cancer

469
00:16:25,250 --> 00:16:28,220
itself is not really different
from the dense tissue,

470
00:16:28,220 --> 00:16:31,224
and the location of
that matters, and not

471
00:16:31,224 --> 00:16:34,120
the actual granularity of the--

472
00:16:34,120 --> 00:16:35,342
what is the reason?

473
00:16:35,342 --> 00:16:37,550
ADAM YALA: So in general
when something doesn't help,

474
00:16:37,550 --> 00:16:40,510
there's always kind of like
a possibility of two things.

475
00:16:40,510 --> 00:16:43,110
One thing is that the whole
image signal kind of subsumes

476
00:16:43,110 --> 00:16:44,855
that smaller scale signal.

477
00:16:44,855 --> 00:16:46,230
Or there is a
better way to do it

478
00:16:46,230 --> 00:16:48,230
I haven't found that would help.

479
00:16:48,230 --> 00:16:51,300
And then this thing looks
to us all very hard.

480
00:16:51,300 --> 00:16:54,330
As of now, so the
task we're [INAUDIBLE]

481
00:16:54,330 --> 00:16:56,765
on is whole image
classification.

482
00:16:56,765 --> 00:16:58,140
And so on that
task it's possible

483
00:16:58,140 --> 00:17:00,180
that the kind of
surrounding context--

484
00:17:00,180 --> 00:17:02,270
so when you do a patch
with an annotation,

485
00:17:02,270 --> 00:17:04,470
you kind of lose the
context which it appears in.

486
00:17:04,470 --> 00:17:06,887
So it's possible that just by
looking at the whole context

487
00:17:06,887 --> 00:17:09,660
every time, it's as good--

488
00:17:09,660 --> 00:17:12,750
you don't get any benefit from
kind of the zooming boxes.

489
00:17:12,750 --> 00:17:15,847
However, we're not evaluating
on kind of an object detection

490
00:17:15,847 --> 00:17:16,930
type of evaluation metric.

491
00:17:16,930 --> 00:17:19,240
If you say how well we
are catching the box.

492
00:17:19,240 --> 00:17:21,900
And if we were, we'd probably
have much better luck

493
00:17:21,900 --> 00:17:23,970
with using the VGG annotation.

494
00:17:23,970 --> 00:17:25,506
Because you might
be able to tell

495
00:17:25,506 --> 00:17:27,089
some of those
discriminations by like,

496
00:17:27,089 --> 00:17:29,422
this looks like a breast
that's likely to develop cancer

497
00:17:29,422 --> 00:17:30,420
at all.

498
00:17:30,420 --> 00:17:32,220
And the ability of
the model to do that

499
00:17:32,220 --> 00:17:33,845
is part of why we
can do risk modeling.

500
00:17:33,845 --> 00:17:37,610
Which is going to be the kind
of the last bit of the talk.

501
00:17:37,610 --> 00:17:38,110
Yes.

502
00:17:38,110 --> 00:17:40,050
AUDIENCE: So do you do
the object detection

503
00:17:40,050 --> 00:17:42,920
after you identify whether
there's cancer or not?

504
00:17:42,920 --> 00:17:45,420
ADAM YALA: So as of now we don't
do object detection in part

505
00:17:45,420 --> 00:17:47,550
because we're framing
the problem as triage.

506
00:17:47,550 --> 00:17:49,620
So there is quite a
few tool kits out there

507
00:17:49,620 --> 00:17:51,460
to draw more boxes
on the mammogram.

508
00:17:51,460 --> 00:17:53,100
But the insight
is that if there's

509
00:17:53,100 --> 00:17:55,870
1,000 things to look at,
looking at 2,000 things

510
00:17:55,870 --> 00:17:57,680
you drew more boxes per image.

511
00:17:57,680 --> 00:17:59,190
And it isn't
necessarily the problem

512
00:17:59,190 --> 00:18:00,190
we're trying to look at.

513
00:18:00,190 --> 00:18:02,243
There's quite a bit
of effort there.

514
00:18:02,243 --> 00:18:04,660
And it's something we might
look into later in the future.

515
00:18:04,660 --> 00:18:06,860
But it's not the
focus of this work.

516
00:18:06,860 --> 00:18:07,400
Yes.

517
00:18:07,400 --> 00:18:11,490
AUDIENCE: So Connie was saying
that the same pattern appearing

518
00:18:11,490 --> 00:18:16,820
in different parts of the breast
can mean different things.

519
00:18:16,820 --> 00:18:23,175
But when you're looking at
the entire image as once,

520
00:18:23,175 --> 00:18:26,700
I would worry
intuitively about whether

521
00:18:26,700 --> 00:18:29,390
the convolutional
architecture is

522
00:18:29,390 --> 00:18:32,990
going to be able to pick
that up or whether--

523
00:18:32,990 --> 00:18:35,840
because you were looking
for a very small cancer

524
00:18:35,840 --> 00:18:37,590
on a very large image.

525
00:18:37,590 --> 00:18:41,120
And then you were looking
for the significance

526
00:18:41,120 --> 00:18:45,360
of that very small cancer in
different parts of the image

527
00:18:45,360 --> 00:18:47,910
or in different
contexts of the image.

528
00:18:47,910 --> 00:18:49,340
And I'm just--

529
00:18:49,340 --> 00:18:52,645
I mean, it's a pleasant
surprise that this works.

530
00:18:52,645 --> 00:18:54,770
ADAM YALA: So there is kind
of like two pieces that

531
00:18:54,770 --> 00:18:56,030
can help explain that.

532
00:18:56,030 --> 00:18:57,970
So the first is that
if you look at, like,

533
00:18:57,970 --> 00:19:00,320
the receptive fields of any
given last receptive map

534
00:19:00,320 --> 00:19:02,630
at the very end of the
network, each of those

535
00:19:02,630 --> 00:19:04,960
summarizes through
these convolutions

536
00:19:04,960 --> 00:19:07,350
a fairly sizable
part of the image.

537
00:19:07,350 --> 00:19:10,090
And so you are kind of, like,
each pixel at the very end

538
00:19:10,090 --> 00:19:12,620
ends up being like something
like a 50 by 50 image.

539
00:19:12,620 --> 00:19:14,730
That's by five total dimensions.

540
00:19:14,730 --> 00:19:17,780
And so each part does summarize
this local context decently

541
00:19:17,780 --> 00:19:18,370
well.

542
00:19:18,370 --> 00:19:20,037
And when you do maximum
at the very end,

543
00:19:20,037 --> 00:19:23,780
and you get some not perfect
but OK global summary, what

544
00:19:23,780 --> 00:19:25,440
is the context of this image?

545
00:19:25,440 --> 00:19:28,525
So something like, let's say,
some of the lower dimensions

546
00:19:28,525 --> 00:19:30,650
can summarize, like, is
this a dense breast or kind

547
00:19:30,650 --> 00:19:32,480
of some of the other
pattern information

548
00:19:32,480 --> 00:19:34,640
that might tell you what
kind of breast this is.

549
00:19:34,640 --> 00:19:38,210
Whereas any one of
them can tell you

550
00:19:38,210 --> 00:19:40,353
this looks like a cancer
given its local context.

551
00:19:40,353 --> 00:19:42,020
So do you have some
level summarization,

552
00:19:42,020 --> 00:19:44,900
both because of the
channel-wise maxim of the end,

553
00:19:44,900 --> 00:19:49,030
and because each point through
the many, many convolutions

554
00:19:49,030 --> 00:19:53,830
of different strides gives you
some of that summary effect.

555
00:19:53,830 --> 00:19:54,710
OK, great.

556
00:19:54,710 --> 00:19:56,690
I'm going to jump forward.

557
00:19:56,690 --> 00:19:58,900
So we've talked about
how to make this learn.

558
00:19:58,900 --> 00:20:01,070
It's actually not
that tricky if we just

559
00:20:01,070 --> 00:20:02,658
do it carefully and tune.

560
00:20:02,658 --> 00:20:05,200
Now I'll talk about how to use
this model to actually deliver

561
00:20:05,200 --> 00:20:07,600
on this triage idea.

562
00:20:07,600 --> 00:20:10,540
So some of my choices again,
ImageNet initialization

563
00:20:10,540 --> 00:20:12,540
is going to make your
life a happier time.

564
00:20:12,540 --> 00:20:13,697
Use bigger batch sizes.

565
00:20:13,697 --> 00:20:15,280
And architecture
choice doesn't really

566
00:20:15,280 --> 00:20:17,536
matter if it's convolutional.

567
00:20:17,536 --> 00:20:20,080
And the overall setup that
we do through this work

568
00:20:20,080 --> 00:20:21,700
and across many
other projects we're

569
00:20:21,700 --> 00:20:23,580
training independently
per image.

570
00:20:23,580 --> 00:20:26,290
Now this is a harder task
because you don't actually

571
00:20:26,290 --> 00:20:26,870
have the--

572
00:20:26,870 --> 00:20:27,720
you're not taking any
of the other view,

573
00:20:27,720 --> 00:20:29,178
you're not taking
prior mammograms.

574
00:20:29,178 --> 00:20:31,743
But this is for kind of more
harder reasons than that.

575
00:20:31,743 --> 00:20:33,910
We're going to get the
prediction for the whole exam

576
00:20:33,910 --> 00:20:36,590
by taking the maximum
across the different images.

577
00:20:36,590 --> 00:20:39,160
So if I say this breast has
cancer, the exam has cancer.

578
00:20:39,160 --> 00:20:41,710
So you should get it checked up.

579
00:20:41,710 --> 00:20:43,920
And at each
development epoch we're

580
00:20:43,920 --> 00:20:45,670
going to evaluate the
ability of the model

581
00:20:45,670 --> 00:20:48,263
to do triage task, which
I'll step into in a second.

582
00:20:48,263 --> 00:20:49,930
And we're going to
kind of take the best

583
00:20:49,930 --> 00:20:51,490
model that can do triage.

584
00:20:51,490 --> 00:20:54,142
So you're always kind of
like, your true end metric

585
00:20:54,142 --> 00:20:55,850
is what you're measuring
during training.

586
00:20:55,850 --> 00:20:57,433
And you're going to
do model selection

587
00:20:57,433 --> 00:20:59,830
and kind of hyper
patching based on that.

588
00:20:59,830 --> 00:21:02,530
And the way we're going
to do triage and our goal

589
00:21:02,530 --> 00:21:06,483
here is to mark as
many people as healthy

590
00:21:06,483 --> 00:21:08,400
without missing a single
cancer that we always

591
00:21:08,400 --> 00:21:09,460
would have caught.

592
00:21:09,460 --> 00:21:11,533
So intuitively kind of
by taking all the cancers

593
00:21:11,533 --> 00:21:13,450
that the radiologist
would have caught, what's

594
00:21:13,450 --> 00:21:15,470
the probability of cancer
across these images,

595
00:21:15,470 --> 00:21:17,470
and just take the minimum
of those and call that

596
00:21:17,470 --> 00:21:18,340
the threshold.

597
00:21:18,340 --> 00:21:21,010
That's exactly what we do.

598
00:21:21,010 --> 00:21:23,710
And another detail
that's quite relevant

599
00:21:23,710 --> 00:21:26,358
often is if you want
these models to output

600
00:21:26,358 --> 00:21:27,775
a reasonable
probability like this

601
00:21:27,775 --> 00:21:31,320
is the probability of cancer,
and you train on a 50/50 sample

602
00:21:31,320 --> 00:21:34,000
the batches, by default
your model thinks

603
00:21:34,000 --> 00:21:35,570
that the average
incidence is 50%.

604
00:21:35,570 --> 00:21:37,540
So it's crazy
confidence all the time.

605
00:21:37,540 --> 00:21:39,940
So to calibrate that one
really simple trick is you do

606
00:21:39,940 --> 00:21:43,120
something called Platt's Method
where you basically just fit

607
00:21:43,120 --> 00:21:45,580
like a two-parameter sigmoid
or just scale and a shift

608
00:21:45,580 --> 00:21:46,230
to just--

609
00:21:46,230 --> 00:21:48,022
on the development sets
to make it actually

610
00:21:48,022 --> 00:21:49,098
fit the distribution.

611
00:21:49,098 --> 00:21:51,140
That way the average
probability you would expect

612
00:21:51,140 --> 00:21:52,430
to actually fit the incidence.

613
00:21:52,430 --> 00:21:55,510
And you don't get this kind
of like crazy off-kilter

614
00:21:55,510 --> 00:21:56,800
probabilities.

615
00:21:56,800 --> 00:21:57,370
OK.

616
00:21:57,370 --> 00:21:59,413
So analysis.

617
00:21:59,413 --> 00:22:01,330
The objectives of what
we would try to do here

618
00:22:01,330 --> 00:22:03,130
is kind of similar
across all the projects.

619
00:22:03,130 --> 00:22:04,652
One, does this thing work?

620
00:22:04,652 --> 00:22:06,610
Two, does this thing work
across all the people

621
00:22:06,610 --> 00:22:08,290
it's supposed to work for?

622
00:22:08,290 --> 00:22:09,580
So we did a subgroup analysis.

623
00:22:09,580 --> 00:22:11,288
First we looked at
the AUC in this model.

624
00:22:11,288 --> 00:22:13,840
So the ability to
discriminate cancer is not.

625
00:22:13,840 --> 00:22:15,140
We did it across races.

626
00:22:15,140 --> 00:22:19,065
We have across MGH, age
groups, and density categories.

627
00:22:19,065 --> 00:22:20,440
And finally, how
does this relate

628
00:22:20,440 --> 00:22:22,360
to radiologist's assessments?

629
00:22:22,360 --> 00:22:24,810
And if we actually
use this at test time

630
00:22:24,810 --> 00:22:26,560
on the test set, what
would have happened?

631
00:22:26,560 --> 00:22:31,700
Kind of a simulation before a
full clinical implementation.

632
00:22:31,700 --> 00:22:37,160
So overall AUC here was 82 with
some confident from 80 to 85.

633
00:22:37,160 --> 00:22:39,120
And we did our analysis by age.

634
00:22:39,120 --> 00:22:41,120
We found that the performance
was pretty similar

635
00:22:41,120 --> 00:22:42,440
across every age group.

636
00:22:42,440 --> 00:22:45,210
What's not shown here is
the confidence intervals.

637
00:22:45,210 --> 00:22:47,720
So for example-- but the
kind of key core takeaway

638
00:22:47,720 --> 00:22:51,290
here is that there
was no noticeable gap

639
00:22:51,290 --> 00:22:52,790
in terms of by age group.

640
00:22:52,790 --> 00:22:54,470
We repeated this
analysis by race,

641
00:22:54,470 --> 00:22:56,730
and we saw the same trend again.

642
00:22:56,730 --> 00:23:01,000
The performance kind of
ranged generally around 82.

643
00:23:01,000 --> 00:23:02,960
And in places where
the gap was bigger,

644
00:23:02,960 --> 00:23:06,080
the just confidence interval
was bigger accordingly due

645
00:23:06,080 --> 00:23:09,740
to smaller sample sizes,
cause MGH is 80% white.

646
00:23:09,740 --> 00:23:12,290
We saw the exact same
trend by density.

647
00:23:12,290 --> 00:23:14,352
The outlier here is
very dense breasts.

648
00:23:14,352 --> 00:23:16,310
But there's only like
100 of those on test set.

649
00:23:16,310 --> 00:23:19,670
So this confidence actually
goes from like, 60 to 90.

650
00:23:19,670 --> 00:23:22,430
So as far as we know for
the other three categories,

651
00:23:22,430 --> 00:23:24,860
it is very much tied
to confidence interval

652
00:23:24,860 --> 00:23:29,000
and very similar,
once again, around 82.

653
00:23:29,000 --> 00:23:29,500
OK.

654
00:23:29,500 --> 00:23:32,570
So we have a decent idea
that this model seems

655
00:23:32,570 --> 00:23:35,410
at least with a
publish of MGH actually

656
00:23:35,410 --> 00:23:38,050
serve the relevant
populations that

657
00:23:38,050 --> 00:23:40,280
exist as far as we know so far.

658
00:23:40,280 --> 00:23:42,405
The next question is, how
does the model assessment

659
00:23:42,405 --> 00:23:44,030
relate to the
radiologist's assessment?

660
00:23:44,030 --> 00:23:45,940
So to look at that we
looked at on the test,

661
00:23:45,940 --> 00:23:48,310
if you look at the
radiologist's true positives,

662
00:23:48,310 --> 00:23:51,080
false positives, true
negatives, false negatives.

663
00:23:51,080 --> 00:23:53,080
Where do they fall within
the model distribution

664
00:23:53,080 --> 00:23:54,760
of percentile risk?

665
00:23:54,760 --> 00:23:56,260
And if there is
below the threshold,

666
00:23:56,260 --> 00:23:58,580
we've got to color it in
this kind of cyan color.

667
00:23:58,580 --> 00:24:00,163
And if it's above
the threshold, we're

668
00:24:00,163 --> 00:24:02,480
going to color it in
this purple color.

669
00:24:02,480 --> 00:24:04,607
So this is kind of
triage, not triage.

670
00:24:04,607 --> 00:24:06,940
The first thing to notice--
this is the true positives--

671
00:24:06,940 --> 00:24:11,050
is that there is like a
pretty kind of steep drop-off.

672
00:24:11,050 --> 00:24:14,410
And so there is only
one true positive

673
00:24:14,410 --> 00:24:17,330
fell below the threshold in
a test set of 26,000 exams.

674
00:24:17,330 --> 00:24:20,540
So none of this difference
was statistically significant.

675
00:24:20,540 --> 00:24:23,522
And the vast majority of them
are kind of this top 10%.

676
00:24:23,522 --> 00:24:25,730
But you kind of see, like,
there's a clear trend here

677
00:24:25,730 --> 00:24:29,220
that they kind of get piled up
towards the higher percentages.

678
00:24:29,220 --> 00:24:31,470
Whereas if you look at the
false positive assessments,

679
00:24:31,470 --> 00:24:33,000
this trend is much weaker.

680
00:24:33,000 --> 00:24:36,200
So you still see that
there is some correlation

681
00:24:36,200 --> 00:24:38,810
that there's going to more false
positives the higher amounts,

682
00:24:38,810 --> 00:24:39,955
but much less stark.

683
00:24:39,955 --> 00:24:42,080
And this actually means
that a lot of radiologist's

684
00:24:42,080 --> 00:24:44,960
false positives we actually
place below the threshold.

685
00:24:44,960 --> 00:24:47,390
And so because these assessments
are completely concordant

686
00:24:47,390 --> 00:24:49,848
and we're not just modeling
what the radiologist would have

687
00:24:49,848 --> 00:24:52,280
said, we get an
anticipated benefit

688
00:24:52,280 --> 00:24:56,570
of actually reducing the false
positives significantly because

689
00:24:56,570 --> 00:24:58,340
of the weight of disagreeing.

690
00:24:58,340 --> 00:25:01,830
And finally, kind of
aiding that further,

691
00:25:01,830 --> 00:25:03,790
if you look at the true
negative assessments,

692
00:25:03,790 --> 00:25:06,495
there is not that much
trending between where

693
00:25:06,495 --> 00:25:07,370
it falls within this.

694
00:25:07,370 --> 00:25:12,308
So it shows that they're kind of
picking up on different things

695
00:25:12,308 --> 00:25:14,600
and they're-- where they
disagree gives them both areas

696
00:25:14,600 --> 00:25:18,450
to improve and ancillary
benefits because now we can

697
00:25:18,450 --> 00:25:20,150
reduce false positives.

698
00:25:20,150 --> 00:25:22,192
This directly leads into
assimilating the impact.

699
00:25:22,192 --> 00:25:24,108
So one of the things we
did, we just said, OK.

700
00:25:24,108 --> 00:25:26,760
If people retrospective on
the test set as a simulation

701
00:25:26,760 --> 00:25:29,690
before which truly plug it
in, if people didn't rebuild

702
00:25:29,690 --> 00:25:31,743
the triage threshold-- so
we can't catch any more

703
00:25:31,743 --> 00:25:33,910
cancer this way, but we can
reduce false positives--

704
00:25:33,910 --> 00:25:34,952
what would have happened?

705
00:25:34,952 --> 00:25:37,922
So at the top we have
the original performance.

706
00:25:37,922 --> 00:25:39,630
So this is looking at
100% of mammograms,

707
00:25:39,630 --> 00:25:43,530
sensitivity was 98.6
with specificity of 93.

708
00:25:43,530 --> 00:25:45,990
And in the simulation
the sensitivity

709
00:25:45,990 --> 00:25:49,410
dropped not
significantly to 90.1,

710
00:25:49,410 --> 00:25:51,900
but significantly improved
to 93.7 while looking

711
00:25:51,900 --> 00:25:54,660
at 81% of the mammograms.

712
00:25:54,660 --> 00:25:57,120
So this is like promising
preliminary data.

713
00:25:57,120 --> 00:26:00,170
But to reevaluate this and
go forward, our next step--

714
00:26:00,170 --> 00:26:01,098
let's see if-- oh.

715
00:26:01,098 --> 00:26:02,640
I'm going to get to
that in a second.

716
00:26:02,640 --> 00:26:05,070
Our next step is we need to
do clinical implementation

717
00:26:05,070 --> 00:26:06,337
to really figure out--

718
00:26:06,337 --> 00:26:07,920
because there's a
core assumption here

719
00:26:07,920 --> 00:26:09,670
is that people read
it the same way.

720
00:26:09,670 --> 00:26:12,370
But if you have this higher
incidence, what does that mean?

721
00:26:12,370 --> 00:26:15,000
Can you focus more on the
people that are more suspicious?

722
00:26:15,000 --> 00:26:18,150
And is the right way to do this
just a single threshold to not

723
00:26:18,150 --> 00:26:18,780
read?

724
00:26:18,780 --> 00:26:20,040
Or have a double
ended with the seniors

725
00:26:20,040 --> 00:26:21,957
cause they're much more
likely to have cancer.

726
00:26:21,957 --> 00:26:24,249
And so there is quite a bit
of exploration here to say,

727
00:26:24,249 --> 00:26:25,832
given we have these
tools that give us

728
00:26:25,832 --> 00:26:27,792
some probability of
cancer, that's not perfect,

729
00:26:27,792 --> 00:26:28,750
but gives us something.

730
00:26:28,750 --> 00:26:31,600
How well can we do that
to improve care today?

731
00:26:31,600 --> 00:26:35,422
So as a quiz, can you tell
which of these will be triaged?

732
00:26:35,422 --> 00:26:36,630
So this is no cherry-picking.

733
00:26:36,630 --> 00:26:39,180
I randomly picked
four mammograms

734
00:26:39,180 --> 00:26:41,610
that were below and
above the threshold.

735
00:26:41,610 --> 00:26:42,930
Can anyone guess which side--

736
00:26:42,930 --> 00:26:45,360
left or right-- was triaged?

737
00:26:48,192 --> 00:26:50,590
This is not graded,
Chris, so you know.

738
00:26:50,590 --> 00:26:52,066
AUDIENCE: Raise your hand for--

739
00:26:52,066 --> 00:26:52,858
ADAM YALA: Oh yeah.

740
00:26:52,858 --> 00:26:55,450
Raise your hand for the left.

741
00:26:55,450 --> 00:26:55,950
OK.

742
00:26:55,950 --> 00:26:57,033
Raise your hand for right.

743
00:26:59,580 --> 00:27:00,980
Here we go.

744
00:27:00,980 --> 00:27:01,480
Well done.

745
00:27:01,480 --> 00:27:02,840
Well done.

746
00:27:02,840 --> 00:27:03,670
OK.

747
00:27:03,670 --> 00:27:05,410
And then next step,
as I said before,

748
00:27:05,410 --> 00:27:07,120
is we need to kind of push to
the clinical implementation

749
00:27:07,120 --> 00:27:09,340
because that's where the
rubber hits the road.

750
00:27:09,340 --> 00:27:11,910
We identify is there any
biases we didn't detect?

751
00:27:11,910 --> 00:27:16,160
And we need to say, can
we deliver this value?

752
00:27:16,160 --> 00:27:20,360
So the next project is on
assessing breast cancer risk.

753
00:27:20,360 --> 00:27:22,837
So this is the same mammogram
I showed you earlier.

754
00:27:22,837 --> 00:27:24,670
It was diagnosed with
breast cancer in 2014.

755
00:27:24,670 --> 00:27:27,260
It's actually my
advisor, Regina's.

756
00:27:27,260 --> 00:27:31,550
And you can see that in
2013 you see it's there.

757
00:27:31,550 --> 00:27:34,790
In 2012 it looks
much less prominence.

758
00:27:34,790 --> 00:27:38,880
And five years ago, really
looking at breast cancer risk.

759
00:27:38,880 --> 00:27:40,430
So if you can tell
from an image that

760
00:27:40,430 --> 00:27:42,290
is going to be healthy
for a long time,

761
00:27:42,290 --> 00:27:43,790
you're really trying
to model what's

762
00:27:43,790 --> 00:27:45,457
the likelihood of
this breast developing

763
00:27:45,457 --> 00:27:46,760
cancer in the future.

764
00:27:46,760 --> 00:27:49,520
Now modeling breast cancer
risk, as Connie earlier said,

765
00:27:49,520 --> 00:27:51,430
is not a new problem.

766
00:27:51,430 --> 00:27:54,600
It's been a quite researched
one in the community.

767
00:27:54,600 --> 00:27:56,350
And the more classical
approach that we're

768
00:27:56,350 --> 00:27:58,080
gonna look at other
kind of global health

769
00:27:58,080 --> 00:28:00,833
factors-- the person's
age, their family history,

770
00:28:00,833 --> 00:28:02,750
whether or not they've
had menopause, and kind

771
00:28:02,750 --> 00:28:05,000
of any other of these kind
of facts we can sort of say

772
00:28:05,000 --> 00:28:06,560
are markers of
their health to try

773
00:28:06,560 --> 00:28:08,510
to predict whether this person's
at risk of developing breast

774
00:28:08,510 --> 00:28:09,260
cancer.

775
00:28:09,260 --> 00:28:10,820
People have thought that
the image contains something

776
00:28:10,820 --> 00:28:11,630
before.

777
00:28:11,630 --> 00:28:12,530
The way they've
thought about this

778
00:28:12,530 --> 00:28:14,150
is through this kind of
subjective breast density

779
00:28:14,150 --> 00:28:15,020
marker.

780
00:28:15,020 --> 00:28:17,660
And the improvements
seen across this

781
00:28:17,660 --> 00:28:20,690
are kind of marginal
from 61 to 63.

782
00:28:20,690 --> 00:28:23,220
And as before,
the kind of sketch

783
00:28:23,220 --> 00:28:25,790
we're going to go through is
dataset collection, modeling,

784
00:28:25,790 --> 00:28:27,523
and analysis.

785
00:28:27,523 --> 00:28:28,940
And dataset
collection we followed

786
00:28:28,940 --> 00:28:30,860
a very similar template.

787
00:28:30,860 --> 00:28:32,440
We saw from
consecutive mammograms

788
00:28:32,440 --> 00:28:37,190
from 2009 to 2012 we took
outcomes from the EHR,

789
00:28:37,190 --> 00:28:39,530
once again, and the
Partners Registry.

790
00:28:39,530 --> 00:28:42,260
We didn't do exclusions based on
race or anything of that sort,

791
00:28:42,260 --> 00:28:43,580
or implants.

792
00:28:43,580 --> 00:28:45,570
But we did exclude
negatives for followup.

793
00:28:45,570 --> 00:28:47,570
So if someone didn't have
cancer in three years,

794
00:28:47,570 --> 00:28:49,240
but disappeared
from the system, we

795
00:28:49,240 --> 00:28:50,823
didn't count them
as negatives that we

796
00:28:50,823 --> 00:28:53,550
have some certainty in both
the modeling and the analysis.

797
00:28:53,550 --> 00:28:58,030
And as always, we split
patients into train, dev, test.

798
00:28:58,030 --> 00:29:00,420
The modeling is very similar.

799
00:29:00,420 --> 00:29:04,010
It's the same kind of templated
lessons as from triage,

800
00:29:04,010 --> 00:29:07,250
except we experimented with a
model that's only the image.

801
00:29:07,250 --> 00:29:10,440
And for the sake of analysis,
a model that's the image model

802
00:29:10,440 --> 00:29:12,107
I just talked to you
before concatenated

803
00:29:12,107 --> 00:29:14,315
with those traditional risk
factors at the last layer

804
00:29:14,315 --> 00:29:15,180
and trained jointly.

805
00:29:15,180 --> 00:29:16,500
That make sense for everyone?

806
00:29:16,500 --> 00:29:19,340
So I'm going to call that
ImageOnly an Image+RF

807
00:29:19,340 --> 00:29:20,778
or hybrid.

808
00:29:20,778 --> 00:29:21,278
OK.

809
00:29:21,278 --> 00:29:22,590
Cool?

810
00:29:22,590 --> 00:29:24,350
Our kind of goals
for the analysis.

811
00:29:24,350 --> 00:29:27,110
As before, we want to
see does this model

812
00:29:27,110 --> 00:29:29,210
actually serve the
whole population?

813
00:29:29,210 --> 00:29:32,330
Is it going to be discriminative
across race, menopause status,

814
00:29:32,330 --> 00:29:33,538
the family history?

815
00:29:33,538 --> 00:29:36,080
And how does it relate to kind
of classical portions of risk?

816
00:29:36,080 --> 00:29:38,380
And are we actually
doing any better?

817
00:29:38,380 --> 00:29:40,360
And so just diving
directly into that,

818
00:29:40,360 --> 00:29:42,440
assuming there's no questions.

819
00:29:42,440 --> 00:29:43,260
Good.

820
00:29:43,260 --> 00:29:45,280
Just to kind of remind you,
this is the kind of the setting.

821
00:29:45,280 --> 00:29:46,980
One thing I forgot to mention--
that's why I had the slide here

822
00:29:46,980 --> 00:29:48,010
to remind me--

823
00:29:48,010 --> 00:29:50,690
is that we excluded
cancers from the first year

824
00:29:50,690 --> 00:29:51,720
from the test set.

825
00:29:51,720 --> 00:29:53,900
So there is truly a negative
screening population.

826
00:29:53,900 --> 00:29:56,030
That way we kind of
disentangle cancer detection

827
00:29:56,030 --> 00:29:57,230
from cancer risk.

828
00:29:57,230 --> 00:29:57,730
OK.

829
00:29:57,730 --> 00:29:59,360
Cool.

830
00:29:59,360 --> 00:30:02,470
So Tyrer-Cuzick is the kind of
prior state-of-the-art model.

831
00:30:02,470 --> 00:30:05,310
It's a model based
out of the UK.

832
00:30:05,310 --> 00:30:08,558
Their developer is
someone named Sir Cuzick,

833
00:30:08,558 --> 00:30:09,850
who was knighted for this work.

834
00:30:09,850 --> 00:30:11,270
It's very commonly used.

835
00:30:11,270 --> 00:30:13,160
So that one had an AUC of 62.

836
00:30:13,160 --> 00:30:16,940
Our image-only model
had an AUC about 68.

837
00:30:16,940 --> 00:30:18,898
And hybrid one had an AUC of 70.

838
00:30:18,898 --> 00:30:20,690
So you know, what is
this kind of AUC thing

839
00:30:20,690 --> 00:30:22,430
gives you when you look
using a risk model.

840
00:30:22,430 --> 00:30:24,430
What it gives you is the
ability to build better

841
00:30:24,430 --> 00:30:25,880
high-risk and low-risk cohorts.

842
00:30:25,880 --> 00:30:27,713
So in terms of looking
at high-risk cohorts,

843
00:30:27,713 --> 00:30:29,900
our best model place about
30% of all the cancers

844
00:30:29,900 --> 00:30:32,840
in the population
in the top 10%,

845
00:30:32,840 --> 00:30:35,210
and 3% of all the
cancers in the bottom 10%

846
00:30:35,210 --> 00:30:39,422
compared to 18 and 5 to
the prior state of the art.

847
00:30:39,422 --> 00:30:40,880
And so what this
enables you to do,

848
00:30:40,880 --> 00:30:42,380
if you're going to
say that this 10%

849
00:30:42,380 --> 00:30:44,270
should actually
qualify for MRI, you

850
00:30:44,270 --> 00:30:46,102
can start fighting this
problem of majority

851
00:30:46,102 --> 00:30:47,810
of people that get
cancer don't have MRI,

852
00:30:47,810 --> 00:30:50,938
and the majority of people
that get it don't need it.

853
00:30:50,938 --> 00:30:52,730
It's all about, is your
risk model actually

854
00:30:52,730 --> 00:30:55,670
place the right people
into the right buckets.

855
00:30:55,670 --> 00:30:58,460
Now we saw that this trend of
outperforming the prior state

856
00:30:58,460 --> 00:30:59,880
of the art held across races.

857
00:30:59,880 --> 00:31:02,080
And one of the things that
was kind of astonishing

858
00:31:02,080 --> 00:31:04,580
was that though Tyrer-Cuzick
performed on white women, which

859
00:31:04,580 --> 00:31:06,288
makes sense because
it was developed only

860
00:31:06,288 --> 00:31:07,490
using white women in the UK.

861
00:31:07,490 --> 00:31:08,990
It was worse than
random [INAUDIBLE]

862
00:31:08,990 --> 00:31:10,490
for African-American women.

863
00:31:10,490 --> 00:31:13,208
And so this kind of
emphasizes the importance

864
00:31:13,208 --> 00:31:14,750
of this kind of
analysis to make sure

865
00:31:14,750 --> 00:31:16,580
that the kind of
data that you have

866
00:31:16,580 --> 00:31:19,038
is reflective of the population
that you're trying to serve

867
00:31:19,038 --> 00:31:21,530
and actually doing the
analysis accordingly.

868
00:31:21,530 --> 00:31:25,030
So we saw that our model
kind of held across races

869
00:31:25,030 --> 00:31:26,780
and as well across--
we see this trend

870
00:31:26,780 --> 00:31:29,480
from across
pre-postmenopausal and with

871
00:31:29,480 --> 00:31:32,238
and without family history.

872
00:31:32,238 --> 00:31:34,530
One thing we did in terms of
a more granular comparison

873
00:31:34,530 --> 00:31:36,560
of performance, if
we just look at kind

874
00:31:36,560 --> 00:31:39,860
of like the risk thirds for
our model and the Tyrer-Cuzick

875
00:31:39,860 --> 00:31:41,990
model, what's the
trend that you see

876
00:31:41,990 --> 00:31:44,370
or the cases where kind
of like which one is right

877
00:31:44,370 --> 00:31:46,568
that's kind of ambiguous.

878
00:31:46,568 --> 00:31:48,110
And what I should
show in these boxes

879
00:31:48,110 --> 00:31:51,480
is the cancer incidence
prevalence in the population.

880
00:31:51,480 --> 00:31:53,792
So the darker the box,
the higher the incidence.

881
00:31:53,792 --> 00:31:55,250
And on the right-hand
side are just

882
00:31:55,250 --> 00:31:58,250
random images from cases
that fit within those boxes.

883
00:31:58,250 --> 00:32:00,230
Does that make
sense for everyone?

884
00:32:00,230 --> 00:32:00,955
Great.

885
00:32:00,955 --> 00:32:03,080
So a clear trend that you
see is that, for example,

886
00:32:03,080 --> 00:32:08,260
if TCv8 calls you a high
risk but we call it low,

887
00:32:08,260 --> 00:32:11,875
that is a lower incidence
than if we called it medium

888
00:32:11,875 --> 00:32:13,000
and they call it low.

889
00:32:13,000 --> 00:32:15,700
So kind of like you kind of
see this straight column-wise

890
00:32:15,700 --> 00:32:17,950
pattern showing that
discrimination truly does

891
00:32:17,950 --> 00:32:21,233
follow the deep learning model
and not the classical approach.

892
00:32:21,233 --> 00:32:22,900
And by looking at the
random images that

893
00:32:22,900 --> 00:32:24,972
were selected in case
where we disagree,

894
00:32:24,972 --> 00:32:26,680
it supports the notion
that it's not just

895
00:32:26,680 --> 00:32:28,450
that the column is just
the most dense, crazy,

896
00:32:28,450 --> 00:32:30,230
dense looking breast, that
there's something more subtle

897
00:32:30,230 --> 00:32:32,688
it's picking up that's actually
indicative of breast cancer

898
00:32:32,688 --> 00:32:34,343
risk.

899
00:32:34,343 --> 00:32:35,760
Kind of a very
similar analysis we

900
00:32:35,760 --> 00:32:39,180
looked at as if we look at just
by a traditional breast density

901
00:32:39,180 --> 00:32:42,030
as labeled by the original
radiologist on the development

902
00:32:42,030 --> 00:32:44,640
set or on the test
set, we end up

903
00:32:44,640 --> 00:32:47,312
seeing the same trend where
if someone is non-dense

904
00:32:47,312 --> 00:32:48,270
we call them high risk.

905
00:32:48,270 --> 00:32:49,430
They're much higher
risk than someone

906
00:32:49,430 --> 00:32:50,930
that is dense than
we call low risk.

907
00:32:53,900 --> 00:32:55,670
And as before, the
kind of real next step

908
00:32:55,670 --> 00:32:59,930
here to make this truly valuable
and truly useful is actually

909
00:32:59,930 --> 00:33:02,060
implementing a clinically
seamless prospectively

910
00:33:02,060 --> 00:33:04,910
and with more centers and
kind of more population to see

911
00:33:04,910 --> 00:33:07,310
does this work and does it
deliver the kind of benefits

912
00:33:07,310 --> 00:33:08,280
that we care about.

913
00:33:08,280 --> 00:33:10,030
And viewing what is
the leverage of change

914
00:33:10,030 --> 00:33:11,697
once you know that
someone is high risk?

915
00:33:11,697 --> 00:33:14,000
Perhaps MRI, perhaps
more frequent screening.

916
00:33:14,000 --> 00:33:16,190
And so this is the kind
of gap between having

917
00:33:16,190 --> 00:33:18,500
a useful technology
on the paper side

918
00:33:18,500 --> 00:33:21,360
to an actual useful
technology in real life.

919
00:33:21,360 --> 00:33:23,968
So I am moving on schedule.

920
00:33:23,968 --> 00:33:25,760
So now I'm gonna talk
about how to mess up.

921
00:33:25,760 --> 00:33:27,760
And it's actually
quite interesting.

922
00:33:27,760 --> 00:33:29,490
There is like, so many ways.

923
00:33:29,490 --> 00:33:33,175
And I fall into them a few
times myself, and it happens.

924
00:33:33,175 --> 00:33:34,550
And kind of
following the sketch,

925
00:33:34,550 --> 00:33:35,780
you can mess up in
dataset collection.

926
00:33:35,780 --> 00:33:37,405
That's probably the
most common by far.

927
00:33:37,405 --> 00:33:39,990
You can mess up in modeling,
which I'm doing right now.

928
00:33:39,990 --> 00:33:41,040
And it's very sad.

929
00:33:41,040 --> 00:33:44,130
And you can mess up in analysis,
which is really preventable.

930
00:33:44,130 --> 00:33:47,120
So in dataset collection,
enriched data sets

931
00:33:47,120 --> 00:33:49,670
are the kind of the most common
thing you see in this space.

932
00:33:49,670 --> 00:33:51,170
You find in a public
data set that's

933
00:33:51,170 --> 00:33:54,140
most likely going to be like
50-50 cancer, not cancer.

934
00:33:54,140 --> 00:33:57,310
And oftentimes these
datasets collect

935
00:33:57,310 --> 00:33:59,250
can have some sort of
bias within the way

936
00:33:59,250 --> 00:34:00,370
it was collected.

937
00:34:00,370 --> 00:34:04,080
So it might be that you have
negative cases from less

938
00:34:04,080 --> 00:34:05,940
centers than you
have positive cases.

939
00:34:05,940 --> 00:34:07,200
Or they're collected
from different years.

940
00:34:07,200 --> 00:34:08,783
And actually, this
is something we ran

941
00:34:08,783 --> 00:34:10,199
into earlier in our own work.

942
00:34:10,199 --> 00:34:12,000
Once upon a time,
Connie and I were

943
00:34:12,000 --> 00:34:16,090
in Shanghai for the opening
of a cancer center there.

944
00:34:16,090 --> 00:34:19,000
And at that time we had all the
cancers from the MGH dataset,

945
00:34:19,000 --> 00:34:20,100
about 2,000.

946
00:34:20,100 --> 00:34:22,770
But the mammograms were still
being collected annually

947
00:34:22,770 --> 00:34:25,110
from 2012-- from 2009.

948
00:34:25,110 --> 00:34:28,020
So at that time, we only had,
like, half of the negatives

949
00:34:28,020 --> 00:34:30,333
by year, but all of the cancers.

950
00:34:30,333 --> 00:34:31,750
And all of a sudden
I had to-- you

951
00:34:31,750 --> 00:34:34,000
know, I came from the slightly
more complicated model,

952
00:34:34,000 --> 00:34:34,850
as one often does.

953
00:34:34,850 --> 00:34:36,683
I looked at several
images at the same time.

954
00:34:36,683 --> 00:34:38,320
And my AUC went up to like, 95.

955
00:34:38,320 --> 00:34:40,560
And I had all this, like,
bouncing off the wall.

956
00:34:40,560 --> 00:34:42,917
And then in-- you know, I
had some suspicion of like,

957
00:34:42,917 --> 00:34:43,500
wait a second.

958
00:34:43,500 --> 00:34:44,460
This is too high.

959
00:34:44,460 --> 00:34:46,350
This is too good.

960
00:34:46,350 --> 00:34:48,780
And we completely realized
that all these numbers

961
00:34:48,780 --> 00:34:50,159
were kind of a myth.

962
00:34:50,159 --> 00:34:51,510
But this level of--

963
00:34:51,510 --> 00:34:54,060
kind of if you do these
kind of case control things,

964
00:34:54,060 --> 00:34:56,670
you can oftentimes,
unless you're

965
00:34:56,670 --> 00:34:58,587
very careful about the
way it was constructed,

966
00:34:58,587 --> 00:35:00,212
you could easily run
into these issues.

967
00:35:00,212 --> 00:35:02,380
And your testing set won't
protect you from that.

968
00:35:02,380 --> 00:35:05,370
And so having a clean
dataset that truly

969
00:35:05,370 --> 00:35:08,400
follows the kind of spectrum
we expect to use it in--

970
00:35:08,400 --> 00:35:10,480
i.e., a natural
distribution, collected

971
00:35:10,480 --> 00:35:12,840
through routine clinical
care is important to say

972
00:35:12,840 --> 00:35:16,530
will it behave as we
actually want it to be used.

973
00:35:16,530 --> 00:35:17,700
In general, the only--

974
00:35:17,700 --> 00:35:20,047
some of this you can think
through in first principle.

975
00:35:20,047 --> 00:35:21,630
But it kind of
stresses the importance

976
00:35:21,630 --> 00:35:23,820
of actually testing
this prospectively

977
00:35:23,820 --> 00:35:26,820
in external validation to try to
see does this work when I take

978
00:35:26,820 --> 00:35:28,760
away some of the
biases in my dataset,

979
00:35:28,760 --> 00:35:30,550
and being really
careful about that.

980
00:35:30,550 --> 00:35:32,175
The common approach
of just controlling

981
00:35:32,175 --> 00:35:33,960
by age or by density
is not enough

982
00:35:33,960 --> 00:35:36,168
when the model can catch
really fine-grained signals.

983
00:35:38,455 --> 00:35:39,580
How to mess up in modeling.

984
00:35:39,580 --> 00:35:41,940
So there's been adventures
in this space as well.

985
00:35:41,940 --> 00:35:43,690
One of the things I've
recently discovered

986
00:35:43,690 --> 00:35:46,720
is that the actual
mammography machine

987
00:35:46,720 --> 00:35:48,323
device that the
machine was captured

988
00:35:48,323 --> 00:35:49,865
on-- so you saw a
bunch of mammograms

989
00:35:49,865 --> 00:35:51,282
probably from
different machines--

990
00:35:51,282 --> 00:35:54,710
has an unexpected
impact on the model.

991
00:35:54,710 --> 00:35:56,790
So the actual probability
distribution--

992
00:35:56,790 --> 00:35:59,500
the distribution of cancer
probabilities by the model

993
00:35:59,500 --> 00:36:01,032
is not independent
of the device.

994
00:36:01,032 --> 00:36:02,740
That's something we're
going through now.

995
00:36:02,740 --> 00:36:04,365
We actually ran into
this while working

996
00:36:04,365 --> 00:36:06,210
on clinical implementation
is like this kind

997
00:36:06,210 --> 00:36:07,960
of conditional adversarial
training set up

998
00:36:07,960 --> 00:36:10,300
to try to rectify this issue.

999
00:36:10,300 --> 00:36:11,030
It's important.

1000
00:36:11,030 --> 00:36:13,955
So this is much harder to
catch based on first principle.

1001
00:36:13,955 --> 00:36:16,330
But it's important to think
through as you kind of really

1002
00:36:16,330 --> 00:36:18,842
start demoing out
your computations.

1003
00:36:18,842 --> 00:36:20,800
This will kind of-- these
issues pop up easily,

1004
00:36:20,800 --> 00:36:22,990
and they're harder to avoid.

1005
00:36:22,990 --> 00:36:25,600
And lastly, and I
think probably one

1006
00:36:25,600 --> 00:36:28,120
that's probably
the most important

1007
00:36:28,120 --> 00:36:30,020
is messing up in analysis.

1008
00:36:30,020 --> 00:36:32,560
So it's quite common
in the previous section

1009
00:36:32,560 --> 00:36:33,310
in this field--

1010
00:36:33,310 --> 00:36:33,620
yes.

1011
00:36:33,620 --> 00:36:35,304
AUDIENCE: With the
adversarial up there,

1012
00:36:35,304 --> 00:36:38,810
just to understand what you
do, do you that discriminate

1013
00:36:38,810 --> 00:36:40,320
or predict the machine?

1014
00:36:40,320 --> 00:36:43,548
And then you train against that?

1015
00:36:43,548 --> 00:36:45,590
ADAM YALA: So my answer
is going to be two parts.

1016
00:36:45,590 --> 00:36:48,350
One, it doesn't work as
well as I want it to yet.

1017
00:36:48,350 --> 00:36:49,330
So really who knows?

1018
00:36:49,330 --> 00:36:52,000
But my best hunch
in terms of what's

1019
00:36:52,000 --> 00:36:54,520
been done before for other
kind of work, specifically

1020
00:36:54,520 --> 00:36:56,853
in radio signals, is they use
a conditional adversarial.

1021
00:36:56,853 --> 00:36:59,437
So you're free to discriminate
at both the label and the image

1022
00:36:59,437 --> 00:36:59,980
presentation.

1023
00:36:59,980 --> 00:37:01,690
You have to try to
predict out the device

1024
00:37:01,690 --> 00:37:04,210
to try to take away the
information that's not just

1025
00:37:04,210 --> 00:37:06,250
contained within the
label distribution.

1026
00:37:06,250 --> 00:37:09,150
And that's been shown to be very
helpful for people trying to do

1027
00:37:09,150 --> 00:37:12,540
[INAUDIBLE] detection
based off on Wi-Fi--

1028
00:37:12,540 --> 00:37:14,390
or not Wi-Fi-- but
like, radio waves.

1029
00:37:14,390 --> 00:37:16,090
And the [INAUDIBLE]
but also, it seems

1030
00:37:16,090 --> 00:37:18,440
to be the most common approach
I've seen in literature.

1031
00:37:18,440 --> 00:37:20,120
So it's something that
I'm going to try soon.

1032
00:37:20,120 --> 00:37:20,810
I haven't implemented it.

1033
00:37:20,810 --> 00:37:22,810
It was just GPU time
and kind of waiting

1034
00:37:22,810 --> 00:37:25,750
to queue up the experiment.

1035
00:37:25,750 --> 00:37:29,020
And the last part in
terms of how to mess up

1036
00:37:29,020 --> 00:37:30,225
is this kind of analysis.

1037
00:37:30,225 --> 00:37:31,600
One thing that's
common is people

1038
00:37:31,600 --> 00:37:33,810
assume that's it kind of
like synthetic experiments

1039
00:37:33,810 --> 00:37:35,360
or the same thing as
clinical implementation.

1040
00:37:35,360 --> 00:37:37,330
Like, people do reader
studies very often.

1041
00:37:37,330 --> 00:37:38,500
And it's quite common
to see that when

1042
00:37:38,500 --> 00:37:40,900
you do reader studies that
it doesn't actually-- like,

1043
00:37:40,900 --> 00:37:42,280
you might find that
computer detection does

1044
00:37:42,280 --> 00:37:43,980
a huge difference
in reader studies.

1045
00:37:43,980 --> 00:37:46,480
And it's-- Connie actual showed
it was harmful in real life.

1046
00:37:46,480 --> 00:37:50,015
And it's important to kind
of like, do these real world

1047
00:37:50,015 --> 00:37:51,890
experiments that we can
say what is happening

1048
00:37:51,890 --> 00:37:54,520
and just them the real
benefit that I expected.

1049
00:37:54,520 --> 00:37:58,270
And a hopefully less
common nowadays mistake

1050
00:37:58,270 --> 00:38:01,510
is that oftentimes people
exclude all inconvenient cases.

1051
00:38:01,510 --> 00:38:03,760
So there was a paper
yesterday that just came out

1052
00:38:03,760 --> 00:38:06,818
that the cancer detection used a
kind of patched-up architecture

1053
00:38:06,818 --> 00:38:08,860
which would read more
closely into their details,

1054
00:38:08,860 --> 00:38:10,760
they excluded all
women with breasts

1055
00:38:10,760 --> 00:38:12,760
that they considered too
small by some threshold

1056
00:38:12,760 --> 00:38:14,260
for like modeling convenience.

1057
00:38:14,260 --> 00:38:15,635
But that might
disproportionately

1058
00:38:15,635 --> 00:38:19,680
affect specifically Asian
women in that population.

1059
00:38:19,680 --> 00:38:21,790
And so they didn't do
a subgroup analysis

1060
00:38:21,790 --> 00:38:23,290
for all the different
races, so it's

1061
00:38:23,290 --> 00:38:24,920
hard to know what
is happening there.

1062
00:38:24,920 --> 00:38:26,378
If your population
is mostly white,

1063
00:38:26,378 --> 00:38:28,570
which it is at MGH, and
is at a lot of the centers

1064
00:38:28,570 --> 00:38:30,450
that these colleges
have developed,

1065
00:38:30,450 --> 00:38:31,915
are reporting the
average that you

1066
00:38:31,915 --> 00:38:33,470
see isn't enough to
really validate that.

1067
00:38:33,470 --> 00:38:35,260
And so you can have things
like Tyrer-Cuzick model

1068
00:38:35,260 --> 00:38:37,450
that are worse than random
and especially harmful

1069
00:38:37,450 --> 00:38:38,680
for African-American women.

1070
00:38:38,680 --> 00:38:41,140
And so guarding
against that is you

1071
00:38:41,140 --> 00:38:43,300
can do a lot of that
based on first principle.

1072
00:38:43,300 --> 00:38:45,508
But some of these things
you can only really find out

1073
00:38:45,508 --> 00:38:48,430
by actively monitoring to say,
is there any subpopulation

1074
00:38:48,430 --> 00:38:51,770
that I didn't think about a
priority that could be harmed?

1075
00:38:51,770 --> 00:38:54,160
And finally, so I talked
about clinical deployments.

1076
00:38:54,160 --> 00:38:55,830
We've actually done
this a couple times.

1077
00:38:55,830 --> 00:38:59,110
And I'm going to switch
over to Connie real soon.

1078
00:38:59,110 --> 00:39:01,480
In general, what
you want to do is

1079
00:39:01,480 --> 00:39:04,540
you want to make it as easy
as plausible and possible

1080
00:39:04,540 --> 00:39:08,980
for the in-house IT
team to use your tool.

1081
00:39:08,980 --> 00:39:11,070
We've gone through this with--

1082
00:39:11,070 --> 00:39:12,990
not like-- I don't--
depends on how you count.

1083
00:39:12,990 --> 00:39:14,470
It's like once for density
and then like three times

1084
00:39:14,470 --> 00:39:15,360
at the same time.

1085
00:39:15,360 --> 00:39:17,950
But I spent, like, many
hours sitting there.

1086
00:39:17,950 --> 00:39:21,190
And the broad way that we
set it up so far is we just

1087
00:39:21,190 --> 00:39:24,340
have a kind of
docker as container

1088
00:39:24,340 --> 00:39:26,860
to manage a web app
that holds the model.

1089
00:39:26,860 --> 00:39:29,140
This web app has kind of a
backup processing toolkit.

1090
00:39:29,140 --> 00:39:31,140
So the kind of steps that
all of our deployments

1091
00:39:31,140 --> 00:39:33,242
follow and I look
under unified framework

1092
00:39:33,242 --> 00:39:35,200
is the IT application
would get some images out

1093
00:39:35,200 --> 00:39:36,760
of the PAC system.

1094
00:39:36,760 --> 00:39:38,260
It will send it
over to application.

1095
00:39:38,260 --> 00:39:40,760
We're going to convert to the
PNG in the way that we expect,

1096
00:39:40,760 --> 00:39:43,410
because we kind of encapsulate
this functionality.

1097
00:39:43,410 --> 00:39:45,743
Run for the models, send it
back, and then write it back

1098
00:39:45,743 --> 00:39:46,270
to the EHR.

1099
00:39:46,270 --> 00:39:49,000
One of the things I ran into
was that they didn't actually

1100
00:39:49,000 --> 00:39:51,760
know how to use things like
HTTP because it's not actually

1101
00:39:51,760 --> 00:39:53,930
normal within their
infrastructure.

1102
00:39:53,930 --> 00:39:56,620
And so being cognizant that
some of these more, like,

1103
00:39:56,620 --> 00:39:59,230
tech standard things
like just HTTP requests

1104
00:39:59,230 --> 00:40:04,390
and responses and stuff is
less standard within the inside

1105
00:40:04,390 --> 00:40:06,460
of their infrastructure
and kind of looking up

1106
00:40:06,460 --> 00:40:07,830
how to actually do these
things in like C Sharp,

1107
00:40:07,830 --> 00:40:09,310
or whatever language
they have, has

1108
00:40:09,310 --> 00:40:11,602
been really what's enabled
us to end block these things

1109
00:40:11,602 --> 00:40:13,450
and actually plug it in.

1110
00:40:13,450 --> 00:40:14,660
And that is it for my part.

1111
00:40:14,660 --> 00:40:16,160
So I'm gonna hand
it back-- oh, yes.

1112
00:40:16,160 --> 00:40:19,220
AUDIENCE: So you're writing
stuff in the IT application

1113
00:40:19,220 --> 00:40:21,970
in C Sharp to do API requests?

1114
00:40:21,970 --> 00:40:23,983
ADAM YALA: So
they're writing it.

1115
00:40:23,983 --> 00:40:25,900
I just meet them to tell
them how to write it.

1116
00:40:25,900 --> 00:40:28,070
But yes.

1117
00:40:28,070 --> 00:40:30,610
So like, in general,
like, there's libraries.

1118
00:40:30,610 --> 00:40:33,200
So like, the entire
environment is in Windows.

1119
00:40:33,200 --> 00:40:34,670
And Windows has a
very poor support

1120
00:40:34,670 --> 00:40:35,790
for lots of things
you would expect

1121
00:40:35,790 --> 00:40:37,130
it to have a good support for.

1122
00:40:37,130 --> 00:40:38,930
So there was like,
if you wanted to send

1123
00:40:38,930 --> 00:40:41,660
HP requests for like
a multipart form

1124
00:40:41,660 --> 00:40:43,430
and just put the
images in that form,

1125
00:40:43,430 --> 00:40:47,000
apparently that has bugs in
it in like, Windows whatever

1126
00:40:47,000 --> 00:40:48,620
version they use today.

1127
00:40:48,620 --> 00:40:50,450
And so that vanilla
version didn't work.

1128
00:40:50,450 --> 00:40:52,370
Windows for Docker
also has bugs.

1129
00:40:52,370 --> 00:40:54,830
And I had to set up this kind
of locking function for them

1130
00:40:54,830 --> 00:40:57,530
to like, automatically table
locks inside the container.

1131
00:40:57,530 --> 00:40:59,070
And it just doesn't work
in Windows for Docker.

1132
00:40:59,070 --> 00:41:00,940
AUDIENCE: [INAUDIBLE] questions
because he is short on time.

1133
00:41:00,940 --> 00:41:01,735
ADAM YALA: Yeah.

1134
00:41:01,735 --> 00:41:03,110
So we can get to
this at the end.

1135
00:41:03,110 --> 00:41:04,700
I want to hand off to Connie.

1136
00:41:04,700 --> 00:41:07,870
If you have any
questions, grab me after.