1
00:00:04,500 --> 00:00:09,950
In this video, we'll create a
basic scatterplot using ggplot.

2
00:00:09,950 --> 00:00:12,570
Let's start by
reading in our data.

3
00:00:12,570 --> 00:00:14,500
We'll be using the
same data set we

4
00:00:14,500 --> 00:00:18,370
used during week one, WHO.csv.

5
00:00:18,370 --> 00:00:22,850
So let's call it WHO and
use the read.csv function

6
00:00:22,850 --> 00:00:26,550
to read in the
data file WHO.csv.

7
00:00:26,550 --> 00:00:28,670
Make sure you're in the
directory containing

8
00:00:28,670 --> 00:00:30,910
this file first.

9
00:00:30,910 --> 00:00:33,610
Now, let's take a look at
the structure of the data

10
00:00:33,610 --> 00:00:37,040
using the str function.

11
00:00:37,040 --> 00:00:42,110
We can see that we have 194
observations, or countries,

12
00:00:42,110 --> 00:00:47,090
and 13 different variables-- the
name of the country, the region

13
00:00:47,090 --> 00:00:50,750
the country's in, the
population in thousands,

14
00:00:50,750 --> 00:00:55,500
the percentage of the
population under 15 or over 60,

15
00:00:55,500 --> 00:01:00,190
the fertility rate or average
number of children per woman,

16
00:01:00,190 --> 00:01:04,950
the life expectancy in years,
the child mortality rate,

17
00:01:04,950 --> 00:01:08,840
which is the number of children
who die by age five per 1,000

18
00:01:08,840 --> 00:01:15,789
births, the number of cellular
subscribers per 100 population,

19
00:01:15,789 --> 00:01:19,960
the literacy rate among
adults older than 15,

20
00:01:19,960 --> 00:01:23,300
the gross national
income per capita,

21
00:01:23,300 --> 00:01:27,080
the percentage of male children
enrolled in primary school,

22
00:01:27,080 --> 00:01:29,470
and the percentage of
female children enrolled

23
00:01:29,470 --> 00:01:31,620
in primary school.

24
00:01:31,620 --> 00:01:34,940
In week one, the very
first plot we made in R

25
00:01:34,940 --> 00:01:37,750
was a scatterplot
of fertility rate

26
00:01:37,750 --> 00:01:40,320
versus gross national income.

27
00:01:40,320 --> 00:01:43,920
Let's make this plot again,
just like we did in week one.

28
00:01:43,920 --> 00:01:48,170
So we'll use the plot function
and give as the first variable

29
00:01:48,170 --> 00:01:52,680
WHO$GNI, and then give
as the second variable,

30
00:01:52,680 --> 00:01:53,430
WHO$FertilityRate.

31
00:01:58,750 --> 00:02:02,140
This plot shows us that
a higher fertility rate

32
00:02:02,140 --> 00:02:05,520
is correlated with
a lower income.

33
00:02:05,520 --> 00:02:08,380
Now, let's redo
this scatterplot,

34
00:02:08,380 --> 00:02:11,110
but this time using ggplot.

35
00:02:11,110 --> 00:02:14,270
We'll see how ggplot can be
used to make more visually

36
00:02:14,270 --> 00:02:17,770
appealing and
complex scatterplots.

37
00:02:17,770 --> 00:02:23,050
First, we need to install
and load the ggplot2 package.

38
00:02:23,050 --> 00:02:24,950
So first type
install.packages("ggplot2").

39
00:02:32,570 --> 00:02:34,850
When the CRAN mirror
window pops up,

40
00:02:34,850 --> 00:02:36,840
make sure to pick a
location near you.

41
00:02:40,500 --> 00:02:43,070
Then, as soon as the
package is done installing

42
00:02:43,070 --> 00:02:45,260
and you're back at
the blinking cursor,

43
00:02:45,260 --> 00:02:47,170
load the package with
the library function.

44
00:02:51,110 --> 00:02:53,800
Now, remember we need
at least three things

45
00:02:53,800 --> 00:02:58,680
to create a plot using ggplot--
data, an aesthetic mapping

46
00:02:58,680 --> 00:03:02,020
of variables in the data
frame to visual output,

47
00:03:02,020 --> 00:03:04,510
and a geometric object.

48
00:03:04,510 --> 00:03:07,140
So first, let's
create the ggplot

49
00:03:07,140 --> 00:03:10,640
object with the data and
the aesthetic mapping.

50
00:03:10,640 --> 00:03:14,360
We'll save it to the
variable scatterplot,

51
00:03:14,360 --> 00:03:17,329
and then use the
ggplot function, where

52
00:03:17,329 --> 00:03:21,470
the first argument is the
name of our data set, WHO,

53
00:03:21,470 --> 00:03:25,590
which specifies the data to
use, and the second argument

54
00:03:25,590 --> 00:03:28,750
is the aesthetic mapping, aes.

55
00:03:28,750 --> 00:03:31,070
In parentheses,
we have to decide

56
00:03:31,070 --> 00:03:34,960
what we want on the x-axis and
what we want on the y-axis.

57
00:03:34,960 --> 00:03:38,380
We want the x-axis
to be GNI, and we

58
00:03:38,380 --> 00:03:42,810
want the y-axis to
be FertilityRate.

59
00:03:42,810 --> 00:03:47,400
Go ahead and close both sets
of parentheses, and hit Enter.

60
00:03:47,400 --> 00:03:50,440
Now, we need to tell
ggplot what geometric

61
00:03:50,440 --> 00:03:52,480
objects to put in the plot.

62
00:03:52,480 --> 00:03:57,060
We could use bars, lines,
points, or something else.

63
00:03:57,060 --> 00:04:00,560
This is a big difference between
ggplot and regular plotting

64
00:04:00,560 --> 00:04:03,690
in R. You can build
different types of graphs

65
00:04:03,690 --> 00:04:06,670
by using the same ggplot object.

66
00:04:06,670 --> 00:04:08,820
There's no need to learn
one function for bar

67
00:04:08,820 --> 00:04:14,290
graphs, a completely different
function for line graphs, etc.

68
00:04:14,290 --> 00:04:18,839
So first, let's just create a
straightforward scatterplot.

69
00:04:18,839 --> 00:04:22,450
So the geometry we
want to add is points.

70
00:04:22,450 --> 00:04:26,430
We can do this by typing the
name of our ggplot object,

71
00:04:26,430 --> 00:04:30,690
scatterplot, and then adding
the function, geom_point().

72
00:04:34,750 --> 00:04:38,120
If you hit Enter, you should
see a new plot in the Graphics

73
00:04:38,120 --> 00:04:41,080
window that looks similar
to our original plot,

74
00:04:41,080 --> 00:04:43,980
but there are already a
few nice improvements.

75
00:04:43,980 --> 00:04:47,270
One is that we don't have the
data set name with a dollar

76
00:04:47,270 --> 00:04:51,140
sign in front of the
label on each axis, just

77
00:04:51,140 --> 00:04:53,030
the variable name.

78
00:04:53,030 --> 00:04:54,970
Another is that we
have these nice grid

79
00:04:54,970 --> 00:04:57,640
lines in the background
and solid points

80
00:04:57,640 --> 00:05:00,880
that pop out from
the background.

81
00:05:00,880 --> 00:05:03,690
We could have made a
line graph just as easily

82
00:05:03,690 --> 00:05:05,780
by changing point to line.

83
00:05:05,780 --> 00:05:09,750
So in your R console, hit
the up arrow, and then just

84
00:05:09,750 --> 00:05:13,410
delete "point" and type
"line" and hit Enter.

85
00:05:13,410 --> 00:05:17,020
Now, you can see a line
graph in the Graphics window.

86
00:05:17,020 --> 00:05:19,290
However, a line doesn't
really make sense

87
00:05:19,290 --> 00:05:21,880
for this particular plot,
so let's switch back

88
00:05:21,880 --> 00:05:25,200
to our points, just by hitting
the up arrow twice and hitting

89
00:05:25,200 --> 00:05:27,890
Enter.

90
00:05:27,890 --> 00:05:31,630
In addition to specifying that
the geometry we want is points,

91
00:05:31,630 --> 00:05:35,010
we can add other options,
like the color, shape,

92
00:05:35,010 --> 00:05:37,080
and size of the points.

93
00:05:37,080 --> 00:05:41,460
Let's redo our plot with blue
triangles instead of circles.

94
00:05:41,460 --> 00:05:45,240
To do that, go ahead and hit
the up arrow in your R console,

95
00:05:45,240 --> 00:05:48,640
and then in the empty
parentheses for geom_point,

96
00:05:48,640 --> 00:05:51,850
we're going to specify some
properties of the points.

97
00:05:51,850 --> 00:05:57,920
We want the color to be equal
to "blue", the size to equal 3--

98
00:05:57,920 --> 00:06:00,110
we'll make the points
a little bigger --

99
00:06:00,110 --> 00:06:03,190
and the shape equals 17.

100
00:06:03,190 --> 00:06:06,760
This is the shape number
corresponding to triangles.

101
00:06:06,760 --> 00:06:09,760
If you hit Enter, you
should now see in your plot

102
00:06:09,760 --> 00:06:13,320
blue triangles
instead of black dots.

103
00:06:13,320 --> 00:06:15,120
Let's try another option.

104
00:06:15,120 --> 00:06:21,310
Hit the up arrow again, and
change "blue" to "darkred",

105
00:06:21,310 --> 00:06:24,460
and change shape to 8.

106
00:06:24,460 --> 00:06:27,720
Now, you should
see dark red stars.

107
00:06:27,720 --> 00:06:29,840
There are many different
colors and shapes

108
00:06:29,840 --> 00:06:31,480
that you can specify.

109
00:06:31,480 --> 00:06:36,320
We've provided some information
in the text below this video.

110
00:06:36,320 --> 00:06:38,430
Now, let's add a
title to the plot.

111
00:06:38,430 --> 00:06:41,010
You can do that by
hitting the up arrow,

112
00:06:41,010 --> 00:06:45,740
and then at the very end
of everything, add ggtitle,

113
00:06:45,740 --> 00:06:48,210
and then in parentheses
in quotes, the title

114
00:06:48,210 --> 00:06:49,750
you want to give your plot.

115
00:06:49,750 --> 00:06:53,200
In our case, we'll
call it "Fertility Rate

116
00:06:53,200 --> 00:06:56,240
vs. Gross National Income".

117
00:06:59,610 --> 00:07:01,070
If you look at your
plot again, you

118
00:07:01,070 --> 00:07:05,610
should now see that it has
a nice title at the top.

119
00:07:05,610 --> 00:07:08,160
Now, let's save
our plot to a file.

120
00:07:08,160 --> 00:07:12,450
We can do this by first
saving our plot to a variable.

121
00:07:12,450 --> 00:07:15,190
So in your R console,
hit the up arrow,

122
00:07:15,190 --> 00:07:18,430
and scroll to the
beginning of the line.

123
00:07:18,430 --> 00:07:21,260
Before scatterplot,
type fertilityGNIplot

124
00:07:21,260 --> 00:07:29,080
= and then everything else.

125
00:07:29,080 --> 00:07:31,420
This will save our
scatterplot to the variable,

126
00:07:31,420 --> 00:07:32,130
fertilityGNIplot.

127
00:07:35,190 --> 00:07:38,830
Now, let's create a file we
want to save our plot to.

128
00:07:38,830 --> 00:07:41,120
We can do that with
the PDF function.

129
00:07:41,120 --> 00:07:43,700
And then in parentheses
and quotes, type the name

130
00:07:43,700 --> 00:07:45,080
you want your file to have.

131
00:07:45,080 --> 00:07:46,180
We'll call it MyPlot.pdf.

132
00:07:50,159 --> 00:07:53,100
Now, let's just print our plot
to that file with the print

133
00:07:53,100 --> 00:07:54,730
function -- so
print(fertilityGNIplot).

134
00:07:59,930 --> 00:08:07,890
And lastly, we just have to type
dev.off() to close the file.

135
00:08:07,890 --> 00:08:11,670
Now, if you look at the
folder where WHO.csv is,

136
00:08:11,670 --> 00:08:15,330
you should see another
file called MyPlot.pdf,

137
00:08:15,330 --> 00:08:17,850
containing the plot we made.

138
00:08:17,850 --> 00:08:20,350
In the next video,
we'll see how to create

139
00:08:20,350 --> 00:08:23,990
more advanced
scatterplots using ggplot.