library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
# Load data
data(email50)
# View the structure of the data
str(email50)
## tibble [50 x 21] (S3: tbl_df/tbl/data.frame)
## $ spam : num [1:50] 0 0 1 0 0 0 0 0 0 0 ...
## $ to_multiple : num [1:50] 0 0 0 0 0 0 0 0 0 0 ...
## $ from : num [1:50] 1 1 1 1 1 1 1 1 1 1 ...
## $ cc : int [1:50] 0 0 4 0 0 0 0 0 1 0 ...
## $ sent_email : num [1:50] 1 0 0 0 0 0 0 1 1 0 ...
## $ time : POSIXct[1:50], format: "2012-01-04 08:19:16" "2012-02-16 15:10:06" ...
## $ image : num [1:50] 0 0 0 0 0 0 0 0 0 0 ...
## $ attach : num [1:50] 0 0 2 0 0 0 0 0 0 0 ...
## $ dollar : num [1:50] 0 0 0 0 9 0 0 0 0 23 ...
## $ winner : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ inherit : num [1:50] 0 0 0 0 0 0 0 0 0 0 ...
## $ viagra : num [1:50] 0 0 0 0 0 0 0 0 0 0 ...
## $ password : num [1:50] 0 0 0 0 1 0 0 0 0 0 ...
## $ num_char : num [1:50] 21.705 7.011 0.631 2.454 41.623 ...
## $ line_breaks : int [1:50] 551 183 28 61 1088 5 17 88 242 578 ...
## $ format : num [1:50] 1 1 0 0 1 0 0 1 1 1 ...
## $ re_subj : num [1:50] 1 0 0 0 0 0 0 1 1 0 ...
## $ exclaim_subj: num [1:50] 0 0 0 0 0 0 0 0 1 0 ...
## $ urgent_subj : num [1:50] 0 0 0 0 0 0 0 0 0 0 ...
## $ exclaim_mess: num [1:50] 8 1 2 1 43 0 0 2 22 3 ...
## $ number : Factor w/ 3 levels "none","small",..: 2 3 1 2 2 2 2 2 2 2 ...
Identify variable types
Recall from the video that the glimpse()
function from dplyr
provides a handy alternative to str()
for previewing a dataset. In addition to the number of observations and variables, it shows the name and type of each column, along with a neatly printed preview of its values.
Let’s have another look at the email50
data, so we can practice identifying variable types.
# Glimpse email50
glimpse(email50)
## Rows: 50
## Columns: 21
## $ spam <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, ...
## $ to_multiple <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
## $ from <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ cc <int> 0, 0, 4, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ sent_email <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, ...
## $ time <dttm> 2012-01-04 08:19:16, 2012-02-16 15:10:06, 2012-01-04 ...
## $ image <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ attach <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, ...
## $ dollar <dbl> 0, 0, 0, 0, 9, 0, 0, 0, 0, 23, 4, 0, 3, 2, 0, 0, 0, 0,...
## $ winner <fct> no, no, no, no, no, no, no, no, no, no, no, no, yes, n...
## $ inherit <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ viagra <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ password <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, ...
## $ num_char <dbl> 21.705, 7.011, 0.631, 2.454, 41.623, 0.057, 0.809, 5.2...
## $ line_breaks <int> 551, 183, 28, 61, 1088, 5, 17, 88, 242, 578, 1167, 198...
## $ format <dbl> 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, ...
## $ re_subj <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, ...
## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ urgent_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ exclaim_mess <dbl> 8, 1, 2, 1, 43, 0, 0, 2, 22, 3, 13, 1, 2, 2, 21, 10, 0...
## $ number <fct> small, big, none, small, small, small, small, small, s...
Nice! Can you determine the type of each variable?
Filtering based on a factor
Categorical data are often stored as factors in R. In this exercise, we’ll practice working with a factor variable, number
, from the email50
dataset. This variable tells us what type of number (none, small, or big) an email contains.
Recall from the video that the filter()
function from dplyr
can be used to filter a dataset to create a subset containing only certain levels of a variable. For example, the following code filters the mtcars
dataset for cars containing 6 cylinders:
mtcars %>%
filter(cyl == 6)
# Subset of emails with big numbers: email50_big
email50_big <- email50 %>%
filter(number == "big")
# Glimpse the subset
glimpse(email50_big)
## Rows: 7
## Columns: 21
## $ spam <dbl> 0, 0, 1, 0, 0, 0, 0
## $ to_multiple <dbl> 0, 0, 0, 0, 0, 0, 0
## $ from <dbl> 1, 1, 1, 1, 1, 1, 1
## $ cc <int> 0, 0, 0, 0, 0, 0, 0
## $ sent_email <dbl> 0, 0, 0, 0, 0, 1, 0
## $ time <dttm> 2012-02-16 15:10:06, 2012-02-04 18:26:09, 2012-01-24 ...
## $ image <dbl> 0, 0, 0, 0, 0, 0, 0
## $ attach <dbl> 0, 0, 0, 0, 0, 0, 0
## $ dollar <dbl> 0, 0, 3, 2, 0, 0, 0
## $ winner <fct> no, no, yes, no, no, no, no
## $ inherit <dbl> 0, 0, 0, 0, 0, 0, 0
## $ viagra <dbl> 0, 0, 0, 0, 0, 0, 0
## $ password <dbl> 0, 2, 0, 0, 0, 0, 8
## $ num_char <dbl> 7.011, 10.368, 42.793, 26.520, 6.563, 11.223, 10.613
## $ line_breaks <int> 183, 198, 712, 692, 140, 512, 225
## $ format <dbl> 1, 1, 1, 1, 1, 1, 1
## $ re_subj <dbl> 0, 0, 0, 0, 0, 0, 0
## $ exclaim_subj <dbl> 0, 0, 0, 1, 0, 0, 0
## $ urgent_subj <dbl> 0, 0, 0, 0, 0, 0, 0
## $ exclaim_mess <dbl> 1, 1, 2, 7, 2, 9, 9
## $ number <fct> big, big, big, big, big, big, big
Great work! Seven emails contain big numbers.
Complete filtering based on a factor
The droplevels()
function removes unused levels of factor variables from our dataset. As we saw in the video, it’s often useful to determine which levels are unused (i.e. contain zero values) with the table()
function.
In this exercise, we’ll see which levels of the number
variable are dropped after applying the droplevels()
function.
# Subset of emails with big numbers: email50_big
email50_big <- email50 %>%
filter(number == "big")
# Table of the number variable
table(email50_big$number)
##
## none small big
## 0 0 7
# Drop levels
email50_big$number_dropped <- droplevels(email50_big$number)
# Table of the number variable
table(email50_big$number_dropped)
##
## big
## 7
Did you notice that dropping the levels of the number
variable gets rid of the levels with counts of zero? This will be useful when you’re creating visualizations later on. Great work!
Discretize a different variable
In this exercise, we’ll create a categorical version of the num_char
variable in the email50
dataset. num_char
is the number of characters in an email, in thousands. This new variable will have two levels ("below median"
and "at or above median"
) depending on whether an email has less than the median number of characters or equal to or more than that value.
The median marks the 50th percentile, or midpoint, of a distribution, so half of the emails should fall in one category and the other half in the other. You will learn more about the median and other measures of center in the next course in this series.
# Calculate median number of characters: med_num_char
med_num_char <- median(email50$num_char)
# Create num_char_cat variable in email50
email50_fortified <- email50 %>%
mutate(num_char_cat = ifelse(num_char < med_num_char, "below median", "at or above median"))
# Count emails in each category
email50_fortified %>%
count(num_char_cat)
## # A tibble: 2 x 2
## num_char_cat n
## <chr> <int>
## 1 at or above median 25
## 2 below median 25
Great job! As you can see, half of the observations are below the median and half are above the median. Makes sense, doesn’t it?
Combining levels of a different factor
Another common way of creating a new variable based on an existing one is by combining levels of a categorical variable. For example, the email50
dataset has a categorical variable called number
with levels "none"
, "small"
, and "big"
, but suppose we’re only interested in whether an email contains a number. In this exercise, we will create a variable containing this information and also visualize it.
For now, do your best to understand the code we’ve provided to generate the plot. We will go through it in detail in the next video.
library(ggplot2)
# Create number_yn column in email50
email50_fortified <- email50 %>%
mutate(
number_yn = case_when(
# if number is "none", make number_yn "no"
number == "none" ~ "no",
# if number is not "none", make number_yn "yes"
number != "none" ~ "yes"
)
)
# Visualize the distribution of number_yn
ggplot(email50_fortified, aes(x = number_yn)) +
geom_bar()
Visualizing numerical and categorical data
In this exercise, we’ll visualize the relationship between two numerical variables from the email50
dataset, conditioned on whether or not the email was spam. This means that we will use an aspect of the plot (like color or shape) to identify the levels in the spam
variable so that we can compare plotted values between them.
Recall that in the ggplot()
function, the first argument is the dataset, then we map the aesthetic features of the plot to variables in the dataset, and finally the geom_*()
layer informs how data are represented on the plot. In this exercise, we will make a scatterplot by adding a geom_point()
layer to the ggplot()
call.
# Load ggplot2
library(ggplot2)
# Scatterplot of exclaim_mess vs. num_char
ggplot(email50, aes(x = num_char, y = exclaim_mess, color = factor(spam))) +
geom_point()
Excellent work! Note how ggplot2
automatically creates a helpful legend for the plot, telling you which color corresponds to each level of the spam
variable.