Welcome to the course!

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
# Load data
data(email50)

# View the structure of the data
str(email50)
## tibble [50 x 21] (S3: tbl_df/tbl/data.frame)
##  $ spam        : num [1:50] 0 0 1 0 0 0 0 0 0 0 ...
##  $ to_multiple : num [1:50] 0 0 0 0 0 0 0 0 0 0 ...
##  $ from        : num [1:50] 1 1 1 1 1 1 1 1 1 1 ...
##  $ cc          : int [1:50] 0 0 4 0 0 0 0 0 1 0 ...
##  $ sent_email  : num [1:50] 1 0 0 0 0 0 0 1 1 0 ...
##  $ time        : POSIXct[1:50], format: "2012-01-04 08:19:16" "2012-02-16 15:10:06" ...
##  $ image       : num [1:50] 0 0 0 0 0 0 0 0 0 0 ...
##  $ attach      : num [1:50] 0 0 2 0 0 0 0 0 0 0 ...
##  $ dollar      : num [1:50] 0 0 0 0 9 0 0 0 0 23 ...
##  $ winner      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ inherit     : num [1:50] 0 0 0 0 0 0 0 0 0 0 ...
##  $ viagra      : num [1:50] 0 0 0 0 0 0 0 0 0 0 ...
##  $ password    : num [1:50] 0 0 0 0 1 0 0 0 0 0 ...
##  $ num_char    : num [1:50] 21.705 7.011 0.631 2.454 41.623 ...
##  $ line_breaks : int [1:50] 551 183 28 61 1088 5 17 88 242 578 ...
##  $ format      : num [1:50] 1 1 0 0 1 0 0 1 1 1 ...
##  $ re_subj     : num [1:50] 1 0 0 0 0 0 0 1 1 0 ...
##  $ exclaim_subj: num [1:50] 0 0 0 0 0 0 0 0 1 0 ...
##  $ urgent_subj : num [1:50] 0 0 0 0 0 0 0 0 0 0 ...
##  $ exclaim_mess: num [1:50] 8 1 2 1 43 0 0 2 22 3 ...
##  $ number      : Factor w/ 3 levels "none","small",..: 2 3 1 2 2 2 2 2 2 2 ...

Types of variables

Identify variable types

Recall from the video that the glimpse() function from dplyr provides a handy alternative to str() for previewing a dataset. In addition to the number of observations and variables, it shows the name and type of each column, along with a neatly printed preview of its values.

Let’s have another look at the email50 data, so we can practice identifying variable types.

# Glimpse email50
glimpse(email50)
## Rows: 50
## Columns: 21
## $ spam         <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, ...
## $ to_multiple  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
## $ from         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ cc           <int> 0, 0, 4, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ sent_email   <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, ...
## $ time         <dttm> 2012-01-04 08:19:16, 2012-02-16 15:10:06, 2012-01-04 ...
## $ image        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ attach       <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, ...
## $ dollar       <dbl> 0, 0, 0, 0, 9, 0, 0, 0, 0, 23, 4, 0, 3, 2, 0, 0, 0, 0,...
## $ winner       <fct> no, no, no, no, no, no, no, no, no, no, no, no, yes, n...
## $ inherit      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ viagra       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ password     <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, ...
## $ num_char     <dbl> 21.705, 7.011, 0.631, 2.454, 41.623, 0.057, 0.809, 5.2...
## $ line_breaks  <int> 551, 183, 28, 61, 1088, 5, 17, 88, 242, 578, 1167, 198...
## $ format       <dbl> 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, ...
## $ re_subj      <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, ...
## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ urgent_subj  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ exclaim_mess <dbl> 8, 1, 2, 1, 43, 0, 0, 2, 22, 3, 13, 1, 2, 2, 21, 10, 0...
## $ number       <fct> small, big, none, small, small, small, small, small, s...

Nice! Can you determine the type of each variable?

Categorical data in R: factors

Filtering based on a factor

Categorical data are often stored as factors in R. In this exercise, we’ll practice working with a factor variable, number, from the email50 dataset. This variable tells us what type of number (none, small, or big) an email contains.

Recall from the video that the filter() function from dplyr can be used to filter a dataset to create a subset containing only certain levels of a variable. For example, the following code filters the mtcars dataset for cars containing 6 cylinders:

mtcars %>%
  filter(cyl == 6)
# Subset of emails with big numbers: email50_big
email50_big <- email50 %>%
  filter(number == "big")
  
# Glimpse the subset
glimpse(email50_big)
## Rows: 7
## Columns: 21
## $ spam         <dbl> 0, 0, 1, 0, 0, 0, 0
## $ to_multiple  <dbl> 0, 0, 0, 0, 0, 0, 0
## $ from         <dbl> 1, 1, 1, 1, 1, 1, 1
## $ cc           <int> 0, 0, 0, 0, 0, 0, 0
## $ sent_email   <dbl> 0, 0, 0, 0, 0, 1, 0
## $ time         <dttm> 2012-02-16 15:10:06, 2012-02-04 18:26:09, 2012-01-24 ...
## $ image        <dbl> 0, 0, 0, 0, 0, 0, 0
## $ attach       <dbl> 0, 0, 0, 0, 0, 0, 0
## $ dollar       <dbl> 0, 0, 3, 2, 0, 0, 0
## $ winner       <fct> no, no, yes, no, no, no, no
## $ inherit      <dbl> 0, 0, 0, 0, 0, 0, 0
## $ viagra       <dbl> 0, 0, 0, 0, 0, 0, 0
## $ password     <dbl> 0, 2, 0, 0, 0, 0, 8
## $ num_char     <dbl> 7.011, 10.368, 42.793, 26.520, 6.563, 11.223, 10.613
## $ line_breaks  <int> 183, 198, 712, 692, 140, 512, 225
## $ format       <dbl> 1, 1, 1, 1, 1, 1, 1
## $ re_subj      <dbl> 0, 0, 0, 0, 0, 0, 0
## $ exclaim_subj <dbl> 0, 0, 0, 1, 0, 0, 0
## $ urgent_subj  <dbl> 0, 0, 0, 0, 0, 0, 0
## $ exclaim_mess <dbl> 1, 1, 2, 7, 2, 9, 9
## $ number       <fct> big, big, big, big, big, big, big

Great work! Seven emails contain big numbers.

Complete filtering based on a factor

The droplevels() function removes unused levels of factor variables from our dataset. As we saw in the video, it’s often useful to determine which levels are unused (i.e. contain zero values) with the table() function.

In this exercise, we’ll see which levels of the number variable are dropped after applying the droplevels() function.

# Subset of emails with big numbers: email50_big
email50_big <- email50 %>%
  filter(number == "big")

# Table of the number variable
table(email50_big$number)
## 
##  none small   big 
##     0     0     7
# Drop levels
email50_big$number_dropped <- droplevels(email50_big$number)

# Table of the number variable
table(email50_big$number_dropped)
## 
## big 
##   7

Did you notice that dropping the levels of the number variable gets rid of the levels with counts of zero? This will be useful when you’re creating visualizations later on. Great work!

Discretize a variable

Discretize a different variable

In this exercise, we’ll create a categorical version of the num_char variable in the email50 dataset. num_char is the number of characters in an email, in thousands. This new variable will have two levels ("below median" and "at or above median") depending on whether an email has less than the median number of characters or equal to or more than that value.

The median marks the 50th percentile, or midpoint, of a distribution, so half of the emails should fall in one category and the other half in the other. You will learn more about the median and other measures of center in the next course in this series.

# Calculate median number of characters: med_num_char
med_num_char <- median(email50$num_char)

# Create num_char_cat variable in email50
email50_fortified <- email50 %>%
  mutate(num_char_cat = ifelse(num_char < med_num_char, "below median", "at or above median"))
  
# Count emails in each category
email50_fortified %>%
  count(num_char_cat)
## # A tibble: 2 x 2
##   num_char_cat           n
##   <chr>              <int>
## 1 at or above median    25
## 2 below median          25

Great job! As you can see, half of the observations are below the median and half are above the median. Makes sense, doesn’t it?

Combining levels of a different factor

Another common way of creating a new variable based on an existing one is by combining levels of a categorical variable. For example, the email50 dataset has a categorical variable called number with levels "none", "small", and "big", but suppose we’re only interested in whether an email contains a number. In this exercise, we will create a variable containing this information and also visualize it.

For now, do your best to understand the code we’ve provided to generate the plot. We will go through it in detail in the next video.

library(ggplot2)

# Create number_yn column in email50
email50_fortified <- email50 %>%
  mutate(
    number_yn = case_when(
      # if number is "none", make number_yn "no"
      number == "none" ~ "no",
      # if number is not "none", make number_yn "yes"
      number != "none" ~ "yes"
    )
  )

# Visualize the distribution of number_yn
ggplot(email50_fortified, aes(x = number_yn)) +
  geom_bar()

Visualizing numerical data

Visualizing numerical and categorical data

In this exercise, we’ll visualize the relationship between two numerical variables from the email50 dataset, conditioned on whether or not the email was spam. This means that we will use an aspect of the plot (like color or shape) to identify the levels in the spam variable so that we can compare plotted values between them.

Recall that in the ggplot() function, the first argument is the dataset, then we map the aesthetic features of the plot to variables in the dataset, and finally the geom_*() layer informs how data are represented on the plot. In this exercise, we will make a scatterplot by adding a geom_point() layer to the ggplot() call.

# Load ggplot2
library(ggplot2)

# Scatterplot of exclaim_mess vs. num_char
ggplot(email50, aes(x = num_char, y = exclaim_mess, color = factor(spam))) +
  geom_point()

Excellent work! Note how ggplot2 automatically creates a helpful legend for the plot, telling you which color corresponds to each level of the spam variable.

Observational studies and experiments