Introduction to Data in R

Welcome to the course!

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(openintro)

## Loading required package: airports

## Loading required package: cherryblossom

## Loading required package: usdata

# Load data
data(email50)

# View the structure of the data
str(email50)

## tibble [50 x 21] (S3: tbl_df/tbl/data.frame)
##  $ spam        : num [1:50] 0 0 1 0 0 0 0 0 0 0 ...
##  $ to_multiple : num [1:50] 0 0 0 0 0 0 0 0 0 0 ...
##  $ from        : num [1:50] 1 1 1 1 1 1 1 1 1 1 ...
##  $ cc          : int [1:50] 0 0 4 0 0 0 0 0 1 0 ...
##  $ sent_email  : num [1:50] 1 0 0 0 0 0 0 1 1 0 ...
##  $ time        : POSIXct[1:50], format: "2012-01-04 08:19:16" "2012-02-16 15:10:06" ...
##  $ image       : num [1:50] 0 0 0 0 0 0 0 0 0 0 ...
##  $ attach      : num [1:50] 0 0 2 0 0 0 0 0 0 0 ...
##  $ dollar      : num [1:50] 0 0 0 0 9 0 0 0 0 23 ...
##  $ winner      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ inherit     : num [1:50] 0 0 0 0 0 0 0 0 0 0 ...
##  $ viagra      : num [1:50] 0 0 0 0 0 0 0 0 0 0 ...
##  $ password    : num [1:50] 0 0 0 0 1 0 0 0 0 0 ...
##  $ num_char    : num [1:50] 21.705 7.011 0.631 2.454 41.623 ...
##  $ line_breaks : int [1:50] 551 183 28 61 1088 5 17 88 242 578 ...
##  $ format      : num [1:50] 1 1 0 0 1 0 0 1 1 1 ...
##  $ re_subj     : num [1:50] 1 0 0 0 0 0 0 1 1 0 ...
##  $ exclaim_subj: num [1:50] 0 0 0 0 0 0 0 0 1 0 ...
##  $ urgent_subj : num [1:50] 0 0 0 0 0 0 0 0 0 0 ...
##  $ exclaim_mess: num [1:50] 8 1 2 1 43 0 0 2 22 3 ...
##  $ number      : Factor w/ 3 levels "none","small",..: 2 3 1 2 2 2 2 2 2 2 ...

Types of variables

Identify variable types

Recall from the video that the glimpse() function from dplyr provides a handy alternative to str() for previewing a dataset. In addition to the number of observations and variables, it shows the name and type of each column, along with a neatly printed preview of its values.

Let’s have another look at the email50 data, so we can practice identifying variable types.

# Glimpse email50
glimpse(email50)

## Rows: 50
## Columns: 21
## $ spam         <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, ...
## $ to_multiple  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
## $ from         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ cc           <int> 0, 0, 4, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ sent_email   <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, ...
## $ time         <dttm> 2012-01-04 08:19:16, 2012-02-16 15:10:06, 2012-01-04 ...
## $ image        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ attach       <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, ...
## $ dollar       <dbl> 0, 0, 0, 0, 9, 0, 0, 0, 0, 23, 4, 0, 3, 2, 0, 0, 0, 0,...
## $ winner       <fct> no, no, no, no, no, no, no, no, no, no, no, no, yes, n...
## $ inherit      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ viagra       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ password     <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, ...
## $ num_char     <dbl> 21.705, 7.011, 0.631, 2.454, 41.623, 0.057, 0.809, 5.2...
## $ line_breaks  <int> 551, 183, 28, 61, 1088, 5, 17, 88, 242, 578, 1167, 198...
## $ format       <dbl> 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, ...
## $ re_subj      <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, ...
## $ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ urgent_subj  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ exclaim_mess <dbl> 8, 1, 2, 1, 43, 0, 0, 2, 22, 3, 13, 1, 2, 2, 21, 10, 0...
## $ number       <fct> small, big, none, small, small, small, small, small, s...

Nice! Can you determine the type of each variable?

Categorical data in R: factors

Filtering based on a factor

Categorical data are often stored as factors in R. In this exercise, we’ll practice working with a factor variable, number, from the email50 dataset. This variable tells us what type of number (none, small, or big) an email contains.

Recall from the video that the filter() function from dplyr can be used to filter a dataset to create a subset containing only certain levels of a variable. For example, the following code filters the mtcars dataset for cars containing 6 cylinders:

mtcars %>%
  filter(cyl == 6)

# Subset of emails with big numbers: email50_big
email50_big <- email50 %>%
  filter(number == "big")
  
# Glimpse the subset
glimpse(email50_big)

## Rows: 7
## Columns: 21
## $ spam         <dbl> 0, 0, 1, 0, 0, 0, 0
## $ to_multiple  <dbl> 0, 0, 0, 0, 0, 0, 0
## $ from         <dbl> 1, 1, 1, 1, 1, 1, 1
## $ cc           <int> 0, 0, 0, 0, 0, 0, 0
## $ sent_email   <dbl> 0, 0, 0, 0, 0, 1, 0
## $ time         <dttm> 2012-02-16 15:10:06, 2012-02-04 18:26:09, 2012-01-24 ...
## $ image        <dbl> 0, 0, 0, 0, 0, 0, 0
## $ attach       <dbl> 0, 0, 0, 0, 0, 0, 0
## $ dollar       <dbl> 0, 0, 3, 2, 0, 0, 0
## $ winner       <fct> no, no, yes, no, no, no, no
## $ inherit      <dbl> 0, 0, 0, 0, 0, 0, 0
## $ viagra       <dbl> 0, 0, 0, 0, 0, 0, 0
## $ password     <dbl> 0, 2, 0, 0, 0, 0, 8
## $ num_char     <dbl> 7.011, 10.368, 42.793, 26.520, 6.563, 11.223, 10.613
## $ line_breaks  <int> 183, 198, 712, 692, 140, 512, 225
## $ format       <dbl> 1, 1, 1, 1, 1, 1, 1
## $ re_subj      <dbl> 0, 0, 0, 0, 0, 0, 0
## $ exclaim_subj <dbl> 0, 0, 0, 1, 0, 0, 0
## $ urgent_subj  <dbl> 0, 0, 0, 0, 0, 0, 0
## $ exclaim_mess <dbl> 1, 1, 2, 7, 2, 9, 9
## $ number       <fct> big, big, big, big, big, big, big

Great work! Seven emails contain big numbers.

Complete filtering based on a factor

The droplevels() function removes unused levels of factor variables from our dataset. As we saw in the video, it’s often useful to determine which levels are unused (i.e. contain zero values) with the table() function.

In this exercise, we’ll see which levels of the number variable are dropped after applying the droplevels() function.

# Subset of emails with big numbers: email50_big
email50_big <- email50 %>%
  filter(number == "big")

# Table of the number variable
table(email50_big$number)

## 
##  none small   big 
##     0     0     7

# Drop levels
email50_big$number_dropped <- droplevels(email50_big$number)

# Table of the number variable
table(email50_big$number_dropped)

## 
## big 
##   7

Did you notice that dropping the levels of the number variable gets rid of the levels with counts of zero? This will be useful when you’re creating visualizations later on. Great work!

Discretize a variable

Discretize a different variable

In this exercise, we’ll create a categorical version of the num_char variable in the email50 dataset. num_char is the number of characters in an email, in thousands. This new variable will have two levels ("below median" and "at or above median") depending on whether an email has less than the median number of characters or equal to or more than that value.

The median marks the 50th percentile, or midpoint, of a distribution, so half of the emails should fall in one category and the other half in the other. You will learn more about the median and other measures of center in the next course in this series.

# Calculate median number of characters: med_num_char
med_num_char <- median(email50$num_char)

# Create num_char_cat variable in email50
email50_fortified <- email50 %>%
  mutate(num_char_cat = ifelse(num_char < med_num_char, "below median", "at or above median"))
  
# Count emails in each category
email50_fortified %>%
  count(num_char_cat)

## # A tibble: 2 x 2
##   num_char_cat           n
##   <chr>              <int>
## 1 at or above median    25
## 2 below median          25

Great job! As you can see, half of the observations are below the median and half are above the median. Makes sense, doesn’t it?

Combining levels of a different factor

Another common way of creating a new variable based on an existing one is by combining levels of a categorical variable. For example, the email50 dataset has a categorical variable called number with levels "none", "small", and "big", but suppose we’re only interested in whether an email contains a number. In this exercise, we will create a variable containing this information and also visualize it.

For now, do your best to understand the code we’ve provided to generate the plot. We will go through it in detail in the next video.

library(ggplot2)

# Create number_yn column in email50
email50_fortified <- email50 %>%
  mutate(
    number_yn = case_when(
      # if number is "none", make number_yn "no"
      number == "none" ~ "no",
      # if number is not "none", make number_yn "yes"
      number != "none" ~ "yes"
    )
  )

# Visualize the distribution of number_yn
ggplot(email50_fortified, aes(x = number_yn)) +
  geom_bar()

Visualizing numerical data

Visualizing numerical and categorical data

In this exercise, we’ll visualize the relationship between two numerical variables from the email50 dataset, conditioned on whether or not the email was spam. This means that we will use an aspect of the plot (like color or shape) to identify the levels in the spam variable so that we can compare plotted values between them.

Recall that in the ggplot() function, the first argument is the dataset, then we map the aesthetic features of the plot to variables in the dataset, and finally the geom_*() layer informs how data are represented on the plot. In this exercise, we will make a scatterplot by adding a geom_point() layer to the ggplot() call.

# Load ggplot2
library(ggplot2)

# Scatterplot of exclaim_mess vs. num_char
ggplot(email50, aes(x = num_char, y = exclaim_mess, color = factor(spam))) +
  geom_point()

Excellent work! Note how ggplot2 automatically creates a helpful legend for the plot, telling you which color corresponds to each level of the spam variable.

Observational studies and experiments

Identify type of study: Countries

Next, let’s take a look at data from a different study on country characteristics. First, load the data and view it, then identify the type of study. Remember, an experiment requires random assignment.

library(gapminder)

# Load data
data(gapminder)

# Glimpse data
glimpse(gapminder)

## Rows: 1,704
## Columns: 6
## $ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afgha...
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 199...
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 4...
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372,...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.113...

# Identify type of study: observational or experimental
type_of_study <- "observational"

Right! Since there is no way to randomly assign countries to attributes, this is an observational study. Nice work!

Random sampling and random assignment

Simpson’s paradox

omission of important explanatory variable can have unknown effect on the response variable when included

Number of males and females admitted

The goal of this exercise is to determine the numbers of male and female applicants who got admitted and rejected. Specifically, we want to find out how many males are admitted and how many are rejected. And similarly we want to find how many females are admitted and how many are rejected.

To do so we will use the count() function from the dplyr package.

In one step, count() groups the data and then tallies the number of observations in each level of the grouping variable. These counts are available under a new variable called n.

# Load packages
library(dplyr)
load("_data/ucb_admit.RData")

# Count number of male and female applicants admitted
(ucb_admission_counts <- ucb_admit %>%
  count(Gender, Admit))

##   Gender    Admit    n
## 1   Male Admitted 1198
## 2   Male Rejected 1493
## 3 Female Admitted  557
## 4 Female Rejected 1278

Cool counting! Passing several arguments to count() gives you the number of rows for each combination of those arguments.

Proportion of males admitted overall

Next we’ll calculate the percentage of males and percentage of females admitted, by creating a new variable, called prop (short for proportion) based off of the counts calculated in the previous exercise and using the mutate() from the dplyr package.

Proportions for each row of the data frame we created in the previous exercise can be calculated as n / sum(n). Note that since the data are grouped by gender, sum(n) will be calculated for males and females separately.

ucb_admission_counts %>%
  # Group by gender
  group_by(Gender) %>%
  # Create new variable
  mutate(prop = n / sum(n)) %>%
  # Filter for admitted
  filter(Admit == "Admitted")

## # A tibble: 2 x 4
## # Groups:   Gender [2]
##   Gender Admit        n  prop
##   <fct>  <fct>    <int> <dbl>
## 1 Male   Admitted  1198 0.445
## 2 Female Admitted   557 0.304

Fantastic! It looks like 44% of males were admitted versus only 30% of females, but as you’ll see in the next exercise, there’s more to the story.

Proportion of males admitted for each department

Finally we’ll make a table similar to the one we constructed earlier, except we’ll first group the data by department. The goal is to compare the proportions of male admitted students across departments.

Proportions for each row of the data frame we create can be calculated as n / sum(n). Note that since the data are grouped by department and gender, sum(n) will be calculated for males and females separately for each department.

ucb_admission_counts <- ucb_admit %>%
  # Counts by department, then gender, then admission status
  count(Dept, Gender, Admit)

# See the result
ucb_admission_counts

##    Dept Gender    Admit   n
## 1     A   Male Admitted 512
## 2     A   Male Rejected 313
## 3     A Female Admitted  89
## 4     A Female Rejected  19
## 5     B   Male Admitted 353
## 6     B   Male Rejected 207
## 7     B Female Admitted  17
## 8     B Female Rejected   8
## 9     C   Male Admitted 120
## 10    C   Male Rejected 205
## 11    C Female Admitted 202
## 12    C Female Rejected 391
## 13    D   Male Admitted 138
## 14    D   Male Rejected 279
## 15    D Female Admitted 131
## 16    D Female Rejected 244
## 17    E   Male Admitted  53
## 18    E   Male Rejected 138
## 19    E Female Admitted  94
## 20    E Female Rejected 299
## 21    F   Male Admitted  22
## 22    F   Male Rejected 351
## 23    F Female Admitted  24
## 24    F Female Rejected 317

ucb_admission_counts  %>%
  # Group by department, then gender
  group_by(Dept, Gender) %>%
  # Create new variable
  mutate(prop = n / sum(n)) %>%
  # Filter for male and admitted
  filter(Gender == "Male", Admit == "Admitted")

## # A tibble: 6 x 5
## # Groups:   Dept, Gender [6]
##   Dept  Gender Admit        n   prop
##   <chr> <fct>  <fct>    <int>  <dbl>
## 1 A     Male   Admitted   512 0.621 
## 2 B     Male   Admitted   353 0.630 
## 3 C     Male   Admitted   120 0.369 
## 4 D     Male   Admitted   138 0.331 
## 5 E     Male   Admitted    53 0.277 
## 6 F     Male   Admitted    22 0.0590

Amazing admission analyzing! The proportion of males admitted varies wildly between departments.

Recap: Simpson’s paradox

Sampling strategies

each cluster is heterogeneous within themselves, but individually representative of each other cluster

Sampling in R

Simple random sample in R

Suppose we want to collect some data from a sample of eight states. A list of all states and the region they belong to (Northeast, Midwest, South, West) are given in the us_regions data frame.

load("_data/us_regions.RData")

# Simple random sample
states_srs <- us_regions %>%
  sample_n(8)

# Count states by region
states_srs %>%
  count(region)

##      region n
## 1   Midwest 1
## 2 Northeast 2
## 3     South 3
## 4      West 2

Great work! Notice that this strategy may select an unequal number of states from each region. In the next exercise, you’ll implement stratified sampling to be sure to select an equal number of states from each region.

Stratified sample in R

In the previous exercise, we took a simple random sample of eight states. However, we did not have any control over how many states from each region got sampled. The goal of stratified sampling in this context is to have control over the number of states sampled from each region. Our goal for this exercise is to sample an equal number of states from each region.

# Stratified sample
states_str <- us_regions %>%
  group_by(region) %>%
  sample_n(2)

# Count states by region
states_str %>%
  count(region)

## # A tibble: 4 x 2
## # Groups:   region [4]
##   region        n
##   <fct>     <int>
## 1 Midwest       2
## 2 Northeast     2
## 3 South         2
## 4 West          2

Nice job! In this stratified sample, each stratum (i.e. Region) is represented equally.

Principles of experimental design

a potential confounding variable in this case is having previous programming experience, which is the blocking variable and equally/randomly assigned to two treatment groups in this example.

Connect blocking and stratifying

In random sampling, we use stratifying to control for a variable. In random assignment, we use blocking to achieve the same goal.

Beauty in the classroom

Inspect the data

The purpose of this chapter is to give you an opportunity to apply and practice what you’ve learned on a real world dataset. For this reason, we’ll provide a little less guidance than usual.

The data from the study described in the video are available in your workspace as evals. Let’s take a look!

load("_data/evals.RData")

# Inspect evals
glimpse(evals)

## Rows: 463
## Columns: 21
## $ score         <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8...
## $ rank          <fct> tenure track, tenure track, tenure track, tenure trac...
## $ ethnicity     <fct> minority, minority, minority, minority, not minority,...
## $ gender        <fct> female, female, female, female, male, male, male, mal...
## $ language      <fct> english, english, english, english, english, english,...
## $ age           <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 4...
## $ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000, 87....
## $ cls_did_eval  <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24, 17, ...
## $ cls_students  <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, 25, 2...
## $ cls_level     <fct> upper, upper, upper, upper, upper, upper, upper, uppe...
## $ cls_profs     <fct> single, single, single, single, multiple, multiple, m...
## $ cls_credits   <fct> multi credit, multi credit, multi credit, multi credi...
## $ bty_f1lower   <int> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, 2, 7,...
## $ bty_f1upper   <int> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 9,...
## $ bty_f2upper   <int> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 9,...
## $ bty_m1lower   <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 7,...
## $ bty_m1upper   <int> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 6,...
## $ bty_m2upper   <int> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 6,...
## $ bty_avg       <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000, 3.33...
## $ pic_outfit    <fct> not formal, not formal, not formal, not formal, not f...
## $ pic_color     <fct> color, color, color, color, color, color, color, colo...

# Alternative solutions
dim(evals)

## [1] 463  21

str(evals)

## tibble [463 x 21] (S3: tbl_df/tbl/data.frame)
##  $ score        : num [1:463] 4.7 4.1 3.9 4.8 4.6 4.3 2.8 4.1 3.4 4.5 ...
##  $ rank         : Factor w/ 3 levels "teaching","tenure track",..: 2 2 2 2 3 3 3 3 3 3 ...
##  $ ethnicity    : Factor w/ 2 levels "minority","not minority": 1 1 1 1 2 2 2 2 2 2 ...
##  $ gender       : Factor w/ 2 levels "female","male": 1 1 1 1 2 2 2 2 2 1 ...
##  $ language     : Factor w/ 2 levels "english","non-english": 1 1 1 1 1 1 1 1 1 1 ...
##  $ age          : int [1:463] 36 36 36 36 59 59 59 51 51 40 ...
##  $ cls_perc_eval: num [1:463] 55.8 68.8 60.8 62.6 85 ...
##  $ cls_did_eval : int [1:463] 24 86 76 77 17 35 39 55 111 40 ...
##  $ cls_students : int [1:463] 43 125 125 123 20 40 44 55 195 46 ...
##  $ cls_level    : Factor w/ 2 levels "lower","upper": 2 2 2 2 2 2 2 2 2 2 ...
##  $ cls_profs    : Factor w/ 2 levels "multiple","single": 2 2 2 2 1 1 1 2 2 2 ...
##  $ cls_credits  : Factor w/ 2 levels "multi credit",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ bty_f1lower  : int [1:463] 5 5 5 5 4 4 4 5 5 2 ...
##  $ bty_f1upper  : int [1:463] 7 7 7 7 4 4 4 2 2 5 ...
##  $ bty_f2upper  : int [1:463] 6 6 6 6 2 2 2 5 5 4 ...
##  $ bty_m1lower  : int [1:463] 2 2 2 2 2 2 2 2 2 3 ...
##  $ bty_m1upper  : int [1:463] 4 4 4 4 3 3 3 3 3 3 ...
##  $ bty_m2upper  : int [1:463] 6 6 6 6 3 3 3 3 3 2 ...
##  $ bty_avg      : num [1:463] 5 5 5 5 3 ...
##  $ pic_outfit   : Factor w/ 2 levels "formal","not formal": 2 2 2 2 2 2 2 2 2 2 ...
##  $ pic_color    : Factor w/ 2 levels "black&white",..: 2 2 2 2 2 2 2 2 2 2 ...

Nice work! There are many ways to inspect a data frame in R and to find how many observations and variables it contains.

Variables in the data

Identify variable types

It’s always useful to start your exploration of a dataset by identifying variable types. The results from this exercise will help you design appropriate visualizations and calculate useful summary statistics later in your analysis.

# Inspect variable types
glimpse(evals)

## Rows: 463
## Columns: 21
## $ score         <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8...
## $ rank          <fct> tenure track, tenure track, tenure track, tenure trac...
## $ ethnicity     <fct> minority, minority, minority, minority, not minority,...
## $ gender        <fct> female, female, female, female, male, male, male, mal...
## $ language      <fct> english, english, english, english, english, english,...
## $ age           <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 4...
## $ cls_perc_eval <dbl> 55.81395, 68.80000, 60.80000, 62.60163, 85.00000, 87....
## $ cls_did_eval  <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24, 17, ...
## $ cls_students  <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, 25, 2...
## $ cls_level     <fct> upper, upper, upper, upper, upper, upper, upper, uppe...
## $ cls_profs     <fct> single, single, single, single, multiple, multiple, m...
## $ cls_credits   <fct> multi credit, multi credit, multi credit, multi credi...
## $ bty_f1lower   <int> 5, 5, 5, 5, 4, 4, 4, 5, 5, 2, 2, 2, 2, 2, 2, 2, 2, 7,...
## $ bty_f1upper   <int> 7, 7, 7, 7, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 9,...
## $ bty_f2upper   <int> 6, 6, 6, 6, 2, 2, 2, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 9,...
## $ bty_m1lower   <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 7,...
## $ bty_m1upper   <int> 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 6,...
## $ bty_m2upper   <int> 6, 6, 6, 6, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 6,...
## $ bty_avg       <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000, 3.33...
## $ pic_outfit    <fct> not formal, not formal, not formal, not formal, not f...
## $ pic_color     <fct> color, color, color, color, color, color, color, colo...

str(evals) # Another option

## tibble [463 x 21] (S3: tbl_df/tbl/data.frame)
##  $ score        : num [1:463] 4.7 4.1 3.9 4.8 4.6 4.3 2.8 4.1 3.4 4.5 ...
##  $ rank         : Factor w/ 3 levels "teaching","tenure track",..: 2 2 2 2 3 3 3 3 3 3 ...
##  $ ethnicity    : Factor w/ 2 levels "minority","not minority": 1 1 1 1 2 2 2 2 2 2 ...
##  $ gender       : Factor w/ 2 levels "female","male": 1 1 1 1 2 2 2 2 2 1 ...
##  $ language     : Factor w/ 2 levels "english","non-english": 1 1 1 1 1 1 1 1 1 1 ...
##  $ age          : int [1:463] 36 36 36 36 59 59 59 51 51 40 ...
##  $ cls_perc_eval: num [1:463] 55.8 68.8 60.8 62.6 85 ...
##  $ cls_did_eval : int [1:463] 24 86 76 77 17 35 39 55 111 40 ...
##  $ cls_students : int [1:463] 43 125 125 123 20 40 44 55 195 46 ...
##  $ cls_level    : Factor w/ 2 levels "lower","upper": 2 2 2 2 2 2 2 2 2 2 ...
##  $ cls_profs    : Factor w/ 2 levels "multiple","single": 2 2 2 2 1 1 1 2 2 2 ...
##  $ cls_credits  : Factor w/ 2 levels "multi credit",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ bty_f1lower  : int [1:463] 5 5 5 5 4 4 4 5 5 2 ...
##  $ bty_f1upper  : int [1:463] 7 7 7 7 4 4 4 2 2 5 ...
##  $ bty_f2upper  : int [1:463] 6 6 6 6 2 2 2 5 5 4 ...
##  $ bty_m1lower  : int [1:463] 2 2 2 2 2 2 2 2 2 3 ...
##  $ bty_m1upper  : int [1:463] 4 4 4 4 3 3 3 3 3 3 ...
##  $ bty_m2upper  : int [1:463] 6 6 6 6 3 3 3 3 3 2 ...
##  $ bty_avg      : num [1:463] 5 5 5 5 3 ...
##  $ pic_outfit   : Factor w/ 2 levels "formal","not formal": 2 2 2 2 2 2 2 2 2 2 ...
##  $ pic_color    : Factor w/ 2 levels "black&white",..: 2 2 2 2 2 2 2 2 2 2 ...

# Remove non-factor variables from the vector below
cat_vars <- c("rank", "ethnicity", "gender", "language",
              "cls_level", "cls_profs", "cls_credits",
              "pic_outfit", "pic_color")

Recode a variable

The cls_students variable in evals tells you the number of students in the class. Suppose instead of the exact number of students, you’re interested in whether the class is

"small" (18 students or fewer),
"midsize" (19 - 59 students), or
"large" (60 students or more).

# Recode cls_students as cls_type
evals_fortified <- evals %>%
  mutate(
    cls_type = case_when(
      cls_students <= 18                      ~ "small",
      cls_students >= 19 & cls_students <= 59 ~ "midsize",
      cls_students >= 60                      ~ "large"
    )
  )

Excellent! The cls_type variable is a categorical variable, stored as a character vector. You could have made it a factor variable by wrapping the nested ifelse() statements inside factor(). You don’t have to do that now. Let’s move on!

Create a scatterplot

The bty_avg variable shows the average beauty rating of the professor by the six students who were asked to rate the attractiveness of these faculty. The score variable shows the average professor evaluation score, with 1 being very unsatisfactory and 5 being excellent.

# Scatterplot of score vs. bty_avg
ggplot(evals, aes(x = bty_avg, y = score)) +
  geom_point()

Create a scatterplot, with an added layer

Suppose you are interested in evaluating how the relationship between a professor’s attractiveness and their evaluation score varies across different class types (small, midsize, and large).

# Scatterplot of score vs. bty_avg colored by cls_type
ggplot(evals_fortified, aes(x = bty_avg, y = score, color = cls_type)) +
  geom_point()