Data Manipulation with dplyr

For today’s demonstration (and the practice), we’ll use data tables from the nycflights13 package.

Setup

First, install the package.

#install.packages('nycflights13')

The library is in the set up chunk above as well, but you need to load the library before using the tables

library(nycflights13)

Before we do

head(flights)
## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     1     1      517            515         2      830
## 2  2013     1     1      533            529         4      850
## 3  2013     1     1      542            540         2      923
## 4  2013     1     1      544            545        -1     1004
## 5  2013     1     1      554            600        -6      812
## 6  2013     1     1      554            558        -4      740
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

Piping

The pipe operator %>% pipes output from one function to the next (like | in Linux). You can use this to chain together as many command/functions as desired. For a simple example, instead of calling head() on flights, we could have piped flights to head()

flights %>% head()
## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     1     1      517            515         2      830
## 2  2013     1     1      533            529         4      850
## 3  2013     1     1      542            540         2      923
## 4  2013     1     1      544            545        -1     1004
## 5  2013     1     1      554            600        -6      812
## 6  2013     1     1      554            558        -4      740
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

You can chain as many pipes as you want together and you can also pipe into ggplot()

flights %>% filter(sched_dep_time <= 1200) %>% ggplot(aes(x = origin, y = dep_time, color = origin)) + geom_violin()
## Warning: Removed 2170 rows containing non-finite values (stat_ydensity).

dplyr functions

Review

The first two functions are a review from week 2: filter() and select(). Remember filter() works on rows and select() works on columns.

filter()

As a quick review of filter(), filter for carrier F9 and arrivale times after noon.

filter(flights, carrier == 'F9', arr_time > 1200)
## # A tibble: 509 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1     1716           1730       -14     1947
##  2  2013     1     2     1728           1730        -2     1952
##  3  2013     1     3     1933           1730       123     2131
##  4  2013     1     4     1831           1730        61     2029
##  5  2013     1     5     1726           1730        -4     1948
##  6  2013     1     6     1722           1730        -8     1946
##  7  2013     1     7     1723           1730        -7     2012
##  8  2013     1     8     1726           1730        -4     2006
##  9  2013     1     9     1723           1730        -7     2017
## 10  2013     1    10     1730           1730         0     2001
## # ... with 499 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

select()

As a quick review of select(), select the year, month, and day columns.

select(flights, year, month, day)
## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ... with 336,766 more rows

An additional trick, you can select many columns at once using a : like with numbers. For example, when you type 1:5, you get the numbers 1 to 5, and if you select carrier:dest, you get all the columns in flights between them as in the example below.

select(flights, carrier:dest)
## # A tibble: 336,776 x 5
##    carrier flight tailnum origin dest 
##    <chr>    <int> <chr>   <chr>  <chr>
##  1 UA        1545 N14228  EWR    IAH  
##  2 UA        1714 N24211  LGA    IAH  
##  3 AA        1141 N619AA  JFK    MIA  
##  4 B6         725 N804JB  JFK    BQN  
##  5 DL         461 N668DN  LGA    ATL  
##  6 UA        1696 N39463  EWR    ORD  
##  7 B6         507 N516JB  EWR    FLL  
##  8 EV        5708 N829AS  LGA    IAD  
##  9 B6          79 N593JB  JFK    MCO  
## 10 AA         301 N3ALAA  LGA    ORD  
## # ... with 336,766 more rows

New Functions

mutate()

Mutate adds another column onto a table. Let’s say we want to add the average time spent in security on as another column. The syntax is column_name = values.

mutate(flights, avg_secur_time = 1)
## # A tibble: 336,776 x 20
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 13 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, avg_secur_time <dbl>

If you just add a single value (which could be a number or a character), it’s listed in every row. Sometimes this is useful, for example if you’re adding a sample ID to an entire table, but often you probably want more than one value in your column. For that, use ifelse() and case_when(). For example, there are three origin airports in flights for the three airports around New York City, EWR (Newark), JFK, and LGA (LaGuardia). Maybe the average wait time in the New York airports (JFK and LGA) is 1 hour, but it’s 1.5 hours in Newark. You can add values based on another column using ifelse() when you have two different values

mutate(flights, avg_secur_time = ifelse(origin == 'EWR', 1.5, 1)) #%>% select (origin, avg_secur_time)
## # A tibble: 336,776 x 20
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 13 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, avg_secur_time <dbl>

If you have more than two values, use case_when(). In the example below, we’ll set different times for each airport.

mutate(flights, avg_secur_time = case_when(origin == 'EWR' ~ 1.5,
                                           origin == 'JFK' ~ 1,
                                           origin == 'LGA' ~ 2)) %>% select(origin, avg_secur_time)
## # A tibble: 336,776 x 2
##    origin avg_secur_time
##    <chr>           <dbl>
##  1 EWR               1.5
##  2 LGA               2  
##  3 JFK               1  
##  4 JFK               1  
##  5 LGA               2  
##  6 EWR               1.5
##  7 EWR               1.5
##  8 LGA               2  
##  9 JFK               1  
## 10 LGA               2  
## # ... with 336,766 more rows

arrange()

arrange() orders the entire table by the column(s) selected.

# this is how it looked originally
flights
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Flights arranged by arrival time

arrange(flights, arr_time)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     2     2130           2130         0        1
##  2  2013     1    11     2157           2000       117        1
##  3  2013     1    11     2253           2249         4        1
##  4  2013     1    14     2122           2130        -8        1
##  5  2013     1    14     2246           2250        -4        1
##  6  2013     1    15     2304           2245        19        1
##  7  2013     1    16     2018           2025        -7        1
##  8  2013     1    16     2303           2245        18        1
##  9  2013     1    19     2107           2110        -3        1
## 10  2013     1    22     2246           2249        -3        1
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Flights arranged by carrier, then arrival time.

arrange(flights, carrier, arr_time)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013    12    15     2155           2100        55        1
##  2  2013     5    10     2051           1955        56        1
##  3  2013     6    25     2112           1830       162        1
##  4  2013     7    19     2059           2030        29        1
##  5  2013     1    25     2116           1915       121        2
##  6  2013    10    24     2154           2040        74        2
##  7  2013     5     8     2157           2047        70        3
##  8  2013     1    21     2057           1945        72        4
##  9  2013    11    27     2255           2159        56        4
## 10  2013     2    23     2303           2045       138        4
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

By default, arrange() orders by least to greatest. If you want to greatest first, use desc(), which selects greatest values first.

arrange(flights, desc(arr_time))
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1     2209           2155        14     2400
##  2  2013     1     5     2116           2130       -14     2400
##  3  2013     1    13     2243           2129        74     2400
##  4  2013     1    16     2138           2107        31     2400
##  5  2013     1    17     2256           2249         7     2400
##  6  2013     1    22     2212           2055        77     2400
##  7  2013     1    22     2249           2125        84     2400
##  8  2013     1    25     2055           1725       210     2400
##  9  2013     1    28     2303           2250        13     2400
## 10  2013     1    30     2155           1915       160     2400
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

group_by() & summarize()/summarise()

summarize() summarizes your table. This can be a summary statistic like mean, or median, or it can be whatever function you want. Also, fyi, for all the tidyverse functions that have multiple English spellings, you can call the function with any version.

summarize(flights, avg_dep_time = mean(sched_dep_time))
## # A tibble: 1 x 1
##   avg_dep_time
##          <dbl>
## 1        1344.

Using group_by(), you can group by one (or more) thing(s) and have a summary function applied to each group individually

flights %>% group_by(origin) %>% summarize(avg_dep_time = mean(sched_dep_time))
## # A tibble: 3 x 2
##   origin avg_dep_time
##   <chr>         <dbl>
## 1 EWR           1322.
## 2 JFK           1402.
## 3 LGA           1308.

All Together Now

flights %>% 
  filter(origin == 'LGA') %>% 
  group_by(carrier) %>% 
  summarize(avg_dep_time = mean(sched_dep_time), avg_arr_time = mean(sched_arr_time)) %>%
  ggplot(aes(x = carrier, y = avg_dep_time, color = avg_arr_time)) + 
    geom_point(size = 4) + 
    scale_color_viridis() +
    labs(x = 'Airline Carrier', y = 'Average Departure Time (24h)', color = 'Average Arrival Time (24h)') +
    theme_classic()