dplyr, tidyr, and the pipe operator

Intro

The packages dplyr and tidyr (along with a bunch of the other things in the tidyverse) are built to help make processing data quick and easy in R.
A “tidy” dataset should be organized like this:

This makes running “vectorized” functions on our dataset easy, since we can always just work with the columns.
Another great cheatsheet for these

What can the `dplyr` and `tidyr` functions do?

library(tidyverse)

hospdata = read_csv("datasets/Weekly_United_States_COVID-19_Hospitalization_Metrics_by_County.csv")

glimpse(hospdata)

## Rows: 9,672
## Columns: 22
## $ state                                                          <chr> "Tennes…
## $ county                                                         <chr> "Greene…
## $ fips_code                                                      <chr> "47059"…
## $ county_population                                              <dbl> 69069, …
## $ health_sa_name                                                 <chr> "Greene…
## $ health_sa_number                                               <dbl> 252, 25…
## $ health_sa_population                                           <dbl> 105073,…
## $ report_date                                                    <date> 2023-0…
## $ week_end_date                                                  <date> 2023-0…
## $ mmwr_report_week                                               <dbl> 20, 20,…
## $ mmwr_report_year                                               <dbl> 2023, 2…
## $ total_adm_all_covid_confirmed_past_7days                       <dbl> 0, 0, 6…
## $ total_adm_all_covid_confirmed_past_7days_per_100k              <dbl> 0.0, 0.…
## $ total_adm_all_covid_confirmed_past_7days_per_100k_level        <chr> "Low (<…
## $ admissions_covid_confirmed_week_over_week_percent_change       <dbl> NA, NA,…
## $ admissions_covid_confirmed_week_over_week_percent_change_level <chr> "Insufi…
## $ avg_percent_inpatient_beds_used_confirmed_covid                <dbl> 0.1, 0.…
## $ avg_percent_inpatient_beds_used_confirmed_covid_level          <chr> "Minima…
## $ abs_chg_avg_percent_inpatient_beds_occupied_covid_confirmed    <dbl> -0.5, -…
## $ avg_percent_staff_icu_beds_covid                               <dbl> 0.0, 0.…
## $ avg_percent_staff_icu_beds_covid_level                         <chr> "Minima…
## $ abs_chg_avg_percent_staff_icu_beds_covid                       <dbl> 0.0, 0.…

Order the data

Order the data by epiweek (oh, reminder there’s an epiweek function in lubridate that converts dates to epiweeks in case that’s useful!).

hospdata = arrange(hospdata, mmwr_report_week)

Filter rows

Filter rows with the filter function!

For example, only Michigan data:

hospdata_MI = filter(hospdata, state == "MI")

Select columns

Select only the variables we care about:

# Data before selecting
glimpse(hospdata_MI)

## Rows: 249
## Columns: 22
## $ state                                                          <chr> "MI", "…
## $ county                                                         <chr> "Monroe…
## $ fips_code                                                      <chr> "26115"…
## $ county_population                                              <dbl> 150500,…
## $ health_sa_name                                                 <chr> "Lucas …
## $ health_sa_number                                               <dbl> 276, 28…
## $ health_sa_population                                           <dbl> 792316,…
## $ report_date                                                    <date> 2023-0…
## $ week_end_date                                                  <date> 2023-0…
## $ mmwr_report_week                                               <dbl> 18, 18,…
## $ mmwr_report_year                                               <dbl> 2023, 2…
## $ total_adm_all_covid_confirmed_past_7days                       <dbl> 35, 6, …
## $ total_adm_all_covid_confirmed_past_7days_per_100k              <dbl> 4.4, 6.…
## $ total_adm_all_covid_confirmed_past_7days_per_100k_level        <chr> "Low (<…
## $ admissions_covid_confirmed_week_over_week_percent_change       <dbl> 6.1, -1…
## $ admissions_covid_confirmed_week_over_week_percent_change_level <chr> "Stable…
## $ avg_percent_inpatient_beds_used_confirmed_covid                <dbl> 1.5, 3.…
## $ avg_percent_inpatient_beds_used_confirmed_covid_level          <chr> "Minima…
## $ abs_chg_avg_percent_inpatient_beds_occupied_covid_confirmed    <dbl> -0.3, 0…
## $ avg_percent_staff_icu_beds_covid                               <dbl> 0.4, 4.…
## $ avg_percent_staff_icu_beds_covid_level                         <chr> "Minima…
## $ abs_chg_avg_percent_staff_icu_beds_covid                       <dbl> -0.8, -…

# Select
hospdata_MI = select(hospdata_MI, c("county", "week_end_date", "avg_percent_inpatient_beds_used_confirmed_covid"))

# Data after selecting
glimpse(hospdata_MI)

## Rows: 249
## Columns: 3
## $ county                                          <chr> "Monroe County", "Onto…
## $ week_end_date                                   <date> 2023-05-06, 2023-05-0…
## $ avg_percent_inpatient_beds_used_confirmed_covid <dbl> 1.5, 3.1, 3.1, 0.3, 0.…

Mutate

Mutate allows you to make a new column/variable in the dataset, for example if the data didn’t already have an admissions/100K column, we could generate it like this:

hospdata = mutate(hospdata, adm_per_100K = total_adm_all_covid_confirmed_past_7days/county_population)

Okay so what’s a pipe?

A pipe is really just a shorthand that lets you chain functions together in a readable way.

The pipe in R looks like this: %>%

The way it works is that the thing you feed into the pipe becomes the first input to the function after the pipe. This makes running a bunch of data tasks simpler to write. For example here’s some data operations with and without the pipe:

Without the pipe:

hospdata = read_csv("datasets/Weekly_United_States_COVID-19_Hospitalization_Metrics_by_County.csv")

hospdata_MI = filter(hospdata, state == "MI")

hospdata_MI = select(hospdata_MI, c("county", "week_end_date", "total_adm_all_covid_confirmed_past_7days", "county_population"))

hospdata_MI = mutate(hospdata, adm_per_100K = total_adm_all_covid_confirmed_past_7days/county_population)

With the pipe:

hospdata_MI = read_csv("datasets/Weekly_United_States_COVID-19_Hospitalization_Metrics_by_County.csv") %>%
  filter(state == "MI") %>%
  select(c("county", "week_end_date", "total_adm_all_covid_confirmed_past_7days", "county_population")) %>%
  mutate(adm_per_100K = total_adm_all_covid_confirmed_past_7days/county_population)

`group_by` and `summarize`

One of the most useful things to combine with the pipe is the group_by function!

This let’s you group your data by a particular variable and then run all your operations split by the groups. You can use this to calculate rolling averages, sums, means, etc. for your groups (e.g. if you wanted to add a new column that’s the rolling average of admissions for each county).

Another super useful one is summarize, which lets you summarize your dataset based on a variable (e.g. taking the mean, SD, sum, etc).

So for example, let’s group by county and then summarize total admissions and total admissions/100K:

hospdata_MI = read_csv("datasets/Weekly_United_States_COVID-19_Hospitalization_Metrics_by_County.csv") %>%
  filter(state == "MI") %>%
  group_by(county) %>%
  summarize(total_admissions = sum(total_adm_all_covid_confirmed_past_7days), 
            total_admissions_per_100K = sum(total_adm_all_covid_confirmed_past_7days_per_100k))

glimpse(hospdata_MI)

## Rows: 83
## Columns: 3
## $ county                    <chr> "Alcona County", "Alger County", "Allegan Co…
## $ total_admissions          <dbl> 1, 15, 12, 1, 14, 16, 3, 19, 16, 14, 15, 23,…
## $ total_admissions_per_100K <dbl> 1.6, 12.0, 2.9, 1.6, 7.0, 9.3, 6.6, 9.7, 9.3…

Lots more

spread and gather
distinct
sample_n

Merge and join

You’ll often want to merge datasets by a variable—the built in merge is actually pretty good for this, but tidyverse also has some special join commands that make things convenient:

IncomeData = read_csv('Datasets/StateIncomeData.csv')
LifespanData = read_csv('Datasets/StateLifeExpectancy.csv')
MergeData = full_join(IncomeData, LifespanData, by = "State")

glimpse(MergeData)

## Rows: 51
## Columns: 10
## $ Rank.x                    <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13…
## $ State                     <chr> "District of Columbia", "Connecticut", "New …
## $ `Per capita income`       <dbl> 45877, 39373, 37288, 36593, 36338, 34691, 34…
## $ `Median household income` <dbl> 71648, 70048, 69160, 71919, 73971, 66532, 64…
## $ `Median family income`    <dbl> 84094, 88819, 87951, 88419, 89678, 80581, 78…
## $ Population                <dbl> 658893, 3596677, 8938175, 6938608, 5976407, …
## $ `Number of households`    <dbl> 277378, 1355817, 2549336, 3194844, 2165438, …
## $ `Number of families`      <dbl> 117864, 887263, 1610581, 2203675, 1445972, 3…
## $ Rank.y                    <dbl> 43, 3, 9, 5, 26, 8, 24, 6, 18, 34, 2, 11, 13…
## $ Life.Expectancy           <dbl> 76.5, 80.8, 80.3, 80.5, 78.8, 80.3, 79.0, 80…

dplyr, tidyr, and the pipe operator

MI Umbrella

2023-06-01

Intro

What can the `dplyr` and `tidyr` functions do?

Order the data

Filter rows

Select columns

Mutate

Okay so what’s a pipe?

`group_by` and `summarize`

Lots more

Merge and join

dplyr, tidyr, and the pipe operator

MI Umbrella

2023-06-01

Intro

What can the dplyr and tidyr functions do?

Order the data

Filter rows

Select columns

Mutate

Okay so what’s a pipe?

group_by and summarize

Lots more

Merge and join

What can the `dplyr` and `tidyr` functions do?

`group_by` and `summarize`