The packages dplyr
and tidyr
(along
with a bunch of the other things in the tidyverse
) are
built to help make processing data quick and easy in R.
A “tidy” dataset should be organized like this:
This makes running “vectorized” functions on our dataset easy, since we can always just work with the columns.
Another great cheatsheet for these
dplyr
and tidyr
functions
do?library(tidyverse)
hospdata = read_csv("datasets/Weekly_United_States_COVID-19_Hospitalization_Metrics_by_County.csv")
glimpse(hospdata)
## Rows: 9,672
## Columns: 22
## $ state <chr> "Tennes…
## $ county <chr> "Greene…
## $ fips_code <chr> "47059"…
## $ county_population <dbl> 69069, …
## $ health_sa_name <chr> "Greene…
## $ health_sa_number <dbl> 252, 25…
## $ health_sa_population <dbl> 105073,…
## $ report_date <date> 2023-0…
## $ week_end_date <date> 2023-0…
## $ mmwr_report_week <dbl> 20, 20,…
## $ mmwr_report_year <dbl> 2023, 2…
## $ total_adm_all_covid_confirmed_past_7days <dbl> 0, 0, 6…
## $ total_adm_all_covid_confirmed_past_7days_per_100k <dbl> 0.0, 0.…
## $ total_adm_all_covid_confirmed_past_7days_per_100k_level <chr> "Low (<…
## $ admissions_covid_confirmed_week_over_week_percent_change <dbl> NA, NA,…
## $ admissions_covid_confirmed_week_over_week_percent_change_level <chr> "Insufi…
## $ avg_percent_inpatient_beds_used_confirmed_covid <dbl> 0.1, 0.…
## $ avg_percent_inpatient_beds_used_confirmed_covid_level <chr> "Minima…
## $ abs_chg_avg_percent_inpatient_beds_occupied_covid_confirmed <dbl> -0.5, -…
## $ avg_percent_staff_icu_beds_covid <dbl> 0.0, 0.…
## $ avg_percent_staff_icu_beds_covid_level <chr> "Minima…
## $ abs_chg_avg_percent_staff_icu_beds_covid <dbl> 0.0, 0.…
Order the data by epiweek (oh, reminder there’s an
epiweek
function in lubridate that converts dates to
epiweeks in case that’s useful!).
hospdata = arrange(hospdata, mmwr_report_week)
Filter rows with the filter
function!
For example, only Michigan data:
hospdata_MI = filter(hospdata, state == "MI")
Select only the variables we care about:
# Data before selecting
glimpse(hospdata_MI)
## Rows: 249
## Columns: 22
## $ state <chr> "MI", "…
## $ county <chr> "Monroe…
## $ fips_code <chr> "26115"…
## $ county_population <dbl> 150500,…
## $ health_sa_name <chr> "Lucas …
## $ health_sa_number <dbl> 276, 28…
## $ health_sa_population <dbl> 792316,…
## $ report_date <date> 2023-0…
## $ week_end_date <date> 2023-0…
## $ mmwr_report_week <dbl> 18, 18,…
## $ mmwr_report_year <dbl> 2023, 2…
## $ total_adm_all_covid_confirmed_past_7days <dbl> 35, 6, …
## $ total_adm_all_covid_confirmed_past_7days_per_100k <dbl> 4.4, 6.…
## $ total_adm_all_covid_confirmed_past_7days_per_100k_level <chr> "Low (<…
## $ admissions_covid_confirmed_week_over_week_percent_change <dbl> 6.1, -1…
## $ admissions_covid_confirmed_week_over_week_percent_change_level <chr> "Stable…
## $ avg_percent_inpatient_beds_used_confirmed_covid <dbl> 1.5, 3.…
## $ avg_percent_inpatient_beds_used_confirmed_covid_level <chr> "Minima…
## $ abs_chg_avg_percent_inpatient_beds_occupied_covid_confirmed <dbl> -0.3, 0…
## $ avg_percent_staff_icu_beds_covid <dbl> 0.4, 4.…
## $ avg_percent_staff_icu_beds_covid_level <chr> "Minima…
## $ abs_chg_avg_percent_staff_icu_beds_covid <dbl> -0.8, -…
# Select
hospdata_MI = select(hospdata_MI, c("county", "week_end_date", "avg_percent_inpatient_beds_used_confirmed_covid"))
# Data after selecting
glimpse(hospdata_MI)
## Rows: 249
## Columns: 3
## $ county <chr> "Monroe County", "Onto…
## $ week_end_date <date> 2023-05-06, 2023-05-0…
## $ avg_percent_inpatient_beds_used_confirmed_covid <dbl> 1.5, 3.1, 3.1, 0.3, 0.…
Mutate allows you to make a new column/variable in the dataset, for example if the data didn’t already have an admissions/100K column, we could generate it like this:
hospdata = mutate(hospdata, adm_per_100K = total_adm_all_covid_confirmed_past_7days/county_population)
A pipe is really just a shorthand that lets you chain functions together in a readable way.
The pipe in R looks like this: %>%
The way it works is that the thing you feed into the pipe becomes the first input to the function after the pipe. This makes running a bunch of data tasks simpler to write. For example here’s some data operations with and without the pipe:
Without the pipe:
hospdata = read_csv("datasets/Weekly_United_States_COVID-19_Hospitalization_Metrics_by_County.csv")
hospdata_MI = filter(hospdata, state == "MI")
hospdata_MI = select(hospdata_MI, c("county", "week_end_date", "total_adm_all_covid_confirmed_past_7days", "county_population"))
hospdata_MI = mutate(hospdata, adm_per_100K = total_adm_all_covid_confirmed_past_7days/county_population)
With the pipe:
hospdata_MI = read_csv("datasets/Weekly_United_States_COVID-19_Hospitalization_Metrics_by_County.csv") %>%
filter(state == "MI") %>%
select(c("county", "week_end_date", "total_adm_all_covid_confirmed_past_7days", "county_population")) %>%
mutate(adm_per_100K = total_adm_all_covid_confirmed_past_7days/county_population)
group_by
and summarize
One of the most useful things to combine with the pipe is the
group_by
function!
This let’s you group your data by a particular variable and then run all your operations split by the groups. You can use this to calculate rolling averages, sums, means, etc. for your groups (e.g. if you wanted to add a new column that’s the rolling average of admissions for each county).
Another super useful one is summarize
, which lets you
summarize your dataset based on a variable (e.g. taking the mean, SD,
sum, etc).
So for example, let’s group by county and then summarize total admissions and total admissions/100K:
hospdata_MI = read_csv("datasets/Weekly_United_States_COVID-19_Hospitalization_Metrics_by_County.csv") %>%
filter(state == "MI") %>%
group_by(county) %>%
summarize(total_admissions = sum(total_adm_all_covid_confirmed_past_7days),
total_admissions_per_100K = sum(total_adm_all_covid_confirmed_past_7days_per_100k))
glimpse(hospdata_MI)
## Rows: 83
## Columns: 3
## $ county <chr> "Alcona County", "Alger County", "Allegan Co…
## $ total_admissions <dbl> 1, 15, 12, 1, 14, 16, 3, 19, 16, 14, 15, 23,…
## $ total_admissions_per_100K <dbl> 1.6, 12.0, 2.9, 1.6, 7.0, 9.3, 6.6, 9.7, 9.3…
spread
and gather
distinct
sample_n
You’ll often want to merge datasets by a variable—the built in
merge
is actually pretty good for this, but
tidyverse
also has some special join commands that make
things convenient:
IncomeData = read_csv('Datasets/StateIncomeData.csv')
LifespanData = read_csv('Datasets/StateLifeExpectancy.csv')
MergeData = full_join(IncomeData, LifespanData, by = "State")
glimpse(MergeData)
## Rows: 51
## Columns: 10
## $ Rank.x <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13…
## $ State <chr> "District of Columbia", "Connecticut", "New …
## $ `Per capita income` <dbl> 45877, 39373, 37288, 36593, 36338, 34691, 34…
## $ `Median household income` <dbl> 71648, 70048, 69160, 71919, 73971, 66532, 64…
## $ `Median family income` <dbl> 84094, 88819, 87951, 88419, 89678, 80581, 78…
## $ Population <dbl> 658893, 3596677, 8938175, 6938608, 5976407, …
## $ `Number of households` <dbl> 277378, 1355817, 2549336, 3194844, 2165438, …
## $ `Number of families` <dbl> 117864, 887263, 1610581, 2203675, 1445972, 3…
## $ Rank.y <dbl> 43, 3, 9, 5, 26, 8, 24, 6, 18, 34, 2, 11, 13…
## $ Life.Expectancy <dbl> 76.5, 80.8, 80.3, 80.5, 78.8, 80.3, 79.0, 80…