2023-06-01

ggplot basics

  • ggplot follows a “grammar of graphics” that is a little different from the rest of R’s coding structure
  • Basic components of a ggplot: a dataset (dataframe), aesthetics (aes), geoms, and formatting
    • ggplot takes dataframes as the basic input, not an x vector and y vector
    • geom’s are different types of plot objects that you can add to the plot (e.g. points, lines, bars, etc.)
    • The aes (short for aesthetics) command tells ggplot which variables in the dataset represent the x values, y values, color, size, etc.
      • You can set aes either in the main ggplot call or within a geom
  • The official ggplot cheatsheet is great! Use this as a quick reference once you get the general idea

ggplot basics

Let’s try an example!

Simple example - Bureau of Transportation Statistics mobility data

Try running this code for yourself! (Be sure you have downloaded the datasets folder and placed it in your working directory first!)

library(ggplot2)
library(readr)

# Load data from csv
mobilityData = read_csv('datasets/Trips_by_Distance.csv')

# Calculate the percent of the population staying home
mobilityData$PercentHome = 
  100 * mobilityData$`Population Staying at Home` /
  (mobilityData$`Population Staying at Home` + mobilityData$`Population Not Staying at Home`)

Plot code

ggplot(mobilityData, aes(x = Date, y = PercentHome)) +
  
  geom_point() + 
  
  labs(title="Percent of Michiganders staying home over 2019 - 2020", 
       x="", y="Percent of population staying home")

You should see something like this:

Okay, let’s fancy it up a bit!

Let’s also add a rolling average (this will be another geom!), and format our colors a bit more.

Since we’ll have different y axis variables for our different geoms (regular and rolling average), we’ll move the aes command into our geoms.

Let’s also set the colors for our plot using a hex code!

Quick primer on hex colors

Just in case you haven’t seen hex colors before!

  • They’re formatted like this: # RR GG BB (or sometimes # RR GG BB A)

  • But the numbering system is hexadecimal so it goes:
    1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F

  • For example: #008800, #00FF00, #9500AB, #00FFFF, #00AAAA

Plot code

ggplot(mobilityData) +
  
  geom_point(aes(x = Date, y = PercentHome), color = "#2255AA", alpha = 0.25) + 
  
  geom_line(aes(x = Date, y = zoo::rollmean(PercentHome, 7, fill = NA)), 
            color = "#2255AA", size = 1) + 
  
  labs(title="Percent of Michiganders staying home over 2019 - 2020", 
       x="", y="Percent of population staying home")

You should see something like this:

Your turn!

Exercise: Adjust the code to plot the number of trips taken each day over time, and change the color of the lines and points! (Don’t forget to adjust the labels too) You should see something like:

Lines & annotations

You can add horizontal and vertical lines with geom_hline and geom_vline, and you can add text annotations with geom_text:

ggplot(mobilityData) +
  geom_point(aes(x = Date, y = PercentHome), color = "#2255AA", alpha = 0.25) + 
  geom_line(aes(x = Date, y = zoo::rollmean(PercentHome, 7, fill = NA)), 
            color = "#2255AA", size = 1) + 
  
  geom_vline(aes(xintercept = as.Date("2020-03-01"))) + 
     # Note we only give it an x intercept value since the rest is already set!
  
  geom_text(aes(x = as.Date("2020-03-01"), y = 40), 
            label = "Pandemic starts", hjust = "left") + 
  
  labs(title="Percent of Michiganders staying home over 2019 - 2020", x="", 
       y="Percent of population staying home")

Formatting: scales

You can use the scales functions to set up how different dimensions of the data behave—the x and y axis behaviors, as well as how line colors and shape fills are set up (more on this in a bit).

For example, let’s make the x-axis breaks occur every 3 months and change the format.

Quick primer on date format strings: UNIX strftime date formats

Okay, let’s add a scale:

Scale code

ggplot(mobilityData) +
  geom_point(aes(x = Date, y = PercentHome), color = "#2255AA", alpha = 0.25) + 
  geom_line(aes(x = Date, y = zoo::rollmean(PercentHome, 7, fill = NA)), 
            color = "#2255AA", size = 1) + 
  
  geom_vline(aes(xintercept = as.Date("2020-03-01"))) + 
     # Note we only give it an x intercept value since the rest is already set!
  
  geom_text(aes(x = as.Date("2020-03-01"), y = 40), 
            label = "Pandemic starts", hjust = "left") + 
  
  ### New scale ###
  scale_x_date(breaks = "3 months", labels = "%b %y") + 
  
  labs(title="Percent of Michiganders staying home over 2019 - 2020", x="", 
       y="Percent of population staying home")

A fancier example

ggplot(mobilityData) +
  geom_line(aes(x = Date, y = zoo::rollmean(PercentHome, 7, fill = NA), 
                color = "2020"), size = 1) + 
  
  # shift date by 1 year so we can plot 2019 and 2020 on the same 1 year span
  geom_line(aes(x = Date + 365, y = zoo::rollmean(PercentHome, 7, fill = NA), 
                color = "2019"), size = 1) + 
  
  scale_x_date(breaks = "3 months", date_labels = "%b %Y", 
               limits = c(as.Date("2020-01-01"), as.Date("2020-12-31")) ) + 
  
  # scale_y_continuous(limits = c(0,40)) + # try this if you want to play with the y axis too!
  
  scale_color_manual(values = c("#264653", "#2a9d8f", "#e9c46a", "#f4a261", "#e76f51")) + 
  
  labs(title="Percent of Michiganders staying home for 2019 vs. 2020", x="", 
       y="Percent of population staying home", color = "")

Formatting: themes

For overall formatting, you can also add themes! There are a ton of different ones, check out the cheatsheet for more, and the ggthemes package has even more options! (see here)

Themes also let you adjust overall properties, like font size for the whole plot, etc.

For example, let’s add a different theme to the last plot, try one of: theme_bw(), theme_gray() (default theme), theme_dark(), theme_classic(), theme_light(), theme_linedraw(), theme_minimal(), or theme_void()

I’ll add this to my ggplot: theme_classic(base_size = 14) to change themes and increase the overall font size

Other kinds of geoms

There are so many other kinds of geoms! The cheatsheet has a fuller list but a lot of the common ones are:

  • histograms
  • column vs bar
  • boxplot, violin
  • geom_smooth, geom_tile, geom_path, geom_ribbon

Examples

ggplot(mobilityData) +
  geom_histogram(aes(`Number of Trips`))

Examples

ggplot(mobilityData) +
  geom_col(aes(x= Date, y = `Number of Trips`), fill = "#44AAAA") + 
  # note fill vs color
  scale_x_date(limits = c(as.Date("2020-03-01"), as.Date("2020-05-31")) ) + 
  theme_linedraw()

Labels

library(ggrepel)

# Load & merge data
IncomeData = read_csv('Datasets/StateIncomeData.csv')
LifespanData = read_csv('Datasets/StateLifeExpectancy.csv')
MergeData = merge(IncomeData, LifespanData, by = "State")

# Plot!
ggplot(MergeData) +
  geom_point(aes(x = `Median household income`, y = Life.Expectancy, 
                 size = Population/1000000), color = 'steelblue') +
  
  geom_label_repel(aes(x = `Median household income`,y = Life.Expectancy, label = State)) + 
  
  # geom_label(aes(x = `Median household income`+1000,y = Life.Expectancy, label = State), 
  #            hjust = "left") +
  
  labs(size = "Population (millions)")

Faceting

Faceting let’s you make panel plots based on the variables in your data (e.g. make a plot of cases and facet_wrap will let you make a panel of case plots for all counties/preparedness regions/etc.). See the cheatsheet for more!

Saving plots to a file

ggsave("plot.png", width = 5, height = 5) saves last plot as 5’ x 5’ file named “plot.png” in working directory. You can use all the usual image file formats (.jpg, etc.) and ggplot will figure it out from the file name.