Data scraping/web scraping

  • Pulling data from the internet (web sites, social media, etc.)
  • Involves: crawling/searching, extraction, parsing, reformatting
  • Often two general approaches:
    • Directly scraping (note possibly rude—your program/bot(s) will make requests from their server)
    • Use an API!

What is an API

  • Application Programming Interface
  • A way for programs/software to communicate
  • Client/server - can be for web, operating system, databases, etc.
  • Web APIs
    • APIs for either web browser or web server
    • Twitter API, Google API, Facebook API…

Using APIs with R

  • Many R packages to make this easier:
    • rtweet,twitteR, Rfacebook, googleAuthR, googleAnalyticsR
  • Directly using R
  • We will be using RSocrata to access public health data using the Socrata API!
  • Python also has very comprehensive libraries (many directly developed by the companies)

Socrata API

  • An API commonly used for public health and government data
    • data.cdc.gov
    • data.michigan.gov
  • Allows you to automatically pull this data into your R code directly as a dataframe

Let’s try it out!



hospdata = read.socrata("https://data.cdc.gov/resource/akn2-qxic.json")

Take a look through the data!

Filtering your query

You can also filter the data before you pull it—this can save you a lot of time if you’re pulling a big dataset but you don’t actually need all of it!

So for example, maybe we only want Michigan hospitalizations—we can add ?state=MI to the end of our API endpoint to filter only the rows that have MI as the state. (Note you should look at your data first to make sure the query makes sense!)

hospdata = read.socrata("https://data.cdc.gov/resource/akn2-qxic.json?state=MI")

Example plot


# RSocrata read the admissions data as a character instead of a number! Fix that:
hospdata$total_adm_all_covid_confirmed = as.numeric(hospdata$total_adm_all_covid_confirmed)

# Make a summarized data set of total admissions by date across all counties
plotdata = hospdata %>%
  group_by(week_end_date) %>%
  summarize(total_adm = sum(total_adm_all_covid_confirmed))

# Plot it!
ggplot(plotdata) + 
  geom_col(aes(x = week_end_date, y = total_adm))