General idea

  • Thanks to Yu-Han Kao—adapted some of these materials from her!

Data scraping/web scraping

  • Pulling data from the internet (web sites, social media, etc.)
  • Involves: crawling/searching, extraction, parsing, reformatting
  • Often two general approaches:
    • Directly scraping (note possibly rude—your program/bot(s) will make requests from their server)
    • Use an API!

What is an API

  • Application Programming Interface
  • A way for programs/software to communicate
  • Client/server - can be for web, operating system, databases, etc.
  • Web APIs
    • APIs for either web browser or web server
    • Twitter API, Google API, Facebook API…

Using APIs with R

  • Many R packages to make this easier:
    • rtweet,twitteR, Rfacebook, googleAuthR, googleAnalyticsR
  • Directly using R
  • Python also has very comprehensive libraries (many directly developed by the companies)

Getting data from Twitter

  • Getting credentials: consumer key & secret, access token & token secret
    • set up Twitter account
    • Create New App - App name, Description, Website, Callback URL:
      ( = localhost, your computer)

Getting data from Twitter