10/30/2017

General idea

Data scraping/web scraping

What is an API

  • Application Programming Interface
    • a way for software to communicate each other
    • restaurant example

What is an API

  • Web APIs
    • APIs for either web browser or web server
    • Twitter API, Google API, FB API…
    • Expedia example

Using APIs with R

  • Using R packages for interacting with specific APIs
    • twitteR, Rfacebook, googleAuthR, googleAnalyticsR…

  • Directly using R
    • google map

Getting data from Twitter

  • Getting credentials: consumer key & secret, access token & token secret

Getting data from Twitter

  • Getting credentials: consumer key & secret, access token & token secret

Getting data from Twitter

  • Getting credentials: consumer key & secret, access token & token secret

Getting data from Twitter

install twitterR


library(twitteR)
consumer_key <- "your consumer key"
consumer_secret <- "your consumer secret"
access_token <- "your token"
access_secret <- "your secret"
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

Getting data from Twitter

following/follower example


seed_user <- getUser('@chekao')
seed_user$getFollowerIDs()

tweets<-searchTwitter('vaccines AND autism', n=3)
target_users<-sapply(tweets, function(x) x$screenName)
target_tweets<-sapply(tweets, function(x) x$getId())
#retweet related functions are buggy, need to report

for (f in 1:length(target_tweets)){
  temp_name<-paste("@", target_users[f], sep="")
  temp_seed<-getUser(temp_name)
  temp_location<-temp_seed$getLocation()
  temp_followers<-try(temp_seed$getFollowerIDs())
  assign(paste(target_users[f],target_tweets[f],sep="_"),c(temp_seed, temp_location, temp_followers))
}

# try different search operators (check twitter search)

Getting data from FB

  • Getting credentials: consumer key & secret, access token & token secret
    • Creating an App https://developers.facebook.com/
    • Note: need to fill in the redirect url in fb log-in
    • can only get very limited data (from users who are using the app)

library(Rfacebook)
fb_oauth <- fbOAuth(
  app_id="your app id",
  app_secret="you app secret",
  extended_permissions = TRUE)

save(fb_oauth, file="token") #next time you can just load the token
#load("token")

Getting data from FB

  • Getting credentials: consumer key & secret, access token & token secret
    • Creating an App https://developers.facebook.com/
    • Note: need to fill in the redirect url in fb log-in
    • can only get very limited data (from users who are using the app)

getUsers("me",token=fb_oauth) #public profile
head(getLikes(user="me", token=fb_oauth)) #likes
updateStatus("hiho", fb_oauth) #fb post via R
#yhk_friends <- getFriends(fb_oauth, simplify = TRUE) #get friends who are using the app

Getting data from FB

  • Getting credentials: consumer key & secret, access token & token secret
library(Rfacebook)
fb_oauth <- "your temp token"
getUsers("me",token=fb_oauth) #public profile
head(getLikes(user="me", token=fb_oauth)) #likes
updateStatus("hiho", fb_oauth) #fb post via R
yhk_friends <- getFriends(fb_oauth, simplify = TRUE)

Getting data from FB

pages<-getPage("DonaldTrump",fb_oauth,n=5)
target_post<-getPost(pages$id[1], fb_oauth, n=100, likes = TRUE, comments = FALSE)
target_user<-getUsers(target_post$likes$from_id, fb_oauth)

Getting data from google maps



# adapted the script from Jose Gonzalez
library(RCurl)
library(RJSONIO)

#write query
query_url<- function(address, return.call = "json", sensor = "false") {
  root <- "http://maps.google.com/maps/api/geocode/"
  u <- paste(root, return.call, "?address=", address, "&sensor=", sensor, sep = "")
  return(URLencode(u))
}

Getting data from google maps

#get and parse json result
geoCode <- function(address,verbose=FALSE) {
  if(verbose) cat(address,"\n")
  u <- query_url(address)
  doc <- getURL(u)
  x <- fromJSON(doc,simplify = FALSE)
  if(x$status=="OK") {
    lat <- x$results[[1]]$geometry$location$lat
    lng <- x$results[[1]]$geometry$location$lng
    location_type  <- x$results[[1]]$geometry$location_type
    formatted_address  <- x$results[[1]]$formatted_address
    return(c(lat, lng, location_type, formatted_address))
    Sys.sleep(0.5)
  } else {
    return(c(NA,NA,NA, NA))
  }
}

Getting data from google maps



# adapted the script from Jose Gonzalez
library(ggmap)
target_loc <- geoCode("ann arbor sph")
sphmap <- get_map(location = c(lon = as.numeric(target_loc[2]), 
                  lat = as.numeric(target_loc[1])), zoom = 10,
                  maptype = "roadmap", scale = 2)
ggmap(sphmap)

Other ways to use APIs

Ethics

  • Be nice to people…
  • Don't make evil bots
  • Ask data nicely

Data scraping without APIs

  • View page source and inspect
  • Simple html: Rcurl + xml or/and RJSONIO
  • Somewhat complicated html, or need to do some additional requests (eg. get, post, etc): rvest (scrapeR)
  • Fancy websites: rvest+Rselenium

Data scraping without APIs

  • View page source and inspect
  • Simple html: Rcurl + xml or/and RJSONIO
  • Somewhat complicated html, or need to do some additional requests (eg. get, post, etc): rvest (scrapeR)
  • Fancy websites: rvest+Rselenium


    More on Wednesday