Libraries

We’re going to use two main libraries for this: quanteda, a package for text data analysis, and e1071, a package with a number of statistical/machine learning functions. (Other useful packages for doing classification include mlr and caret)

library(quanteda)
library(e1071)

Next, let’s load the data set we’re going to start with—we’ll start with a set that’s already got labels (to see how we do)—let’s classify the aliens-related tweets we pulled a couple of classes ago, using the keyword we searched for as the label—in this case, we’ll use just the ‘alien’ and ‘naruto’ keywords (‘naruto’ being from the Area 51 naruto run thing a while back). Let’s see if our classifier can distinguish between tweets generated from searching for ‘alien’ vs. those from searching for ‘naruto’.

Load data

load('AliensTwitterData.Rdata')

We should have a set of training data (traintweets), and set of test data (testtweets).

Explore data

Let’s explore our training data a little using quanteda. We’ll start by making a ‘document-feature matrix’, i.e. a matrix with rows as documents, columns as words (features). We’ll also prune off some extra junk in our data that we probably won’t need.

# Turn our data into a text corpus, where each tweet is a document
traindata = corpus(traintweets, docid_field="status_id", text_field="text")

# Make DFM, remove stopwords (highly common words like 'the'), punctuation, and twitter characters (@ and #)
traindfm = dfm(traindata, remove=stopwords("english"), remove_punct=TRUE, remove_twitter = TRUE)

# Summarize the resulting text data
# summary(traindfm,3)

Now let’s see what we have:

# What are the top 10 text features (words)?
topfeatures(traindfm, 10)
## naruto   t.co  https  alien      o     de    que     の      e ナルト 
##     88     65     64     31     30     28     22     20     17     15
# Notice some of these are html and some are in Japanese (probably because of using 'naruto' as a keyword), but we'll go with it anyhow.

# Make a word cloud
textplot_wordcloud(traindfm, min_count = 6, random_order = FALSE,
                   rotation = .25, max_words=50,
                   color = RColorBrewer::brewer.pal(8,"Dark2"))

Let’s look at some of our key words in context (kwic):

# KWIC for 'alien'/'aliens'/etc. with a word window of 3:
kwic(traindata, "alien*", window=3) 
##                                                                  
##  [1191176000568266752, 15]                cases against illegal |
##  [1191176010961707009, 11]                 arresting 23 illegal |
##  [1191176047104012289, 10]                          grande q el |
##   [1191176048718864384, 4]                        Deben ser los |
##   [1191176050493022208, 8]         estaba investigando crímenes |
##   [1191176057883222016, 8]                         祭 ライブ で |
##  [1191176070474518528, 11]                       pirate to evil |
##   [1191176071804260353, 6]                     give the illegal |
##  [1191176073377189889, 13]                         of Ripley in |
##  [1191176079458930689, 13]                         de los rotos |
##  [1191176083657220097, 10]                         > COSMOS vs |
##  [1191176097301286912, 57]                        is an illegal |
##  [1191176104381468672, 11]                 arresting 23 illegal |
##  [1191176113495715840, 11]                  . Desafio qualidade |
##  [1191176118205849601, 31]                          Time to get |
##  [1191176118432235525, 15]                     for any explicit |
##  [1191176123905785859, 17]                              , like, |
##  [1191176123905785859, 24]                          Is Jason an |
##  [1191176123905785859, 36]                          Jason is an |
##   [1191176139563261953, 2]                                    [ |
##  [1191176152439767040, 31]                          Time to get |
##  [1191176188292685824, 23]                         land and the |
##  [1191176226213371904, 26]                            que son 9 |
##  [1191176235923202048, 23]                            to be the |
##   [1191180029067243520, 4]                       Awwww Lena has |
##  [1191180034356269056, 21]                           um tipo de |
##  [1191180059211640834, 14]                 rodeada de ciclistas |
##   [1191180065196916736, 8]                         all of those |
##  [1191180066228703233, 24]                               , los" |
##   [1191180072641798145, 7]                     \U0001f44d, this |
##  [1191180096922517504, 32]                         the concept. |
##  [1191180120259584001, 23]                   how they're cloned |
##   [1191180129126375426, 4]                  @ten79ryuu The real |
##   [1191180131303411712, 6]                          Jennie é um |
##  [1191180149091229696, 11]                       pirate to evil |
##  [1191180176509558784, 15]                cases against illegal |
##  [1191180185296670720, 14]                 rodeada de ciclistas |
##   [1191180208688320512, 4] @MichaelShawver1@AGWillliamBarr FAKE |
##   [1191180219018817536, 5]                            has a pet |
##   [1191180224983175169, 3]                               My gay |
##   [1191180248127283200, 4]                    Lena bought those |
##  [1191180257686163457, 27]                       perdemos. Tipo |
##  [1191180262908018689, 10]                            Ed has an |
##  [1191180262908018689, 23]                    ability to handle |
##  [1191180267802812417, 14]                 rodeada de ciclistas |
##   [1191180269572755457, 3]                               My new |
##                                            
##    aliens    | so that they                
##     alien    | gang members 13             
##     alien    | y es la                     
##  alienígenas | , el Foro                   
##  alienígenas | en un campo                 
##     Alien    | と fiction と               
##     alien    | doppelgängers               
##     alien    | community FREE healthcare   
##     Alien    | Resurrection to be          
##  alienígenas | llegó hasta new             
##     ALIEN    | シューティング https:       
##     alien    | https:/                     
##     alien    | gang members 13             
##  alienígena  | bunda suja a                
##     Alien    | Lizard a real               
##     alien    | naughtiness. It's           
##  alien-talk  | right there.                
##     alien    | ? I think                   
##     alien    | .                           
##     Alien    | family passing Earth        
##     Alien    | Lizard a real               
##     alien    | world of the                
##  alienígenas | con superpoderes perseguidos
##     alien    | version of Bobby            
##     alien    | pets petting zoo            
##  alienígena  |                             
##  alienígenas | extranjeros pagados por     
##     alien    | rodents?#Supergirl          
##  alienígenas | " son imparables            
##     alien    | scared me\U0001f633         
##     Alien    | impostors??                 
##     alien    | experiments from exoplanet  
##     alien    | \U0001f47d                  
##  alienígena  | .\U0001f47d                 
##     alien    | doppelgängers               
##    aliens    | so that they                
##  alienígenas | extranjeros pagados por     
##     ALIEN    | AGENDA, SPACE               
##     alien    | collection.#Supergirl       
##     alien    | and I listening             
##     alien    | monsters at PetSmart        
##     alien    | x predador                  
##     alien    | recording the Government    
##    aliens    | they want to                
##  alienígenas | extranjeros pagados por     
##     alien    | map is coming
# note we run this function on the corpus not the dfm, since we want all the stopwords, etc.

So not everything is alien related in the way we maybe intended—do a little more exploring with some of the other common features of the data to see what you find!

Cluster Analysis

We’re focusing on classifiers, but just to see what we get, let’s do a quick k-means on the training data to see what happens. We’ll make wordclouds for each to see what got clustered where.

clusters = kmeans(traindfm, centers = 2, nstart = 3)
textplot_wordcloud(traindfm[clusters$cluster==1,], max_words=30)

textplot_wordcloud(traindfm[clusters$cluster==2,], max_words=30)

Interesting! A different breakdown of the data than maybe we expected.

Naive Bayes Classifier

Okay, now to our actual goal—naive Bayes classification. Let’s start by training the classifier:

# We may want to trim down our training data (try both ways):
traindfm = dfm_trim(traindfm, min_termfreq = 10, min_docfreq =10, verbose = TRUE)
## Removing features occurring:
##   - fewer than 10 times: 1,412
##   - in fewer than 10 documents: 1,419
##   Total features removed: 1,419 (99.2%).
# Convert our DFM to a dataframe (you actually don't have to do this, we can chat about this in class)
traindfm.df = convert(traindfm,to="data.frame")

# Train the classifier (drop the first column since it's the doc ID's)
classifier = naiveBayes(traindfm.df[,-1], traintweets$keyword, laplace=1)
# The laplace option is a smoother to assign non-zero probabilities to words that do not appear in the sample (important when the new data might have previously unobserved words/features)

# Let's see what the classifier looks like:
classifier
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = traindfm.df[, -1], y = traintweets$keyword, 
##     laplace = 1)
## 
## A-priori probabilities:
## traintweets$keyword
##     alien    naruto 
## 0.3358209 0.6641791 
## 
## Conditional probabilities:
##                    https
## traintweets$keyword      [,1]      [,2]
##              alien  0.3333333 0.4767313
##              naruto 0.5505618 0.5643024
## 
##                    t.co
## traintweets$keyword      [,1]      [,2]
##              alien  0.3333333 0.4767313
##              naruto 0.5617978 0.5631699
## 
##                    alien
## traintweets$keyword      [,1]     [,2]
##              alien  0.6888889 0.514438
##              naruto 0.0000000 0.000000
## 
##                    de
## traintweets$keyword      [,1]      [,2]
##              alien  0.3333333 0.6741999
##              naruto 0.1460674 0.4408298
## 
##                    o
## traintweets$keyword      [,1]      [,2]
##              alien  0.1333333 0.4572646
##              naruto 0.2696629 0.7502128
## 
##                    que
## traintweets$keyword       [,1]      [,2]
##              alien  0.08888889 0.2877990
##              naruto 0.20224719 0.5873627
## 
##                    の
## traintweets$keyword       [,1]      [,2]
##              alien  0.04444444 0.2084091
##              naruto 0.20224719 0.6248595
## 
##                    e
## traintweets$keyword       [,1]      [,2]
##              alien  0.02222222 0.1490712
##              naruto 0.17977528 0.4899396
## 
##                    naruto
## traintweets$keyword     [,1]      [,2]
##              alien  0.000000 0.0000000
##              naruto 0.988764 0.4883735
## 
##                    ナルト
## traintweets$keyword      [,1]      [,2]
##              alien  0.0000000 0.0000000
##              naruto 0.1685393 0.4581573
## 
##                    ナルステ
## traintweets$keyword      [,1]      [,2]
##              alien  0.0000000 0.0000000
##              naruto 0.1123596 0.3175976

Now, let’s test out the classifier on our test data:

# Make a test data corpus and DFM
testdata = corpus(testtweets, docid_field="status_id", text_field="text")
testdfm = dfm(testdata, remove=stopwords("english"), remove_punct=TRUE, remove_twitter = TRUE)
testdfm = dfm_trim(testdfm, min_termfreq = 10, min_docfreq = 10, verbose = TRUE)
## Removing features occurring:
##   - fewer than 10 times: 3,547
##   - in fewer than 10 documents: 3,565
##   Total features removed: 3,565 (97.9%).
# Convert to data frame
testdfm.df = convert(testdfm,to="data.frame")
pred = predict(classifier, newdata=testdfm.df[,-1])
table("Predictions"= pred,  "Actual" = testtweets$keyword)
##            Actual
## Predictions alien naruto
##      alien    131    128
##      naruto     3    152

It does okay—catches most of it for the alien keyword, but does worse with naruto. Try fiddling with the settings and see what you get, but this illustrates the basic idea. Next, let’s adapt it to look at the twitter data on vaccination!