WordCloud Twitter Text Analysis on CSC using R

These past few days, I have been reading a lot on non-parametric tests on natural language as one of the current work I have been doing is about natural language processing via machine learning. This is very advanced and even the Data Science course offered in Coursera has not started yet so I am relying fully on what I have been reading on fora and some blog articles. Starting with acquiring tweets from Twitter requires some libraries because of the oAuth that they have implemented. If you have not installed these packages, you need them before you can reproduce my code.

install.packages("twitteR")
install.packages("ROAuth")
install.packages("RCurl")
install.packages("tm")
install.packages("wordcloud")
install.packages("RColorBrewer")

Then, load the libraries, as usual.

library(twitteR)
library(ROAuth)
library(RCurl)
library(tm)
library(wordcloud)
library(RColorBrewer)

Set the option of RCurl package to use the file you will be downloading later using CurlSSL option

options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))

Download the Twitter oAuth using RCurl and save it to your working envionment

download.file(url = "http://curl.haxx.se/ca/cacert.pem", destfile = "cacert.pem")
save(Credentials, file="credentials.RData")

For easier code reading, assign these links to your environment. Note that the consumerKey and consumerSecret objects are redacted because these are unique to your own twitter API. You need to create a developer account on Twitter to acquire your own codes.

reqURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
consumerKey <- "MVdO2NE****************"
consumerSecret <- "oRZ9ff2yWvf9*************************c"
twitCred consumerSecret = consumerSecret, requestURL = reqURL, accessURL = accessURL, authURL = authURL)
twitCred$handshake(cainfo = "cacert.pem")

On your R console, after running the codes above, it will give you a link, somewhat similar like below:

> twitCred$handshake(cainfo = "cacert.pem")
To enable the connection, please direct your web browser to:
https//api.twitter.com/oauth/authorize?oauth_token=pK8JJiDb6j*******************************
When complete, record the PIN given to you and provide it here: 

Copy and paste the link to your browser. Click on accept to allow twitter to provide access on API. Pause here because you input the code given by twitter. If it is successful, as usual, R being the introvert, it will not give you any message. On the other hand, if your code is wrong, it will give you an error like, Unauthorized. You can check if you can now access Twitter API

registerTwitterOAuth(twitCred)

This is where you can start scraping the tweets. Twitter only allows a maximum of 1,500 tweets you can extract for a limited number of past days. If you want to get a constant feed, you may need to build a custom function to do it for you. For this analysis, let us just get the sample of the most recent tweets while I am writing these bunch of codes.

csc <- searchTwitter("@CSC", n = 1500)

Check the first tweet that was collected.

csc.firsttweet[[1]]
csc.firsttweet$getScreenName()
csc.firsttweet$getText()

Or check the head and tail of your list.

head(csc); tail(csc)

Once you have checked that you have a good number of tweets, prepare your data and convert it to a corpus.

csc.frame <- do.call('rbind', lapply(csc, as.data.frame))
csc.corpus <- Corpus(VectorSource(csc.frame))

Also, convert the characters into a homogenized language by removing stop words, punctuation marks and numbers. Take note that I added a few more words to be removed because these are values from the category in the data set when we downloaded the tweets.

csc.corpus <- tm_map(csc.corpus, tolower) # Convert to lowercase
csc.corpus <- tm_map(csc.corpus, removePunctuation) # Remove punctuation
csc.corpus <- tm_map(csc.corpus, removeNumbers) # Remove numbers
csc.corpus <- tm_map(csc.corpus, removeWords, c(stopwords('english'), 'false', 'buttona', 'hrefhttptwittercomtweetbutton', 
'relnofollowtweet', 'true', 'web', 'relnofollowtwitter', 'april', 'hrefhttptwittercomdownloadiphone', 'iphonea', 
'relnofollowtweetdecka', 'via', 'hrefhttpsabouttwittercomproductstweetdeck', 'hrefhttpwwwhootsuitecom', 'httptcoqqqiaipk', 
'androida', 'cschealth', 'cscanalytics', 'csccloud', 'relnofollowhootsuitea', 'cscmyworkstyle', 'cscaustralia', 'hrefhttptwittercomdownloadandroid')) # Remove stop words

Prepare the document term matrix

csc.dtm <- DocumentTermMatrix(csc.corpus)
csc.dtm.matrix <- as.matrix(csc.dtm)

Or term document matrix, whichever you prefer.

csc.tdm <- TermDocumentMatrix(csc.corpus)
csc.tdm.sum <- sort(rowSums(as.matrix(csc.tdm)), decreasing = T) # Sum of frequency of words
csc.tdm.sum <- data.frame(keyword = names(csc.tdm.sum), freq = csc.tdm.sum) # Convert keyword frequency to DF
csc.tdm.sum

Plot the wordcloud.

cloudcolor <- brewer.pal(8, "Paired")
wordcloud(csc.tdm.sum$keyword, csc.tdm.sum$freq, scale=c(8,.2), min.freq=1, max.words=Inf, random.order=T, rot.per=.3, colors=cloudcolor)
@CSC

Wordcloud using TwittR and TM package in R

Yes! It is CSC‘s birthday this April! In my next few posts, I will perform some sentiment analysis particularly on this data set where the false keyword is the mostly frequently and standing word have been used by users.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s