These past few days, I have been reading a lot on non-parametric tests on natural language as one of the current work I have been doing is about natural language processing via machine learning. This is very advanced and even the Data Science course offered in Coursera has not started yet so I am relying fully on what I have been reading on fora and some blog articles. Starting with acquiring tweets from Twitter requires some libraries because of the oAuth that they have implemented. If you have not installed these packages, you need them before you can reproduce my code.
install.packages("twitteR") install.packages("ROAuth") install.packages("RCurl") install.packages("tm") install.packages("wordcloud") install.packages("RColorBrewer")
Then, load the libraries, as usual.
library(twitteR) library(ROAuth) library(RCurl) library(tm) library(wordcloud) library(RColorBrewer)
Set the option of RCurl package to use the file you will be downloading later using CurlSSL option
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
Download the Twitter oAuth using RCurl and save it to your working envionment
download.file(url = "http://curl.haxx.se/ca/cacert.pem", destfile = "cacert.pem") save(Credentials, file="credentials.RData")
For easier code reading, assign these links to your environment. Note that the consumerKey and consumerSecret objects are redacted because these are unique to your own twitter API. You need to create a developer account on Twitter to acquire your own codes.
reqURL <- "https://api.twitter.com/oauth/request_token" accessURL <- "https://api.twitter.com/oauth/access_token" authURL <- "https://api.twitter.com/oauth/authorize" consumerKey <- "MVdO2NE****************" consumerSecret <- "oRZ9ff2yWvf9*************************c" twitCred consumerSecret = consumerSecret, requestURL = reqURL, accessURL = accessURL, authURL = authURL) twitCred$handshake(cainfo = "cacert.pem")
On your R console, after running the codes above, it will give you a link, somewhat similar like below:
> twitCred$handshake(cainfo = "cacert.pem") To enable the connection, please direct your web browser to: https//api.twitter.com/oauth/authorize?oauth_token=pK8JJiDb6j******************************* When complete, record the PIN given to you and provide it here:
Copy and paste the link to your browser. Click on accept to allow twitter to provide access on API. Pause here because you input the code given by twitter. If it is successful, as usual, R being the introvert, it will not give you any message. On the other hand, if your code is wrong, it will give you an error like, Unauthorized. You can check if you can now access Twitter API
This is where you can start scraping the tweets. Twitter only allows a maximum of 1,500 tweets you can extract for a limited number of past days. If you want to get a constant feed, you may need to build a custom function to do it for you. For this analysis, let us just get the sample of the most recent tweets while I am writing these bunch of codes.
csc <- searchTwitter("@CSC", n = 1500)
Check the first tweet that was collected.
csc.firsttweet[] csc.firsttweet$getScreenName() csc.firsttweet$getText()
Or check the head and tail of your list.
Once you have checked that you have a good number of tweets, prepare your data and convert it to a corpus.
csc.frame <- do.call('rbind', lapply(csc, as.data.frame)) csc.corpus <- Corpus(VectorSource(csc.frame))
Also, convert the characters into a homogenized language by removing stop words, punctuation marks and numbers. Take note that I added a few more words to be removed because these are values from the category in the data set when we downloaded the tweets.
csc.corpus <- tm_map(csc.corpus, tolower) # Convert to lowercase csc.corpus <- tm_map(csc.corpus, removePunctuation) # Remove punctuation csc.corpus <- tm_map(csc.corpus, removeNumbers) # Remove numbers csc.corpus <- tm_map(csc.corpus, removeWords, c(stopwords('english'), 'false', 'buttona', 'hrefhttptwittercomtweetbutton', 'relnofollowtweet', 'true', 'web', 'relnofollowtwitter', 'april', 'hrefhttptwittercomdownloadiphone', 'iphonea', 'relnofollowtweetdecka', 'via', 'hrefhttpsabouttwittercomproductstweetdeck', 'hrefhttpwwwhootsuitecom', 'httptcoqqqiaipk', 'androida', 'cschealth', 'cscanalytics', 'csccloud', 'relnofollowhootsuitea', 'cscmyworkstyle', 'cscaustralia', 'hrefhttptwittercomdownloadandroid')) # Remove stop words
Prepare the document term matrix
csc.dtm <- DocumentTermMatrix(csc.corpus) csc.dtm.matrix <- as.matrix(csc.dtm)
Or term document matrix, whichever you prefer.
csc.tdm <- TermDocumentMatrix(csc.corpus) csc.tdm.sum <- sort(rowSums(as.matrix(csc.tdm)), decreasing = T) # Sum of frequency of words csc.tdm.sum <- data.frame(keyword = names(csc.tdm.sum), freq = csc.tdm.sum) # Convert keyword frequency to DF csc.tdm.sum
Plot the wordcloud.
cloudcolor <- brewer.pal(8, "Paired") wordcloud(csc.tdm.sum$keyword, csc.tdm.sum$freq, scale=c(8,.2), min.freq=1, max.words=Inf, random.order=T, rot.per=.3, colors=cloudcolor)
Yes! It is CSC‘s birthday this April! In my next few posts, I will perform some sentiment analysis particularly on this data set where the false keyword is the mostly frequently and standing word have been used by users.