WordCloud Twitter Text Analysis on CSC using R

These past few days, I have been reading a lot on non-parametric tests on natural language as one of the current work I have been doing is about natural language processing via machine learning. This is very advanced and even the Data Science course offered in Coursera has not started yet so I am relying fully on what I have been reading on fora and some blog articles. Starting with acquiring tweets from Twitter requires some libraries because of the oAuth that they have implemented. If you have not installed these packages, you need them before you can reproduce my code.

install.packages("twitteR")
install.packages("ROAuth")
install.packages("RCurl")
install.packages("tm")
install.packages("wordcloud")
install.packages("RColorBrewer")

Then, load the libraries, as usual.

library(twitteR)
library(ROAuth)
library(RCurl)
library(tm)
library(wordcloud)
library(RColorBrewer)

Set the option of RCurl package to use the file you will be downloading later using CurlSSL option

options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))

Download the Twitter oAuth using RCurl and save it to your working envionment

download.file(url = "http://curl.haxx.se/ca/cacert.pem", destfile = "cacert.pem")
save(Credentials, file="credentials.RData")

For easier code reading, assign these links to your environment. Note that the consumerKey and consumerSecret objects are redacted because these are unique to your own twitter API. You need to create a developer account on Twitter to acquire your own codes.

reqURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
consumerKey <- "MVdO2NE****************"
consumerSecret <- "oRZ9ff2yWvf9*************************c"
twitCred consumerSecret = consumerSecret, requestURL = reqURL, accessURL = accessURL, authURL = authURL)
twitCred$handshake(cainfo = "cacert.pem")

On your R console, after running the codes above, it will give you a link, somewhat similar like below:

> twitCred$handshake(cainfo = "cacert.pem")
To enable the connection, please direct your web browser to:
https//api.twitter.com/oauth/authorize?oauth_token=pK8JJiDb6j*******************************
When complete, record the PIN given to you and provide it here: 

Copy and paste the link to your browser. Click on accept to allow twitter to provide access on API. Pause here because you input the code given by twitter. If it is successful, as usual, R being the introvert, it will not give you any message. On the other hand, if your code is wrong, it will give you an error like, Unauthorized. You can check if you can now access Twitter API

registerTwitterOAuth(twitCred)

This is where you can start scraping the tweets. Twitter only allows a maximum of 1,500 tweets you can extract for a limited number of past days. If you want to get a constant feed, you may need to build a custom function to do it for you. For this analysis, let us just get the sample of the most recent tweets while I am writing these bunch of codes.

csc <- searchTwitter("@CSC", n = 1500)

Check the first tweet that was collected.

csc.firsttweet[[1]]
csc.firsttweet$getScreenName()
csc.firsttweet$getText()

Or check the head and tail of your list.

head(csc); tail(csc)

Once you have checked that you have a good number of tweets, prepare your data and convert it to a corpus.

csc.frame <- do.call('rbind', lapply(csc, as.data.frame))
csc.corpus <- Corpus(VectorSource(csc.frame))

Also, convert the characters into a homogenized language by removing stop words, punctuation marks and numbers. Take note that I added a few more words to be removed because these are values from the category in the data set when we downloaded the tweets.

csc.corpus <- tm_map(csc.corpus, tolower) # Convert to lowercase
csc.corpus <- tm_map(csc.corpus, removePunctuation) # Remove punctuation
csc.corpus <- tm_map(csc.corpus, removeNumbers) # Remove numbers
csc.corpus <- tm_map(csc.corpus, removeWords, c(stopwords('english'), 'false', 'buttona', 'hrefhttptwittercomtweetbutton', 
'relnofollowtweet', 'true', 'web', 'relnofollowtwitter', 'april', 'hrefhttptwittercomdownloadiphone', 'iphonea', 
'relnofollowtweetdecka', 'via', 'hrefhttpsabouttwittercomproductstweetdeck', 'hrefhttpwwwhootsuitecom', 'httptcoqqqiaipk', 
'androida', 'cschealth', 'cscanalytics', 'csccloud', 'relnofollowhootsuitea', 'cscmyworkstyle', 'cscaustralia', 'hrefhttptwittercomdownloadandroid')) # Remove stop words

Prepare the document term matrix

csc.dtm <- DocumentTermMatrix(csc.corpus)
csc.dtm.matrix <- as.matrix(csc.dtm)

Or term document matrix, whichever you prefer.

csc.tdm <- TermDocumentMatrix(csc.corpus)
csc.tdm.sum <- sort(rowSums(as.matrix(csc.tdm)), decreasing = T) # Sum of frequency of words
csc.tdm.sum <- data.frame(keyword = names(csc.tdm.sum), freq = csc.tdm.sum) # Convert keyword frequency to DF
csc.tdm.sum

Plot the wordcloud.

cloudcolor <- brewer.pal(8, "Paired")
wordcloud(csc.tdm.sum$keyword, csc.tdm.sum$freq, scale=c(8,.2), min.freq=1, max.words=Inf, random.order=T, rot.per=.3, colors=cloudcolor)
@CSC

Wordcloud using TwittR and TM package in R

Yes! It is CSC‘s birthday this April! In my next few posts, I will perform some sentiment analysis particularly on this data set where the false keyword is the mostly frequently and standing word have been used by users.

wpid-screenshot_2014-04-17-11-42-37.png
Video

Randy Blue – Blow (NSFW)

Ever since the boys if Randy Blue recorded a cover of Kylie Minogue, I would not deny that I started collecting their videos. If you want copies, I am very generous.

Then again, after the videos in Weho, the new breed of Randy Blue made another cover – Beyonce’s Blow wearing Calvin Kleon underwear in multitude. Beware, it is NSFW.

Enjoy!

Posted from WordPress for Android

Homecoming Commercial
Video

Homecoming Commercial

This video was created by the Coalition for Equal Marriage who strongly believe that Equal But Separate Is Not Equal.

Same sex marriage is still one of the hottest debates especially in advanced countries where the civil rights is implicated against moral values, either by religion or by the social norm. While watching this video, you would realize that everyone is created and seen by justice equally. So why not give the opportunity to give a legal basis for homosexual marriage, not to celebrate religion, but to be equally right?

qqplot

T-test of Parametric Test of Paired Data of the Null Hypothesis

I was working on a sample data last Friday and testing if it is really worth looking or spending time because someone has requested for an analysis that I have revised a lot of times and one of the frustrations that I have been encountering so far is to translate these statistical tests into business language. That is another topic that I need to rant on.

Anyway, like I mentioned that two separate data were collected. You would think that these as pre and post tests, in a sense but the data’s background is that it was measured again after two weeks. I will start of in encoding these into R.

# Load ggplot2 package. Install this if necessary:
# install.packages("ggplot2")
library(ggplot2)

# Creating Dataframe of Paired Data
test.data <- data.frame(Test = as.character(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    2, 2, 2, 2, 2, 2, 2, 2, 2, 2)), Score = c(0.54, 0.573, 0.575, 0.589, 0.639, 
    0.624, 0.64, 0.565, 0.694, 0.605, 0.632, 0.535, 0.556, 0.533, 0.516, 0.575, 
    0.57, 0.608, 0.58, 0.502))

As usual, I am a fan of the subset function. I could use the open square brackets, but I am very comfortable in using this; it takes the job done.

# Subset
test1 <- subset(test.data, Test == 1)
test2 <- subset(test.data, Test == 2)

Now that we have subset the data. Let us look how far they are to each other. Most people are intimidated looking at these boxplots. I will not discuss further how to read and interpret these but you can actually see the difference between the mean, which is the small dot in between the boxes, and the median, the straight line across each box.

My question is, are these two data sets statistically significant to say that they are different to each other?

ggplot(data = test.data, aes(x = Test, y = Score)) + stat_boxplot(geom = "errorbar") + 
    geom_boxplot(aes(fill = Test)) + stat_summary(fun.y = mean, geom = "point", 
    aes(group = 1)) + ylab("Scores") + xlab("Test") + theme(legend.position = "none")

qqplot
Of course, let us rely on the simple Student’s test of Paired Data.

t.test(x = test1$Score, y = test2$Score, alternative = "two.sided", paired = T, 
    conf.level = 0.95)
## 
##  Paired t-test
## 
## data:  test1$Score and test2$Score
## t = 2.018, df = 9, p-value = 0.07432
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.00528  0.09268
## sample estimates:
## mean of the differences 
##                  0.0437

If you need me to compute these manually, I would love to. Starting from the standard deviation of differences of the two means to standard error, the degrees of freedom, until we arrive at the p-value according to t-test value. If I would plot this on a normal curve, the end point of t test value of 2.018 in a 9 degrees of freedom, the probability is 0.07.

Even on a 95% confidence level, I could say that they are not different to each other basing it on the p-value of more than 0.05. Why? Let’s construct the hypothesis statement first.

H0 = Test 1 = Test 2, Test 1 and Test 2 are equal to each other on a two-sided tail
HA = Test 1 ≠ Test 2, Test 1 and Test 2 are not equal to each other on a two-sided tail

Given that the p-value of 0.07, where the significance level is at 0.05 cut off, we retain the null hypotheses. Therefore, we conclude that these two tests are equal to each other. With all of these languages I speak, what do they really mean?

If you look into both means or averages of the two data sets, they are different. 0.6044 and 0.5607, respectively. I can say that the request I am working on is not worth looking at into a lower level. This is where decision error takes in place of what would be the implication if I continue looking for answers or I just decide not to because it is not worth looking at. Decision Errors is another topic, maybe in the next few posts.