SlideShare a Scribd company logo
1 of 35
Download to read offline
Text Mining with R { an Analysis of Twitter Data 
Yanchang Zhao 
http://www.RDataMining.com 
30 September 2014 
1 / 35
Outline 
Introduction 
Extracting Tweets 
Text Cleaning 
Frequent Words and Associations 
Word Cloud 
Clustering 
Topic Modelling 
Online Resources 
2 / 35
Text Mining 
I unstructured text data 
I text categorization 
I text clustering 
I entity extraction 
I sentiment analysis 
I document summarization 
I . . . 
3 / 35
Text mining of Twitter data with R 1 
1. extract data from Twitter 
2. clean extracted data and build a document-term matrix 
3.
nd frequent words and associations 
4. create a word cloud to visualize important words 
5. text clustering 
6. topic modelling 
1Chapter 10: Text Mining, R and Data Mining: Examples and Case Studies. 
http://www.rdatamining.com/docs/RDataMining.pdf 
4 / 35
Outline 
Introduction 
Extracting Tweets 
Text Cleaning 
Frequent Words and Associations 
Word Cloud 
Clustering 
Topic Modelling 
Online Resources 
5 / 35
Retrieve Tweets 
Retrieve recent tweets by @RDataMining 
## Option 1: retrieve tweets from Twitter 
library(twitteR) 
tweets <- userTimeline("RDataMining", n = 3200) 
## Option 2: download @RDataMining tweets from RDataMining.com 
url <- "http://www.rdatamining.com/data/rdmTweets.RData" 
download.file(url, destfile = "./data/rdmTweets.RData") 
## load tweets into R 
load(file = "./data/rdmTweets-201306.RData") 
6 / 35
(n.tweet <- length(tweets)) 
## [1] 320 
tweets[1:5] 
## [[1]] 
## [1] "RDataMining: Examples on calling Java code from R nht... 
## 
## [[2]] 
## [1] "RDataMining: Simulating Map-Reduce in R for Big Data A... 
## 
## [[3]] 
## [1] "RDataMining: Job opportunity: Senior Analyst - Big Dat... 
## 
## [[4]] 
## [1] "RDataMining: CLAVIN: an open source software package f... 
## 
## [[5]] 
## [1] "RDataMining: An online book on Natural Language Proces... 
7 / 35
Outline 
Introduction 
Extracting Tweets 
Text Cleaning 
Frequent Words and Associations 
Word Cloud 
Clustering 
Topic Modelling 
Online Resources 
8 / 35
# convert tweets to a data frame 
# tweets.df <- do.call("rbind", lapply(tweets, as.data.frame)) 
tweets.df <- twListToDF(tweets) 
dim(tweets.df) 
## [1] 320 14 
library(tm) 
# build a corpus, and specify the source to be character vectors 
myCorpus <- Corpus(VectorSource(tweets.df$text)) 
# convert to lower case 
myCorpus <- tm_map(myCorpus, tolower) 
Package tm v0.5-10 was used in this example. With tm v0.6, 
content transformer" needs to be used to wrap around normal 
functions. 
# tm v0.6 
myCorpus <- tm_map(myCorpus, content_transformer(tolower)) 
9 / 35
# remove punctuation 
myCorpus <- tm_map(myCorpus, removePunctuation) 
# remove numbers 
myCorpus <- tm_map(myCorpus, removeNumbers) 
# remove URLs 
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x) 
myCorpus <- tm_map(myCorpus, removeURL) 
# add two extra stop words: 
available
 and 
via
 
myStopwords <- c(stopwords("english"), "available", "via") 
# remove 
r
 and 
big
 from stopwords 
myStopwords <- setdiff(myStopwords, c("r", "big")) 
# remove stopwords from corpus 
myCorpus <- tm_map(myCorpus, removeWords, myStopwords) 
10 / 35
# keep a copy of corpus to use later as a dictionary for stem completion 
myCorpusCopy <- myCorpus 
# stem words 
myCorpus <- tm_map(myCorpus, stemDocument) 
# inspect the first 5 documents (tweets) inspect(myCorpus[1:5]) 
# The code below is used for to make text fit for paper width 
for (i in 1:5) f 
cat(paste("[[", i, "]] ", sep = "")) 
writeLines(myCorpus[[i]]) 
g 
## [[1]] exampl call java code r 
## 
## [[2]] simul mapreduc r big data analysi use flight data ... 
## [[3]] job opportun senior analyst big data wesfarm indust... 
## [[4]] clavin open sourc softwar packag document geotag g... 
## [[5]] onlin book natur languag process python 
11 / 35
# stem completion 
myCorpus <- tm_map(myCorpus, stemCompletion, 
dictionary = myCorpusCopy) 
## [[1]] examples call java code r 
## [[2]] simulating mapreduce r big data analysis used flights... 
## [[3]] job opportunity senior analyst big data wesfarmers in... 
## [[4]] clavin open source software package document geotaggi... 
## [[5]] online book natural language processing python 
12 / 35
# count frequency of "mining" 
miningCases <- tm_map(myCorpusCopy, grep, pattern = "nn<mining") 
sum(unlist(miningCases)) 
## [1] 82 
# count frequency of "miners" 
minerCases <- tm_map(myCorpusCopy, grep, pattern = "nn<miners") 
sum(unlist(minerCases)) 
## [1] 4 
# replace "miners" with "mining" 
myCorpus <- tm_map(myCorpus, gsub, pattern = "miners", 
replacement = "mining") 
13 / 35
tdm <- TermDocumentMatrix(myCorpus, 
control = list(wordLengths = c(1, Inf))) 
tdm 
## A term-document matrix (790 terms, 320 documents) 
## 
## Non-/sparse entries: 2449/250351 
## Sparsity : 99% 
## Maximal term length: 27 
## Weighting : term frequency (tf) 
14 / 35
Outline 
Introduction 
Extracting Tweets 
Text Cleaning 
Frequent Words and Associations 
Word Cloud 
Clustering 
Topic Modelling 
Online Resources 
15 / 35
idx <- which(dimnames(tdm)$Terms == "r") 
inspect(tdm[idx + (0:5), 101:110]) 
## A term-document matrix (6 terms, 10 documents) 
## 
## Non-/sparse entries: 4/56 
## Sparsity : 93% 
## Maximal term length: 12 
## Weighting : term frequency (tf) 
## 
## Docs 
## Terms 101 102 103 104 105 106 107 108 109 110 
## r 0 1 1 0 0 0 0 0 1 1 
## ramachandran 0 0 0 0 0 0 0 0 0 0 
## random 0 0 0 0 0 0 0 0 0 0 
## ranked 0 0 0 0 0 0 0 0 0 0 
## rann 0 0 0 0 0 0 0 0 0 0 
## rapidminer 0 0 0 0 0 0 0 0 0 0 
16 / 35
# inspect frequent words 
(freq.terms <- findFreqTerms(tdm, lowfreq = 15)) 
## [1] "analysis" "applications" "big" "book" 
## [5] "code" "computing" "data" "examples" 
## [9] "group" "introduction" "mining" "network" 
## [13] "package" "position" "postdoctoral" "r" 
## [17] "research" "see" "slides" "social" 
## [21] "tutorial" "university" "used" 
term.freq <- rowSums(as.matrix(tdm)) 
term.freq <- subset(term.freq, term.freq >= 15) 
df <- data.frame(term = names(term.freq), freq = term.freq) 
17 / 35
library(ggplot2) 
ggplot(df, aes(x = term, y = freq)) + geom_bar(stat = "identity") + 
xlab("Terms") + ylab("Count") + coord_flip() 
used 
university 
tutorial 
social 
slides 
see 
research 
r 
postdoctoral 
position 
package 
network 
mining 
introduction 
group 
examples 
data 
computing 
code 
book 
big 
applications 
analysis 
0 50 100 150 
Count 
Terms 
18 / 35
# which words are associated with 
r
? 
findAssocs(tdm, "r", 0.2) 
## r 
## examples 0.32 
## code 0.29 
## package 0.20 
# which words are associated with 
mining
? 
findAssocs(tdm, "mining", 0.25) 
## mining 
## data 0.47 
## mahout 0.30 
## recommendation 0.30 
## sets 0.30 
## supports 0.30 
## frequent 0.26 
## itemset 0.26 
19 / 35
library(graph) 
library(Rgraphviz) 
plot(tdm, term = freq.terms, corThreshold = 0.12, weighting = T) 
analysis 
research r 
applications big 
book code computing 
data 
examples 
group 
introduction 
mining 
network postdoctoral 
package 
position 
see 
slides 
tutorial social 
university used 
20 / 35
Outline 
Introduction 
Extracting Tweets 
Text Cleaning 
Frequent Words and Associations 
Word Cloud 
Clustering 
Topic Modelling 
Online Resources 
21 / 35
library(wordcloud) 
m <- as.matrix(tdm) 
# calculate the frequency of words and sort it by frequency 
word.freq <- sort(rowSums(m), decreasing = T) 
wordcloud(words = names(word.freq), freq = word.freq, min.freq = 3, 
random.order = F) 
language 
postdoctoral 
university 
social 
book 
data 
r 
project 
forecasting 
frequent 
applications 
big 
code 
tutorial 
job 
tried 
edited 
provided 
information 
research examples 
package 
mining 
analysis 
used 
cfp 
network 
positionslides 
computing 
see 
lecture 
introduction 
group 
analytics 
australia 
modelling 
scientist 
time 
association 
free 
online 
parallel 
text 
clustering 
course 
learn 
pdf 
rules 
series 
talk 
document 
now 
programming 
fellow 
statistics 
ausdm 
detection 
outlier 
google 
large 
open 
rdatamining 
techniques 
thanks 
tools 
users 
vacancy 
amp 
chapter 
due 
graphics 
processing 
notes 
map 
presentation 
science 
software 
starting 
studies 
visualizing 
business 
call 
can 
case center 
database 
details 
follower 
functions 
join 
kdnuggets 
page 
poll 
top 
submission spatial 
technology 
twitter 
analyst 
california 
card 
fast 
classification 
find 
get 
graph 
handling 
interactive 
linkedin 
list 
machine 
melbourne 
recent 
reference 
using 
videos 
web 
elsevier 
workshop 
wwwrdataminingcom 
access added 
canada 
canberra 
china 
cloud 
conference 
datasets 
distributed 
dmapps 
events draft 
experience 
high 
ibm 
industrial 
knowledge 
management 
nd 
performance 
predictive 
published 
search 
sentiment 
short 
snowfall 
southern 
sydney 
tenuretrack 
topic 
week 
youtube 
advanced 
also 
answers 
april 
area 
august 
building 
charts 
check 
comments 
containing 
cran 
create 
csiro 
dec 
dynamic 
easier 
engineer 
format 
guide 
hadoop 
igraph 
incl 
itemset 
jan 
labs 
may 
media 
microsoft 
mid 
new 
opportunity 
plot 
postdocresearch 
quick 
realworld 
san 
senior 
simple 
singapore 
state 
titled 
tj 
track 
tweets 
usamaf 
views 
visits 
watson 
website 22 / 35
Outline 
Introduction 
Extracting Tweets 
Text Cleaning 
Frequent Words and Associations 
Word Cloud 
Clustering 
Topic Modelling 
Online Resources 
23 / 35
# remove sparse terms 
tdm2 <- removeSparseTerms(tdm, sparse = 0.95) 
m2 <- as.matrix(tdm2) 
# cluster terms 
distMatrix <- dist(scale(m2)) 
fit <- hclust(distMatrix, method = "ward") 
24 / 35
plot(fit) 
rect.hclust(fit, k = 6) # cut tree into 6 clusters 
10 20 30 40 50 60 
research 
position 
postdoctoral 
university 
analysis 
network 
social 
examples 
package 
slides 
used 
computing 
tutorial 
big 
applications 
book 
r 
data 
mining 
Cluster Dendrogram 
Height 
25 / 35
m3 <- t(m2) # transpose the matrix to cluster documents (tweets) 
set.seed(122) # set a fixed random seed 
k <- 6 # number of clusters 
kmeansResult <- kmeans(m3, k) 
round(kmeansResult$centers, digits = 3) # cluster centers 
## analysis applications big book computing data examples 
## 1 0.147 0.088 0.147 0.015 0.059 1.015 0.088 
## 2 0.028 0.167 0.167 0.250 0.028 1.556 0.194 
## 3 0.810 0.000 0.000 0.000 0.000 0.048 0.095 
## 4 0.080 0.036 0.007 0.058 0.087 0.000 0.181 
## 5 0.000 0.000 0.000 0.067 0.067 0.333 0.067 
## 6 0.119 0.048 0.071 0.000 0.048 0.357 0.000 
## mining network package position postdoctoral r research 
## 1 0.338 0.015 0.015 0.059 0.074 0.235 0.074 
## 2 1.056 0.000 0.222 0.000 0.000 1.000 0.028 
## 3 0.048 1.000 0.095 0.143 0.095 0.286 0.048 
## 4 0.065 0.022 0.174 0.000 0.007 0.703 0.000 
## 5 1.200 0.000 0.000 0.000 0.067 0.067 0.000 
## 6 0.119 0.000 0.024 0.643 0.310 0.000 0.714 
## slides social tutorial university used 
## 1 0.074 0.000 0.015 0.015 0.029 
## 2 0.056 0.000 0.000 0.000 0.250 
## 3 0.095 0.762 0.190 0.000 0.095 
26 / 35
for (i in 1:k) f 
cat(paste("cluster ", i, ": ", sep = "")) 
s <- sort(kmeansResult$centers[i, ], decreasing = T) 
cat(names(s)[1:5], "nn") 
# print the tweets of every cluster 
# print(tweets[which(kmeansResult$cluster==i)]) 
g 
## cluster 1: data mining r analysis big 
## cluster 2: data mining r book used 
## cluster 3: network analysis social r tutorial 
## cluster 4: r examples package slides used 
## cluster 5: mining tutorial slides data book 
## cluster 6: research position university data postdoctoral 
27 / 35
library(fpc) 
# partitioning around medoids with estimation of number of clusters 
pamResult <- pamk(m3, metric="manhattan") 
k <- pamResult$nc # number of clusters identified 
pamResult <- pamResult$pamobject 
# print cluster medoids 
for (i in 1:k) f 
cat("cluster", i, ": ", 
colnames(pamResult$medoids)[which(pamResult$medoids[i,]==1)], "nn") 
g 
## cluster 1 : examples r 
## cluster 2 : analysis data r 
## cluster 3 : data 
## cluster 4 : 
## cluster 5 : r 
## cluster 6 : data mining r 
## cluster 7 : data mining 
## cluster 8 : analysis network social 
## cluster 9 : data position research 
## cluster 10 : position postdoctoral university 
28 / 35
# plot clustering result 
layout(matrix(c(1, 2), 1, 2)) # set to two graphs per page 
plot(pamResult, col.p = pamResult$clustering) 
−4 −2 0 2 4 6 
−4 −2 0 2 4 
clusplot(pam(x = sdata, k = k, diss = diss, metric = "manhattan")) 
Component 1 
Component 2 
These two components explain 25.23 % of the point variability. 
Silhouette plot of pam(x = sdata, k = k, diss = diss, metric = "manhattan") 
n = 320 10  clusters  Cj 
−0.2 0.0 0.2 0.4 0.6 0.8 1.0 
Silhouette width si 
Average silhouette width : 0.27 
j : nj | aveiÎCj  si 
1 : 28 | 0.13 
2 : 12 | 0.25 
3 : 31 | 0.23 
4 : 61 | 0.54 
5 : 58 | 0.29 
6 : 28 | 0.21 
7 : 43 | 0.15 
8 : 19 | 0.29 
9 : 21 | 0.15 
10 : 19 | 0.14 
layout(matrix(1)) # change back to one graph per page 
29 / 35
Outline 
Introduction 
Extracting Tweets 
Text Cleaning 
Frequent Words and Associations 
Word Cloud 
Clustering 
Topic Modelling 
Online Resources 
30 / 35
Topic Modelling 
dtm <- as.DocumentTermMatrix(tdm) 
library(topicmodels) 
lda <- LDA(dtm, k = 8) # find 8 topics 
term <- terms(lda, 4) # first 4 terms of every topic 
term 
## Topic 1 Topic 2 Topic 3 Topic 4 
## [1,] "data" "data" "mining" "r" 
## [2,] "mining" "r" "r" "lecture" 
## [3,] "scientist" "mining" "tutorial" "time" 
## [4,] "research" "book" "introduction" "series" 
## Topic 5 Topic 6 Topic 7 Topic 8 
## [1,] "position" "package" "r" "analysis" 
## [2,] "research" "r" "data" "network" 
## [3,] "postdoctoral" "clustering" "computing" "r" 
## [4,] "university" "detection" "slides" "code" 
term <- apply(term, MARGIN = 2, paste, collapse = ", ") 
31 / 35
Topic Modelling 
# first topic identified for every document (tweet) 
topic <- topics(lda, 1) 
topics <- data.frame(date=as.IDate(tweets.df$created), topic) 
qplot(date, ..count.., data=topics, geom="density", 
fill=term[topic], position="stack") 
0.4 
0.2 
0.0 
2011−07 2012−01 2012−07 2013−01 2013−07 
date 
count 
term[topic] 
analysis, network, r, code 
data, mining, scientist, research 
data, r, mining, book 
mining, r, tutorial, introduction 
package, r, clustering, detection 
position, research, postdoctoral, university 
r, data, computing, slides 
r, lecture, time, series 
32 / 35
Outline 
Introduction 
Extracting Tweets 
Text Cleaning 
Frequent Words and Associations 
Word Cloud 
Clustering 
Topic Modelling 
Online Resources 
33 / 35
Online Resources 
I Chapter 10: Text Mining, in book 
R and Data Mining: Examples and Case Studies 
http://www.rdatamining.com/docs/RDataMining.pdf 
I R Reference Card for Data Mining 
http://www.rdatamining.com/docs/R-refcard-data-mining.pdf 
I Free online courses and documents 
http://www.rdatamining.com/resources/ 
I RDataMining Group on LinkedIn (7,000+ members) 
http://group.rdatamining.com 
I RDataMining on Twitter (1,700+ followers) 
@RDataMining 
34 / 35

More Related Content

What's hot

Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSSujit Pal
 
Data communication & computer networking: Huffman algorithm
Data communication & computer networking:  Huffman algorithmData communication & computer networking:  Huffman algorithm
Data communication & computer networking: Huffman algorithmDr Rajiv Srivastava
 
Hate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation
Hate Speech in Pixels: Detection of Offensive Memes towards Automatic ModerationHate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation
Hate Speech in Pixels: Detection of Offensive Memes towards Automatic ModerationUniversitat Politècnica de Catalunya
 
Text classification
Text classificationText classification
Text classificationJames Wong
 
Microsoft vs google ppt for final
Microsoft vs google ppt for finalMicrosoft vs google ppt for final
Microsoft vs google ppt for finalRashedul Amin
 
Scaling Security Threat Detection with Apache Spark and Databricks
Scaling Security Threat Detection with Apache Spark and DatabricksScaling Security Threat Detection with Apache Spark and Databricks
Scaling Security Threat Detection with Apache Spark and DatabricksDatabricks
 
Intro to Python Programming Language
Intro to Python Programming LanguageIntro to Python Programming Language
Intro to Python Programming LanguageDipankar Achinta
 
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...Edureka!
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)Marina Santini
 
C++ programming Unit 5 flow of control
C++ programming Unit 5 flow of controlC++ programming Unit 5 flow of control
C++ programming Unit 5 flow of controlAAKASH KUMAR
 
Intro automata theory
Intro automata theory Intro automata theory
Intro automata theory Rajendran
 
Question Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesQuestion Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesSujit Pal
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLPAnuj Gupta
 

What's hot (20)

Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDS
 
Data communication & computer networking: Huffman algorithm
Data communication & computer networking:  Huffman algorithmData communication & computer networking:  Huffman algorithm
Data communication & computer networking: Huffman algorithm
 
Hate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation
Hate Speech in Pixels: Detection of Offensive Memes towards Automatic ModerationHate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation
Hate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation
 
Text classification
Text classificationText classification
Text classification
 
Microsoft vs google ppt for final
Microsoft vs google ppt for finalMicrosoft vs google ppt for final
Microsoft vs google ppt for final
 
8 python data structure-1
8 python data structure-18 python data structure-1
8 python data structure-1
 
Scaling Security Threat Detection with Apache Spark and Databricks
Scaling Security Threat Detection with Apache Spark and DatabricksScaling Security Threat Detection with Apache Spark and Databricks
Scaling Security Threat Detection with Apache Spark and Databricks
 
Time complexity
Time complexityTime complexity
Time complexity
 
Intro to Python Programming Language
Intro to Python Programming LanguageIntro to Python Programming Language
Intro to Python Programming Language
 
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
C++ programming Unit 5 flow of control
C++ programming Unit 5 flow of controlC++ programming Unit 5 flow of control
C++ programming Unit 5 flow of control
 
Intro automata theory
Intro automata theory Intro automata theory
Intro automata theory
 
陳宜欣/大數據下的情緒分析
陳宜欣/大數據下的情緒分析陳宜欣/大數據下的情緒分析
陳宜欣/大數據下的情緒分析
 
Question Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesQuestion Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other Stories
 
Text Similarity
Text SimilarityText Similarity
Text Similarity
 
Context free grammar
Context free grammarContext free grammar
Context free grammar
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
Propositional Logic and Pridicate logic
Propositional Logic and Pridicate logicPropositional Logic and Pridicate logic
Propositional Logic and Pridicate logic
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLP
 

Similar to Text Mining with R -- an Analysis of Twitter Data

Text Mining of Twitter in Data Mining
Text Mining of Twitter in Data MiningText Mining of Twitter in Data Mining
Text Mining of Twitter in Data MiningMeghaj Mallick
 
Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Johan Blomme
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rYanchang Zhao
 
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...Goran S. Milovanovic
 
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In RRsquared Academy
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...Databricks
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahoutaneeshabakharia
 
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkSpark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkZalando Technology
 
Machine Learning with Microsoft Azure
Machine Learning with Microsoft AzureMachine Learning with Microsoft Azure
Machine Learning with Microsoft AzureDmitry Petukhov
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionChetan Khatri
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityDatabricks
 
Dynamic Tracing of your AMP web site
Dynamic Tracing of your AMP web siteDynamic Tracing of your AMP web site
Dynamic Tracing of your AMP web siteSriram Natarajan
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"sandinmyjoints
 
Reproducibility with R
Reproducibility with RReproducibility with R
Reproducibility with RMartin Jung
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLPBill Liu
 
Learning deep structured semantic models for web search
Learning deep structured semantic models for web searchLearning deep structured semantic models for web search
Learning deep structured semantic models for web searchhyunsung lee
 
A look ahead at spark 2.0
A look ahead at spark 2.0 A look ahead at spark 2.0
A look ahead at spark 2.0 Databricks
 
“Performance” - Dallas Oracle Users Group 2019-01-29 presentation
“Performance” - Dallas Oracle Users Group 2019-01-29 presentation“Performance” - Dallas Oracle Users Group 2019-01-29 presentation
“Performance” - Dallas Oracle Users Group 2019-01-29 presentationCary Millsap
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Serban Tanasa
 
200612_BioPackathon_ss
200612_BioPackathon_ss200612_BioPackathon_ss
200612_BioPackathon_ssSatoshi Kume
 

Similar to Text Mining with R -- an Analysis of Twitter Data (20)

Text Mining of Twitter in Data Mining
Text Mining of Twitter in Data MiningText Mining of Twitter in Data Mining
Text Mining of Twitter in Data Mining
 
Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
 
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
 
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In R
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
 
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkSpark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
 
Machine Learning with Microsoft Azure
Machine Learning with Microsoft AzureMachine Learning with Microsoft Azure
Machine Learning with Microsoft Azure
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Dynamic Tracing of your AMP web site
Dynamic Tracing of your AMP web siteDynamic Tracing of your AMP web site
Dynamic Tracing of your AMP web site
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"
 
Reproducibility with R
Reproducibility with RReproducibility with R
Reproducibility with R
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
Learning deep structured semantic models for web search
Learning deep structured semantic models for web searchLearning deep structured semantic models for web search
Learning deep structured semantic models for web search
 
A look ahead at spark 2.0
A look ahead at spark 2.0 A look ahead at spark 2.0
A look ahead at spark 2.0
 
“Performance” - Dallas Oracle Users Group 2019-01-29 presentation
“Performance” - Dallas Oracle Users Group 2019-01-29 presentation“Performance” - Dallas Oracle Users Group 2019-01-29 presentation
“Performance” - Dallas Oracle Users Group 2019-01-29 presentation
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
 
200612_BioPackathon_ss
200612_BioPackathon_ss200612_BioPackathon_ss
200612_BioPackathon_ss
 

More from Yanchang Zhao

RDataMining slides-time-series-analysis
RDataMining slides-time-series-analysisRDataMining slides-time-series-analysis
RDataMining slides-time-series-analysisYanchang Zhao
 
RDataMining slides-regression-classification
RDataMining slides-regression-classificationRDataMining slides-regression-classification
RDataMining slides-regression-classificationYanchang Zhao
 
RDataMining slides-r-programming
RDataMining slides-r-programmingRDataMining slides-r-programming
RDataMining slides-r-programmingYanchang Zhao
 
RDataMining slides-network-analysis-with-r
RDataMining slides-network-analysis-with-rRDataMining slides-network-analysis-with-r
RDataMining slides-network-analysis-with-rYanchang Zhao
 
RDataMining slides-data-exploration-visualisation
RDataMining slides-data-exploration-visualisationRDataMining slides-data-exploration-visualisation
RDataMining slides-data-exploration-visualisationYanchang Zhao
 
RDataMining slides-clustering-with-r
RDataMining slides-clustering-with-rRDataMining slides-clustering-with-r
RDataMining slides-clustering-with-rYanchang Zhao
 
RDataMining slides-association-rule-mining-with-r
RDataMining slides-association-rule-mining-with-rRDataMining slides-association-rule-mining-with-r
RDataMining slides-association-rule-mining-with-rYanchang Zhao
 
RDataMining-reference-card
RDataMining-reference-cardRDataMining-reference-card
RDataMining-reference-cardYanchang Zhao
 
Association Rule Mining with R
Association Rule Mining with RAssociation Rule Mining with R
Association Rule Mining with RYanchang Zhao
 
Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with RYanchang Zhao
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with RYanchang Zhao
 
Data Clustering with R
Data Clustering with RData Clustering with R
Data Clustering with RYanchang Zhao
 
Data Exploration and Visualization with R
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with RYanchang Zhao
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RYanchang Zhao
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with RYanchang Zhao
 
Time series-mining-slides
Time series-mining-slidesTime series-mining-slides
Time series-mining-slidesYanchang Zhao
 
R Reference Card for Data Mining
R Reference Card for Data MiningR Reference Card for Data Mining
R Reference Card for Data MiningYanchang Zhao
 

More from Yanchang Zhao (17)

RDataMining slides-time-series-analysis
RDataMining slides-time-series-analysisRDataMining slides-time-series-analysis
RDataMining slides-time-series-analysis
 
RDataMining slides-regression-classification
RDataMining slides-regression-classificationRDataMining slides-regression-classification
RDataMining slides-regression-classification
 
RDataMining slides-r-programming
RDataMining slides-r-programmingRDataMining slides-r-programming
RDataMining slides-r-programming
 
RDataMining slides-network-analysis-with-r
RDataMining slides-network-analysis-with-rRDataMining slides-network-analysis-with-r
RDataMining slides-network-analysis-with-r
 
RDataMining slides-data-exploration-visualisation
RDataMining slides-data-exploration-visualisationRDataMining slides-data-exploration-visualisation
RDataMining slides-data-exploration-visualisation
 
RDataMining slides-clustering-with-r
RDataMining slides-clustering-with-rRDataMining slides-clustering-with-r
RDataMining slides-clustering-with-r
 
RDataMining slides-association-rule-mining-with-r
RDataMining slides-association-rule-mining-with-rRDataMining slides-association-rule-mining-with-r
RDataMining slides-association-rule-mining-with-r
 
RDataMining-reference-card
RDataMining-reference-cardRDataMining-reference-card
RDataMining-reference-card
 
Association Rule Mining with R
Association Rule Mining with RAssociation Rule Mining with R
Association Rule Mining with R
 
Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with R
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
Data Clustering with R
Data Clustering with RData Clustering with R
Data Clustering with R
 
Data Exploration and Visualization with R
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with R
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with R
 
Time series-mining-slides
Time series-mining-slidesTime series-mining-slides
Time series-mining-slides
 
R Reference Card for Data Mining
R Reference Card for Data MiningR Reference Card for Data Mining
R Reference Card for Data Mining
 

Recently uploaded

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 

Recently uploaded (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 

Text Mining with R -- an Analysis of Twitter Data

  • 1. Text Mining with R { an Analysis of Twitter Data Yanchang Zhao http://www.RDataMining.com 30 September 2014 1 / 35
  • 2. Outline Introduction Extracting Tweets Text Cleaning Frequent Words and Associations Word Cloud Clustering Topic Modelling Online Resources 2 / 35
  • 3. Text Mining I unstructured text data I text categorization I text clustering I entity extraction I sentiment analysis I document summarization I . . . 3 / 35
  • 4. Text mining of Twitter data with R 1 1. extract data from Twitter 2. clean extracted data and build a document-term matrix 3.
  • 5. nd frequent words and associations 4. create a word cloud to visualize important words 5. text clustering 6. topic modelling 1Chapter 10: Text Mining, R and Data Mining: Examples and Case Studies. http://www.rdatamining.com/docs/RDataMining.pdf 4 / 35
  • 6. Outline Introduction Extracting Tweets Text Cleaning Frequent Words and Associations Word Cloud Clustering Topic Modelling Online Resources 5 / 35
  • 7. Retrieve Tweets Retrieve recent tweets by @RDataMining ## Option 1: retrieve tweets from Twitter library(twitteR) tweets <- userTimeline("RDataMining", n = 3200) ## Option 2: download @RDataMining tweets from RDataMining.com url <- "http://www.rdatamining.com/data/rdmTweets.RData" download.file(url, destfile = "./data/rdmTweets.RData") ## load tweets into R load(file = "./data/rdmTweets-201306.RData") 6 / 35
  • 8. (n.tweet <- length(tweets)) ## [1] 320 tweets[1:5] ## [[1]] ## [1] "RDataMining: Examples on calling Java code from R nht... ## ## [[2]] ## [1] "RDataMining: Simulating Map-Reduce in R for Big Data A... ## ## [[3]] ## [1] "RDataMining: Job opportunity: Senior Analyst - Big Dat... ## ## [[4]] ## [1] "RDataMining: CLAVIN: an open source software package f... ## ## [[5]] ## [1] "RDataMining: An online book on Natural Language Proces... 7 / 35
  • 9. Outline Introduction Extracting Tweets Text Cleaning Frequent Words and Associations Word Cloud Clustering Topic Modelling Online Resources 8 / 35
  • 10. # convert tweets to a data frame # tweets.df <- do.call("rbind", lapply(tweets, as.data.frame)) tweets.df <- twListToDF(tweets) dim(tweets.df) ## [1] 320 14 library(tm) # build a corpus, and specify the source to be character vectors myCorpus <- Corpus(VectorSource(tweets.df$text)) # convert to lower case myCorpus <- tm_map(myCorpus, tolower) Package tm v0.5-10 was used in this example. With tm v0.6, content transformer" needs to be used to wrap around normal functions. # tm v0.6 myCorpus <- tm_map(myCorpus, content_transformer(tolower)) 9 / 35
  • 11. # remove punctuation myCorpus <- tm_map(myCorpus, removePunctuation) # remove numbers myCorpus <- tm_map(myCorpus, removeNumbers) # remove URLs removeURL <- function(x) gsub("http[[:alnum:]]*", "", x) myCorpus <- tm_map(myCorpus, removeURL) # add two extra stop words: available and via myStopwords <- c(stopwords("english"), "available", "via") # remove r and big from stopwords myStopwords <- setdiff(myStopwords, c("r", "big")) # remove stopwords from corpus myCorpus <- tm_map(myCorpus, removeWords, myStopwords) 10 / 35
  • 12. # keep a copy of corpus to use later as a dictionary for stem completion myCorpusCopy <- myCorpus # stem words myCorpus <- tm_map(myCorpus, stemDocument) # inspect the first 5 documents (tweets) inspect(myCorpus[1:5]) # The code below is used for to make text fit for paper width for (i in 1:5) f cat(paste("[[", i, "]] ", sep = "")) writeLines(myCorpus[[i]]) g ## [[1]] exampl call java code r ## ## [[2]] simul mapreduc r big data analysi use flight data ... ## [[3]] job opportun senior analyst big data wesfarm indust... ## [[4]] clavin open sourc softwar packag document geotag g... ## [[5]] onlin book natur languag process python 11 / 35
  • 13. # stem completion myCorpus <- tm_map(myCorpus, stemCompletion, dictionary = myCorpusCopy) ## [[1]] examples call java code r ## [[2]] simulating mapreduce r big data analysis used flights... ## [[3]] job opportunity senior analyst big data wesfarmers in... ## [[4]] clavin open source software package document geotaggi... ## [[5]] online book natural language processing python 12 / 35
  • 14. # count frequency of "mining" miningCases <- tm_map(myCorpusCopy, grep, pattern = "nn<mining") sum(unlist(miningCases)) ## [1] 82 # count frequency of "miners" minerCases <- tm_map(myCorpusCopy, grep, pattern = "nn<miners") sum(unlist(minerCases)) ## [1] 4 # replace "miners" with "mining" myCorpus <- tm_map(myCorpus, gsub, pattern = "miners", replacement = "mining") 13 / 35
  • 15. tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(1, Inf))) tdm ## A term-document matrix (790 terms, 320 documents) ## ## Non-/sparse entries: 2449/250351 ## Sparsity : 99% ## Maximal term length: 27 ## Weighting : term frequency (tf) 14 / 35
  • 16. Outline Introduction Extracting Tweets Text Cleaning Frequent Words and Associations Word Cloud Clustering Topic Modelling Online Resources 15 / 35
  • 17. idx <- which(dimnames(tdm)$Terms == "r") inspect(tdm[idx + (0:5), 101:110]) ## A term-document matrix (6 terms, 10 documents) ## ## Non-/sparse entries: 4/56 ## Sparsity : 93% ## Maximal term length: 12 ## Weighting : term frequency (tf) ## ## Docs ## Terms 101 102 103 104 105 106 107 108 109 110 ## r 0 1 1 0 0 0 0 0 1 1 ## ramachandran 0 0 0 0 0 0 0 0 0 0 ## random 0 0 0 0 0 0 0 0 0 0 ## ranked 0 0 0 0 0 0 0 0 0 0 ## rann 0 0 0 0 0 0 0 0 0 0 ## rapidminer 0 0 0 0 0 0 0 0 0 0 16 / 35
  • 18. # inspect frequent words (freq.terms <- findFreqTerms(tdm, lowfreq = 15)) ## [1] "analysis" "applications" "big" "book" ## [5] "code" "computing" "data" "examples" ## [9] "group" "introduction" "mining" "network" ## [13] "package" "position" "postdoctoral" "r" ## [17] "research" "see" "slides" "social" ## [21] "tutorial" "university" "used" term.freq <- rowSums(as.matrix(tdm)) term.freq <- subset(term.freq, term.freq >= 15) df <- data.frame(term = names(term.freq), freq = term.freq) 17 / 35
  • 19. library(ggplot2) ggplot(df, aes(x = term, y = freq)) + geom_bar(stat = "identity") + xlab("Terms") + ylab("Count") + coord_flip() used university tutorial social slides see research r postdoctoral position package network mining introduction group examples data computing code book big applications analysis 0 50 100 150 Count Terms 18 / 35
  • 20. # which words are associated with r ? findAssocs(tdm, "r", 0.2) ## r ## examples 0.32 ## code 0.29 ## package 0.20 # which words are associated with mining ? findAssocs(tdm, "mining", 0.25) ## mining ## data 0.47 ## mahout 0.30 ## recommendation 0.30 ## sets 0.30 ## supports 0.30 ## frequent 0.26 ## itemset 0.26 19 / 35
  • 21. library(graph) library(Rgraphviz) plot(tdm, term = freq.terms, corThreshold = 0.12, weighting = T) analysis research r applications big book code computing data examples group introduction mining network postdoctoral package position see slides tutorial social university used 20 / 35
  • 22. Outline Introduction Extracting Tweets Text Cleaning Frequent Words and Associations Word Cloud Clustering Topic Modelling Online Resources 21 / 35
  • 23. library(wordcloud) m <- as.matrix(tdm) # calculate the frequency of words and sort it by frequency word.freq <- sort(rowSums(m), decreasing = T) wordcloud(words = names(word.freq), freq = word.freq, min.freq = 3, random.order = F) language postdoctoral university social book data r project forecasting frequent applications big code tutorial job tried edited provided information research examples package mining analysis used cfp network positionslides computing see lecture introduction group analytics australia modelling scientist time association free online parallel text clustering course learn pdf rules series talk document now programming fellow statistics ausdm detection outlier google large open rdatamining techniques thanks tools users vacancy amp chapter due graphics processing notes map presentation science software starting studies visualizing business call can case center database details follower functions join kdnuggets page poll top submission spatial technology twitter analyst california card fast classification find get graph handling interactive linkedin list machine melbourne recent reference using videos web elsevier workshop wwwrdataminingcom access added canada canberra china cloud conference datasets distributed dmapps events draft experience high ibm industrial knowledge management nd performance predictive published search sentiment short snowfall southern sydney tenuretrack topic week youtube advanced also answers april area august building charts check comments containing cran create csiro dec dynamic easier engineer format guide hadoop igraph incl itemset jan labs may media microsoft mid new opportunity plot postdocresearch quick realworld san senior simple singapore state titled tj track tweets usamaf views visits watson website 22 / 35
  • 24. Outline Introduction Extracting Tweets Text Cleaning Frequent Words and Associations Word Cloud Clustering Topic Modelling Online Resources 23 / 35
  • 25. # remove sparse terms tdm2 <- removeSparseTerms(tdm, sparse = 0.95) m2 <- as.matrix(tdm2) # cluster terms distMatrix <- dist(scale(m2)) fit <- hclust(distMatrix, method = "ward") 24 / 35
  • 26. plot(fit) rect.hclust(fit, k = 6) # cut tree into 6 clusters 10 20 30 40 50 60 research position postdoctoral university analysis network social examples package slides used computing tutorial big applications book r data mining Cluster Dendrogram Height 25 / 35
  • 27. m3 <- t(m2) # transpose the matrix to cluster documents (tweets) set.seed(122) # set a fixed random seed k <- 6 # number of clusters kmeansResult <- kmeans(m3, k) round(kmeansResult$centers, digits = 3) # cluster centers ## analysis applications big book computing data examples ## 1 0.147 0.088 0.147 0.015 0.059 1.015 0.088 ## 2 0.028 0.167 0.167 0.250 0.028 1.556 0.194 ## 3 0.810 0.000 0.000 0.000 0.000 0.048 0.095 ## 4 0.080 0.036 0.007 0.058 0.087 0.000 0.181 ## 5 0.000 0.000 0.000 0.067 0.067 0.333 0.067 ## 6 0.119 0.048 0.071 0.000 0.048 0.357 0.000 ## mining network package position postdoctoral r research ## 1 0.338 0.015 0.015 0.059 0.074 0.235 0.074 ## 2 1.056 0.000 0.222 0.000 0.000 1.000 0.028 ## 3 0.048 1.000 0.095 0.143 0.095 0.286 0.048 ## 4 0.065 0.022 0.174 0.000 0.007 0.703 0.000 ## 5 1.200 0.000 0.000 0.000 0.067 0.067 0.000 ## 6 0.119 0.000 0.024 0.643 0.310 0.000 0.714 ## slides social tutorial university used ## 1 0.074 0.000 0.015 0.015 0.029 ## 2 0.056 0.000 0.000 0.000 0.250 ## 3 0.095 0.762 0.190 0.000 0.095 26 / 35
  • 28. for (i in 1:k) f cat(paste("cluster ", i, ": ", sep = "")) s <- sort(kmeansResult$centers[i, ], decreasing = T) cat(names(s)[1:5], "nn") # print the tweets of every cluster # print(tweets[which(kmeansResult$cluster==i)]) g ## cluster 1: data mining r analysis big ## cluster 2: data mining r book used ## cluster 3: network analysis social r tutorial ## cluster 4: r examples package slides used ## cluster 5: mining tutorial slides data book ## cluster 6: research position university data postdoctoral 27 / 35
  • 29. library(fpc) # partitioning around medoids with estimation of number of clusters pamResult <- pamk(m3, metric="manhattan") k <- pamResult$nc # number of clusters identified pamResult <- pamResult$pamobject # print cluster medoids for (i in 1:k) f cat("cluster", i, ": ", colnames(pamResult$medoids)[which(pamResult$medoids[i,]==1)], "nn") g ## cluster 1 : examples r ## cluster 2 : analysis data r ## cluster 3 : data ## cluster 4 : ## cluster 5 : r ## cluster 6 : data mining r ## cluster 7 : data mining ## cluster 8 : analysis network social ## cluster 9 : data position research ## cluster 10 : position postdoctoral university 28 / 35
  • 30. # plot clustering result layout(matrix(c(1, 2), 1, 2)) # set to two graphs per page plot(pamResult, col.p = pamResult$clustering) −4 −2 0 2 4 6 −4 −2 0 2 4 clusplot(pam(x = sdata, k = k, diss = diss, metric = "manhattan")) Component 1 Component 2 These two components explain 25.23 % of the point variability. Silhouette plot of pam(x = sdata, k = k, diss = diss, metric = "manhattan") n = 320 10 clusters Cj −0.2 0.0 0.2 0.4 0.6 0.8 1.0 Silhouette width si Average silhouette width : 0.27 j : nj | aveiÎCj si 1 : 28 | 0.13 2 : 12 | 0.25 3 : 31 | 0.23 4 : 61 | 0.54 5 : 58 | 0.29 6 : 28 | 0.21 7 : 43 | 0.15 8 : 19 | 0.29 9 : 21 | 0.15 10 : 19 | 0.14 layout(matrix(1)) # change back to one graph per page 29 / 35
  • 31. Outline Introduction Extracting Tweets Text Cleaning Frequent Words and Associations Word Cloud Clustering Topic Modelling Online Resources 30 / 35
  • 32. Topic Modelling dtm <- as.DocumentTermMatrix(tdm) library(topicmodels) lda <- LDA(dtm, k = 8) # find 8 topics term <- terms(lda, 4) # first 4 terms of every topic term ## Topic 1 Topic 2 Topic 3 Topic 4 ## [1,] "data" "data" "mining" "r" ## [2,] "mining" "r" "r" "lecture" ## [3,] "scientist" "mining" "tutorial" "time" ## [4,] "research" "book" "introduction" "series" ## Topic 5 Topic 6 Topic 7 Topic 8 ## [1,] "position" "package" "r" "analysis" ## [2,] "research" "r" "data" "network" ## [3,] "postdoctoral" "clustering" "computing" "r" ## [4,] "university" "detection" "slides" "code" term <- apply(term, MARGIN = 2, paste, collapse = ", ") 31 / 35
  • 33. Topic Modelling # first topic identified for every document (tweet) topic <- topics(lda, 1) topics <- data.frame(date=as.IDate(tweets.df$created), topic) qplot(date, ..count.., data=topics, geom="density", fill=term[topic], position="stack") 0.4 0.2 0.0 2011−07 2012−01 2012−07 2013−01 2013−07 date count term[topic] analysis, network, r, code data, mining, scientist, research data, r, mining, book mining, r, tutorial, introduction package, r, clustering, detection position, research, postdoctoral, university r, data, computing, slides r, lecture, time, series 32 / 35
  • 34. Outline Introduction Extracting Tweets Text Cleaning Frequent Words and Associations Word Cloud Clustering Topic Modelling Online Resources 33 / 35
  • 35. Online Resources I Chapter 10: Text Mining, in book R and Data Mining: Examples and Case Studies http://www.rdatamining.com/docs/RDataMining.pdf I R Reference Card for Data Mining http://www.rdatamining.com/docs/R-refcard-data-mining.pdf I Free online courses and documents http://www.rdatamining.com/resources/ I RDataMining Group on LinkedIn (7,000+ members) http://group.rdatamining.com I RDataMining on Twitter (1,700+ followers) @RDataMining 34 / 35
  • 36. The End Thanks! Email: yanchang(at)rdatamining.com 35 / 35