SlideShare una empresa de Scribd logo
1 de 60
Security Challenges in
Online Social Media
Chun-Ming Tim Lai
賴俊鳴 助理教授
Personal information
2007 – 2011: B.S. NTU CSIE
• Prof. Juin-Ming Chen (Math): Lattice Reduction
• Prof. Der-Tsai. Lee (CSIE), Chen-Mou Cheng (EE):
Secure Index
2011 – 2012: Military Police second lieutenant
2013 – 2019: Ph.D. study, UC Davis
• Prof. S. Felix Wu
• Dissertation: Attackers’ Intention and Influence
Analysis in Social Media
2019 ~ Cloud Innovation School @ 東海大學
10/18/2021 2
Social Media
Exerting significant impact on mass communication
10/18/2021 3
Traditional Media Social Media
Datasize Less More
User Type Reader Editor/Reporter
Time-based Delayed Real time
10/18/2021 4
Reaction
 回覆的即時性
 是否切中要點,立案追蹤
 文章的生命週期
 平均1.5小時,影響人的生活
10/18/2021 5
Security Threat
Severe Threat
• Phishing
• Malware, drive-by-download
Medium to light Threat
• Advertisement
• Spamming (Fund-raising, porn, canned messages, etc.)
New type Threat
• Rumors, Media manipulation, sign up, vote stuffing, etc.
• Fake News
• Crowdturfing = CrowdSourcing + Astroturfing
10/18/2021 6
Outline
 Suitable Target, Lifecycle Analysis
 Multiple Accounts Detection
 Geolocation Identification
 Personal words
10/18/2021 7
10/18/2021 8
Facebook.com/63811549237/posts/10153038271604238
2014, 12-19, 03:06 am GMT
Social Media— Climate Change
10/18/2021 9
GMT+0
10/18/2021 10
Total: 609 comments
Suitable Targets Problem
Any post thread p in social media
platform, predict whether p
contains at least one malicious
comment via a classifier – c
{target,nontarget}
10/18/2021 11
Key idea: Life Cycle of Posts
10/18/2021 12
10 hrs
Definition
 Time Series (TS)
• TScreated(post): the time an original article is posted
• TSj: a time period j following the time of the original
• TSfinal: the end of our observation
 Accumulated Number of participants (AccNcomment)
• The number of post comments between TSi and TS(i-1)
 Discussion Atmosphere Vector (DAV)
10/18/2021 13
Example
TScreated(Climate) = 2014-12-19 03:06:42
Suppose j = 5, final = 120
DAV(Climate) = [# of comments 03:06:42 ~ 03:11:42 1st
# of comments 03:11:42 ~ 03:16:42 2nd
…
# of comments 05:01:42 ~ 05:06:42] 24th
10/18/2021 14
Dataset
2011~2014 Ten Main Media pages on
Facebook
Totally 42,703,463
10/18/2021 15
Feature Engineering
 # of comments, # of likes, # of shares
 Spanning time (Last comment time – first comment time)
 Temporal Feature with Delta Time window, with a final
observation time
 Context-free, don’t need to address Natural Language
Processing
10/18/2021
16
Time Elapsed
1st
Comments 1st Likes 1st Shares
Results
10/18/2021 17
Near Real Time
Discussion: Do you understand Facebook enough?
10/18/2021 18
• Attackers’ preference
• Selected by Facebook
• Audience reaction
• Bandwagon Effect
• Rich get Richer
• Human loves biased and
debating ones
Life Cycle and Influence Ratio
10/18/2021 19
CNN 2012 all post threads
>70%
mURL
DAV Predict IR (1/2)
10/18/2021 20
DAV Predict IR (2/2)
10/18/2021 21
Accounts Activity within a week around election date
10/18/2021 22
Active = Count(Activities) within 1 week >= threshold
10/18/2021 23
Clinton
1st week
Clinton
2nd
week
10/18/2021 24
Trump
2nd
week
Trump
1st
week
All accounts:
Periodic
Attacker accounts:
Random
Conclusion
Predict Suitable Targets successfully with temporal
features
• Attackers: Follow or not?
• Defenders: Deploy resource
Temporal Analysis with different variables
• Influence Ratio, increase or decrease for next time
window?
• 24 hours pattern, link online and offline behavior
10/18/2021 25
Outline
 Suitable Target, Lifecycle Analysis
 Multiple Accounts Detection
 Geolocation Identification
 Personal words
10/18/2021 26
Semi-Supervised Learning on Graphs
Motivation of detecting multiple accounts on FB
Crawler
1
Crawler
2
Crawler
3
FaceBook
API
When Call FaceBook
API:
API will give each
crawler a different
scope ID. Thus it leads
to same user with
different scope ID in
the dataset.
100003468896671 高婷婷
https://www.facebook.com/mayuko.sakamoto.503
100004123536871 賴婷婷
https://www.facebook.com/profile.php?id=100004123536871
100003251795795 陳婷婷 https://www.facebook.com/rika.etoh
100000681128139 高婷婷 https://www.facebook.com/vincenzo.muscari.5
100002630019886 陳婷婷 https://www.facebook.com/sven.erkens.98
813243492 高婷婷 https://www.facebook.com/profile.php?id=813243492
Ting-Ting’s Family
Facebook 允許朋友數
100003468896671 高婷婷 45xx
100004123536871 賴婷婷 45xx
100003251795795 陳婷婷 4xxx
5000
Multiple Accounts Detection using
Semi-Supervised Learning on Graphs
When crawling data from FB using multiple crawlers, it will give you a scope ID instead of
giving you primary ID for each crawler.
For example, a user’s primary ID is mohamed.aimane.98. He has multiple scope ID,
they are 1815396745342476, 1815402648675219 , 1815411572007660,
1815468805335270 , 1815515615330589 ,1815482155333935 , 1815488781999939 ,
1816157185266432. It implies mohamed’s data is crawled by 8 different crawlers.
As the result, in our dataset we know their users names are all mohamed aimane, but
there are a lot of ID with the same user name.
Problem : Given 2 scope ID with the same user name. Are they the same user(same
primary ID) or not?
Motivation of detecting multiple accounts on FB
Graph Construction
U: {Users}, V:{Pages}, edge:{u,v} : u had an activity on page v
Activities
Main Algorithms
Unsupervised learning using Katz Similarity
Pxy(i) = (x,x1,x2,….y), length I
u1, u2 are similar if their activity paths are similar
Katz similarity can be computed by:
Where M is the adjacency matrix of graph G. 𝛽 is a scalar smaller than 1/ 𝑀 2
to
ensure convergence, and I is the identity matrix.
Main Algorithms
Unsupervised learning using Katz Similarity
Katz matrix is
1 0.9
0.9 1
0.2 0.3
0.5 0.5
0.2 0.6
0.3 0.5
1 0.8
0.8 1
The threshold we use is 0.8
Then the 1st node and the 2nd node are belong to the same user, and the 3rd and 4th
node are belongs to the same user, others are not.
Example of Algorithm 1
Main Algorithms
Semi-Supervised Method using Graph Embedding
Classical ML Tasks in Networks
• Node Classification
• Predict type of a node
• Link Prediction
• Predict friends
• Community Detection
• Network Similarity
• Similar with two networks
Node2vec(1/4)
Many Possible ways:
• PageRank score, Degree, centrality, # of edges…etc.
Features
Node2vec(2/4)
Mixture of BFS and DFS
BFS --- LocalView (u and S1)
DFS --- GlobalView (u and S6)
Node2vec(3/4)
• Two Parameters:
• Return parameter p:
• Return back to the previous node
• In-out parameter q:
• Moving outwards (DFS) vs. inwards (BFS)
• The ratio of BFS vs.DFS
• Biased 2nd-order random walks explore network neighborhoods.
Parameters
Node2vec(4/4)
• Simulate r random walks of length l starting from each node u
• Optimize the node2vec objective using Stochastic Gradient Descent
Embedding for node 1 : (0.1, 0.3, 0.2, 0.4), Embedding for node 2 : (0.2, 0.3, 0.2, 0.4)
We sample some ground truth that : node 1 and node 2 are belongs to the same node,
ect.
L looks like :((1,2), 1) ((1,3), 0 ), ((2,3), 0) ((2,4), - 1) ((3,4), -1) …..
X is from embedding : for example, ((1,2), (0.1, 0, 0, 0 )) ….
Then feed X and L into label spreading model, we will get, the 1st node and the 2nd node
are belong to the same user, and the 3rd and 4th node are belongs to the same user,
others are not.
Example of Algorithm 3
Main Algorithms
Different measurement of Embedding Vectors
Experiments and Evaluation
Comparison among the Three Methods
Two simple datasets : dataset 1: 188 nodes and 262 activities (links);
dataset 2: 4188 accounts and 6715 activities(links).
Outline
 Suitable Target, Lifecycle Analysis
 Multiple Account Detection
 Geolocation Identification
 Personal words
10/18/2021 45
Page Information and Page-like Graph
10/18/2021
Sport Illustrated
Golden State
Warriors
Oakland Museum
Giving Tuesday
like
like
like
Field Example
Page ID 47657117525
Name Golden State Warriors
Category Sports Team
Country United States
Fan Count 11,019,236
Description The Official Facebook page
of
the Golden State Warriors
46
10/18/2021
• Facebook public
pages are public
profiles used by
local businesses,
companies,
organizations or
public figures
Likes
Promoting other pages to
community participants
47
Data Collection
Facebook Graph API version 2.8 used to collect our
data [1]
• 38,831,367 pages (for this work)
• 2,430,873 US
• 12,685,090 other countries
• 23,715,404 unknown
 [1] https://developers.facebook.com/docs/graph-api/reference/page
10/18/2021 48
Majority Vote Algorithm
10/18/2021
• location designated as state
information in this scenario
• The location labeling is determined by
the most votes
• Overall accuracy is only 59.4%
• This algorithm works well in page nationality
prediction task, with 90.25% accuracy
49
Baseline Algorithm
Utilizes locality of states to find pages
belonging to their corresponding states
• Pick out anchored pages, with local property, as
multiple seeds to start BFS from
Target classifier: 51 classes
• 50 classes of US states and a class of ”others (OT)”
State Distance Vector (SDV)
10/18/2021 50
Alabama Arkansas Arizona Wyoming
……
P IHOP(P, S_Arizona) == 4
OHOP(P, S_Arizona) == 3
31M+ nodes, 600M+ edges
10/18/2021
Alaska
51
Anchor Page Selection (1/2)
10/18/2021
Effectiveness of BFS-based algorithms
• It depends on anchored page selection
Anchored pages have to be local such that SDV can provide authentic
tendency of a page’s locality
Suitable examples (focusing on local communities)
• state universities, government, park or police organizations
Ill-suited examples (popular and thus having global impact)
• NBA, MLB, or NFL sports teams
52
• We adopt all subsidiary
pages
of ”OnlyInYourState.com” as
a set of anchored pages
• It has a distinct page for each state
• Each subsidiary page mostly
connects local communities
Anchor Page Selection (2/2)
Page Name Page ID
Only In Alabama 783744898386760
Only In Alaska 686107314826906
Only In Southern California
184034905285700
6
Only In Northern California 856450181102963
Idaho Only 435099846671531
Only In New York 386608421546055
Only In Virginia
156051573754049
2
Only In West Virginia
150970950928653
2
Only In Wisconsin
139029706462742
0
Only In Wyoming
172417436447638
1
10/18/2021 53
51 Anchors
Arizona
Northern
California
10/18/2021 54
Advanced Algorithm
Baseline algorithm’s drawback
• A local page can have a few connections with those pages far beyond
• This kind of connection noise would highly reduce prediction accuracy
State Neighborhood Probability (SNP)
Both SDV and SNP are taken as feature vectors for ML models
• Utilize locality and neighborhood context for better identification
10/18/2021 55
Dataset
California accounts for 20% of all US pages, and half of all
pages (49.49%) are located in top 5 states
• California, New York, Florida, Illinois, and Texas
10/18/2021 56
Accuracy Summary
Classifier Precision Recall F1 score
Naive Bayes (Baseline BFS) 0.44 0.27 0.26
Adaboost (Baseline BFS) 0.46 0.40 0.37
Random Forest (Baseline BFS) 0.69 0.69 0.68
Random Forest (Advanced BFS) 0.89 0.88 0.88
10/18/2021 57
Outline
 Suitable Target, Lifecycle Analysis
 Multiple Account Detection
 Geolocation Identification
 Personal words
10/18/2021 58
Future Trends -- IT
10/18/2021 59
Thank you!
Q & A
10/18/2021 60
Thank you!
Q & A

Más contenido relacionado

Similar a NDU Present

CML's Presentation at FengChia University
CML's Presentation at FengChia UniversityCML's Presentation at FengChia University
CML's Presentation at FengChia UniversityTunghai University
 
srd117.final.512Spring2016
srd117.final.512Spring2016srd117.final.512Spring2016
srd117.final.512Spring2016Saurabh Deochake
 
Inferring social media user attributes using language and network information
Inferring social media user attributes using language and network informationInferring social media user attributes using language and network information
Inferring social media user attributes using language and network informationNikolaos Aletras
 
[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)Kunwoo Park
 
Temporal Exploration in 2D Visualization of Emotions on Twitter Stream
Temporal Exploration in 2D Visualization of Emotions on Twitter StreamTemporal Exploration in 2D Visualization of Emotions on Twitter Stream
Temporal Exploration in 2D Visualization of Emotions on Twitter StreamTELKOMNIKA JOURNAL
 
User behavior model & recommendation on basis of social networks
User behavior model & recommendation on basis of social networks User behavior model & recommendation on basis of social networks
User behavior model & recommendation on basis of social networks Shah Alam Sabuj
 
Studying user footprints in different online social networks
Studying user footprints in different online social networksStudying user footprints in different online social networks
Studying user footprints in different online social networksIIIT Hyderabad
 
Profiling User Interests on the Social Semantic Web
Profiling User Interests on the Social Semantic WebProfiling User Interests on the Social Semantic Web
Profiling User Interests on the Social Semantic WebFabrizio Orlandi
 
Big social data analytics - social network analysis
Big social data analytics - social network analysis Big social data analytics - social network analysis
Big social data analytics - social network analysis Jari Jussila
 
SRS Of Social Networking
SRS Of Social NetworkingSRS Of Social Networking
SRS Of Social Networkingmaaano786
 
Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석datasciencekorea
 
User Behaviour Pattern Recognition On Twitter Social Network
User Behaviour Pattern Recognition On Twitter Social NetworkUser Behaviour Pattern Recognition On Twitter Social Network
User Behaviour Pattern Recognition On Twitter Social NetworkGeorge Konstantakopoulos
 
Inferring Peer Centrality in Socially-Informed Peer-to-Peer Systems
Inferring Peer Centrality in Socially-Informed Peer-to-Peer SystemsInferring Peer Centrality in Socially-Informed Peer-to-Peer Systems
Inferring Peer Centrality in Socially-Informed Peer-to-Peer SystemsNicolas Kourtellis
 
Assessment Worksheet Aligning Risks, Threats, and Vuln.docx
Assessment Worksheet Aligning Risks, Threats, and Vuln.docxAssessment Worksheet Aligning Risks, Threats, and Vuln.docx
Assessment Worksheet Aligning Risks, Threats, and Vuln.docxfestockton
 
An Empirical Analysis on Social Capital and Enterprise 2.0 Participation in a...
An Empirical Analysis on Social Capital and Enterprise 2.0 Participation in a...An Empirical Analysis on Social Capital and Enterprise 2.0 Participation in a...
An Empirical Analysis on Social Capital and Enterprise 2.0 Participation in a...Paolo Massa
 

Similar a NDU Present (20)

Yuntech present
Yuntech presentYuntech present
Yuntech present
 
CML's Presentation at FengChia University
CML's Presentation at FengChia UniversityCML's Presentation at FengChia University
CML's Presentation at FengChia University
 
srd117.final.512Spring2016
srd117.final.512Spring2016srd117.final.512Spring2016
srd117.final.512Spring2016
 
Inferring social media user attributes using language and network information
Inferring social media user attributes using language and network informationInferring social media user attributes using language and network information
Inferring social media user attributes using language and network information
 
[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)
 
Temporal Exploration in 2D Visualization of Emotions on Twitter Stream
Temporal Exploration in 2D Visualization of Emotions on Twitter StreamTemporal Exploration in 2D Visualization of Emotions on Twitter Stream
Temporal Exploration in 2D Visualization of Emotions on Twitter Stream
 
Sp150502ss
Sp150502ssSp150502ss
Sp150502ss
 
User behavior model & recommendation on basis of social networks
User behavior model & recommendation on basis of social networks User behavior model & recommendation on basis of social networks
User behavior model & recommendation on basis of social networks
 
Studying user footprints in different online social networks
Studying user footprints in different online social networksStudying user footprints in different online social networks
Studying user footprints in different online social networks
 
Profiling User Interests on the Social Semantic Web
Profiling User Interests on the Social Semantic WebProfiling User Interests on the Social Semantic Web
Profiling User Interests on the Social Semantic Web
 
Big social data analytics - social network analysis
Big social data analytics - social network analysis Big social data analytics - social network analysis
Big social data analytics - social network analysis
 
SRS Of Social Networking
SRS Of Social NetworkingSRS Of Social Networking
SRS Of Social Networking
 
Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석
 
User Behaviour Pattern Recognition On Twitter Social Network
User Behaviour Pattern Recognition On Twitter Social NetworkUser Behaviour Pattern Recognition On Twitter Social Network
User Behaviour Pattern Recognition On Twitter Social Network
 
Web Mining .ppt
Web Mining .pptWeb Mining .ppt
Web Mining .ppt
 
Web Mining .ppt
Web Mining .pptWeb Mining .ppt
Web Mining .ppt
 
Inferring Peer Centrality in Socially-Informed Peer-to-Peer Systems
Inferring Peer Centrality in Socially-Informed Peer-to-Peer SystemsInferring Peer Centrality in Socially-Informed Peer-to-Peer Systems
Inferring Peer Centrality in Socially-Informed Peer-to-Peer Systems
 
Citizen-centric Linked Data Services for Smarter Cities
Citizen-centric Linked Data Services for Smarter CitiesCitizen-centric Linked Data Services for Smarter Cities
Citizen-centric Linked Data Services for Smarter Cities
 
Assessment Worksheet Aligning Risks, Threats, and Vuln.docx
Assessment Worksheet Aligning Risks, Threats, and Vuln.docxAssessment Worksheet Aligning Risks, Threats, and Vuln.docx
Assessment Worksheet Aligning Risks, Threats, and Vuln.docx
 
An Empirical Analysis on Social Capital and Enterprise 2.0 Participation in a...
An Empirical Analysis on Social Capital and Enterprise 2.0 Participation in a...An Empirical Analysis on Social Capital and Enterprise 2.0 Participation in a...
An Empirical Analysis on Social Capital and Enterprise 2.0 Participation in a...
 

Más de Tunghai University

Más de Tunghai University (6)

CCU Talk at 2022/01/03
CCU Talk at 2022/01/03CCU Talk at 2022/01/03
CCU Talk at 2022/01/03
 
When Online Computational Data Meets Offline Real World Events
When Online Computational Data Meets Offline Real World EventsWhen Online Computational Data Meets Offline Real World Events
When Online Computational Data Meets Offline Real World Events
 
Mimic iv
Mimic iv Mimic iv
Mimic iv
 
CML's presentation with IORG
CML's presentation with IORGCML's presentation with IORG
CML's presentation with IORG
 
Pydata Taipei 2020
Pydata Taipei 2020Pydata Taipei 2020
Pydata Taipei 2020
 
Big data johnson_public
Big data johnson_publicBig data johnson_public
Big data johnson_public
 

Último

Unveiling SOCIO COSMOS: Where Socializing Meets the Stars
Unveiling SOCIO COSMOS: Where Socializing Meets the StarsUnveiling SOCIO COSMOS: Where Socializing Meets the Stars
Unveiling SOCIO COSMOS: Where Socializing Meets the StarsSocioCosmos
 
Values Newsletter teamwork section 2023.pdf
Values Newsletter teamwork section 2023.pdfValues Newsletter teamwork section 2023.pdf
Values Newsletter teamwork section 2023.pdfSoftServe HRM
 
THE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECT
THE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECTTHE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECT
THE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECT17mos052
 
The--Fraud: Netflix Original Media Pitch
The--Fraud: Netflix Original Media PitchThe--Fraud: Netflix Original Media Pitch
The--Fraud: Netflix Original Media Pitch17mos052
 
Amplify Your Brand with Our Tailored Social Media Marketing Services
Amplify Your Brand with Our Tailored Social Media Marketing ServicesAmplify Your Brand with Our Tailored Social Media Marketing Services
Amplify Your Brand with Our Tailored Social Media Marketing ServicesNetqom Solutions
 
Dubai Calls Girls Busty Babes O525547819 Call Girls In Dubai
Dubai Calls Girls Busty Babes O525547819 Call Girls In DubaiDubai Calls Girls Busty Babes O525547819 Call Girls In Dubai
Dubai Calls Girls Busty Babes O525547819 Call Girls In Dubaikojalkojal131
 
Top 5 Ways To Use Reddit for SEO SEO Expert in USA - Macaw Digital
Top 5 Ways To Use Reddit for SEO  SEO Expert in USA - Macaw DigitalTop 5 Ways To Use Reddit for SEO  SEO Expert in USA - Macaw Digital
Top 5 Ways To Use Reddit for SEO SEO Expert in USA - Macaw Digitalmacawdigitalseo2023
 

Último (7)

Unveiling SOCIO COSMOS: Where Socializing Meets the Stars
Unveiling SOCIO COSMOS: Where Socializing Meets the StarsUnveiling SOCIO COSMOS: Where Socializing Meets the Stars
Unveiling SOCIO COSMOS: Where Socializing Meets the Stars
 
Values Newsletter teamwork section 2023.pdf
Values Newsletter teamwork section 2023.pdfValues Newsletter teamwork section 2023.pdf
Values Newsletter teamwork section 2023.pdf
 
THE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECT
THE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECTTHE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECT
THE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECT
 
The--Fraud: Netflix Original Media Pitch
The--Fraud: Netflix Original Media PitchThe--Fraud: Netflix Original Media Pitch
The--Fraud: Netflix Original Media Pitch
 
Amplify Your Brand with Our Tailored Social Media Marketing Services
Amplify Your Brand with Our Tailored Social Media Marketing ServicesAmplify Your Brand with Our Tailored Social Media Marketing Services
Amplify Your Brand with Our Tailored Social Media Marketing Services
 
Dubai Calls Girls Busty Babes O525547819 Call Girls In Dubai
Dubai Calls Girls Busty Babes O525547819 Call Girls In DubaiDubai Calls Girls Busty Babes O525547819 Call Girls In Dubai
Dubai Calls Girls Busty Babes O525547819 Call Girls In Dubai
 
Top 5 Ways To Use Reddit for SEO SEO Expert in USA - Macaw Digital
Top 5 Ways To Use Reddit for SEO  SEO Expert in USA - Macaw DigitalTop 5 Ways To Use Reddit for SEO  SEO Expert in USA - Macaw Digital
Top 5 Ways To Use Reddit for SEO SEO Expert in USA - Macaw Digital
 

NDU Present

  • 1. Security Challenges in Online Social Media Chun-Ming Tim Lai 賴俊鳴 助理教授
  • 2. Personal information 2007 – 2011: B.S. NTU CSIE • Prof. Juin-Ming Chen (Math): Lattice Reduction • Prof. Der-Tsai. Lee (CSIE), Chen-Mou Cheng (EE): Secure Index 2011 – 2012: Military Police second lieutenant 2013 – 2019: Ph.D. study, UC Davis • Prof. S. Felix Wu • Dissertation: Attackers’ Intention and Influence Analysis in Social Media 2019 ~ Cloud Innovation School @ 東海大學 10/18/2021 2
  • 3. Social Media Exerting significant impact on mass communication 10/18/2021 3 Traditional Media Social Media Datasize Less More User Type Reader Editor/Reporter Time-based Delayed Real time
  • 5. Reaction  回覆的即時性  是否切中要點,立案追蹤  文章的生命週期  平均1.5小時,影響人的生活 10/18/2021 5
  • 6. Security Threat Severe Threat • Phishing • Malware, drive-by-download Medium to light Threat • Advertisement • Spamming (Fund-raising, porn, canned messages, etc.) New type Threat • Rumors, Media manipulation, sign up, vote stuffing, etc. • Fake News • Crowdturfing = CrowdSourcing + Astroturfing 10/18/2021 6
  • 7. Outline  Suitable Target, Lifecycle Analysis  Multiple Accounts Detection  Geolocation Identification  Personal words 10/18/2021 7
  • 11. Suitable Targets Problem Any post thread p in social media platform, predict whether p contains at least one malicious comment via a classifier – c {target,nontarget} 10/18/2021 11
  • 12. Key idea: Life Cycle of Posts 10/18/2021 12 10 hrs
  • 13. Definition  Time Series (TS) • TScreated(post): the time an original article is posted • TSj: a time period j following the time of the original • TSfinal: the end of our observation  Accumulated Number of participants (AccNcomment) • The number of post comments between TSi and TS(i-1)  Discussion Atmosphere Vector (DAV) 10/18/2021 13
  • 14. Example TScreated(Climate) = 2014-12-19 03:06:42 Suppose j = 5, final = 120 DAV(Climate) = [# of comments 03:06:42 ~ 03:11:42 1st # of comments 03:11:42 ~ 03:16:42 2nd … # of comments 05:01:42 ~ 05:06:42] 24th 10/18/2021 14
  • 15. Dataset 2011~2014 Ten Main Media pages on Facebook Totally 42,703,463 10/18/2021 15
  • 16. Feature Engineering  # of comments, # of likes, # of shares  Spanning time (Last comment time – first comment time)  Temporal Feature with Delta Time window, with a final observation time  Context-free, don’t need to address Natural Language Processing 10/18/2021 16 Time Elapsed 1st Comments 1st Likes 1st Shares
  • 18. Discussion: Do you understand Facebook enough? 10/18/2021 18 • Attackers’ preference • Selected by Facebook • Audience reaction • Bandwagon Effect • Rich get Richer • Human loves biased and debating ones
  • 19. Life Cycle and Influence Ratio 10/18/2021 19 CNN 2012 all post threads >70% mURL
  • 20. DAV Predict IR (1/2) 10/18/2021 20
  • 21. DAV Predict IR (2/2) 10/18/2021 21
  • 22. Accounts Activity within a week around election date 10/18/2021 22 Active = Count(Activities) within 1 week >= threshold
  • 25. Conclusion Predict Suitable Targets successfully with temporal features • Attackers: Follow or not? • Defenders: Deploy resource Temporal Analysis with different variables • Influence Ratio, increase or decrease for next time window? • 24 hours pattern, link online and offline behavior 10/18/2021 25
  • 26. Outline  Suitable Target, Lifecycle Analysis  Multiple Accounts Detection  Geolocation Identification  Personal words 10/18/2021 26
  • 27. Semi-Supervised Learning on Graphs Motivation of detecting multiple accounts on FB Crawler 1 Crawler 2 Crawler 3 FaceBook API When Call FaceBook API: API will give each crawler a different scope ID. Thus it leads to same user with different scope ID in the dataset.
  • 28.
  • 29. 100003468896671 高婷婷 https://www.facebook.com/mayuko.sakamoto.503 100004123536871 賴婷婷 https://www.facebook.com/profile.php?id=100004123536871 100003251795795 陳婷婷 https://www.facebook.com/rika.etoh 100000681128139 高婷婷 https://www.facebook.com/vincenzo.muscari.5 100002630019886 陳婷婷 https://www.facebook.com/sven.erkens.98 813243492 高婷婷 https://www.facebook.com/profile.php?id=813243492 Ting-Ting’s Family
  • 30. Facebook 允許朋友數 100003468896671 高婷婷 45xx 100004123536871 賴婷婷 45xx 100003251795795 陳婷婷 4xxx 5000
  • 31. Multiple Accounts Detection using Semi-Supervised Learning on Graphs When crawling data from FB using multiple crawlers, it will give you a scope ID instead of giving you primary ID for each crawler. For example, a user’s primary ID is mohamed.aimane.98. He has multiple scope ID, they are 1815396745342476, 1815402648675219 , 1815411572007660, 1815468805335270 , 1815515615330589 ,1815482155333935 , 1815488781999939 , 1816157185266432. It implies mohamed’s data is crawled by 8 different crawlers. As the result, in our dataset we know their users names are all mohamed aimane, but there are a lot of ID with the same user name. Problem : Given 2 scope ID with the same user name. Are they the same user(same primary ID) or not? Motivation of detecting multiple accounts on FB
  • 32. Graph Construction U: {Users}, V:{Pages}, edge:{u,v} : u had an activity on page v Activities
  • 33. Main Algorithms Unsupervised learning using Katz Similarity Pxy(i) = (x,x1,x2,….y), length I u1, u2 are similar if their activity paths are similar Katz similarity can be computed by: Where M is the adjacency matrix of graph G. 𝛽 is a scalar smaller than 1/ 𝑀 2 to ensure convergence, and I is the identity matrix.
  • 34. Main Algorithms Unsupervised learning using Katz Similarity
  • 35. Katz matrix is 1 0.9 0.9 1 0.2 0.3 0.5 0.5 0.2 0.6 0.3 0.5 1 0.8 0.8 1 The threshold we use is 0.8 Then the 1st node and the 2nd node are belong to the same user, and the 3rd and 4th node are belongs to the same user, others are not. Example of Algorithm 1
  • 36. Main Algorithms Semi-Supervised Method using Graph Embedding
  • 37. Classical ML Tasks in Networks • Node Classification • Predict type of a node • Link Prediction • Predict friends • Community Detection • Network Similarity • Similar with two networks
  • 38. Node2vec(1/4) Many Possible ways: • PageRank score, Degree, centrality, # of edges…etc. Features
  • 39. Node2vec(2/4) Mixture of BFS and DFS BFS --- LocalView (u and S1) DFS --- GlobalView (u and S6)
  • 40. Node2vec(3/4) • Two Parameters: • Return parameter p: • Return back to the previous node • In-out parameter q: • Moving outwards (DFS) vs. inwards (BFS) • The ratio of BFS vs.DFS • Biased 2nd-order random walks explore network neighborhoods. Parameters
  • 41. Node2vec(4/4) • Simulate r random walks of length l starting from each node u • Optimize the node2vec objective using Stochastic Gradient Descent
  • 42. Embedding for node 1 : (0.1, 0.3, 0.2, 0.4), Embedding for node 2 : (0.2, 0.3, 0.2, 0.4) We sample some ground truth that : node 1 and node 2 are belongs to the same node, ect. L looks like :((1,2), 1) ((1,3), 0 ), ((2,3), 0) ((2,4), - 1) ((3,4), -1) ….. X is from embedding : for example, ((1,2), (0.1, 0, 0, 0 )) …. Then feed X and L into label spreading model, we will get, the 1st node and the 2nd node are belong to the same user, and the 3rd and 4th node are belongs to the same user, others are not. Example of Algorithm 3
  • 44. Experiments and Evaluation Comparison among the Three Methods Two simple datasets : dataset 1: 188 nodes and 262 activities (links); dataset 2: 4188 accounts and 6715 activities(links).
  • 45. Outline  Suitable Target, Lifecycle Analysis  Multiple Account Detection  Geolocation Identification  Personal words 10/18/2021 45
  • 46. Page Information and Page-like Graph 10/18/2021 Sport Illustrated Golden State Warriors Oakland Museum Giving Tuesday like like like Field Example Page ID 47657117525 Name Golden State Warriors Category Sports Team Country United States Fan Count 11,019,236 Description The Official Facebook page of the Golden State Warriors 46
  • 47. 10/18/2021 • Facebook public pages are public profiles used by local businesses, companies, organizations or public figures Likes Promoting other pages to community participants 47
  • 48. Data Collection Facebook Graph API version 2.8 used to collect our data [1] • 38,831,367 pages (for this work) • 2,430,873 US • 12,685,090 other countries • 23,715,404 unknown  [1] https://developers.facebook.com/docs/graph-api/reference/page 10/18/2021 48
  • 49. Majority Vote Algorithm 10/18/2021 • location designated as state information in this scenario • The location labeling is determined by the most votes • Overall accuracy is only 59.4% • This algorithm works well in page nationality prediction task, with 90.25% accuracy 49
  • 50. Baseline Algorithm Utilizes locality of states to find pages belonging to their corresponding states • Pick out anchored pages, with local property, as multiple seeds to start BFS from Target classifier: 51 classes • 50 classes of US states and a class of ”others (OT)” State Distance Vector (SDV) 10/18/2021 50
  • 51. Alabama Arkansas Arizona Wyoming …… P IHOP(P, S_Arizona) == 4 OHOP(P, S_Arizona) == 3 31M+ nodes, 600M+ edges 10/18/2021 Alaska 51
  • 52. Anchor Page Selection (1/2) 10/18/2021 Effectiveness of BFS-based algorithms • It depends on anchored page selection Anchored pages have to be local such that SDV can provide authentic tendency of a page’s locality Suitable examples (focusing on local communities) • state universities, government, park or police organizations Ill-suited examples (popular and thus having global impact) • NBA, MLB, or NFL sports teams 52
  • 53. • We adopt all subsidiary pages of ”OnlyInYourState.com” as a set of anchored pages • It has a distinct page for each state • Each subsidiary page mostly connects local communities Anchor Page Selection (2/2) Page Name Page ID Only In Alabama 783744898386760 Only In Alaska 686107314826906 Only In Southern California 184034905285700 6 Only In Northern California 856450181102963 Idaho Only 435099846671531 Only In New York 386608421546055 Only In Virginia 156051573754049 2 Only In West Virginia 150970950928653 2 Only In Wisconsin 139029706462742 0 Only In Wyoming 172417436447638 1 10/18/2021 53
  • 55. Advanced Algorithm Baseline algorithm’s drawback • A local page can have a few connections with those pages far beyond • This kind of connection noise would highly reduce prediction accuracy State Neighborhood Probability (SNP) Both SDV and SNP are taken as feature vectors for ML models • Utilize locality and neighborhood context for better identification 10/18/2021 55
  • 56. Dataset California accounts for 20% of all US pages, and half of all pages (49.49%) are located in top 5 states • California, New York, Florida, Illinois, and Texas 10/18/2021 56
  • 57. Accuracy Summary Classifier Precision Recall F1 score Naive Bayes (Baseline BFS) 0.44 0.27 0.26 Adaboost (Baseline BFS) 0.46 0.40 0.37 Random Forest (Baseline BFS) 0.69 0.69 0.68 Random Forest (Advanced BFS) 0.89 0.88 0.88 10/18/2021 57
  • 58. Outline  Suitable Target, Lifecycle Analysis  Multiple Account Detection  Geolocation Identification  Personal words 10/18/2021 58
  • 59. Future Trends -- IT 10/18/2021 59
  • 60. Thank you! Q & A 10/18/2021 60 Thank you! Q & A

Notas del editor

  1. Top-down, Authoritative, vs. distributed, skim SFW – “Editor/Reporter” and “reader”
  2. Sometimes it’s hard to evaluate “spamming” New SFW – Likefarm? Is that ContentFarm?
  3. Every principle has its mind, reason, everything has its causality
  4. SFW – we need to have a better organized presentation for problems. SFW – the defenders concern might be different – we need to consider the risk factor
  5. Shelf Life, skim messages, can “catch” ones eyes only , enlarge the influence https://www.facebook.com/barackobama/posts/10151673679836749 https://www.facebook.com/cnn/posts/313652498762911 SFW – ask the audience “which post has higher prob to be attacked”?
  6. SFW – watch out for the transition into this slide. SFW – do you want to provide one example for all or most of the slides? SFW – I feel that you should give an example to explain. SFW – Definition**s**
  7. SFW – how to interpret 10 minutes? (what is the total time and attack time)? Naïve Bayne: DAV not independent with each other Adaboost: Not good for outlier, number of estimators = 50 and learning rate = 1. Decision Tree: Good for social networks data we set minimum samples split = 2 and minimum samples leaf = 1, as with depth, nodes are expanded until all leaves are pure.
  8. 1. IR is learnable? 2. No difference between Light and Critical malicious URLs since their performance are quite similar 3. Increase recall result is high
  9. SFW – explain “Exact time after last attack”
  10. Why do you choose similarity Fast
  11. Read the silde
  12. Our first thought is majority vote algorithm
  13. where IHOP(Page,Si) denotes hop distance between page and seed Si, using inward edges as connection for BFS; OHOP(Page, Si) denotes hop distance between page and seed Si, using outward edges as connection for BFS.
  14. In particular, since California is much larger than other states in perspectives of population and economy, “OnlyInYourState.com” splits California into Northern and Southern regions, as shown in Table. Therefore, both ”Only In Northern California” and ”Only In Southern California” are used as anchored pages to calculate IHOP (P age, Si) and OHOP (P age, Si), in addition to the other forty nine an- chored pages. Hence Nanchored pages is set as 51. Furthermore, since ”Only In Idaho” had been registered, OnlyInYourState.com named its Idaho counterpart as ”Idaho Only” instead. In general, more anchored pages involved would enlarge the BFS coverage of pages.
  15. This probability is not high; however, the baseline BFS-based ML algorithm only cares about the hop distances to the anchored pages. where INP(Page,Ri) denotes inward neighborhood location probability between this page and the adjacent pages belonging to the region Ri; where IE(Page,Ri) is the number of inward edges between this page and the adjacent pages belonging to the region Ri;
  16. We took the pages with declared location information of country and city as ground truth data. Few pages are excluded because their city names exist in multiple states, which can result in ambiguous city-to-state mapping. There are 29,849 cities in total in the US.
  17. The training set utilized 80% of data while test set employed the rest. Since number of classes is rather large, Random Forest classifier is preferably adopted, instead of Gradient Boosting classifier [23]. The default parameter sets were applied when using the implementations available in the scikit-learn package [54]. As shown in Table 4.2, the precision, recall, f1 score of the Random Forest classifier are at least 20% better than the counterparts of the Naive Bayes classifier and the Adaboost classifier. Thus in the following, we only present results done with the Random Forest classifier. baseline BFS-based ML algorithm with the Random Forest classifier achieved 69% accuracy, which is 10% better than accuracy of the majority vote algorithm. With addition of SNP, advanced BFS-based ML algorithm accomplished 89% prediction accuracy, which is a 20% improvement over baseline.
  18. SFW – what have been done? Whether you can justify some of your work is fundamental and not just incremental and applied? SFW – balance between contributions to CS versus Social Science