SlideShare a Scribd company logo
1 of 27
Download to read offline
Flume-based news
aggregator service on
    Amazon EC2
      Arinto Murdopo
      Mário Almeida
       Zafar Gilani

      SDS, EMDC 2012
Outline
● Introduction
    ○ Cloudera Manager CDH3
    ○ Cloudera Flume
    ○ Hadoop Distributed File System
●   Infrastructure setup
●   Architecture
●   News recommendation
●   RSS News aggregator         ●      Issues faced
●   Proof of concept            ●      Future work
                                ●      Conclusions
                                ●      References
Introduction
● A flume-based independent news aggregator
  service.
● Using:
  ○   Amazon EC2 IaaS
  ○   Cloudera Manager CDH3
  ○   Cloudera Flume
  ○   Hadoop Distributed File System
Cloudera Manager CDH3
● Automates the installation and configuration
  process of CDH3 on an entire cluster.
● We used free edition (up to 50 nodes).
Cloudera Flume
● A distributed, reliable and available system.
● To efficiently collect, aggregate and move
  large amounts of log data.
● From many different sources to a centralized
  or distributed data store (such as Hadoop
  HDFS).
Hadoop HDFS (1/2)
● For our purpose Hadoop handles:
  ○ Log receipt and storage.
  ○ Search and log processing.
● Coordinates work among cluster of
  machines.
Hadoop HDFS (2/2)
Infrastructure setup
● 2 Agent nodes collecting data:
  ○ Source: RSS feed
  ○ Sink: Collector
● 1 Agent node (Collector):
  ○ Source: Agents
  ○ Sink: HDFS
● HDFS NameNode:
  ○ Replicates data to DataNodes 1, 2 and 3.
● Cloudera Manager CDH3 node:
  ○ Managing all our nodes (Agents and HDFS nodes).
Architecture
News Recommendation
● We hosted a webpage in which people can
  recommend possible sources for news.
  ○ http://web.ist.utl.pt/~ist156947/sds/
● Retrieved a big compilation of news websites
  and blogs from a reasonable variety of
  countries
  ○ E.g. Spain, Libya, Russia, Syria, Iran...
RSS News aggregator
● We wrote a Java application to read RSS
  feeds using:
  ○ java.net.URL to handle the resource pointed-to by
    the URL.
  ○ javax.xml.parsers for XML parsing.
  ○ org.w3c.dom provides interfaces for DOM to process
    XML.
Proof of concept (1/3)
● Our Agent collects the RSS feeds and sends it
  to the Collector Agent.
Proof of concept (2/3)
● The collector receives the events from both
  Agents and stores them into the HDFS.
Proof of concept (3/3)
● Because we have a level of replication of 3,
  every DataNode will end up with the same
  amount of data.
Issues faced (1/4)
● DataNode Setting dfs.datanode.du.reserved
  is set by default to 10 GB.
  ○ This means that if a datanode has less than 10 GB of
    capacity, then there is no remaining available space
    for the file system. (Warning: Not able to place
    enough replicas)
Issues faced (2/4)
● In order for CDH Manager to work, all nodes
  must run either Suse or RedHat.
● The CDH Manager cannot run on a AWS
  EC2 micro instance.
● Upon instance restart, its IP changes.
   ○ So the CDH Manager loses track of the node
● CDH Manager operates with private DNS
  and so any references it makes point to this
  private DNS.
   ○ Web UI's are only accessible from our machines web
     browsers through public DNS names.
Issues faced (3/4)
● Some installation guides forget to mention
  the required ports to allow communication
  with its services.
  ○ Cloudera provides a page with all the required ports.
● The creating folders and changing user
  permissions is not mentioned in the user
  guide.
  ○ We needed to access hadoop with username hdfs and
    create the flume folder and change its owner to
    flume using chown command.
    (AccessControlException)
Issues faced (4/4)
● Although scaling through the addition of new
  Agents is easy, it requires fine-tuning of the
  channels capacity (number of events) and
  transaction size for each Agent.
Future work
●   Expand RSS sources.
●   Implement a web UI.
●   Provide search services on the HDFS.
●   Improve the HDFS load balancing.
Conclusions (1/3)
● HDFS default configuration parameters are
  not suitable for deploying it in AWS EC2.
● Cloudera Manager makes installation and
  configuration process much easier!
  ○ but it also introduces a few constraints that might
    result in higher operating costs.
● Adapting the RSS reader of the agents is not
  trivial!
  ○ different RSS sources have different contents (e.g.
    posts with ad banners).
Conclusions (2/3)
● Amazon EC2 service is easier to use and
  more reliable than other cloud providers!
  ○ E.g. PlanetLab.
● Flume's architecture based on streaming
  data flows makes it easier to add new sources
  and sinks.
  ○ the service can scale by adding new Agents.
● Flume is horizontally scalable!
  ○ because its performance is proportional to the
    number of machines on which it is deployed.
Conclusions (3/3)
● Fine tunage of Flume's configuration files is
  not trivial!
● HDFS NameNode is no longer a single point
  of failure!
  ○ since NameNode replication was introduced. Adding
    passive NameNodes affects the overall performance
    of the HDFS cluster though.
References (1/2)
● Cloudera Flume 1.x installation
  ○ https://ccp.cloudera.
    com/display/CDHDOC/Flume+1.x+Installation
● Cloudera Manager CDH3
  ○ https://ccp.cloudera.
    com/display/FREE374/Cloudera+Manager+Free+E
    dition+Installation+Guide
● Cloudera port information
  ○ https://ccp.cloudera.
    com/display/CDHDOC/Configuring+Ports+for+CD
    H3
● Cloudera Flume User Guide
  ○ http://archive.cloudera.com/cdh4/cdh/4/flume-
References (2/2)
● Find more detailed information on our setup
  and configuration on our personal blogs:
  ○ http://www.aknahs.pt/
  ○ http://www.otnira.com/
  ○ http://115.186.131.91/~zafar/
Easter Egg: Issues faced
● One islamic team member declared love to a Cloudera
    female member and ended up having to marry her
    during the project.
    ○ Turns out it was a male.
●   One member became angry because other team was
    using demos on their project and ended up cutting a
    poor rastafarian hair off.
    ○ Turns out that screenshots are better than demos.
●   One member managed to get sun burned while doing
    the project. Before this it was thought that computer
    scientists would only work in caves.
    ○ Turns out that he just took a very hot shower.
Special Thanks
●   Leandro Navarro - UPC
●   Amazon
●   jarcec - #flume on irc.freenode.net
●   mids - #cloudera on irc.freenode.net
    (@mids106)

Hanging out in IRC is useful!
News aggregator service on
      Amazon EC2
         Arinto Murdopo
         Mário Almeida
          Zafar Gilani

        SDS, EMDC 2012

More Related Content

What's hot

HBaseConAsia2018 Track1-2: WALLess HBase with persistent memory devices
HBaseConAsia2018 Track1-2: WALLess HBase with persistent memory devicesHBaseConAsia2018 Track1-2: WALLess HBase with persistent memory devices
HBaseConAsia2018 Track1-2: WALLess HBase with persistent memory devicesMichael Stack
 
Gluster fs tutorial part 2 gluster and big data- gluster for devs and sys ...
Gluster fs tutorial   part 2  gluster and big data- gluster for devs and sys ...Gluster fs tutorial   part 2  gluster and big data- gluster for devs and sys ...
Gluster fs tutorial part 2 gluster and big data- gluster for devs and sys ...Tommy Lee
 
An introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAn introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAmir Sedighi
 
Disperse xlator ramon_datalab
Disperse xlator ramon_datalabDisperse xlator ramon_datalab
Disperse xlator ramon_datalabGluster.org
 
Tiering barcelona
Tiering barcelonaTiering barcelona
Tiering barcelonaGluster.org
 
Discover some "Big Data" architectural concepts with Redis
Discover some  "Big Data" architectural concepts with  Redis Discover some  "Big Data" architectural concepts with  Redis
Discover some "Big Data" architectural concepts with Redis Maturin BADO
 
Gluster intro-tdose
Gluster intro-tdoseGluster intro-tdose
Gluster intro-tdoseGluster.org
 
Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017HashedIn Technologies
 
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH Ceph Community
 
On demand file-caching_-_gustavo_brand
On demand file-caching_-_gustavo_brandOn demand file-caching_-_gustavo_brand
On demand file-caching_-_gustavo_brandGluster.org
 
Erasure codes and storage tiers on gluster
Erasure codes and storage tiers on glusterErasure codes and storage tiers on gluster
Erasure codes and storage tiers on glusterRed_Hat_Storage
 
Kkeithley ufonfs-gluster summit
Kkeithley ufonfs-gluster summitKkeithley ufonfs-gluster summit
Kkeithley ufonfs-gluster summitGluster.org
 
Sdc challenges-2012
Sdc challenges-2012Sdc challenges-2012
Sdc challenges-2012Gluster.org
 
Join the super_colony_-_feb2013
Join the super_colony_-_feb2013Join the super_colony_-_feb2013
Join the super_colony_-_feb2013Gluster.org
 
State of the_gluster_-_lceu
State of the_gluster_-_lceuState of the_gluster_-_lceu
State of the_gluster_-_lceuGluster.org
 
ImpalaToGo and Tachyon integration
ImpalaToGo and Tachyon integrationImpalaToGo and Tachyon integration
ImpalaToGo and Tachyon integrationDavid Groozman
 
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaApache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaYahoo Developer Network
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBaseHBaseCon
 
Hadoop basic commands
Hadoop basic commandsHadoop basic commands
Hadoop basic commandsbispsolutions
 

What's hot (20)

HBaseConAsia2018 Track1-2: WALLess HBase with persistent memory devices
HBaseConAsia2018 Track1-2: WALLess HBase with persistent memory devicesHBaseConAsia2018 Track1-2: WALLess HBase with persistent memory devices
HBaseConAsia2018 Track1-2: WALLess HBase with persistent memory devices
 
Gluster fs tutorial part 2 gluster and big data- gluster for devs and sys ...
Gluster fs tutorial   part 2  gluster and big data- gluster for devs and sys ...Gluster fs tutorial   part 2  gluster and big data- gluster for devs and sys ...
Gluster fs tutorial part 2 gluster and big data- gluster for devs and sys ...
 
An introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAn introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoop
 
Disperse xlator ramon_datalab
Disperse xlator ramon_datalabDisperse xlator ramon_datalab
Disperse xlator ramon_datalab
 
Tiering barcelona
Tiering barcelonaTiering barcelona
Tiering barcelona
 
Discover some "Big Data" architectural concepts with Redis
Discover some  "Big Data" architectural concepts with  Redis Discover some  "Big Data" architectural concepts with  Redis
Discover some "Big Data" architectural concepts with Redis
 
Gluster intro-tdose
Gluster intro-tdoseGluster intro-tdose
Gluster intro-tdose
 
Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017
 
Hdfs
HdfsHdfs
Hdfs
 
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
 
On demand file-caching_-_gustavo_brand
On demand file-caching_-_gustavo_brandOn demand file-caching_-_gustavo_brand
On demand file-caching_-_gustavo_brand
 
Erasure codes and storage tiers on gluster
Erasure codes and storage tiers on glusterErasure codes and storage tiers on gluster
Erasure codes and storage tiers on gluster
 
Kkeithley ufonfs-gluster summit
Kkeithley ufonfs-gluster summitKkeithley ufonfs-gluster summit
Kkeithley ufonfs-gluster summit
 
Sdc challenges-2012
Sdc challenges-2012Sdc challenges-2012
Sdc challenges-2012
 
Join the super_colony_-_feb2013
Join the super_colony_-_feb2013Join the super_colony_-_feb2013
Join the super_colony_-_feb2013
 
State of the_gluster_-_lceu
State of the_gluster_-_lceuState of the_gluster_-_lceu
State of the_gluster_-_lceu
 
ImpalaToGo and Tachyon integration
ImpalaToGo and Tachyon integrationImpalaToGo and Tachyon integration
ImpalaToGo and Tachyon integration
 
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaApache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
 
Hadoop basic commands
Hadoop basic commandsHadoop basic commands
Hadoop basic commands
 

Similar to Flume-based Independent News Aggregator

Automation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure DataAutomation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure DataYan Wang
 
Running Dataproc At Scale in production - Searce Talk at GDG Delhi
Running Dataproc At Scale in production - Searce Talk at GDG DelhiRunning Dataproc At Scale in production - Searce Talk at GDG Delhi
Running Dataproc At Scale in production - Searce Talk at GDG DelhiSearce Inc
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflowmutt_data
 
Ozone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalabilityOzone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalabilityDinesh Chitlangia
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Junping Du
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateDataWorks Summit
 
Truemotion Adventures in Containerization
Truemotion Adventures in ContainerizationTruemotion Adventures in Containerization
Truemotion Adventures in ContainerizationRyan Hunter
 
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...Ridwan Fadjar
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache HadoopSufi Nawaz
 
[WSO2Con Asia 2018] Architecting for Container-native Environments
[WSO2Con Asia 2018] Architecting for Container-native Environments[WSO2Con Asia 2018] Architecting for Container-native Environments
[WSO2Con Asia 2018] Architecting for Container-native EnvironmentsWSO2
 
Creation of Own Cloud
Creation of Own Cloud Creation of Own Cloud
Creation of Own Cloud Mohammed Adam
 
Terraforming your Infrastructure on GCP
Terraforming your Infrastructure on GCPTerraforming your Infrastructure on GCP
Terraforming your Infrastructure on GCPSamuel Chow
 
PyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applicationsPyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applicationsCesar Cardenas Desales
 
Ghost Environment
Ghost EnvironmentGhost Environment
Ghost EnvironmentPratipD
 
Effectively deploying hadoop to the cloud
Effectively  deploying hadoop to the cloudEffectively  deploying hadoop to the cloud
Effectively deploying hadoop to the cloudAvinash Ramineni
 

Similar to Flume-based Independent News Aggregator (20)

Automation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure DataAutomation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure Data
 
RubiX
RubiXRubiX
RubiX
 
Running Dataproc At Scale in production - Searce Talk at GDG Delhi
Running Dataproc At Scale in production - Searce Talk at GDG DelhiRunning Dataproc At Scale in production - Searce Talk at GDG Delhi
Running Dataproc At Scale in production - Searce Talk at GDG Delhi
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
 
Ozone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalabilityOzone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalability
 
App Deployment on Cloud
App Deployment on CloudApp Deployment on Cloud
App Deployment on Cloud
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Truemotion Adventures in Containerization
Truemotion Adventures in ContainerizationTruemotion Adventures in Containerization
Truemotion Adventures in Containerization
 
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
Ridwan Fadjar Septian PyCon ID 2021 Regular Talk - django application monitor...
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache Hadoop
 
[WSO2Con Asia 2018] Architecting for Container-native Environments
[WSO2Con Asia 2018] Architecting for Container-native Environments[WSO2Con Asia 2018] Architecting for Container-native Environments
[WSO2Con Asia 2018] Architecting for Container-native Environments
 
Creation of Own Cloud
Creation of Own Cloud Creation of Own Cloud
Creation of Own Cloud
 
Scaling symfony apps
Scaling symfony appsScaling symfony apps
Scaling symfony apps
 
Terraforming your Infrastructure on GCP
Terraforming your Infrastructure on GCPTerraforming your Infrastructure on GCP
Terraforming your Infrastructure on GCP
 
PyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applicationsPyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applications
 
Scaling PHP apps
Scaling PHP appsScaling PHP apps
Scaling PHP apps
 
Ghost Environment
Ghost EnvironmentGhost Environment
Ghost Environment
 
Effectively deploying hadoop to the cloud
Effectively  deploying hadoop to the cloudEffectively  deploying hadoop to the cloud
Effectively deploying hadoop to the cloud
 

More from Mário Almeida

Empirical Study of Android Alarm Usage for Application Scheduling
Empirical Study of Android Alarm Usage for Application SchedulingEmpirical Study of Android Alarm Usage for Application Scheduling
Empirical Study of Android Alarm Usage for Application SchedulingMário Almeida
 
Android reverse engineering - Analyzing skype
Android reverse engineering - Analyzing skypeAndroid reverse engineering - Analyzing skype
Android reverse engineering - Analyzing skypeMário Almeida
 
High-Availability of YARN (MRv2)
High-Availability of YARN (MRv2)High-Availability of YARN (MRv2)
High-Availability of YARN (MRv2)Mário Almeida
 
Flume impact of reliability on scalability
Flume impact of reliability on scalabilityFlume impact of reliability on scalability
Flume impact of reliability on scalabilityMário Almeida
 
Dimemas and Multi-Level Cache Simulations
Dimemas and Multi-Level Cache SimulationsDimemas and Multi-Level Cache Simulations
Dimemas and Multi-Level Cache SimulationsMário Almeida
 
Self-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File SystemsSelf-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File SystemsMário Almeida
 
Smith waterman algorithm parallelization
Smith waterman algorithm parallelizationSmith waterman algorithm parallelization
Smith waterman algorithm parallelizationMário Almeida
 
Man-In-The-Browser attacks
Man-In-The-Browser attacksMan-In-The-Browser attacks
Man-In-The-Browser attacksMário Almeida
 
Exploiting Availability Prediction in Distributed Systems
Exploiting Availability Prediction in Distributed SystemsExploiting Availability Prediction in Distributed Systems
Exploiting Availability Prediction in Distributed SystemsMário Almeida
 
High Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing NetworksHigh Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing NetworksMário Almeida
 
Instrumenting parsecs raytrace
Instrumenting parsecs raytraceInstrumenting parsecs raytrace
Instrumenting parsecs raytraceMário Almeida
 
Architecting a cloud scale identity fabric
Architecting a cloud scale identity fabricArchitecting a cloud scale identity fabric
Architecting a cloud scale identity fabricMário Almeida
 

More from Mário Almeida (14)

Empirical Study of Android Alarm Usage for Application Scheduling
Empirical Study of Android Alarm Usage for Application SchedulingEmpirical Study of Android Alarm Usage for Application Scheduling
Empirical Study of Android Alarm Usage for Application Scheduling
 
Android reverse engineering - Analyzing skype
Android reverse engineering - Analyzing skypeAndroid reverse engineering - Analyzing skype
Android reverse engineering - Analyzing skype
 
Spark
SparkSpark
Spark
 
High-Availability of YARN (MRv2)
High-Availability of YARN (MRv2)High-Availability of YARN (MRv2)
High-Availability of YARN (MRv2)
 
Flume impact of reliability on scalability
Flume impact of reliability on scalabilityFlume impact of reliability on scalability
Flume impact of reliability on scalability
 
Dimemas and Multi-Level Cache Simulations
Dimemas and Multi-Level Cache SimulationsDimemas and Multi-Level Cache Simulations
Dimemas and Multi-Level Cache Simulations
 
Self-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File SystemsSelf-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File Systems
 
Smith waterman algorithm parallelization
Smith waterman algorithm parallelizationSmith waterman algorithm parallelization
Smith waterman algorithm parallelization
 
Man-In-The-Browser attacks
Man-In-The-Browser attacksMan-In-The-Browser attacks
Man-In-The-Browser attacks
 
Exploiting Availability Prediction in Distributed Systems
Exploiting Availability Prediction in Distributed SystemsExploiting Availability Prediction in Distributed Systems
Exploiting Availability Prediction in Distributed Systems
 
High Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing NetworksHigh Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing Networks
 
Instrumenting parsecs raytrace
Instrumenting parsecs raytraceInstrumenting parsecs raytrace
Instrumenting parsecs raytrace
 
Architecting a cloud scale identity fabric
Architecting a cloud scale identity fabricArchitecting a cloud scale identity fabric
Architecting a cloud scale identity fabric
 
SOAP vs REST
SOAP vs RESTSOAP vs REST
SOAP vs REST
 

Recently uploaded

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 

Recently uploaded (20)

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 

Flume-based Independent News Aggregator

  • 1. Flume-based news aggregator service on Amazon EC2 Arinto Murdopo Mário Almeida Zafar Gilani SDS, EMDC 2012
  • 2. Outline ● Introduction ○ Cloudera Manager CDH3 ○ Cloudera Flume ○ Hadoop Distributed File System ● Infrastructure setup ● Architecture ● News recommendation ● RSS News aggregator ● Issues faced ● Proof of concept ● Future work ● Conclusions ● References
  • 3. Introduction ● A flume-based independent news aggregator service. ● Using: ○ Amazon EC2 IaaS ○ Cloudera Manager CDH3 ○ Cloudera Flume ○ Hadoop Distributed File System
  • 4. Cloudera Manager CDH3 ● Automates the installation and configuration process of CDH3 on an entire cluster. ● We used free edition (up to 50 nodes).
  • 5. Cloudera Flume ● A distributed, reliable and available system. ● To efficiently collect, aggregate and move large amounts of log data. ● From many different sources to a centralized or distributed data store (such as Hadoop HDFS).
  • 6. Hadoop HDFS (1/2) ● For our purpose Hadoop handles: ○ Log receipt and storage. ○ Search and log processing. ● Coordinates work among cluster of machines.
  • 8. Infrastructure setup ● 2 Agent nodes collecting data: ○ Source: RSS feed ○ Sink: Collector ● 1 Agent node (Collector): ○ Source: Agents ○ Sink: HDFS ● HDFS NameNode: ○ Replicates data to DataNodes 1, 2 and 3. ● Cloudera Manager CDH3 node: ○ Managing all our nodes (Agents and HDFS nodes).
  • 10. News Recommendation ● We hosted a webpage in which people can recommend possible sources for news. ○ http://web.ist.utl.pt/~ist156947/sds/ ● Retrieved a big compilation of news websites and blogs from a reasonable variety of countries ○ E.g. Spain, Libya, Russia, Syria, Iran...
  • 11. RSS News aggregator ● We wrote a Java application to read RSS feeds using: ○ java.net.URL to handle the resource pointed-to by the URL. ○ javax.xml.parsers for XML parsing. ○ org.w3c.dom provides interfaces for DOM to process XML.
  • 12. Proof of concept (1/3) ● Our Agent collects the RSS feeds and sends it to the Collector Agent.
  • 13. Proof of concept (2/3) ● The collector receives the events from both Agents and stores them into the HDFS.
  • 14. Proof of concept (3/3) ● Because we have a level of replication of 3, every DataNode will end up with the same amount of data.
  • 15. Issues faced (1/4) ● DataNode Setting dfs.datanode.du.reserved is set by default to 10 GB. ○ This means that if a datanode has less than 10 GB of capacity, then there is no remaining available space for the file system. (Warning: Not able to place enough replicas)
  • 16. Issues faced (2/4) ● In order for CDH Manager to work, all nodes must run either Suse or RedHat. ● The CDH Manager cannot run on a AWS EC2 micro instance. ● Upon instance restart, its IP changes. ○ So the CDH Manager loses track of the node ● CDH Manager operates with private DNS and so any references it makes point to this private DNS. ○ Web UI's are only accessible from our machines web browsers through public DNS names.
  • 17. Issues faced (3/4) ● Some installation guides forget to mention the required ports to allow communication with its services. ○ Cloudera provides a page with all the required ports. ● The creating folders and changing user permissions is not mentioned in the user guide. ○ We needed to access hadoop with username hdfs and create the flume folder and change its owner to flume using chown command. (AccessControlException)
  • 18. Issues faced (4/4) ● Although scaling through the addition of new Agents is easy, it requires fine-tuning of the channels capacity (number of events) and transaction size for each Agent.
  • 19. Future work ● Expand RSS sources. ● Implement a web UI. ● Provide search services on the HDFS. ● Improve the HDFS load balancing.
  • 20. Conclusions (1/3) ● HDFS default configuration parameters are not suitable for deploying it in AWS EC2. ● Cloudera Manager makes installation and configuration process much easier! ○ but it also introduces a few constraints that might result in higher operating costs. ● Adapting the RSS reader of the agents is not trivial! ○ different RSS sources have different contents (e.g. posts with ad banners).
  • 21. Conclusions (2/3) ● Amazon EC2 service is easier to use and more reliable than other cloud providers! ○ E.g. PlanetLab. ● Flume's architecture based on streaming data flows makes it easier to add new sources and sinks. ○ the service can scale by adding new Agents. ● Flume is horizontally scalable! ○ because its performance is proportional to the number of machines on which it is deployed.
  • 22. Conclusions (3/3) ● Fine tunage of Flume's configuration files is not trivial! ● HDFS NameNode is no longer a single point of failure! ○ since NameNode replication was introduced. Adding passive NameNodes affects the overall performance of the HDFS cluster though.
  • 23. References (1/2) ● Cloudera Flume 1.x installation ○ https://ccp.cloudera. com/display/CDHDOC/Flume+1.x+Installation ● Cloudera Manager CDH3 ○ https://ccp.cloudera. com/display/FREE374/Cloudera+Manager+Free+E dition+Installation+Guide ● Cloudera port information ○ https://ccp.cloudera. com/display/CDHDOC/Configuring+Ports+for+CD H3 ● Cloudera Flume User Guide ○ http://archive.cloudera.com/cdh4/cdh/4/flume-
  • 24. References (2/2) ● Find more detailed information on our setup and configuration on our personal blogs: ○ http://www.aknahs.pt/ ○ http://www.otnira.com/ ○ http://115.186.131.91/~zafar/
  • 25. Easter Egg: Issues faced ● One islamic team member declared love to a Cloudera female member and ended up having to marry her during the project. ○ Turns out it was a male. ● One member became angry because other team was using demos on their project and ended up cutting a poor rastafarian hair off. ○ Turns out that screenshots are better than demos. ● One member managed to get sun burned while doing the project. Before this it was thought that computer scientists would only work in caves. ○ Turns out that he just took a very hot shower.
  • 26. Special Thanks ● Leandro Navarro - UPC ● Amazon ● jarcec - #flume on irc.freenode.net ● mids - #cloudera on irc.freenode.net (@mids106) Hanging out in IRC is useful!
  • 27. News aggregator service on Amazon EC2 Arinto Murdopo Mário Almeida Zafar Gilani SDS, EMDC 2012