SlideShare una empresa de Scribd logo
1 de 120
Descargar para leer sin conexión
Cloud deployments with Apache Hadoop 
          and Apache HBase

    8/23/11 NoSqlNow! 2011 Conference
              Jonathan Hsieh
            jon@cloudera.com
            jon@cloudera com
                @jmhsieh
Who Am I?

                   • Cloudera:
                          • Software Engineer on the Platform 
                            Team
                          • Apache HBase contributor
                            Apache HBase contributor
                          • Apache Flume (incubating) 
                            founder/committer
                          • Apache Sqoop (incubating) 
                            committer
                   • U of Washington:
                     U of Washington:
                          • Research in Distributed Systems 
                            and Programming Languages

            8/23/11  NoSQLNow! 2011 Conference                   2
Who is Cloudera? 

   Cloudera, the leader in Apache 
   Cloudera, the leader in Apache
     Hadoop‐based software and 
    services, enables enterprises to 
   easily derive business value from 
   easily derive business value from
              all their data.


              8/23/11  NoSQLNow! 2011 Conference   3
8/23/11  NoSQLNow! 2011 Conference   4
8/23/11  NoSQLNow! 2011 Conference   5
8/23/11  NoSQLNow! 2011 Conference   6
“Every two days we create as much 
“E      t d                 t        h
  information as we did from the dawn 
  of civilization up until 2003.” 
                              Eric Schmidt
                              Eric Schmidt
“I keep saying that the sexy job in the 
      p y g                 yj
next 10 years will be statisticians. 
And I m not kidding.
And I’m not kidding.”
                                      Hal Varian 
                     (Google s chief economist)
                     (Google’s chief economist)
            8/23/11  NoSQLNow! 2011 Conference      7
Outline

•   Motivation 
•   Enter Apache Hadoop 
•   Enter Apache HBase
•   Real‐World Applications
•   System Architecture
•   Deployment (in the Cloud)
•   Conclusions



                  8/23/11  NoSQLNow! 2011 Conference   8
Outline

•   Motivation
•   Enter Apache Hadoop 
•   Enter Apache HBase
•   Real‐World Applications
•   System Architecture
•   Deployment (in the Cloud)
•   Conclusions



                  8/23/11  NoSQLNow! 2011 Conference   9
What is Apache HBase?
         p

                           Apache HBase is an 
                             p
                               open source, 
                          horizontally scalable,
                          h i t ll         l bl
                             sorted map data 
                                        p
                           store built on top of 
                             Apache Hadoop.
                             Apache Hadoop

             8/23/11  NoSQLNow! 2011 Conference     10
What is Apache Hadoop?
         p          p

                           Apache Hadoop is an
                            p             p
                               open source,
                           horizontally scalable 
                           horizontally scalable
                            system for reliably
                          storing and processing 
                           massive amounts of 
                           massive amounts of
                             data across many  
                           commodity servers.
                                   di
             8/23/11  NoSQLNow! 2011 Conference     11
Open Source
 p

• Apache 2.0 License
• A Community project with committers and contributors 
  from diverse organizations.
  • Cloudera, Facebook, Yahoo!, eBay, StumbleUpon, Trend Micro, 
    …
• Code license means anyone can modify and use the
  Code license means anyone can modify and use the 
  code.




                 8/23/11  NoSQLNow! 2011 Conference           12
Horizontally Scalable
           y
                            600
                                                                   • Store and access data on 
                     put)




                            500

                                                                     1‐1000’s commodity 
                                                                        000’          di
     /Storage/Throughp




                            400
       Performance 




                            300
                                                                     servers.
                            200

                                                                   • Addi
                                                                     Adding more servers 
(IOPs/




                            100

                              0
                                  # of servers                       should linearly increase 
                                                                     performance and 
                                                                     performance and
                                                                     capacity.
                                                                          • Storage capacity
                                                                            Storage capacity 
                                                                          • Processing capacity
                                                                              p / p        p
                                                                          • Input/output operations

                                             8/23/11  NoSQLNow! 2011 Conference                       13
Commodity Servers (circa 2010)
        y         (          )

• 2 quad core CPUs, running at least 2‐2.5GHz
• 16‐24GBs of RAM (24‐32GBs if you’re considering 
  HBase)
• 4x 1TB hard disks in a JBOD (Just a Bunch Of Disks) 
  configuration
• Gigabit Ethernet
• $5k‐10k / machine




                 8/23/11  NoSQLNow! 2011 Conference      14
Let s deploy some machines.
Let’s deploy some machines
        (in the cloud!)



       8/23/11  NoSQLNow! 2011 Conference   15
We’ll use Apache Whirr
           p

                         Apache Whirr is set of 
                         Apache Whirr is set of
                          tools and libraries for 
                          deploying clusters on 
                           cloud services in a 
                           cloud services in a
                            cloud‐neutral way.y


              8/23/11  NoSQLNow! 2011 Conference   16
Ready?
    y
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
curl -O
  http://www.apache.org/dist/incubator/whirr/w
  http://www apache org/dist/incubator/whirr/w
  hirr-0.5.0-incubating/whirr-0.5.0-
  incubating.tar.gz
tar zxf whirr-0.5.0-incubating.tar.gz
cd whirr-0.5.0-incubating
bin/whirr launch cluster --config
          launch-cluster   config
  recipes/hbase-ec2.properties


             8/23/11  NoSQLNow! 2011 Conference   17
Done.  That s it. 
          Done. That’s it.

       You can go home now.

Ok, ok, we’ll come back to this later.
Ok k      ’ll      b k t thi l t

           8/23/11  NoSQLNow! 2011 Conference   18
“The future is already here — it's
  The future is already here — it s 
 just not very evenly distributed.”

                                  William Gibson
                                  William Gibson




          8/23/11  NoSQLNow! 2011 Conference       19
8/23/11  NoSQLNow! 2011 Conference   20
Building a web index
       g

• Download all of the web. 

• Store all of it.

• Analyze all of it to build the index and rankings.

• Repeat.
    p



                     8/23/11  NoSQLNow! 2011 Conference   21
Size of Google’s Web Index
           g

• Let’s assume 50k per webpage, 500 bytes per URL.
• According to Google*                                                                           100000




                                                                                   orage in TB
  • 1998: 26Mn indexed pages                                                                      10000




                                                                 of Google's Web Sto
     • 1.3TB
                                                                                                   1000

  • 2000:  1Bn indexed pages
     • 500 TB
       500 TB                                                                                       100




                                                  Estimated Size o
  • 2008: ~40Bn indexed pages                                                                       10

     • 20,000 TB
                                                                                                     1
  • 2008: 1 Tn URLS
     • ~500 TB just in urls names!                                                                        Year



          * http://googleblog.blogspot.com/2008/07/we‐knew‐web‐was‐big.html
                    8/23/11  NoSQLNow! 2011 Conference                                                           22
Volume, Variety, and Velocity
      ,       y,            y

• The web has a massive amount of data.
   • How do we store this massive volume?


• Raw web data is diverse, dirty, and semi‐structured.
   • How do we deal with all the compute necessary to clean and 
     process this variety of data?
     process this variety of data?


• There is new content being created all the time!
  There is new content being created all the time!
   • How do we keep up the velocity of new data coming in?


                   8/23/11  NoSQLNow! 2011 Conference              23
Telltale signs you might need 
Telltale signs you might need
            “noSQL”.



       8/23/11  NoSQLNow! 2011 Conference   24
Did you try scaling vertically?
    y     y       g          y



                                   =>
• Upgrading to a beefier machines could be quick.
   • (upgrade that m1 large to a m2 4xlarge)
     (upgrade that m1.large to a m2.4xlarge)
• This is probably a good idea.  
• Not quite time for HBase
  Not quite time for HBase.

• What if this isn’t enough?
  What if this isn t enough?
                   8/23/11  NoSQLNow! 2011 Conference   25
Changed your schema and queries?
    g y                 q

• Remove text search queries (LIKE).
   • These are expensive.
• Remove joins. 
   • Normalization is more expensive today
     Normalization is more expensive today. 
   • Multiple seeks are more expensive than sequential reads/writes.
• Remove foreign keys and encode your own relations.
  Remove foreign keys and encode your own relations.
   • Avoids constraint checks.
• Just put all parts of a query in a single table.
• Lots of Full table scans?  
   • Good time for Hadoop.
• This might be a good time to consider HBase.
                     8/23/11  NoSQLNow! 2011 Conference                26
Need to scale reads?

• Using DB replication to make more copies to read from
• Use Memcached

• Assumes 80/20 read to write ratio, this works 
  reasonably well if can tolerate replication lag.




                  8/23/11  NoSQLNow! 2011 Conference      27
Need to scale writes?

• Unfortunately, eventually you may need more writes. 
• Let’s shard and federate the DB
   • Loses consistency, order of operations.
   • Replication has diminishing returns with more writes.
                                                                         600


• HA operational complexity!
  HA operational complexity!                                             500




                                                    Performance (IOPs)
                                                                         400

   • Gah!                                                                300

                                                                         200

                                                                         100

                                                                           0
                                                                               # of servers


• This is definitely a good time to consider HBase
  This is definitely a good time to consider HBase
                   8/23/11  NoSQLNow! 2011 Conference                                         28
Wait –
W i we “optimized the DB” by 
          “ i i d h DB” b
 discarding some fundamental 
 discarding some fundamental
   SQL/relational databases 
           features?


        8/23/11  NoSQLNow! 2011 Conference   29
Yup.




8/23/11  NoSQLNow! 2011 Conference   30
Assumption #1: Your workload fits on one machine…




                8/23/11  NoSQLNow! 2011 Conference                                  31
                                                     Image: Matthew J. Stinson CC-BY-NC
Massive amounts of storage
                        g
• How much data could you collect today?
   • Many companies easily collect 200GB of logs per day.
   • Facebook claims to collect >15 TB per day.


• How do you handle this problem?
   • Just keep a few days worth of data and then /dev/null.
             p         y                         / /
   • Sample your data.
   • Move data to write only media (tape)

• If you want to analyze all your data, you are going to 
  need to use multiple machines.
  need to use multiple machines
                   8/23/11  NoSQLNow! 2011 Conference         32
Assumption #2:  Machines deserve identities...




                 8/23/11  NoSQLNow! 2011 Conference                             33
                                                      Image:Laughing Squid CC BY-NC-SA
Interact with a cluster, not 
             Interact with a cluster not
             a bunch of machines.
             a bunch of machines.




                                            8/23/11  NoSQLNow! 2011 Conference   34
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
Assumption #3: Machines can be reliable…




                                       8/23/11  NoSQLNow! 2011 Conference   35
Image: MadMan the Mighty CC BY-NC-SA
Disk failures are very common
                     y

• USENIX FAST ‘07:




• tl;dr: 2‐8 % new disk drives fail per year.
    ;                               p y
• For a 100 node cluster with 1200 drives:
   • a drive failure every 15‐60 days
                         y         y

                    8/23/11  NoSQLNow! 2011 Conference   36
Outline

•   Motivation 
•   Enter Apache Hadoop 
•   Enter Apache HBase
•   Real‐World Applications
•   System Architecture
•   Cluster Deployment 
•            p y
    Cloud Deployment
•   Demo
•   Conclusions
                  8/23/11  NoSQLNow! 2011 Conference   37
What is Apache Hadoop?
         p          p

                           Apache Hadoop is an
                            p             p
                               open source,
                           horizontally scalable 
                           horizontally scalable
                            system for reliably
                          storing and processing 
                           massive amounts of 
                           massive amounts of
                             data across many  
                           commodity servers. 
                                   di
             8/23/11  NoSQLNow! 2011 Conference     38
What is Apache Hadoop?
         p          p

                           Apache Hadoop is an
                            p             p
                               open source,
                           horizontally scalable 
                           horizontally scalable
                            system for reliably 
                          storing and processing
                           massive amounts of 
                           massive amounts of
                             data across many  
                           commodity servers. 
                                   di
             8/23/11  NoSQLNow! 2011 Conference     39
Goal: Separate complicated 
Goal: Separate complicated
distributed fault‐tolerance code 
from application code
                                                    Unicorns




        Systems Programmers
        Systems Programmers                       Statisticians




             8/23/11  NoSQLNow! 2011 Conference                   40
What did Google do?
            g

• SOSP 2003:




• OSDI 2004:




               8/23/11  NoSQLNow! 2011 Conference   41
Origin of Apache Hadoop
   g       p          p


                                                                        Hadoop wins 
                                                                        Terabyte sort 
                                                                        benchmark

                                                                                                   Releases 
                                                Open Source,                                       CDH3 and 
                     Publishes                  MapReduce                                          Cloudera 
                     MapReduce,                 & HDFS           Runs 4,000                        Enterprise
  Open Source,       GFS Paper                  project          Node Hadoop 
  Web Crawler 
  W bC      l                                   created by       Cluster
  project                                                                          Launches SQL 
                                                Doug Cutting
  created by                                                                       Support for 
  Doug Cutting                                                                     Hadoop


2002        2003   2004       2005          2006          2007   2008           2009        2010




                           8/23/11  NoSQLNow! 2011 Conference                                              42
Apache Hadoop in production today
 p          p p                 y

 • Yahoo! Hadoop Clusters: > 82PB, >40k machines
   (as of Jun ‘11)
   (    f     ‘ )

 • Facebook: 15TB new data per day;
   1200+ machines, 30PB in one cluster
   1200+ machines 30PB in one cluster
 • Twitter: >1TB per day, ~120 nodes

 • Lots of 5‐40 node clusters at companies without
   petabytes of data (web retail finance telecom
              of data (web, retail, finance, telecom, 
   research, government)


                     8/23/11  NoSQLNow! 2011 Conference   43
Case studies: Hadoop World ‘10
                   p
 •   eBay: Hadoop at eBay
 •   Twitter: The Hadoop Ecosystem at Twitter
              The Hadoop Ecosystem at Twitter
 •   Yale University: MapReduce and Parallel Database Systems
 •   General Electric: Sentiment Analysis powered by Hadoop
 •   Facebook: HBase in Production 
 •   Bank of America: The Business of Big Data
 •   AOL: AOL’s Data Layer
          AOL s Data Layer
 •   Raytheon: SHARD: Storing and Querying Large‐Scale Data
 •   StumbleUpon: Mixing Real‐Time and Batch Processing


                                  More Info at
                                  More Info at
     http://www.cloudera.com/company/press‐center/hadoop‐world‐nyc/
                     8/23/11  NoSQLNow! 2011 Conference               44
What is HDFS?

• HDFS is a file system. 
   • Just stores (a lot of) bytes as files.
   • Distributes storage across many machines and many disks.
   • R li bl b
     Reliable by replicating of blocks across the machines 
                      li i      f bl k         h     hi
     throughout the cluster.
   • Horizontally Scalable ‐‐ just add more machines and disks.
     Horizontally Scalable  just add more machines and disks.
   • Optimized for large sequential writes.
• Features
   • Unix‐style permissions.
   • Kerberos‐based authentication.

                   8/23/11  NoSQLNow! 2011 Conference             45
HDFS’s File API

• Dir
   •   List files
   •   Remove files
   •   Copy files
       C       fil
   •   Put / Get Files
• File
   •   Open 
   •   Close
   •   Read 
   •          pp
       Write/Append

                         8/23/11  NoSQLNow! 2011 Conference   46
Ideal use cases

• Great for storage of 
  massive amounts of raw
  or uncooked data
  • M i d t fil
    Massive data files
  • All of your logs




                  8/23/11  NoSQLNow! 2011 Conference   47
Massive scale while tolerating machine failures
                             g




               8/23/11  NoSQLNow! 2011 Conference   48
Hadoop gives you agility
     pg      y    g y

• Schema on write
  • Traditional DBs require cleaning and applying schema.
  • Great if you can plan on you schema well in advance.


• Schema on read
  •   HDFS enables you to store all of your raw data.
      HDFS      bl       t t       ll f         d t
  •   Great if you have ad hoc queries on ad hoc data. 
  •   If you don t know your schema, you can try new ones.
      If you don’t know your schema you can try new ones
  •   Great if you are exploring schemas and  transforming data.


                    8/23/11  NoSQLNow! 2011 Conference             49
Analyzing data in with MapReduce
    y g                  p

• Apache Hadoop MapReduce
   • Simplified distributed programming model.
   • Just specify a “map” and a “reduce” function.


• MapReduce is a batch processing system.
   • O ti i d f th
     Optimized for throughput not latency.
                          h t tl t
   • The fastest MR job takes 15+ seconds.


• You are not going to use this to directly serve data for 
  your next web site. 
  your next web site
                   8/23/11  NoSQLNow! 2011 Conference         50
Don’t like programming Java?
           p g       g

• Apache Pig
  • Higher level dataflow language
  • Good for data transformations
  • G
    Generally preferred by programmers
           ll     f    db


• Apache Hive
  Apache Hive
  • SQL‐like language for querying data in HDFS
  • Generally preferred by Data scientists and Business Analysts
    Generally preferred by Data scientists and Business Analysts




                  8/23/11  NoSQLNow! 2011 Conference               51
There is a catch…

                          • Files are append‐only.
                                 • No update or random writes.
                                 • Data not available for read until 
                                   file is flushed.
                                   file is flushed
                          • Files are ideally large.
                                 • Enables storage of 10’s of
                                   Enables storage of 10 s of 
                                   petabytes of data
                                 • HDFS has split the file into 64MB 
                                   or 128MB blocks! 



              8/23/11  NoSQLNow! 2011 Conference                        52
Outline

•   Motivation 
•   Enter Apache Hadoop 
•   Enter Apache HBase
•   Real‐World Applications
•   System Architecture
•   Deployment (in the Cloud)
•   Conclusions



                  8/23/11  NoSQLNow! 2011 Conference   53
What is Apache HBase?
         p

                           Apache HBase is an 
                             p
                               open source, 
                          horizontally scalable,
                          h i t ll         l bl
                             sorted map data 
                                        p
                           store built on top of 
                             Apache Hadoop.
                             Apache Hadoop

             8/23/11  NoSQLNow! 2011 Conference     54
What is Apache HBase?
         p

                           Apache HBase is an 
                             p
                               open source, 
                          horizontally scalable,
                          h i t ll         l bl
                             sorted map data 
                                        p
                           store built on top of 
                             Apache Hadoop.
                             Apache Hadoop

             8/23/11  NoSQLNow! 2011 Conference     55
Inspiration: Google BigTable
   p            g     g

• OSDI 2006 paper




• Goal: Quick random read/write access to massive 
        Q                 /
  amounts of structured data. 
  • It was the data store for Google’s crawler web table, orkut, 
    analytics, earth, blogger, …
                   8/23/11  NoSQLNow! 2011 Conference               56
Sorted Map Datastore
         p

                     • It really is just a Big Table!
 0000000000




 1111111111




 2222222222




 3333333333
                     • Tables consist of rows, each of 
 4444444444
                       which has a primary key (row key).
 5555555555




 6666666666
                     • Each row may have any number of 
 7777777777            columns.
                     • Rows are stored in sorted order




              8/23/11  NoSQLNow! 2011 Conference        57
Anatomy of a Row
      y
• Each row has a row key (think primary key)
   • Lexicographically sorted byte[]
     Lexicographically sorted byte[]
   • Timestamp associated for keeping multiple versions of data

• R i
  Row is made up of columns. 
           d      f l
• Each column has a Cell
   • Contents of a cell are an untyped byte[]’s.
   • Apps must “know” types and handle them.
• Columns are logically like a Map<byte[] column, byte[] 
  value>

• Rows edits are atomic and changes are strongly consistent 
  (replicas are in sync)
  (replicas are in sync)

                     8/23/11  NoSQLNow! 2011 Conference           58
Sorted Map Datastore (logical view)
         p           ( g          )
Implicit PRIMARY KEY in 
           RDBMS terms
           RDBMS terms                                            Data is all stored as byte[]
                                                                                         y


         Row key           Data (column : value)
         cutting
             i             ‘info:height’: ‘9ft’, ‘info:state’: ‘CA’ 
                           ‘i f h i h ’ ‘9f ’ ‘i f          ’ ‘CA’
                           ‘roles:ASF’: ‘Director’, ‘roles:Hadoop’: ‘Founder’ 
         tlipcon           ‘info:height’: ‘5ft7’, ‘info:state’: ‘CA’
                           ‘roles:Hadoop’: ‘Committer’@ts=2010,
                           ‘roles:Hadoop’: ‘PMC’@ts=2011,
                           ‘roles:Hive’: ‘Contributor’ 

Different rows may have different sets 
            of columns(table is sparse)                                A single cell might have different
       Useful for *‐To‐Many mappings                                   values at different timestamps


                             8/23/11  NoSQLNow! 2011 Conference                                             59
Apache HBase Depends upon HDFS
 p             p      p

• Relies on HDFS for data durability and reliability.

• Uses HDFS to store its Write‐Ahead Log (WAL).
   • Need flush/sync support in HDFS in order to prevent data loss 
     problems.




                   8/23/11  NoSQLNow! 2011 Conference             60
HBase in Numbers

•   Largest cluster: ~1000 nodes, ~1PB
•   Most clusters: 5‐20 nodes, 100GB‐4TB
•   Writes: 1‐3ms, 1k‐10k writes/sec per node
•   Reads: 0‐3ms cached, 10‐30ms disk
    • 10‐40k reads / second / node from cache
• Cell size: 0‐3MB preferred




                   8/23/11  NoSQLNow! 2011 Conference   61
Access data via an API.  There is “noSQL”*
                                      Q

• HBase API
   •   get(row)
   •   put(row, Map<column, value>)
   •   scan(key range, filter)
            (k         fil )
   •   increment(row, columns)
   •   … (checkAndPut, delete, etc…)
         (checkAndPut delete etc )


• *Ok that’s a slight lie
   Ok, that s a slight lie.  
   • There is work on integrating Apache Hive, a SQL‐like query 
     language, with HBase. 
   • This is not optimal; x5 slower than normal Hive+HDFS.
                    8/23/11  NoSQLNow! 2011 Conference             62
Cost Transparency
          p     y

• Goal: Want predictable latency of random read and 
  write operations.
   • To do this, you have to understand some of the physical layout 
     of your datastore.
     of your datastore
   • Efficiencies are based on Locality.


• A few physical concepts to help:
   • Column Families
     Column Families
   • Regions


                   8/23/11  NoSQLNow! 2011 Conference             63
0000000000




Column Families                                                     1111111111




                                                                    2222222222




• Just a set of related columns.                                    3333333333




                                                                    4444444444



• Each may have different
  Each may have different                                           5555555555


  columns and access patterns.                                      6666666666




• Each may have parameters                                          7777777777



  set per column family:
  set per column family:
   • Block Compression (none, gzip, 
     LZO, Snappy)
   • Version retention policies
                                                       0000000000                0000000000

   • Cache priority
                                                       1111111111                1111111111


• Improves read performance.                           2222222222                2222222222



   • CFs stored separately: access                     3333333333                3333333333

     one without wasting IO on the 
     other.                                            4444444444                4444444444




   • Store related data together for                   5555555555                5555555555




     better compression
     better compression                                6666666666                6666666666




                                                       7777777777                7777777777




                       8/23/11  NoSQLNow! 2011 Conference                                     64
0000000000




Sparse Columns
 p                                                                 1111111111




                                                                   2222222222




                                                                   3333333333



• Provides schema flexibility                                      4444444444




   • Add columns later, no need                                    5555555555




     to transform entire schema.                                   6666666666




   • Use for writing aggregates 
                   g gg g
                                                                   7777777777




     atomically (“prejoins”)
• Improves performance
   • Null columns don’t take                          0000000000                0000000000




     space.  You don’t need to 
                                                      1111111111                1111111111




                                                      2222222222                2222222222


     read what is not there.                                                    3333333333




   • If you have a traditional db                     4444444444                4444444444




     table with lots of nulls, you                    5555555555




     data will probably fit well
     data will probably fit well                      6666666666




                                                      7777777777                7777777777




                      8/23/11  NoSQLNow! 2011 Conference                                     65
Regions
  g

• Tables are divided into sets of rows called regions
• Read and write load are scaled by spreading across 
  many regions.
                                                       0000000000




                                                       1111111111




                                                       2222222222
     0000000000




     1111111111




     2222222222
                                                       3333333333



     3333333333
                                                       4444444444



     4444444444




     5555555555

                                                       5555555555

     6666666666

                                                       6666666666

     7777777777

                                                       7777777777




                  8/23/11  NoSQLNow! 2011 Conference                66
Sorted Map Datastore (physical view)
           p           (p y          )
                                  info Column Family
                Row key   Column key            Timestamp        Cell value
                cutting   info:height           1273516197868    9ft
                cutting   info:state            1043871824184    CA
                tlipcon
                   p      info:height
                                  g             1273878447049    5ft7
                tlipcon   info:state            1273616297446    CA
                                 roles Column Family
                Row key
                Row key   Column key
                          Column key            Timestamp        Cell value
                                                                 Cell value
                cutting   roles:ASF             1273871823022    Director
     Sorted
  on disk by    cutting   roles:Hadoop          1183746289103    Founder
Row key, Col 
Row key Col     tlipcon
                 li       roles:Hadoop
                            l H d               1300062064923    PMC
        key, 
 descending     tlipcon   roles:Hadoop          1293388212294    Committer
 timestamp
                tlipcon   roles:Hive            1273616297446    Contributor

                                       Milliseconds since unix epoch
                            8/23/11  NoSQLNow! 2011 Conference                 67
HBase purposely doesn’t have everything
      p p     y                  y    g

• No atomic multi‐row operations

• No global time ordering

• No built‐in SQL query language

• No query Optimizer
     q y p



                8/23/11  NoSQLNow! 2011 Conference   68
HBase vs just HDFS
         j
                                    Plain HDFS/MR                        HBase
Write pattern               Append‐only                         Random write, bulk 
                                                                incremental
Read pattern                Full table scan, partition table    Random read, small range 
                            scan                                scan, or table scan
Hive (SQL) performance      Very good                           4‐5x slower

Structured t
St t d storage              Do‐it‐yourself / TSV /
                            D it        lf / TSV /              Sparse column‐family data 
                                                                S        l    f il d t
                            SequenceFile / Avro / ?             model
Max data size               30+ PB                              ~1PB



If you have neither random write nor random read, stick to HDFS!


                           8/23/11  NoSQLNow! 2011 Conference                            69
What if I don’t know what my schema should be?
                           y
• MR and HBase complement each other.
   • Use HDFS for long sequential writes.
   • Use MR for large batch jobs.
   • Use HBase for random writes and reads
     Use HBase for random writes and reads.


• Applications need HBase to have data structured in a 
   pp
  certain way.
• Save raw data to HDFS and then experiment.
• MR for data transformation and ETL‐like jobs from raw 
  data.
• U b lk i
  Use bulk import from MR to HBase.
                 tf    MR t HB
                  8/23/11  NoSQLNow! 2011 Conference       70
Outline

•   Motivation 
•   Enter Apache Hadoop 
•   Enter Apache HBase
•   Real‐World Applications
•   System Architecture
•   Deployment (in the Cloud)
•   Conclusions



                  8/23/11  NoSQLNow! 2011 Conference   71
Apache HBase in Production
 p
• Facebook :
   • Messages
• StumbleUpon:
   • http://su pr
     http://su.pr
• Mozilla :
   • Socorro ‐‐ receives crash reports
                                 p
• Yahoo:
   • Web Crawl Cache
• Twitter:
   • stores users and tweets for analytics
• … many others
           h
                    8/23/11  NoSQLNow! 2011 Conference   72
High Level Architecture
High Le el Architect re
                     Your PHP Application




   MapReduce         Thrift/REST Gateway            Your Java Application
   Hive/Pig



                 Java Client
                                                          ZooKeeper
                 HBase



                         HDFS
               8/23/11  NoSQLNow! 2011 Conference                           73
Data‐centric schema design?
                        g

• Entity relationship model.
   • Design schema in “Normalized form”.
   • Figure out your queries.
   • DBA
     DBA to sets primary/foreign keys and indexes once query is 
                   i    /f i k          di d                 i
     known.


• Issues:
   • Difficult and expensive to change schema. 
                     p              g
   • Difficult and expensive to add columns.



                   8/23/11  NoSQLNow! 2011 Conference              74
Q y
Query‐centric schema design
                         g
• Know your queries, then design your schema

• Pick row keys to spread region load
   • Spreading loads can increase read and write efficiency.
      p      g                                            y
• Pick column‐family members for better reads
   • Create these by knowing fields needed by queries.
   • I b
     Its better to have a fewer than many.
                   h      f      h

• Notice:
   • App developers optimize the queries, not DBAs.
   • If you’ve done the relational DB query optimizations,  you are 
     mostly there already!
     mostly there already!

                     8/23/11  NoSQLNow! 2011 Conference                75
Schema design exercises
           g

• URL Shortener
  • Bit.ly, goo.gl, su.pr etc.


• Web table
  • Google BigTable’s example, Yahoo!’s Web Crawl Cache


• Facebook Messages 
  • Conversations and Inbox Search
    Conversations and Inbox Search
  • Transition strategies


                    8/23/11  NoSQLNow! 2011 Conference    76
Lookup hash, track click, and 
Url Shortener Service                           forward to full url
                                                forward to full url




                                            Enter new long url, generate short url, 
                                                 and store to user’s mapping
                                                                       pp g




            Look up all of a user’s                      Track historical click counts 
          shortened urls and display                              over time
                 8/23/11  NoSQLNow! 2011 Conference                                       77
Url Shortener schema




• All queries have at least one join.
• Constraints when adding new urls, and short urls.
• How do we delete users?

                 8/23/11  NoSQLNow! 2011 Conference   78
Url Shortener HBase schema

• Most common 
  queries are single 
       i       i l
  gets

• Use compression 
  settings on content 
        g
  column families.

• Use composite  row 
  key to group all of a 
  user’s shortened urls
       ’ h t      d l
                   8/23/11  NoSQLNow! 2011 Conference   79
Web Tables

• Goal: Manage web crawls and its data by keeping 
  snapshots of the web. 
   • Google used BigTable for Web table example
   • Y h
     Yahoo uses HBase for Web crawl cache
                HB    f W b        l    h

                                                       Full scan applications
                               HBase
                               HB

                                                          Random access 
                                                           applications


                               HDFS



                  8/23/11  NoSQLNow! 2011 Conference                            80
Web Table queries
          q

•   Crawler continuously updating link and pages
•   Want to track individual pages over time
•   Want to group related pages from same site
•   Want to calculate PageRank (links and backlinks)
•   Want to do build a search index
•   Want to do ad‐hoc analytical queries on page content




                  8/23/11  NoSQLNow! 2011 Conference       81
Google web table schema
   g




             8/23/11  NoSQLNow! 2011 Conference   82
Web table Schema Design
                     g
• Want to keep related pages together
   • Reverse url so that related pages are near each other
     Reverse url so that related pages are near each other.
   • archive.cloudera.com => com.cloudera.archive
   • www.cloudera.com => com.cloudera.www
• Want to track pages over time
  Want to track pages over time
   • reverseurl‐crawltimestamp: put all of same url together
   • Just scan a localized set of pages.
• Want to calculate pagerank (links and backlinks)
  Want to calculate pagerank (links and backlinks)
   • Just need links, so put raw content in different column family.
   • Avoid having to do IO to read unused raw content.
• Want to index newer pages
  Want to index newer pages
   • Use Map Reduce on most recently crawled content.
• Want to do analytical queries
   • We’ll do a full scan with filters.
     We ll do a full scan with filters.

                        8/23/11  NoSQLNow! 2011 Conference             83
Facebook Messages (as of 12/10)
              g (          / )

• 15Bn/month message email, 1k = 14TB
• 120Bn /month, 100 bytes = 11TB
            Create a new                                            Keyword search of 
         message/conversation
                /         ti                                           messages




                             Show full                 List most recent 
                           conversation                 conversations

                  8/23/11  NoSQLNow! 2011 Conference                                     84
Possible Schema Design
                    g
• Show my most recent conversations
   • Have a “conversation” table using user revTimeStamp as key
     Have a  conversation table using user‐revTimeStamp as key
   • Have a Metadata column family 
   • Metadata contains date, to/from, one line of most recent
• Show me the full conversation
   • Use same “conversation” table
   • Content column family contains a conversation
   • We already have full row key from previous, so this is just a quick lookup
• S
  Search my inbox for keywords 
       h    i b f k         d
   • Have a separate “inboxSearch” table
   • Row key design: userId‐word‐messageId‐lastUpdateRevTimestamp
       • Works for type‐ahead / partial messages
         Works for type ahead / partial messages
       • Show top N message ids with word
• Send new message 
   • Update both tables and to both users’s rows
   • Update recent conversations and keyword index

                         8/23/11  NoSQLNow! 2011 Conference                       85
Facebook MySQL to HBase transition
          y Q

• Initially a normalized MySQL email schema sharded over 
  1000 production servers with 500M users.
   000       d i             i h 00
• How do we export users’ emails?
• Di t
  Direct approach:h
   • Big join – point table for Tb’s of data (500M users!)
   • This would kill the production servers.
     This would kill the production servers.
• Incremental approach:
   •   Snapshot copy via naïve bulk load into migration HBase cluster.
   •   Have incremental fetches from db for new live data.
   •   Use MR on migration HBase to do join, writing to final cluster.
   •   App writes to both places until migration complete.
       A      it t b th l          til i ti           l t

                      8/23/11  NoSQLNow! 2011 Conference                 86
Row Key tricks
      y
• Row Key design for schemas are critical
   • Reasonable number of regions
     Reasonable number of regions.
   • Make sure key distributes to spread write load.
   • Take advantage of lexicographic sort order.

• Numeric Keys and lexicographic sort
   • Store numbers big‐endian.
   • Pad ASCII numbers with 0’s.
• Use reversal to have most significant traits first.
   • Reverse URL.
   • Reverse timestamp to get most recent first.
                     p g
• Use composite keys to make key distribute nicely and work well 
  with sub‐scans
   • Ex: User‐ReverseTimeStamp
   • Do not use current timestamp as first part of row key!
                      8/23/11  NoSQLNow! 2011 Conference            87
Key Take‐aways
  y         y
• Denormalized schema localized data for single lookups.

• Rowkey is critical for lookups and subset scans.

• Make sure when writing row keys are distributed.

• Use Bulk loads and Map Reduce to re‐organize or 
  change you schema (during down time).
      g y            (    g            )

• Multiple clusters for different workloads if you can 
       p                                       y
  afford it.
                  8/23/11  NoSQLNow! 2011 Conference       88
Outline

•   Motivation 
•   Enter Apache Hadoop 
•   Enter Apache HBase
•   Real‐World Applications
•   System Architecture
•   Deployment (in the Cloud)
•   Conclusions



                  8/23/11  NoSQLNow! 2011 Conference   89
A Typical Look...
   yp

• 5‐4000 commodity servers
  (8‐core, 24GB RAM, 4‐12 TB, gig‐E)
• 2‐level network architecture
   • 20‐40 nodes per rack




                  8/23/11  NoSQLNow! 2011 Conference   90
Hadoop Cluster nodes
     p
Master nodes (1 each)


              NameNode (metadata server and database)


              JobTracker (scheduler)



Slave nodes (1‐4000 each)


                          DataNodes                                TaskTrackers 
                        (block storage)                          (task execution)



                            8/23/11  NoSQLNow! 2011 Conference                      91
Name Node and Secondary Name Node
                      y
• NameNode
   • The most critical node in the system
     The most critical node in the system.
   • Stores file system metadata on disk and in memory.
       • Directory structures, permissions
   • Modification stored as an edit log
     Modification stored as an edit log.
   • Fault tolerant but not highly available yet.

• Secondary NameNode
  Secondary NameNode
   • Not a hot standby!
   • Gets a copy of file system metadata and edit log.
   • Periodically compacts image and edit log and ships to NameNode.
          d ll                      d d l       d h              d

• Make sure your DNS is setup properly!

                       8/23/11  NoSQLNow! 2011 Conference              92
Data nodes

• HDFS splits files into 64MB (or 128MB) blocks.

• Data nodes store and serves these blocks.
   • By default, pipeline writes to 3 different machines.
   • By default, local machine, at machines on other racks.

   • Locality helps significantly on subsequent reads and 
     computation scheduling.
     computation scheduling




                   8/23/11  NoSQLNow! 2011 Conference         93
Job Tracker and Task Trackers

• Now, we want to process that data!

• Job Tracker 
   • Schedules work and resource usage through out the cluster.
   • Makes sure work gets done.
      • Controls retry, speculative execution, etc.


• Task Trackers
  Task Trackers
   • These slaves do the “map” and “reduce” work.
   • Co‐located with data nodes.

                      8/23/11  NoSQLNow! 2011 Conference          94
HBase cluster nodes

 Master nodes (1 each)                          Slave nodes (1‐4000 each)
           NameNode
           (HDFS metadata server)

           HMaster
           HM t
           (region metadata)                                                RegionServer
             HMaster                                                        (table server)
             (hot standby)
             (hot standby)                                                  DataNode
                                                                            (hdfs block server)
        Coordination nodes 
        (odd number)


                                                 ZooKeeper 
                                                 Quorum Peer


                         8/23/11  NoSQLNow! 2011 Conference                              95
HMaster and ZooKeeper
                  p
• HMaster
  • Controls which Regions are served by which Region Servers.
  • Assigns regions to new region servers when they arrive or go 
    down.
  • Can have a hot standby master if main master goes down.
  • All region state kept in ZooKeeper.

• Apache ZooKeeper
  •   Highly Available System for coordination.
      Highly Available System for coordination
  •   Generally 3 or 5 machines (always an odd number).
  •   Uses consensus to guarantee common shared state.
  •   Writes are considered expensive.
                   8/23/11  NoSQLNow! 2011 Conference               96
Region Server
  g

• Tables are chopped up into regions.
• A region is only served by one region server at a time.
• Regions are served by a “region server”.
   • Load balancing if region server goes down.
• Co‐locate region servers with data nodes.
   • Takes advantage of HDFS file locality. 


• Important that clocks are in reasonable sync. Use NTP!


                    8/23/11  NoSQLNow! 2011 Conference      97
Stability and Tuning Hints
        y          g
• Monitor your cluster.
• Avoid memory swapping. 
   • Do not oversubscribe memory.
   • Can suffer from cascading failures
     Can suffer from cascading failures.


• Mostly scan jobs?
  Mostly scan jobs? 
   • Small read cache, low swapiness.
• Large max region size for large column families.
     g        g                g
   • Avoid costly “region splits”.
• Make the ZK timeout higher. 
   • longer to recover from failure but prevents cascading failure.
                    8/23/11  NoSQLNow! 2011 Conference                98
Outline

•   Motivation 
•   Enter Apache Hadoop 
•   Enter Apache HBase
•   Real‐World Applications
•   System Architecture
•   Deployment (in the Cloud)
•   Conclusions



                  8/23/11  NoSQLNow! 2011 Conference   99
Back to the cloud!




  8/23/11  NoSQLNow! 2011 Conference   100
We’ll use Apache Whirr
           p

                         Apache Whirr is set of 
                         Apache Whirr is set of
                          tools and libraries for 
                          deploying clusters on 
                           cloud services in a 
                           cloud services in a
                            cloud‐neutral way.y


              8/23/11  NoSQLNow! 2011 Conference   101
This is great for setting up a cluster...
        g               g p


jon@grimlock:~/whirr-0.5.0-incubating$
  bin/whirr launch-cluster --config
  recipes/hbase-ec2.properties
     i   /hb      2        i

jon@grimlock:~/whirr-0.5.0-incubating$
j     i l k / hi           i   b i $
  bin/whirr launch-cluster --config
  recipes/scm-ec2.properties
  recipes/scm ec2 properties



                8/23/11  NoSQLNow! 2011 Conference   102
and an easy way to tear down a cluster.
          y y


jon@grimlock:~/whirr-0.5.0-incubating$
  bin/whirr destroy-cluster --config
  recipes/hbase-ec2.properties
     i   /hb      2        i

jon@grimlock:~/whirr-0.5.0-incubating$
j     i l k / hi           i   b i $
  bin/whirr destroy-cluster --config
  recipes/scm-ec2.properties
  recipes/scm ec2 properties



              8/23/11  NoSQLNow! 2011 Conference   103
But how do you to manage a cluster deployment?
           y          g              p y




             8/23/11  NoSQLNow! 2011 Conference   104
Interact with your cluster with Hue
              y




               8/23/11  NoSQLNow! 2011 Conference   105
What did we just do? 
            j

• Whirr
  • Provisioned the machines on EC2.
  • Installed SCM on all the machines.


• Cloudera Service and Configuration Manager Express
  • O h t t d th d l
    Orchestrated the deployment the services in the proper order.
                              t th      i    i th            d
     • ZooKeeper, Hadoop, HDFS, HBase and Hue
  • Set service configurations.
                     g
  • Free download for kicking off up to 50 nodes!
  http://www.cloudera.com/products‐services/scm‐express/

                   8/23/11  NoSQLNow! 2011 Conference          106
To Cloud or not to Cloud?

• The key feature of a cloud deployment 
   • Elasticity: The ability to expand and contract the number of 
     machines being used on demand.


• Things to consider:
   • E
     Economics of machines and people.
           i    f    hi      d     l
      • Capital Expenses vs Operational Expenses.
   • Workload requirements: Performance / SLAs.
                 q                      /
   • Previous investments.



                     8/23/11  NoSQLNow! 2011 Conference              107
Economics of a cluster
Economics of a cluster

 EC2 Cloud deployment
 EC2 Cloud deployment                             Private Cluster
                                                  Private Cluster
 • 10 small instances                             • 10 commodity servers
    • 1 core,  1.7GB ram, 160GB disk
    • $0 085/hr/mchn => $7 446/yr
      $0.085/hr/mchn => $7,446/yr
                                                           •   8 core, 24 GB ram, 6TB disk
    • Reserved $227.50/yr/machine                          •   $6500 /machine => $65,000
      +$0.03/hr/mchn=> $4903/yr.                           •   + physical space
                                                           •   + networking gear
                                                               + networking gear
 • 10 Dedicated‐Reserved Quad 
   XLs  instances                                          •   + power 
    • 8 core, 23GB ram, 1690GB disk                        •   + admin costs
    • $6600/yr/mchn + +                                    •   + more setup time
      $0.84/hr/mchn + 
      $10/hr/region => 66,000 + 
      73,584 + 87,600 => 
      $227,184/yr
      $227 184/yr

                      8/23/11  NoSQLNow! 2011 Conference                                     108
Pros of using the cloud
            g

• With Virtualized machines, you can install any SW you 
  want!
      !
• Great for bursty or occasional loads that expand and shrink.
• G t if
  Great if your apps and data is already in the cloud.
                       d d t i l d i th l d
   • Logs already live in S3 for example.
• Great if you don’t have a large ops team
  Great if you don t have a large ops team.
   • Save money on people dealing with colos, hardware failures.
   • Steadier ops team personnel requirements, (unless catastrophic 
     failure).
• Great for experimentation.
• Great for testing/QA clusters at scale.
        f       i /     l             l
                     8/23/11  NoSQLNow! 2011 Conference                109
Cons of using the cloud
            g
• Getting data in and out of EC2. 
   • N t
     Not cost, but amount of time.
            t b t       t f ti
   • AWS Direct connect can help.
• Virtualization causes varying network connectivity.
                           y g                     y
   • ZooKeeper timeouts can cause cascading failure.
   • Some connections fast, others slow.
   • Dedicated or Cluster compute instance could improve
     Dedicated or Cluster‐compute instance could improve.
• Virtualization causes unpredictable IO performance.
   • EBS is like a SAN and an eventual bottleneck.
   • Ephemeral disks perform better but not recoverable on failures.
• Still need to deal with Disaster recovery.
   • What happens if an EC2 or a region goes down? (4/21/11)
     What happens if an EC2 or a region goes down? (4/21/11)

                     8/23/11  NoSQLNow! 2011 Conference                110
Cloudera’s Experience with Hadoop in the Cloud
             p                  p
• Some Enterprise Hadoop/MR use the Cloud.
   • Good for daily jobs with moderate amounts of data (GB’s), 
                                                       (    )
     generally computationally expensive.
• Ex: Periodic matching or recommendation applications.
  Ex: Periodic matching or recommendation applications.
   • Spin up cluster.
   • Upload a data set to S3.
   • Do an n2 matching or recommendation job. 
      • Mapper expands data.
      • Reducer gets small amount of data back. 
                 g
      • Write to S3.
   • Download result set.
   • Tear down cluster
     Tear down cluster.

                     8/23/11  NoSQLNow! 2011 Conference           111
Cloudera’s Experience with HBase in the cloud
             p

• Almost all enterprise HBase users use physical hardware.
• Some initially used the cloud, but transitioned to physical 
  hardware.
• O
  One story:
       t
   • EC2: 40nodes in ec2 xl instances,
   • Bought 10 physical machines and got similar or better performance.
     Bought 10 physical machines and got similar or better performance. 
• Why?
   • Physical hardware gives more control of machine build out, network 
     infrastructure, locality which are critical for performance.
   • HBase is up all the time and usually grows over time.


                    8/23/11  NoSQLNow! 2011 Conference               112
Outline

•   Motivation 
•   Enter Apache Hadoop 
•   Enter Apache HBase
•   Real‐World Applications
•   System Architecture
•   Deployment (in the Cloud)
•   Conclusions



                  8/23/11  NoSQLNow! 2011 Conference   113
Key takeaways
  y        y
• Apache HBase is not a Database!  There are other scalable 
  databases.
  databases
• Query‐centric schema design, not data‐centric schema 
  design.
• I
  In production at 100’s of TB scale at several large 
        d ti      t 100’ f TB     l t         ll
  enterprises.
• If you are restructuring your SQL DB to optimize it, you may 
  be a candidate for HBase.
• HBase complements and depends upon Hadoop.
• Hadoop makes sense in the cloud for some production
  Hadoop makes sense in the cloud for some production 
  workloads.
• HBase in the cloud for experimentation but generally is in 
  physical hardware for production.
  physical hardware for production
                   8/23/11  NoSQLNow! 2011 Conference         114
HBase vs RDBMS
                                      RDBMS                             HBase
Data layout         Row‐oriented                               Column‐family‐oriented

Transactions        Multi‐row ACID                             Single row only

Query language      SQL                                        get/put/scan/etc *

Security            Authentication/Authorization               Work in progress

Indexes             On arbitrary columns                       Row‐key only*

Max data size
Max data si e       TBs                                        ~1PB

Read/write          1000s queries/second                       Millions of 
throughput limits                                              “queries”/second


                          8/23/11  NoSQLNow! 2011 Conference                            115
HBase vs other “NoSQL”
                   Q

• Favors Strict Consistency over Availability (but 
  availability is good in practice!)
• Great Hadoop integration (very efficient bulk loads, 
  MapReduce analysis)
  M R d             l i)
• Ordered range partitions (not hash)
• Automatically shards/scales (just turn on more servers, 
  really proven at petabyte scale)
• S
  Sparse column storage (not key‐value)
             l              (    k   l )



                 8/23/11  NoSQLNow! 2011 Conference     116
HBase vs just HDFS
         j
                                    Plain HDFS/MR                        HBase
Write pattern               Append‐only                         Random write, bulk 
                                                                incremental
Read pattern                Full table scan, partition table    Random read, small range 
                            scan                                scan, or table scan
Hive (SQL) performance      Very good                           4‐5x slower

Structured t
St t d storage              Do‐it‐yourself / TSV /
                            D it        lf / TSV /              Sparse column‐family data 
                                                                S        l    f il d t
                            SequenceFile / Avro / ?             model
Max data size               30+ PB                              ~1PB



If you have neither random write nor random read, stick to HDFS!


                           8/23/11  NoSQLNow! 2011 Conference                            117
More resources?

• Download Hadoop and HBase!
   • CDH ‐ Cloudera’s Distribution including 
     Apache Hadoop
     http://cloudera.com/
     http://cloudera com/
   • http://hadoop.apache.org/
• Try it out! (Locally, VM, or EC2)
  Try it out!  (Locally, VM, or EC2)
• Watch free training videos on
  http://cloudera.com/
     p //                 /



                    8/23/11  NoSQLNow! 2011 Conference   118
jon@cloudera.com
                                                   @jmhsieh


QUESTIONS?


       8/23/11  NoSQLNow! 2011 Conference                      119
Cloud Deployments with Apache Hadoop and Apache HBase

Más contenido relacionado

La actualidad más candente

Introduction to NOSQL databases
Introduction to NOSQL databasesIntroduction to NOSQL databases
Introduction to NOSQL databasesAshwani Kumar
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...Simplilearn
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft PlatformAndrew Brust
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars GeorgeJAX London
 
NoSQL and The Big Data Hullabaloo
NoSQL and The Big Data HullabalooNoSQL and The Big Data Hullabaloo
NoSQL and The Big Data HullabalooAndrew Brust
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBaseAnil Gupta
 
NoSQL-Database-Concepts
NoSQL-Database-ConceptsNoSQL-Database-Concepts
NoSQL-Database-ConceptsBhaskar Gunda
 
NoSQL - 05March2014 Seminar
NoSQL - 05March2014 SeminarNoSQL - 05March2014 Seminar
NoSQL - 05March2014 SeminarJainul Musani
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Cloudera, Inc.
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 

La actualidad más candente (20)

Selecting best NoSQL
Selecting best NoSQL Selecting best NoSQL
Selecting best NoSQL
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Introduction to NOSQL databases
Introduction to NOSQL databasesIntroduction to NOSQL databases
Introduction to NOSQL databases
 
NoSQL Consepts
NoSQL ConseptsNoSQL Consepts
NoSQL Consepts
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft Platform
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
NoSQL and The Big Data Hullabaloo
NoSQL and The Big Data HullabalooNoSQL and The Big Data Hullabaloo
NoSQL and The Big Data Hullabaloo
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
 
Hadoop Ecosystem Overview
Hadoop Ecosystem OverviewHadoop Ecosystem Overview
Hadoop Ecosystem Overview
 
No SQL introduction
No SQL introductionNo SQL introduction
No SQL introduction
 
NoSQL-Database-Concepts
NoSQL-Database-ConceptsNoSQL-Database-Concepts
NoSQL-Database-Concepts
 
NoSQL - 05March2014 Seminar
NoSQL - 05March2014 SeminarNoSQL - 05March2014 Seminar
NoSQL - 05March2014 Seminar
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Introduction to hbase
Introduction to hbaseIntroduction to hbase
Introduction to hbase
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 

Destacado

Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data WarehousingAlexey Grigorev
 
Exploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in ClassicsExploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in ClassicsMatteo Romanello
 
How many citations are there in the Data Citation Index?
How many citations are there in the Data Citation Index?How many citations are there in the Data Citation Index?
How many citations are there in the Data Citation Index?Nicolas Robinson-Garcia
 
Efficient blocking method for a large scale citation matching
Efficient blocking method for a large scale citation matchingEfficient blocking method for a large scale citation matching
Efficient blocking method for a large scale citation matchingMateusz Fedoryszak
 
Cited Reference Searching
Cited Reference SearchingCited Reference Searching
Cited Reference SearchingSCULibrarian
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawlingDenis Shestakov
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...Victor Giannakouris
 
Towards a Semantic Citation Index for the German Social Sciences
Towards a Semantic Citation Index for the German Social SciencesTowards a Semantic Citation Index for the German Social Sciences
Towards a Semantic Citation Index for the German Social SciencesGESIS
 
How to build your own citation index
How to build your own citation indexHow to build your own citation index
How to build your own citation indexGESIS
 
The Research Paper and Citation Methodology
The Research Paper and Citation MethodologyThe Research Paper and Citation Methodology
The Research Paper and Citation MethodologyOttawa University
 
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...Christian Gügi
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 

Destacado (14)

Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data Warehousing
 
Exploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in ClassicsExploring Citation Networks to Study Intertextuality in Classics
Exploring Citation Networks to Study Intertextuality in Classics
 
How many citations are there in the Data Citation Index?
How many citations are there in the Data Citation Index?How many citations are there in the Data Citation Index?
How many citations are there in the Data Citation Index?
 
Efficient blocking method for a large scale citation matching
Efficient blocking method for a large scale citation matchingEfficient blocking method for a large scale citation matching
Efficient blocking method for a large scale citation matching
 
Emerging sources citation index (esci)
Emerging sources citation index (esci)Emerging sources citation index (esci)
Emerging sources citation index (esci)
 
Cited Reference Searching
Cited Reference SearchingCited Reference Searching
Cited Reference Searching
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawling
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
 
Towards a Semantic Citation Index for the German Social Sciences
Towards a Semantic Citation Index for the German Social SciencesTowards a Semantic Citation Index for the German Social Sciences
Towards a Semantic Citation Index for the German Social Sciences
 
How to build your own citation index
How to build your own citation indexHow to build your own citation index
How to build your own citation index
 
The Research Paper and Citation Methodology
The Research Paper and Citation MethodologyThe Research Paper and Citation Methodology
The Research Paper and Citation Methodology
 
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...
 
Citation and referencing in research work
Citation and referencing in research workCitation and referencing in research work
Citation and referencing in research work
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 

Similar a Cloud Deployments with Apache Hadoop and Apache HBase

Red Hat Storage Day LA - Performance and Sizing Software Defined Storage
Red Hat Storage Day LA - Performance and Sizing Software Defined Storage Red Hat Storage Day LA - Performance and Sizing Software Defined Storage
Red Hat Storage Day LA - Performance and Sizing Software Defined Storage Red_Hat_Storage
 
Immutable Infrastructure: the new App Deployment
Immutable Infrastructure: the new App DeploymentImmutable Infrastructure: the new App Deployment
Immutable Infrastructure: the new App DeploymentAxel Fontaine
 
Introduction to MySQL
Introduction to MySQLIntroduction to MySQL
Introduction to MySQLTed Wennmark
 
Wicked Easy Ceph Block Storage & OpenStack Deployment with Crowbar
Wicked Easy Ceph Block Storage & OpenStack Deployment with CrowbarWicked Easy Ceph Block Storage & OpenStack Deployment with Crowbar
Wicked Easy Ceph Block Storage & OpenStack Deployment with CrowbarKamesh Pemmaraju
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for HadoopJoe Crobak
 
Cloud Computing & Scaling Web Apps
Cloud Computing & Scaling Web AppsCloud Computing & Scaling Web Apps
Cloud Computing & Scaling Web AppsMark Slingsby
 
MySQL Cluster as Transactional NoSQL (KVS)
MySQL Cluster as Transactional NoSQL (KVS)MySQL Cluster as Transactional NoSQL (KVS)
MySQL Cluster as Transactional NoSQL (KVS)Ryusuke Kajiyama
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
Hybrid and On-premise AWS workloads using HP Helion Eucalyptus
Hybrid and On-premise AWS workloads using HP Helion EucalyptusHybrid and On-premise AWS workloads using HP Helion Eucalyptus
Hybrid and On-premise AWS workloads using HP Helion EucalyptusVedanta Barooah
 
MySQL 5.6, news in 5.7 and our HA options
MySQL 5.6, news in 5.7 and our HA optionsMySQL 5.6, news in 5.7 and our HA options
MySQL 5.6, news in 5.7 and our HA optionsTed Wennmark
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsgagravarr
 
SQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data WarehouseSQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data WarehouseMark Ginnebaugh
 
Openstack - An introduction/Installation - Presented at Dr Dobb's conference...
 Openstack - An introduction/Installation - Presented at Dr Dobb's conference... Openstack - An introduction/Installation - Presented at Dr Dobb's conference...
Openstack - An introduction/Installation - Presented at Dr Dobb's conference...Rahul Krishna Upadhyaya
 
One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)DataWorks Summit
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Cloudera, Inc.
 
Chef for OpenStack December 2012
Chef for OpenStack December 2012Chef for OpenStack December 2012
Chef for OpenStack December 2012Matt Ray
 

Similar a Cloud Deployments with Apache Hadoop and Apache HBase (20)

Data Science
Data ScienceData Science
Data Science
 
Red Hat Storage Day LA - Performance and Sizing Software Defined Storage
Red Hat Storage Day LA - Performance and Sizing Software Defined Storage Red Hat Storage Day LA - Performance and Sizing Software Defined Storage
Red Hat Storage Day LA - Performance and Sizing Software Defined Storage
 
Immutable Infrastructure: the new App Deployment
Immutable Infrastructure: the new App DeploymentImmutable Infrastructure: the new App Deployment
Immutable Infrastructure: the new App Deployment
 
Introduction to MySQL
Introduction to MySQLIntroduction to MySQL
Introduction to MySQL
 
Wicked Easy Ceph Block Storage & OpenStack Deployment with Crowbar
Wicked Easy Ceph Block Storage & OpenStack Deployment with CrowbarWicked Easy Ceph Block Storage & OpenStack Deployment with Crowbar
Wicked Easy Ceph Block Storage & OpenStack Deployment with Crowbar
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Cloud Computing & Scaling Web Apps
Cloud Computing & Scaling Web AppsCloud Computing & Scaling Web Apps
Cloud Computing & Scaling Web Apps
 
MySQL Cluster as Transactional NoSQL (KVS)
MySQL Cluster as Transactional NoSQL (KVS)MySQL Cluster as Transactional NoSQL (KVS)
MySQL Cluster as Transactional NoSQL (KVS)
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Hybrid and On-premise AWS workloads using HP Helion Eucalyptus
Hybrid and On-premise AWS workloads using HP Helion EucalyptusHybrid and On-premise AWS workloads using HP Helion Eucalyptus
Hybrid and On-premise AWS workloads using HP Helion Eucalyptus
 
MySQL 5.6, news in 5.7 and our HA options
MySQL 5.6, news in 5.7 and our HA optionsMySQL 5.6, news in 5.7 and our HA options
MySQL 5.6, news in 5.7 and our HA options
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
SQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data WarehouseSQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data Warehouse
 
Openstack - An introduction/Installation - Presented at Dr Dobb's conference...
 Openstack - An introduction/Installation - Presented at Dr Dobb's conference... Openstack - An introduction/Installation - Presented at Dr Dobb's conference...
Openstack - An introduction/Installation - Presented at Dr Dobb's conference...
 
One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
 
Chef for OpenStack December 2012
Chef for OpenStack December 2012Chef for OpenStack December 2012
Chef for OpenStack December 2012
 

Más de DATAVERSITY

Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...DATAVERSITY
 
Data at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceData at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceDATAVERSITY
 
Exploring Levels of Data Literacy
Exploring Levels of Data LiteracyExploring Levels of Data Literacy
Exploring Levels of Data LiteracyDATAVERSITY
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsDATAVERSITY
 
Make Data Work for You
Make Data Work for YouMake Data Work for You
Make Data Work for YouDATAVERSITY
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?DATAVERSITY
 
Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?DATAVERSITY
 
Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling FundamentalsDATAVERSITY
 
Showing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectShowing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectDATAVERSITY
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at ScaleDATAVERSITY
 
Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?DATAVERSITY
 
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...DATAVERSITY
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?DATAVERSITY
 
Data Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsData Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsDATAVERSITY
 
Data Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayData Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayDATAVERSITY
 
2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics2023 Trends in Enterprise Analytics
2023 Trends in Enterprise AnalyticsDATAVERSITY
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best PracticesDATAVERSITY
 
Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?DATAVERSITY
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best PracticesDATAVERSITY
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageDATAVERSITY
 

Más de DATAVERSITY (20)

Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
 
Data at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceData at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and Governance
 
Exploring Levels of Data Literacy
Exploring Levels of Data LiteracyExploring Levels of Data Literacy
Exploring Levels of Data Literacy
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business Goals
 
Make Data Work for You
Make Data Work for YouMake Data Work for You
Make Data Work for You
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
 
Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?
 
Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling Fundamentals
 
Showing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectShowing ROI for Your Analytic Project
Showing ROI for Your Analytic Project
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
 
Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?
 
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?
 
Data Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsData Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and Forwards
 
Data Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayData Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement Today
 
2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best Practices
 
Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best Practices
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive Advantage
 

Último

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Último (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Cloud Deployments with Apache Hadoop and Apache HBase

  • 1. Cloud deployments with Apache Hadoop  and Apache HBase 8/23/11 NoSqlNow! 2011 Conference Jonathan Hsieh jon@cloudera.com jon@cloudera com @jmhsieh
  • 2. Who Am I? • Cloudera: • Software Engineer on the Platform  Team • Apache HBase contributor Apache HBase contributor • Apache Flume (incubating)  founder/committer • Apache Sqoop (incubating)  committer • U of Washington: U of Washington: • Research in Distributed Systems  and Programming Languages 8/23/11  NoSQLNow! 2011 Conference 2
  • 3. Who is Cloudera?  Cloudera, the leader in Apache  Cloudera, the leader in Apache Hadoop‐based software and  services, enables enterprises to  easily derive business value from  easily derive business value from all their data. 8/23/11  NoSQLNow! 2011 Conference 3
  • 7. “Every two days we create as much  “E t d t h information as we did from the dawn  of civilization up until 2003.”  Eric Schmidt Eric Schmidt “I keep saying that the sexy job in the  p y g yj next 10 years will be statisticians.  And I m not kidding. And I’m not kidding.” Hal Varian  (Google s chief economist) (Google’s chief economist) 8/23/11  NoSQLNow! 2011 Conference 7
  • 8. Outline • Motivation  • Enter Apache Hadoop  • Enter Apache HBase • Real‐World Applications • System Architecture • Deployment (in the Cloud) • Conclusions 8/23/11  NoSQLNow! 2011 Conference 8
  • 9. Outline • Motivation • Enter Apache Hadoop  • Enter Apache HBase • Real‐World Applications • System Architecture • Deployment (in the Cloud) • Conclusions 8/23/11  NoSQLNow! 2011 Conference 9
  • 10. What is Apache HBase? p Apache HBase is an  p open source,  horizontally scalable, h i t ll l bl sorted map data  p store built on top of  Apache Hadoop. Apache Hadoop 8/23/11  NoSQLNow! 2011 Conference 10
  • 11. What is Apache Hadoop? p p Apache Hadoop is an p p open source, horizontally scalable  horizontally scalable system for reliably storing and processing  massive amounts of  massive amounts of data across many   commodity servers. di 8/23/11  NoSQLNow! 2011 Conference 11
  • 12. Open Source p • Apache 2.0 License • A Community project with committers and contributors  from diverse organizations. • Cloudera, Facebook, Yahoo!, eBay, StumbleUpon, Trend Micro,  … • Code license means anyone can modify and use the Code license means anyone can modify and use the  code. 8/23/11  NoSQLNow! 2011 Conference 12
  • 13. Horizontally Scalable y 600 • Store and access data on  put) 500 1‐1000’s commodity  000’ di /Storage/Throughp 400 Performance  300 servers. 200 • Addi Adding more servers  (IOPs/ 100 0 # of servers should linearly increase  performance and  performance and capacity. • Storage capacity Storage capacity  • Processing capacity p / p p • Input/output operations 8/23/11  NoSQLNow! 2011 Conference 13
  • 14. Commodity Servers (circa 2010) y ( ) • 2 quad core CPUs, running at least 2‐2.5GHz • 16‐24GBs of RAM (24‐32GBs if you’re considering  HBase) • 4x 1TB hard disks in a JBOD (Just a Bunch Of Disks)  configuration • Gigabit Ethernet • $5k‐10k / machine 8/23/11  NoSQLNow! 2011 Conference 14
  • 15. Let s deploy some machines. Let’s deploy some machines (in the cloud!) 8/23/11  NoSQLNow! 2011 Conference 15
  • 16. We’ll use Apache Whirr p Apache Whirr is set of  Apache Whirr is set of tools and libraries for  deploying clusters on  cloud services in a  cloud services in a cloud‐neutral way.y 8/23/11  NoSQLNow! 2011 Conference 16
  • 17. Ready? y export AWS_ACCESS_KEY_ID=... export AWS_SECRET_ACCESS_KEY=... curl -O http://www.apache.org/dist/incubator/whirr/w http://www apache org/dist/incubator/whirr/w hirr-0.5.0-incubating/whirr-0.5.0- incubating.tar.gz tar zxf whirr-0.5.0-incubating.tar.gz cd whirr-0.5.0-incubating bin/whirr launch cluster --config launch-cluster config recipes/hbase-ec2.properties 8/23/11  NoSQLNow! 2011 Conference 17
  • 18. Done.  That s it.  Done. That’s it. You can go home now. Ok, ok, we’ll come back to this later. Ok k ’ll b k t thi l t 8/23/11  NoSQLNow! 2011 Conference 18
  • 19. “The future is already here — it's The future is already here — it s  just not very evenly distributed.” William Gibson William Gibson 8/23/11  NoSQLNow! 2011 Conference 19
  • 21. Building a web index g • Download all of the web.  • Store all of it. • Analyze all of it to build the index and rankings. • Repeat. p 8/23/11  NoSQLNow! 2011 Conference 21
  • 22. Size of Google’s Web Index g • Let’s assume 50k per webpage, 500 bytes per URL. • According to Google* 100000 orage in TB • 1998: 26Mn indexed pages 10000 of Google's Web Sto • 1.3TB 1000 • 2000:  1Bn indexed pages • 500 TB 500 TB 100 Estimated Size o • 2008: ~40Bn indexed pages 10 • 20,000 TB 1 • 2008: 1 Tn URLS • ~500 TB just in urls names! Year * http://googleblog.blogspot.com/2008/07/we‐knew‐web‐was‐big.html 8/23/11  NoSQLNow! 2011 Conference 22
  • 23. Volume, Variety, and Velocity , y, y • The web has a massive amount of data. • How do we store this massive volume? • Raw web data is diverse, dirty, and semi‐structured. • How do we deal with all the compute necessary to clean and  process this variety of data? process this variety of data? • There is new content being created all the time! There is new content being created all the time! • How do we keep up the velocity of new data coming in? 8/23/11  NoSQLNow! 2011 Conference 23
  • 24. Telltale signs you might need  Telltale signs you might need “noSQL”. 8/23/11  NoSQLNow! 2011 Conference 24
  • 25. Did you try scaling vertically? y y g y => • Upgrading to a beefier machines could be quick. • (upgrade that m1 large to a m2 4xlarge) (upgrade that m1.large to a m2.4xlarge) • This is probably a good idea.   • Not quite time for HBase Not quite time for HBase. • What if this isn’t enough? What if this isn t enough? 8/23/11  NoSQLNow! 2011 Conference 25
  • 26. Changed your schema and queries? g y q • Remove text search queries (LIKE). • These are expensive. • Remove joins.  • Normalization is more expensive today Normalization is more expensive today.  • Multiple seeks are more expensive than sequential reads/writes. • Remove foreign keys and encode your own relations. Remove foreign keys and encode your own relations. • Avoids constraint checks. • Just put all parts of a query in a single table. • Lots of Full table scans?   • Good time for Hadoop. • This might be a good time to consider HBase. 8/23/11  NoSQLNow! 2011 Conference 26
  • 27. Need to scale reads? • Using DB replication to make more copies to read from • Use Memcached • Assumes 80/20 read to write ratio, this works  reasonably well if can tolerate replication lag. 8/23/11  NoSQLNow! 2011 Conference 27
  • 28. Need to scale writes? • Unfortunately, eventually you may need more writes.  • Let’s shard and federate the DB • Loses consistency, order of operations. • Replication has diminishing returns with more writes. 600 • HA operational complexity! HA operational complexity! 500 Performance (IOPs) 400 • Gah! 300 200 100 0 # of servers • This is definitely a good time to consider HBase This is definitely a good time to consider HBase 8/23/11  NoSQLNow! 2011 Conference 28
  • 29. Wait – W i we “optimized the DB” by  “ i i d h DB” b discarding some fundamental  discarding some fundamental SQL/relational databases  features? 8/23/11  NoSQLNow! 2011 Conference 29
  • 31. Assumption #1: Your workload fits on one machine… 8/23/11  NoSQLNow! 2011 Conference 31 Image: Matthew J. Stinson CC-BY-NC
  • 32. Massive amounts of storage g • How much data could you collect today? • Many companies easily collect 200GB of logs per day. • Facebook claims to collect >15 TB per day. • How do you handle this problem? • Just keep a few days worth of data and then /dev/null. p y / / • Sample your data. • Move data to write only media (tape) • If you want to analyze all your data, you are going to  need to use multiple machines. need to use multiple machines 8/23/11  NoSQLNow! 2011 Conference 32
  • 33. Assumption #2:  Machines deserve identities... 8/23/11  NoSQLNow! 2011 Conference 33 Image:Laughing Squid CC BY-NC-SA
  • 34. Interact with a cluster, not  Interact with a cluster not a bunch of machines. a bunch of machines. 8/23/11  NoSQLNow! 2011 Conference 34 Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
  • 35. Assumption #3: Machines can be reliable… 8/23/11  NoSQLNow! 2011 Conference 35 Image: MadMan the Mighty CC BY-NC-SA
  • 36. Disk failures are very common y • USENIX FAST ‘07: • tl;dr: 2‐8 % new disk drives fail per year. ; p y • For a 100 node cluster with 1200 drives: • a drive failure every 15‐60 days y y 8/23/11  NoSQLNow! 2011 Conference 36
  • 37. Outline • Motivation  • Enter Apache Hadoop  • Enter Apache HBase • Real‐World Applications • System Architecture • Cluster Deployment  • p y Cloud Deployment • Demo • Conclusions 8/23/11  NoSQLNow! 2011 Conference 37
  • 38. What is Apache Hadoop? p p Apache Hadoop is an p p open source, horizontally scalable  horizontally scalable system for reliably storing and processing  massive amounts of  massive amounts of data across many   commodity servers.  di 8/23/11  NoSQLNow! 2011 Conference 38
  • 39. What is Apache Hadoop? p p Apache Hadoop is an p p open source, horizontally scalable  horizontally scalable system for reliably  storing and processing massive amounts of  massive amounts of data across many   commodity servers.  di 8/23/11  NoSQLNow! 2011 Conference 39
  • 40. Goal: Separate complicated  Goal: Separate complicated distributed fault‐tolerance code  from application code Unicorns Systems Programmers Systems Programmers Statisticians 8/23/11  NoSQLNow! 2011 Conference 40
  • 41. What did Google do? g • SOSP 2003: • OSDI 2004: 8/23/11  NoSQLNow! 2011 Conference 41
  • 42. Origin of Apache Hadoop g p p Hadoop wins  Terabyte sort  benchmark Releases  Open Source,  CDH3 and  Publishes  MapReduce  Cloudera  MapReduce,  & HDFS  Runs 4,000  Enterprise Open Source,  GFS Paper project  Node Hadoop  Web Crawler  W bC l created by  Cluster project  Launches SQL  Doug Cutting created by  Support for  Doug Cutting Hadoop 2002 2003 2004 2005 2006 2007 2008 2009 2010 8/23/11  NoSQLNow! 2011 Conference 42
  • 43. Apache Hadoop in production today p p p y • Yahoo! Hadoop Clusters: > 82PB, >40k machines (as of Jun ‘11) ( f ‘ ) • Facebook: 15TB new data per day; 1200+ machines, 30PB in one cluster 1200+ machines 30PB in one cluster • Twitter: >1TB per day, ~120 nodes • Lots of 5‐40 node clusters at companies without petabytes of data (web retail finance telecom of data (web, retail, finance, telecom,  research, government) 8/23/11  NoSQLNow! 2011 Conference 43
  • 44. Case studies: Hadoop World ‘10 p • eBay: Hadoop at eBay • Twitter: The Hadoop Ecosystem at Twitter The Hadoop Ecosystem at Twitter • Yale University: MapReduce and Parallel Database Systems • General Electric: Sentiment Analysis powered by Hadoop • Facebook: HBase in Production  • Bank of America: The Business of Big Data • AOL: AOL’s Data Layer AOL s Data Layer • Raytheon: SHARD: Storing and Querying Large‐Scale Data • StumbleUpon: Mixing Real‐Time and Batch Processing More Info at More Info at http://www.cloudera.com/company/press‐center/hadoop‐world‐nyc/ 8/23/11  NoSQLNow! 2011 Conference 44
  • 45. What is HDFS? • HDFS is a file system.  • Just stores (a lot of) bytes as files. • Distributes storage across many machines and many disks. • R li bl b Reliable by replicating of blocks across the machines  li i f bl k h hi throughout the cluster. • Horizontally Scalable ‐‐ just add more machines and disks. Horizontally Scalable  just add more machines and disks. • Optimized for large sequential writes. • Features • Unix‐style permissions. • Kerberos‐based authentication. 8/23/11  NoSQLNow! 2011 Conference 45
  • 46. HDFS’s File API • Dir • List files • Remove files • Copy files C fil • Put / Get Files • File • Open  • Close • Read  • pp Write/Append 8/23/11  NoSQLNow! 2011 Conference 46
  • 47. Ideal use cases • Great for storage of  massive amounts of raw or uncooked data • M i d t fil Massive data files • All of your logs 8/23/11  NoSQLNow! 2011 Conference 47
  • 48. Massive scale while tolerating machine failures g 8/23/11  NoSQLNow! 2011 Conference 48
  • 49. Hadoop gives you agility pg y g y • Schema on write • Traditional DBs require cleaning and applying schema. • Great if you can plan on you schema well in advance. • Schema on read • HDFS enables you to store all of your raw data. HDFS bl t t ll f d t • Great if you have ad hoc queries on ad hoc data.  • If you don t know your schema, you can try new ones. If you don’t know your schema you can try new ones • Great if you are exploring schemas and  transforming data. 8/23/11  NoSQLNow! 2011 Conference 49
  • 50. Analyzing data in with MapReduce y g p • Apache Hadoop MapReduce • Simplified distributed programming model. • Just specify a “map” and a “reduce” function. • MapReduce is a batch processing system. • O ti i d f th Optimized for throughput not latency. h t tl t • The fastest MR job takes 15+ seconds. • You are not going to use this to directly serve data for  your next web site.  your next web site 8/23/11  NoSQLNow! 2011 Conference 50
  • 51. Don’t like programming Java? p g g • Apache Pig • Higher level dataflow language • Good for data transformations • G Generally preferred by programmers ll f db • Apache Hive Apache Hive • SQL‐like language for querying data in HDFS • Generally preferred by Data scientists and Business Analysts Generally preferred by Data scientists and Business Analysts 8/23/11  NoSQLNow! 2011 Conference 51
  • 52. There is a catch… • Files are append‐only. • No update or random writes. • Data not available for read until  file is flushed. file is flushed • Files are ideally large. • Enables storage of 10’s of Enables storage of 10 s of  petabytes of data • HDFS has split the file into 64MB  or 128MB blocks!  8/23/11  NoSQLNow! 2011 Conference 52
  • 53. Outline • Motivation  • Enter Apache Hadoop  • Enter Apache HBase • Real‐World Applications • System Architecture • Deployment (in the Cloud) • Conclusions 8/23/11  NoSQLNow! 2011 Conference 53
  • 54. What is Apache HBase? p Apache HBase is an  p open source,  horizontally scalable, h i t ll l bl sorted map data  p store built on top of  Apache Hadoop. Apache Hadoop 8/23/11  NoSQLNow! 2011 Conference 54
  • 55. What is Apache HBase? p Apache HBase is an  p open source,  horizontally scalable, h i t ll l bl sorted map data  p store built on top of  Apache Hadoop. Apache Hadoop 8/23/11  NoSQLNow! 2011 Conference 55
  • 56. Inspiration: Google BigTable p g g • OSDI 2006 paper • Goal: Quick random read/write access to massive  Q / amounts of structured data.  • It was the data store for Google’s crawler web table, orkut,  analytics, earth, blogger, … 8/23/11  NoSQLNow! 2011 Conference 56
  • 57. Sorted Map Datastore p • It really is just a Big Table! 0000000000 1111111111 2222222222 3333333333 • Tables consist of rows, each of  4444444444 which has a primary key (row key). 5555555555 6666666666 • Each row may have any number of  7777777777 columns. • Rows are stored in sorted order 8/23/11  NoSQLNow! 2011 Conference 57
  • 58. Anatomy of a Row y • Each row has a row key (think primary key) • Lexicographically sorted byte[] Lexicographically sorted byte[] • Timestamp associated for keeping multiple versions of data • R i Row is made up of columns.  d f l • Each column has a Cell • Contents of a cell are an untyped byte[]’s. • Apps must “know” types and handle them. • Columns are logically like a Map<byte[] column, byte[]  value> • Rows edits are atomic and changes are strongly consistent  (replicas are in sync) (replicas are in sync) 8/23/11  NoSQLNow! 2011 Conference 58
  • 59. Sorted Map Datastore (logical view) p ( g ) Implicit PRIMARY KEY in  RDBMS terms RDBMS terms Data is all stored as byte[] y Row key Data (column : value) cutting i ‘info:height’: ‘9ft’, ‘info:state’: ‘CA’  ‘i f h i h ’ ‘9f ’ ‘i f ’ ‘CA’ ‘roles:ASF’: ‘Director’, ‘roles:Hadoop’: ‘Founder’  tlipcon ‘info:height’: ‘5ft7’, ‘info:state’: ‘CA’ ‘roles:Hadoop’: ‘Committer’@ts=2010, ‘roles:Hadoop’: ‘PMC’@ts=2011, ‘roles:Hive’: ‘Contributor’  Different rows may have different sets  of columns(table is sparse) A single cell might have different Useful for *‐To‐Many mappings values at different timestamps 8/23/11  NoSQLNow! 2011 Conference 59
  • 60. Apache HBase Depends upon HDFS p p p • Relies on HDFS for data durability and reliability. • Uses HDFS to store its Write‐Ahead Log (WAL). • Need flush/sync support in HDFS in order to prevent data loss  problems. 8/23/11  NoSQLNow! 2011 Conference 60
  • 61. HBase in Numbers • Largest cluster: ~1000 nodes, ~1PB • Most clusters: 5‐20 nodes, 100GB‐4TB • Writes: 1‐3ms, 1k‐10k writes/sec per node • Reads: 0‐3ms cached, 10‐30ms disk • 10‐40k reads / second / node from cache • Cell size: 0‐3MB preferred 8/23/11  NoSQLNow! 2011 Conference 61
  • 62. Access data via an API.  There is “noSQL”* Q • HBase API • get(row) • put(row, Map<column, value>) • scan(key range, filter) (k fil ) • increment(row, columns) • … (checkAndPut, delete, etc…) (checkAndPut delete etc ) • *Ok that’s a slight lie Ok, that s a slight lie.   • There is work on integrating Apache Hive, a SQL‐like query  language, with HBase.  • This is not optimal; x5 slower than normal Hive+HDFS. 8/23/11  NoSQLNow! 2011 Conference 62
  • 63. Cost Transparency p y • Goal: Want predictable latency of random read and  write operations. • To do this, you have to understand some of the physical layout  of your datastore. of your datastore • Efficiencies are based on Locality. • A few physical concepts to help: • Column Families Column Families • Regions 8/23/11  NoSQLNow! 2011 Conference 63
  • 64. 0000000000 Column Families 1111111111 2222222222 • Just a set of related columns. 3333333333 4444444444 • Each may have different Each may have different  5555555555 columns and access patterns. 6666666666 • Each may have parameters  7777777777 set per column family: set per column family: • Block Compression (none, gzip,  LZO, Snappy) • Version retention policies 0000000000 0000000000 • Cache priority 1111111111 1111111111 • Improves read performance. 2222222222 2222222222 • CFs stored separately: access  3333333333 3333333333 one without wasting IO on the  other. 4444444444 4444444444 • Store related data together for  5555555555 5555555555 better compression better compression 6666666666 6666666666 7777777777 7777777777 8/23/11  NoSQLNow! 2011 Conference 64
  • 65. 0000000000 Sparse Columns p 1111111111 2222222222 3333333333 • Provides schema flexibility 4444444444 • Add columns later, no need  5555555555 to transform entire schema. 6666666666 • Use for writing aggregates  g gg g 7777777777 atomically (“prejoins”) • Improves performance • Null columns don’t take  0000000000 0000000000 space.  You don’t need to  1111111111 1111111111 2222222222 2222222222 read what is not there. 3333333333 • If you have a traditional db  4444444444 4444444444 table with lots of nulls, you  5555555555 data will probably fit well data will probably fit well 6666666666 7777777777 7777777777 8/23/11  NoSQLNow! 2011 Conference 65
  • 66. Regions g • Tables are divided into sets of rows called regions • Read and write load are scaled by spreading across  many regions. 0000000000 1111111111 2222222222 0000000000 1111111111 2222222222 3333333333 3333333333 4444444444 4444444444 5555555555 5555555555 6666666666 6666666666 7777777777 7777777777 8/23/11  NoSQLNow! 2011 Conference 66
  • 67. Sorted Map Datastore (physical view) p (p y ) info Column Family Row key Column key Timestamp Cell value cutting info:height 1273516197868 9ft cutting info:state 1043871824184 CA tlipcon p info:height g 1273878447049 5ft7 tlipcon info:state 1273616297446 CA roles Column Family Row key Row key Column key Column key Timestamp Cell value Cell value cutting roles:ASF 1273871823022 Director Sorted on disk by cutting roles:Hadoop 1183746289103 Founder Row key, Col  Row key Col tlipcon li roles:Hadoop l H d 1300062064923 PMC key,  descending  tlipcon roles:Hadoop 1293388212294 Committer timestamp tlipcon roles:Hive 1273616297446 Contributor Milliseconds since unix epoch 8/23/11  NoSQLNow! 2011 Conference 67
  • 68. HBase purposely doesn’t have everything p p y y g • No atomic multi‐row operations • No global time ordering • No built‐in SQL query language • No query Optimizer q y p 8/23/11  NoSQLNow! 2011 Conference 68
  • 69. HBase vs just HDFS j Plain HDFS/MR HBase Write pattern Append‐only Random write, bulk  incremental Read pattern Full table scan, partition table  Random read, small range  scan scan, or table scan Hive (SQL) performance Very good 4‐5x slower Structured t St t d storage Do‐it‐yourself / TSV / D it lf / TSV /  Sparse column‐family data  S l f il d t SequenceFile / Avro / ? model Max data size 30+ PB ~1PB If you have neither random write nor random read, stick to HDFS! 8/23/11  NoSQLNow! 2011 Conference 69
  • 70. What if I don’t know what my schema should be? y • MR and HBase complement each other. • Use HDFS for long sequential writes. • Use MR for large batch jobs. • Use HBase for random writes and reads Use HBase for random writes and reads. • Applications need HBase to have data structured in a  pp certain way. • Save raw data to HDFS and then experiment. • MR for data transformation and ETL‐like jobs from raw  data. • U b lk i Use bulk import from MR to HBase. tf MR t HB 8/23/11  NoSQLNow! 2011 Conference 70
  • 71. Outline • Motivation  • Enter Apache Hadoop  • Enter Apache HBase • Real‐World Applications • System Architecture • Deployment (in the Cloud) • Conclusions 8/23/11  NoSQLNow! 2011 Conference 71
  • 72. Apache HBase in Production p • Facebook : • Messages • StumbleUpon: • http://su pr http://su.pr • Mozilla : • Socorro ‐‐ receives crash reports p • Yahoo: • Web Crawl Cache • Twitter: • stores users and tweets for analytics • … many others h 8/23/11  NoSQLNow! 2011 Conference 72
  • 73. High Level Architecture High Le el Architect re Your PHP Application MapReduce Thrift/REST Gateway Your Java Application Hive/Pig Java Client ZooKeeper HBase HDFS 8/23/11  NoSQLNow! 2011 Conference 73
  • 74. Data‐centric schema design? g • Entity relationship model. • Design schema in “Normalized form”. • Figure out your queries. • DBA DBA to sets primary/foreign keys and indexes once query is  i /f i k di d i known. • Issues: • Difficult and expensive to change schema.  p g • Difficult and expensive to add columns. 8/23/11  NoSQLNow! 2011 Conference 74
  • 75. Q y Query‐centric schema design g • Know your queries, then design your schema • Pick row keys to spread region load • Spreading loads can increase read and write efficiency. p g y • Pick column‐family members for better reads • Create these by knowing fields needed by queries. • I b Its better to have a fewer than many. h f h • Notice: • App developers optimize the queries, not DBAs. • If you’ve done the relational DB query optimizations,  you are  mostly there already! mostly there already! 8/23/11  NoSQLNow! 2011 Conference 75
  • 76. Schema design exercises g • URL Shortener • Bit.ly, goo.gl, su.pr etc. • Web table • Google BigTable’s example, Yahoo!’s Web Crawl Cache • Facebook Messages  • Conversations and Inbox Search Conversations and Inbox Search • Transition strategies 8/23/11  NoSQLNow! 2011 Conference 76
  • 77. Lookup hash, track click, and  Url Shortener Service forward to full url forward to full url Enter new long url, generate short url,  and store to user’s mapping pp g Look up all of a user’s  Track historical click counts  shortened urls and display over time 8/23/11  NoSQLNow! 2011 Conference 77
  • 78. Url Shortener schema • All queries have at least one join. • Constraints when adding new urls, and short urls. • How do we delete users? 8/23/11  NoSQLNow! 2011 Conference 78
  • 79. Url Shortener HBase schema • Most common  queries are single  i i l gets • Use compression  settings on content  g column families. • Use composite  row  key to group all of a  user’s shortened urls ’ h t d l 8/23/11  NoSQLNow! 2011 Conference 79
  • 80. Web Tables • Goal: Manage web crawls and its data by keeping  snapshots of the web.  • Google used BigTable for Web table example • Y h Yahoo uses HBase for Web crawl cache HB f W b l h Full scan applications HBase HB Random access  applications HDFS 8/23/11  NoSQLNow! 2011 Conference 80
  • 81. Web Table queries q • Crawler continuously updating link and pages • Want to track individual pages over time • Want to group related pages from same site • Want to calculate PageRank (links and backlinks) • Want to do build a search index • Want to do ad‐hoc analytical queries on page content 8/23/11  NoSQLNow! 2011 Conference 81
  • 82. Google web table schema g 8/23/11  NoSQLNow! 2011 Conference 82
  • 83. Web table Schema Design g • Want to keep related pages together • Reverse url so that related pages are near each other Reverse url so that related pages are near each other. • archive.cloudera.com => com.cloudera.archive • www.cloudera.com => com.cloudera.www • Want to track pages over time Want to track pages over time • reverseurl‐crawltimestamp: put all of same url together • Just scan a localized set of pages. • Want to calculate pagerank (links and backlinks) Want to calculate pagerank (links and backlinks) • Just need links, so put raw content in different column family. • Avoid having to do IO to read unused raw content. • Want to index newer pages Want to index newer pages • Use Map Reduce on most recently crawled content. • Want to do analytical queries • We’ll do a full scan with filters. We ll do a full scan with filters. 8/23/11  NoSQLNow! 2011 Conference 83
  • 84. Facebook Messages (as of 12/10) g ( / ) • 15Bn/month message email, 1k = 14TB • 120Bn /month, 100 bytes = 11TB Create a new  Keyword search of  message/conversation / ti messages Show full  List most recent  conversation conversations 8/23/11  NoSQLNow! 2011 Conference 84
  • 85. Possible Schema Design g • Show my most recent conversations • Have a “conversation” table using user revTimeStamp as key Have a  conversation table using user‐revTimeStamp as key • Have a Metadata column family  • Metadata contains date, to/from, one line of most recent • Show me the full conversation • Use same “conversation” table • Content column family contains a conversation • We already have full row key from previous, so this is just a quick lookup • S Search my inbox for keywords  h i b f k d • Have a separate “inboxSearch” table • Row key design: userId‐word‐messageId‐lastUpdateRevTimestamp • Works for type‐ahead / partial messages Works for type ahead / partial messages • Show top N message ids with word • Send new message  • Update both tables and to both users’s rows • Update recent conversations and keyword index 8/23/11  NoSQLNow! 2011 Conference 85
  • 86. Facebook MySQL to HBase transition y Q • Initially a normalized MySQL email schema sharded over  1000 production servers with 500M users. 000 d i i h 00 • How do we export users’ emails? • Di t Direct approach:h • Big join – point table for Tb’s of data (500M users!) • This would kill the production servers. This would kill the production servers. • Incremental approach: • Snapshot copy via naïve bulk load into migration HBase cluster. • Have incremental fetches from db for new live data. • Use MR on migration HBase to do join, writing to final cluster. • App writes to both places until migration complete. A it t b th l til i ti l t 8/23/11  NoSQLNow! 2011 Conference 86
  • 87. Row Key tricks y • Row Key design for schemas are critical • Reasonable number of regions Reasonable number of regions. • Make sure key distributes to spread write load. • Take advantage of lexicographic sort order. • Numeric Keys and lexicographic sort • Store numbers big‐endian. • Pad ASCII numbers with 0’s. • Use reversal to have most significant traits first. • Reverse URL. • Reverse timestamp to get most recent first. p g • Use composite keys to make key distribute nicely and work well  with sub‐scans • Ex: User‐ReverseTimeStamp • Do not use current timestamp as first part of row key! 8/23/11  NoSQLNow! 2011 Conference 87
  • 88. Key Take‐aways y y • Denormalized schema localized data for single lookups. • Rowkey is critical for lookups and subset scans. • Make sure when writing row keys are distributed. • Use Bulk loads and Map Reduce to re‐organize or  change you schema (during down time). g y ( g ) • Multiple clusters for different workloads if you can  p y afford it. 8/23/11  NoSQLNow! 2011 Conference 88
  • 89. Outline • Motivation  • Enter Apache Hadoop  • Enter Apache HBase • Real‐World Applications • System Architecture • Deployment (in the Cloud) • Conclusions 8/23/11  NoSQLNow! 2011 Conference 89
  • 90. A Typical Look... yp • 5‐4000 commodity servers (8‐core, 24GB RAM, 4‐12 TB, gig‐E) • 2‐level network architecture • 20‐40 nodes per rack 8/23/11  NoSQLNow! 2011 Conference 90
  • 91. Hadoop Cluster nodes p Master nodes (1 each) NameNode (metadata server and database) JobTracker (scheduler) Slave nodes (1‐4000 each) DataNodes  TaskTrackers  (block storage) (task execution) 8/23/11  NoSQLNow! 2011 Conference 91
  • 92. Name Node and Secondary Name Node y • NameNode • The most critical node in the system The most critical node in the system. • Stores file system metadata on disk and in memory. • Directory structures, permissions • Modification stored as an edit log Modification stored as an edit log. • Fault tolerant but not highly available yet. • Secondary NameNode Secondary NameNode • Not a hot standby! • Gets a copy of file system metadata and edit log. • Periodically compacts image and edit log and ships to NameNode. d ll d d l d h d • Make sure your DNS is setup properly! 8/23/11  NoSQLNow! 2011 Conference 92
  • 93. Data nodes • HDFS splits files into 64MB (or 128MB) blocks. • Data nodes store and serves these blocks. • By default, pipeline writes to 3 different machines. • By default, local machine, at machines on other racks. • Locality helps significantly on subsequent reads and  computation scheduling. computation scheduling 8/23/11  NoSQLNow! 2011 Conference 93
  • 94. Job Tracker and Task Trackers • Now, we want to process that data! • Job Tracker  • Schedules work and resource usage through out the cluster. • Makes sure work gets done. • Controls retry, speculative execution, etc. • Task Trackers Task Trackers • These slaves do the “map” and “reduce” work. • Co‐located with data nodes. 8/23/11  NoSQLNow! 2011 Conference 94
  • 95. HBase cluster nodes Master nodes (1 each) Slave nodes (1‐4000 each) NameNode (HDFS metadata server) HMaster HM t (region metadata) RegionServer HMaster (table server) (hot standby) (hot standby) DataNode (hdfs block server) Coordination nodes  (odd number) ZooKeeper  Quorum Peer 8/23/11  NoSQLNow! 2011 Conference 95
  • 96. HMaster and ZooKeeper p • HMaster • Controls which Regions are served by which Region Servers. • Assigns regions to new region servers when they arrive or go  down. • Can have a hot standby master if main master goes down. • All region state kept in ZooKeeper. • Apache ZooKeeper • Highly Available System for coordination. Highly Available System for coordination • Generally 3 or 5 machines (always an odd number). • Uses consensus to guarantee common shared state. • Writes are considered expensive. 8/23/11  NoSQLNow! 2011 Conference 96
  • 97. Region Server g • Tables are chopped up into regions. • A region is only served by one region server at a time. • Regions are served by a “region server”. • Load balancing if region server goes down. • Co‐locate region servers with data nodes. • Takes advantage of HDFS file locality.  • Important that clocks are in reasonable sync. Use NTP! 8/23/11  NoSQLNow! 2011 Conference 97
  • 98. Stability and Tuning Hints y g • Monitor your cluster. • Avoid memory swapping.  • Do not oversubscribe memory. • Can suffer from cascading failures Can suffer from cascading failures. • Mostly scan jobs? Mostly scan jobs?  • Small read cache, low swapiness. • Large max region size for large column families. g g g • Avoid costly “region splits”. • Make the ZK timeout higher.  • longer to recover from failure but prevents cascading failure. 8/23/11  NoSQLNow! 2011 Conference 98
  • 99. Outline • Motivation  • Enter Apache Hadoop  • Enter Apache HBase • Real‐World Applications • System Architecture • Deployment (in the Cloud) • Conclusions 8/23/11  NoSQLNow! 2011 Conference 99
  • 101. We’ll use Apache Whirr p Apache Whirr is set of  Apache Whirr is set of tools and libraries for  deploying clusters on  cloud services in a  cloud services in a cloud‐neutral way.y 8/23/11  NoSQLNow! 2011 Conference 101
  • 102. This is great for setting up a cluster... g g p jon@grimlock:~/whirr-0.5.0-incubating$ bin/whirr launch-cluster --config recipes/hbase-ec2.properties i /hb 2 i jon@grimlock:~/whirr-0.5.0-incubating$ j i l k / hi i b i $ bin/whirr launch-cluster --config recipes/scm-ec2.properties recipes/scm ec2 properties 8/23/11  NoSQLNow! 2011 Conference 102
  • 103. and an easy way to tear down a cluster. y y jon@grimlock:~/whirr-0.5.0-incubating$ bin/whirr destroy-cluster --config recipes/hbase-ec2.properties i /hb 2 i jon@grimlock:~/whirr-0.5.0-incubating$ j i l k / hi i b i $ bin/whirr destroy-cluster --config recipes/scm-ec2.properties recipes/scm ec2 properties 8/23/11  NoSQLNow! 2011 Conference 103
  • 104. But how do you to manage a cluster deployment? y g p y 8/23/11  NoSQLNow! 2011 Conference 104
  • 105. Interact with your cluster with Hue y 8/23/11  NoSQLNow! 2011 Conference 105
  • 106. What did we just do?  j • Whirr • Provisioned the machines on EC2. • Installed SCM on all the machines. • Cloudera Service and Configuration Manager Express • O h t t d th d l Orchestrated the deployment the services in the proper order. t th i i th d • ZooKeeper, Hadoop, HDFS, HBase and Hue • Set service configurations. g • Free download for kicking off up to 50 nodes! http://www.cloudera.com/products‐services/scm‐express/ 8/23/11  NoSQLNow! 2011 Conference 106
  • 107. To Cloud or not to Cloud? • The key feature of a cloud deployment  • Elasticity: The ability to expand and contract the number of  machines being used on demand. • Things to consider: • E Economics of machines and people. i f hi d l • Capital Expenses vs Operational Expenses. • Workload requirements: Performance / SLAs. q / • Previous investments. 8/23/11  NoSQLNow! 2011 Conference 107
  • 108. Economics of a cluster Economics of a cluster EC2 Cloud deployment EC2 Cloud deployment Private Cluster Private Cluster • 10 small instances • 10 commodity servers • 1 core,  1.7GB ram, 160GB disk • $0 085/hr/mchn => $7 446/yr $0.085/hr/mchn => $7,446/yr • 8 core, 24 GB ram, 6TB disk • Reserved $227.50/yr/machine  • $6500 /machine => $65,000 +$0.03/hr/mchn=> $4903/yr. • + physical space • + networking gear + networking gear • 10 Dedicated‐Reserved Quad  XLs  instances • + power  • 8 core, 23GB ram, 1690GB disk • + admin costs • $6600/yr/mchn + +  • + more setup time $0.84/hr/mchn +  $10/hr/region => 66,000 +  73,584 + 87,600 =>  $227,184/yr $227 184/yr 8/23/11  NoSQLNow! 2011 Conference 108
  • 109. Pros of using the cloud g • With Virtualized machines, you can install any SW you  want! ! • Great for bursty or occasional loads that expand and shrink. • G t if Great if your apps and data is already in the cloud. d d t i l d i th l d • Logs already live in S3 for example. • Great if you don’t have a large ops team Great if you don t have a large ops team. • Save money on people dealing with colos, hardware failures. • Steadier ops team personnel requirements, (unless catastrophic  failure). • Great for experimentation. • Great for testing/QA clusters at scale. f i / l l 8/23/11  NoSQLNow! 2011 Conference 109
  • 110. Cons of using the cloud g • Getting data in and out of EC2.  • N t Not cost, but amount of time. t b t t f ti • AWS Direct connect can help. • Virtualization causes varying network connectivity. y g y • ZooKeeper timeouts can cause cascading failure. • Some connections fast, others slow. • Dedicated or Cluster compute instance could improve Dedicated or Cluster‐compute instance could improve. • Virtualization causes unpredictable IO performance. • EBS is like a SAN and an eventual bottleneck. • Ephemeral disks perform better but not recoverable on failures. • Still need to deal with Disaster recovery. • What happens if an EC2 or a region goes down? (4/21/11) What happens if an EC2 or a region goes down? (4/21/11) 8/23/11  NoSQLNow! 2011 Conference 110
  • 111. Cloudera’s Experience with Hadoop in the Cloud p p • Some Enterprise Hadoop/MR use the Cloud. • Good for daily jobs with moderate amounts of data (GB’s),  ( ) generally computationally expensive. • Ex: Periodic matching or recommendation applications. Ex: Periodic matching or recommendation applications. • Spin up cluster. • Upload a data set to S3. • Do an n2 matching or recommendation job.  • Mapper expands data. • Reducer gets small amount of data back.  g • Write to S3. • Download result set. • Tear down cluster Tear down cluster. 8/23/11  NoSQLNow! 2011 Conference 111
  • 112. Cloudera’s Experience with HBase in the cloud p • Almost all enterprise HBase users use physical hardware. • Some initially used the cloud, but transitioned to physical  hardware. • O One story: t • EC2: 40nodes in ec2 xl instances, • Bought 10 physical machines and got similar or better performance. Bought 10 physical machines and got similar or better performance.  • Why? • Physical hardware gives more control of machine build out, network  infrastructure, locality which are critical for performance. • HBase is up all the time and usually grows over time. 8/23/11  NoSQLNow! 2011 Conference 112
  • 113. Outline • Motivation  • Enter Apache Hadoop  • Enter Apache HBase • Real‐World Applications • System Architecture • Deployment (in the Cloud) • Conclusions 8/23/11  NoSQLNow! 2011 Conference 113
  • 114. Key takeaways y y • Apache HBase is not a Database!  There are other scalable  databases. databases • Query‐centric schema design, not data‐centric schema  design. • I In production at 100’s of TB scale at several large  d ti t 100’ f TB l t ll enterprises. • If you are restructuring your SQL DB to optimize it, you may  be a candidate for HBase. • HBase complements and depends upon Hadoop. • Hadoop makes sense in the cloud for some production Hadoop makes sense in the cloud for some production  workloads. • HBase in the cloud for experimentation but generally is in  physical hardware for production. physical hardware for production 8/23/11  NoSQLNow! 2011 Conference 114
  • 115. HBase vs RDBMS RDBMS HBase Data layout Row‐oriented Column‐family‐oriented Transactions Multi‐row ACID Single row only Query language SQL get/put/scan/etc * Security Authentication/Authorization Work in progress Indexes On arbitrary columns Row‐key only* Max data size Max data si e TBs ~1PB Read/write  1000s queries/second Millions of  throughput limits “queries”/second 8/23/11  NoSQLNow! 2011 Conference 115
  • 116. HBase vs other “NoSQL” Q • Favors Strict Consistency over Availability (but  availability is good in practice!) • Great Hadoop integration (very efficient bulk loads,  MapReduce analysis) M R d l i) • Ordered range partitions (not hash) • Automatically shards/scales (just turn on more servers,  really proven at petabyte scale) • S Sparse column storage (not key‐value) l ( k l ) 8/23/11  NoSQLNow! 2011 Conference 116
  • 117. HBase vs just HDFS j Plain HDFS/MR HBase Write pattern Append‐only Random write, bulk  incremental Read pattern Full table scan, partition table  Random read, small range  scan scan, or table scan Hive (SQL) performance Very good 4‐5x slower Structured t St t d storage Do‐it‐yourself / TSV / D it lf / TSV /  Sparse column‐family data  S l f il d t SequenceFile / Avro / ? model Max data size 30+ PB ~1PB If you have neither random write nor random read, stick to HDFS! 8/23/11  NoSQLNow! 2011 Conference 117
  • 118. More resources? • Download Hadoop and HBase! • CDH ‐ Cloudera’s Distribution including  Apache Hadoop http://cloudera.com/ http://cloudera com/ • http://hadoop.apache.org/ • Try it out! (Locally, VM, or EC2) Try it out!  (Locally, VM, or EC2) • Watch free training videos on http://cloudera.com/ p // / 8/23/11  NoSQLNow! 2011 Conference 118
  • 119. jon@cloudera.com @jmhsieh QUESTIONS? 8/23/11  NoSQLNow! 2011 Conference 119