SlideShare una empresa de Scribd logo
1 de 60
Descargar para leer sin conexión
Rainbird:
Real-time Analytics @Twitter
Kevin Weil -- @kevinweil
Product Lead for Revenue, Twitter




                                    TM
Agenda
‣   Why Real-time Analytics?
‣   Rainbird and Cassandra
‣   Production Uses at Twitter
‣   Open Source
My Background
‣   Mathematics and Physics at Harvard, Physics at
    Stanford
‣   Tropos Networks (city-wide wireless): mesh
    routing algorithms, GBs of data
‣   Cooliris (web media): Hadoop and Pig for
    analytics, TBs of data
‣   Twitter: Hadoop, Pig, HBase, Cassandra, data
    viz, social graph analysis, soon to be PBs of data
My Background
‣   Mathematics and Physics at Harvard, Physics at
    Stanford
‣   Tropos Networks (city-wide wireless): mesh
    routing algorithms, GBs of data
‣   Cooliris (web media): Hadoop and Pig for
    analytics, TBs of data
‣   Twitter: Hadoop, Pig, HBase, Cassandra, data
    viz, social graph analysis, soon to be PBs of data
    Now revenue products!
Agenda
‣   Why Real-time Analytics?
‣   Rainbird and Cassandra
‣   Production Uses at Twitter
‣   Open Source
Why Real-time Analytics
‣   Twitter is real-time
Why Real-time Analytics
‣   Twitter is real-time
‣   ... even in space
And My Personal Favorite
And My Personal Favorite
Real-time Reporting
‣   Discussion around ad-based revenue model
‣   Help shape the conversation in real-time with
    Promoted Tweets
Real-time Reporting
‣   Discussion around ad-based revenue model
‣   Help shape the conversation in real-time with
    Promoted Tweets
‣   Realtime reporting
    ties it all together
Agenda
‣   Why Real-time Analytics?
‣   Rainbird and Cassandra
‣   Production Uses at Twitter
‣   Open Source
Requirements
‣   Extremely high write volume
‣      Needs to scale to 100,000s of WPS
Requirements
‣   Extremely high write volume
‣      Needs to scale to 100,000s of WPS

‣   High read volume
‣      Needs to scale to 10,000s of RPS
Requirements
‣   Extremely high write volume
‣      Needs to scale to 100,000s of WPS

‣   High read volume
‣      Needs to scale to 10,000s of RPS

‣   Horizontally scalable (reads, storage, etc)
‣      Needs to scale to 100+ TB
Requirements
‣   Extremely high write volume
‣      Needs to scale to 100,000s of WPS

‣   High read volume
‣      Needs to scale to 10,000s of RPS

‣   Horizontally scalable (reads, storage, etc)
‣      Needs to scale to 100+ TB

‣   Low latency
‣      Most reads <100 ms (esp. recent data)
Cassandra
‣   Pro: In-house expertise
‣   Pro: Open source Apache project
‣   Pro: Writes are extremely fast
‣   Pro: Horizontally scalable, low latency
‣   Pro: Other startup adoption (Digg, SimpleGeo)
Cassandra
‣   Pro: In-house expertise
‣   Pro: Open source Apache project
‣   Pro: Writes are extremely fast
‣   Pro: Horizontally scalable, low latency
‣   Pro: Other startup adoption (Digg, SimpleGeo)




‣   Con: It was really young (0.3a)
Cassandra
‣   Pro: Some dudes at Digg had already started
    working on distributed atomic counters in
    Cassandra
Cassandra
‣   Pro: Some dudes at Digg had already started
    working on distributed atomic counters in
    Cassandra
‣   Say hi to @kelvin
Cassandra
‣   Pro: Some dudes at Digg had already started
    working on distributed atomic counters in
    Cassandra
‣   Say hi to @kelvin
‣   And @lenn0x
Cassandra
‣   Pro: Some dudes at Digg had already started
    working on distributed atomic counters in
    Cassandra
‣   Say hi to @kelvin
‣   And @lenn0x
‣   A dude from
    Sweden began helping: @skr
Cassandra
‣   Pro: Some dudes at Digg had already started
    working on distributed atomic counters in
    Cassandra
‣   Say hi to @kelvin
‣   And @lenn0x
‣   A dude from
    Sweden began helping: @skr


‣   Now all at Twitter :)
Rainbird
‣   It counts things. Really quickly.
‣   Layers on top of the distributed
    counters patch, CASSANDRA-1072
Rainbird
‣   It counts things. Really quickly.
‣   Layers on top of the distributed
    counters patch, CASSANDRA-1072


‣   Relies on Zookeeper, Cassandra, Scribe, Thrift
‣   Written in Scala
Rainbird Design
‣   Aggregators
    buffer for 1m
‣   Intelligent
    flush to
    Cassandra
‣   Query
    servers read
    once written
‣   1m is
    configurable
Rainbird Data Structures
struct Event
{
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Rainbird Data Structures
struct Event
{                               Unix timestamp of event
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Rainbird Data Structures
struct Event
{                               Stat category name
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Rainbird Data Structures
struct Event
{                               Stat keys (hierarchical)
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Rainbird Data Structures
struct Event
{                               Actual count (diff)
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Rainbird Data Structures
struct Event
{                               More later
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Hierarchical Aggregation
‣   Say we’re counting Promoted Tweet impressions
‣   category = pti
‣   keys = [advertiser_id, campaign_id, tweet_id]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [advertiser_id, campaign_id, tweet_id]
‣      [advertiser_id, campaign_id]
‣      [advertiser_id]
‣   Means fast queries over each level of hierarchy
‣   Configurable in rainbird.conf, or dynamically via ZK
Hierarchical Aggregation
‣   Another example: tracking URL shortener tweets/clicks
‣   full URL = http://music.amazon.com/some_really_long_path
‣   keys = [com, amazon, music, full URL]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [com, amazon, music, full URL]
‣      [com, amazon, music]
‣      [com, amazon]
‣      [com]
‣   Means we can count clicks on full URLs
‣   And automatically aggregate over domains and subdomains!
Hierarchical Aggregation
‣   Another example: tracking URL shortener tweets/clicks
‣   full URL = http://music.amazon.com/some_really_long_path
‣   keys = [com, amazon, music, full URL]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [com, amazon, music, full URL]
‣      [com, amazon, music]          How many people tweeted
‣      [com, amazon]                 full URL?
‣      [com]
‣   Means we can count clicks on full URLs
‣   And automatically aggregate over domains and subdomains!
Hierarchical Aggregation
‣   Another example: tracking URL shortener tweets/clicks
‣   full URL = http://music.amazon.com/some_really_long_path
‣   keys = [com, amazon, music, full URL]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [com, amazon, music, full URL]
‣      [com, amazon, music]          How many people tweeted
‣      [com, amazon]                 any music.amazon.com URL?
‣      [com]
‣   Means we can count clicks on full URLs
‣   And automatically aggregate over domains and subdomains!
Hierarchical Aggregation
‣   Another example: tracking URL shortener tweets/clicks
‣   full URL = http://music.amazon.com/some_really_long_path
‣   keys = [com, amazon, music, full URL]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [com, amazon, music, full URL]
‣      [com, amazon, music]          How many people tweeted
‣      [com, amazon]                 any amazon.com URL?
‣      [com]
‣   Means we can count clicks on full URLs
‣   And automatically aggregate over domains and subdomains!
Hierarchical Aggregation
‣   Another example: tracking URL shortener tweets/clicks
‣   full URL = http://music.amazon.com/some_really_long_path
‣   keys = [com, amazon, music, full URL]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [com, amazon, music, full URL]
‣      [com, amazon, music]          How many people tweeted
‣      [com, amazon]                 any .com URL?
‣      [com]
‣   Means we can count clicks on full URLs
‣   And automatically aggregate over domains and subdomains!
Temporal Aggregation
‣   Rainbird also does (configurable) temporal
    aggregation
‣   Each count is kept minutely, but also
    denormalized hourly, daily, and all time
‣   Gives us quick counts at varying granularities
    with no large scans at read time
‣      Trading storage for latency
Multiple Formulas
‣   So far we have talked about sums
‣   Could also store counts (1 for each event)
‣   ... which gives us a mean
‣   And sums of squares (count * count for each event)
‣   ... which gives us a standard deviation
‣   And min/max as well


‣   Configure this per-category in rainbird.conf
Rainbird
‣   Write 100,000s of events per second, each with
    hierarchical structure
‣   Query with minutely granularity over any level of
    the hierarchy, get back a time series
‣   Or query all time values
‣   Or query all time means, standard deviations
‣   Latency < 100ms
Agenda
‣   Why Real-time Analytics?
‣   Rainbird and Cassandra
‣   Production Uses at Twitter
‣   Open Source
Production Uses
‣   It turns out we need to count things all the time
‣   As soon as we had this service, we started
    finding all sorts of use cases for it
‣      Promoted Products
‣      Tweeted URLs, by domain/subdomain
‣      Per-user Tweet interactions (fav, RT, follow)
‣      Arbitrary terms in Tweets
‣      Clicks on t.co URLs
Use Cases
‣   Promoted Tweet Analytics
Each different metric is part
Production Uses                of the key hierarchy

‣   Promoted Tweet Analytics
Uses the temporal
                               aggregation to quickly show
Production Uses                different levels of granularity

‣   Promoted Tweet Analytics
Data can be historical, or
Production Uses                from 60 seconds ago

‣   Promoted Tweet Analytics
Production Uses
‣   Internal Monitoring and Alerting




‣   We require operational reporting on all internal services
‣   Needs to be real-time, but also want longer-term
    aggregates
‣   Hierarchical, too: [stat,   datacenter, service, machine]
Production Uses
‣   Tweet Button Counts




‣   Tweet Button counts are requested many many
    times each day from across the web
‣   Uses the all time field
Agenda
‣   Why Real-time Analytics?
‣   Rainbird and Cassandra
‣   Production Uses at Twitter
‣   Open Source
Open Source?
‣   Yes!
Open Source?
‣   Yes!   ... but not yet
Open Source?
‣   Yes!   ... but not yet
‣   Relies on unreleased version of Cassandra
Open Source?
‣   Yes!   ... but not yet
‣   Relies on unreleased version of Cassandra
‣      ... but the counters patch is committed in trunk (0.8)
Open Source?
‣   Yes!   ... but not yet
‣   Relies on unreleased version of Cassandra
‣      ... but the counters patch is committed in trunk (0.8)
‣      ... also relies on some internal frameworks we need to
    open source
Open Source?
‣   Yes!   ... but not yet
‣   Relies on unreleased version of Cassandra
‣      ... but the counters patch is committed in trunk (0.8)
‣      ... also relies on some internal frameworks we need to
    open source
‣   It will happen
Open Source?
‣   Yes!   ... but not yet
‣   Relies on unreleased version of Cassandra
‣      ... but the counters patch is committed in trunk (0.8)
‣      ... also relies on some internal frameworks we need to
    open source
‣   It will happen
‣   See http://github.com/twitter for proof of how much
    Twitter    open source
Team
‣   John Corwin (@johnxorz)
‣   Adam Samet (@damnitsamet)
‣   Johan Oskarsson (@skr)
‣   Kelvin Kakugawa (@kelvin)
‣   Chris Goffinet (@lenn0x)
‣   Steve Jiang (@sjiang)
‣   Kevin Weil (@kevinweil)
If You Only Remember One Slide...
‣   Rainbird is a distributed, high-volume counting
    service built on top of Cassandra
‣   Write 100,000s events per second, query it with
    hierarchy and multiple time granularities, returns
    results in <100 ms
‣   Used by Twitter for multiple products internally,
    including our Promoted Products, operational
    monitoring and Tweet Button
‣   Will be open sourced so the community can use and
    improve it!
Questions?
        Follow me: @kevinweil




                       TM

Más contenido relacionado

La actualidad más candente

KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...confluent
 
Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup) Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup) Roopa Tangirala
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidImply
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkDataWorks Summit
 
Exactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsExactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsGuozhang Wang
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using KafkaKnoldus Inc.
 
Change data capture with MongoDB and Kafka.
Change data capture with MongoDB and Kafka.Change data capture with MongoDB and Kafka.
Change data capture with MongoDB and Kafka.Dan Harvey
 
Distributed Counters in Cassandra (Cassandra Summit 2010)
Distributed Counters in Cassandra (Cassandra Summit 2010)Distributed Counters in Cassandra (Cassandra Summit 2010)
Distributed Counters in Cassandra (Cassandra Summit 2010)kakugawa
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013Jun Rao
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...confluent
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
Protocol Buffers and Hadoop at Twitter
Protocol Buffers and Hadoop at TwitterProtocol Buffers and Hadoop at Twitter
Protocol Buffers and Hadoop at TwitterKevin Weil
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
 
JavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient JavaJavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient JavaChris Bailey
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database Systemconfluent
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka IntroductionAmita Mirajkar
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcachedJurriaan Persyn
 
Big Data in Real-Time at Twitter
Big Data in Real-Time at TwitterBig Data in Real-Time at Twitter
Big Data in Real-Time at Twitternkallen
 

La actualidad más candente (20)

KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
 
Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup) Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup)
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on Druid
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
Exactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsExactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka Streams
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Change data capture with MongoDB and Kafka.
Change data capture with MongoDB and Kafka.Change data capture with MongoDB and Kafka.
Change data capture with MongoDB and Kafka.
 
Distributed Counters in Cassandra (Cassandra Summit 2010)
Distributed Counters in Cassandra (Cassandra Summit 2010)Distributed Counters in Cassandra (Cassandra Summit 2010)
Distributed Counters in Cassandra (Cassandra Summit 2010)
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Protocol Buffers and Hadoop at Twitter
Protocol Buffers and Hadoop at TwitterProtocol Buffers and Hadoop at Twitter
Protocol Buffers and Hadoop at Twitter
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
JavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient JavaJavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient Java
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Big Data in Real-Time at Twitter
Big Data in Real-Time at TwitterBig Data in Real-Time at Twitter
Big Data in Real-Time at Twitter
 

Destacado

Scalable Event Analytics with MongoDB & Ruby on Rails
Scalable Event Analytics with MongoDB & Ruby on RailsScalable Event Analytics with MongoDB & Ruby on Rails
Scalable Event Analytics with MongoDB & Ruby on RailsJared Rosoff
 
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Kevin Weil
 
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Kevin Weil
 
NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)Kevin Weil
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
MongoDB at the energy frontier
MongoDB at the energy frontierMongoDB at the energy frontier
MongoDB at the energy frontierValentin Kuznetsov
 
Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdKevin Weil
 
Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Kevin Weil
 
Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010Kevin Weil
 
How to Start Using Analytics Without Feeling Overwhelmed
How to Start Using Analytics Without Feeling OverwhelmedHow to Start Using Analytics Without Feeling Overwhelmed
How to Start Using Analytics Without Feeling OverwhelmedKissmetrics on SlideShare
 
Email Optimization: A discussion about how A/B testing generated $500 million...
Email Optimization: A discussion about how A/B testing generated $500 million...Email Optimization: A discussion about how A/B testing generated $500 million...
Email Optimization: A discussion about how A/B testing generated $500 million...MarketingSherpa
 
Human resonance for leaders pap 2015
Human resonance for leaders pap 2015Human resonance for leaders pap 2015
Human resonance for leaders pap 2015Bernhard K.F. Pelzer
 
Why So Many Businesses are Hopeless at Blogging - New Statistics for 2013
Why So Many Businesses are Hopeless at Blogging - New Statistics for 2013Why So Many Businesses are Hopeless at Blogging - New Statistics for 2013
Why So Many Businesses are Hopeless at Blogging - New Statistics for 2013Passle
 
Using Single Keyword Ad Groups To Drive PPC Performance
Using Single Keyword Ad Groups To Drive PPC PerformanceUsing Single Keyword Ad Groups To Drive PPC Performance
Using Single Keyword Ad Groups To Drive PPC PerformanceSam Owen
 
The Social Consumer Study
The Social Consumer StudyThe Social Consumer Study
The Social Consumer StudyDon Bulmer
 
Brand Equity Measures
Brand Equity MeasuresBrand Equity Measures
Brand Equity Measuressat2coolguy
 
Novartis: Prevacid 24HR Online Consumer Trial Panel and Brand Ambassador Prog...
Novartis: Prevacid 24HR Online Consumer Trial Panel and Brand Ambassador Prog...Novartis: Prevacid 24HR Online Consumer Trial Panel and Brand Ambassador Prog...
Novartis: Prevacid 24HR Online Consumer Trial Panel and Brand Ambassador Prog...Affinitive
 
The Response Rates of Personalized Cross-Media Marketing Campaigns
The Response Rates of Personalized Cross-Media Marketing CampaignsThe Response Rates of Personalized Cross-Media Marketing Campaigns
The Response Rates of Personalized Cross-Media Marketing CampaignsAaron Corson
 

Destacado (18)

Scalable Event Analytics with MongoDB & Ruby on Rails
Scalable Event Analytics with MongoDB & Ruby on RailsScalable Event Analytics with MongoDB & Ruby on Rails
Scalable Event Analytics with MongoDB & Ruby on Rails
 
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
 
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
 
NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
MongoDB at the energy frontier
MongoDB at the energy frontierMongoDB at the energy frontier
MongoDB at the energy frontier
 
Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant bird
 
Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)
 
Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010Big Data at Twitter, Chirp 2010
Big Data at Twitter, Chirp 2010
 
How to Start Using Analytics Without Feeling Overwhelmed
How to Start Using Analytics Without Feeling OverwhelmedHow to Start Using Analytics Without Feeling Overwhelmed
How to Start Using Analytics Without Feeling Overwhelmed
 
Email Optimization: A discussion about how A/B testing generated $500 million...
Email Optimization: A discussion about how A/B testing generated $500 million...Email Optimization: A discussion about how A/B testing generated $500 million...
Email Optimization: A discussion about how A/B testing generated $500 million...
 
Human resonance for leaders pap 2015
Human resonance for leaders pap 2015Human resonance for leaders pap 2015
Human resonance for leaders pap 2015
 
Why So Many Businesses are Hopeless at Blogging - New Statistics for 2013
Why So Many Businesses are Hopeless at Blogging - New Statistics for 2013Why So Many Businesses are Hopeless at Blogging - New Statistics for 2013
Why So Many Businesses are Hopeless at Blogging - New Statistics for 2013
 
Using Single Keyword Ad Groups To Drive PPC Performance
Using Single Keyword Ad Groups To Drive PPC PerformanceUsing Single Keyword Ad Groups To Drive PPC Performance
Using Single Keyword Ad Groups To Drive PPC Performance
 
The Social Consumer Study
The Social Consumer StudyThe Social Consumer Study
The Social Consumer Study
 
Brand Equity Measures
Brand Equity MeasuresBrand Equity Measures
Brand Equity Measures
 
Novartis: Prevacid 24HR Online Consumer Trial Panel and Brand Ambassador Prog...
Novartis: Prevacid 24HR Online Consumer Trial Panel and Brand Ambassador Prog...Novartis: Prevacid 24HR Online Consumer Trial Panel and Brand Ambassador Prog...
Novartis: Prevacid 24HR Online Consumer Trial Panel and Brand Ambassador Prog...
 
The Response Rates of Personalized Cross-Media Marketing Campaigns
The Response Rates of Personalized Cross-Media Marketing CampaignsThe Response Rates of Personalized Cross-Media Marketing Campaigns
The Response Rates of Personalized Cross-Media Marketing Campaigns
 

Similar a Rainbird: Realtime Analytics at Twitter (Strata 2011)

Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogC4Media
 
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac DawsonCODE BLUE
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleSean Chittenden
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleSriram Krishnan
 
Realtime Analytics on AWS
Realtime Analytics on AWSRealtime Analytics on AWS
Realtime Analytics on AWSSungmin Kim
 
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile PlatformReal-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile PlatformScyllaDB
 
Using Event Streams in Serverless Applications
Using Event Streams in Serverless ApplicationsUsing Event Streams in Serverless Applications
Using Event Streams in Serverless ApplicationsJonathan Dee
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Landon Robinson
 
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started DemoIan Massingham
 
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...Amazon Web Services
 
AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
 AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
AWS Cloud Kata 2014 | Jakarta - 2-3 Big DataAmazon Web Services
 
TSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech TalkTSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech TalkAnirudh Todi
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
 
Functional architectural patterns
Functional architectural patternsFunctional architectural patterns
Functional architectural patternsLars Albertsson
 
Introduction to Artificial Intelligence and Machine Learning services at AWS ...
Introduction to Artificial Intelligence and Machine Learning services at AWS ...Introduction to Artificial Intelligence and Machine Learning services at AWS ...
Introduction to Artificial Intelligence and Machine Learning services at AWS ...Amazon Web Services
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPDaniel Zivkovic
 
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese)  - QCon TokyoCloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese) - QCon TokyoSid Anand
 
Big Data: Mejores prácticas en AWS
Big Data: Mejores prácticas en AWSBig Data: Mejores prácticas en AWS
Big Data: Mejores prácticas en AWSAmazon Web Services
 

Similar a Rainbird: Realtime Analytics at Twitter (Strata 2011) (20)

Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson
 
Streams
StreamsStreams
Streams
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
 
Realtime Analytics on AWS
Realtime Analytics on AWSRealtime Analytics on AWS
Realtime Analytics on AWS
 
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile PlatformReal-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
 
Using Event Streams in Serverless Applications
Using Event Streams in Serverless ApplicationsUsing Event Streams in Serverless Applications
Using Event Streams in Serverless Applications
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
 
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
 
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
 
AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
 AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
 
Tsar tech talk
Tsar tech talkTsar tech talk
Tsar tech talk
 
TSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech TalkTSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech Talk
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
Functional architectural patterns
Functional architectural patternsFunctional architectural patterns
Functional architectural patterns
 
Introduction to Artificial Intelligence and Machine Learning services at AWS ...
Introduction to Artificial Intelligence and Machine Learning services at AWS ...Introduction to Artificial Intelligence and Machine Learning services at AWS ...
Introduction to Artificial Intelligence and Machine Learning services at AWS ...
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
 
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese)  - QCon TokyoCloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
 
Big Data: Mejores prácticas en AWS
Big Data: Mejores prácticas en AWSBig Data: Mejores prácticas en AWS
Big Data: Mejores prácticas en AWS
 

Rainbird: Realtime Analytics at Twitter (Strata 2011)

  • 1. Rainbird: Real-time Analytics @Twitter Kevin Weil -- @kevinweil Product Lead for Revenue, Twitter TM
  • 2. Agenda ‣ Why Real-time Analytics? ‣ Rainbird and Cassandra ‣ Production Uses at Twitter ‣ Open Source
  • 3. My Background ‣ Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data ‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data ‣ Twitter: Hadoop, Pig, HBase, Cassandra, data viz, social graph analysis, soon to be PBs of data
  • 4. My Background ‣ Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data ‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data ‣ Twitter: Hadoop, Pig, HBase, Cassandra, data viz, social graph analysis, soon to be PBs of data Now revenue products!
  • 5. Agenda ‣ Why Real-time Analytics? ‣ Rainbird and Cassandra ‣ Production Uses at Twitter ‣ Open Source
  • 6. Why Real-time Analytics ‣ Twitter is real-time
  • 7. Why Real-time Analytics ‣ Twitter is real-time ‣ ... even in space
  • 8. And My Personal Favorite
  • 9. And My Personal Favorite
  • 10. Real-time Reporting ‣ Discussion around ad-based revenue model ‣ Help shape the conversation in real-time with Promoted Tweets
  • 11. Real-time Reporting ‣ Discussion around ad-based revenue model ‣ Help shape the conversation in real-time with Promoted Tweets ‣ Realtime reporting ties it all together
  • 12. Agenda ‣ Why Real-time Analytics? ‣ Rainbird and Cassandra ‣ Production Uses at Twitter ‣ Open Source
  • 13. Requirements ‣ Extremely high write volume ‣ Needs to scale to 100,000s of WPS
  • 14. Requirements ‣ Extremely high write volume ‣ Needs to scale to 100,000s of WPS ‣ High read volume ‣ Needs to scale to 10,000s of RPS
  • 15. Requirements ‣ Extremely high write volume ‣ Needs to scale to 100,000s of WPS ‣ High read volume ‣ Needs to scale to 10,000s of RPS ‣ Horizontally scalable (reads, storage, etc) ‣ Needs to scale to 100+ TB
  • 16. Requirements ‣ Extremely high write volume ‣ Needs to scale to 100,000s of WPS ‣ High read volume ‣ Needs to scale to 10,000s of RPS ‣ Horizontally scalable (reads, storage, etc) ‣ Needs to scale to 100+ TB ‣ Low latency ‣ Most reads <100 ms (esp. recent data)
  • 17. Cassandra ‣ Pro: In-house expertise ‣ Pro: Open source Apache project ‣ Pro: Writes are extremely fast ‣ Pro: Horizontally scalable, low latency ‣ Pro: Other startup adoption (Digg, SimpleGeo)
  • 18. Cassandra ‣ Pro: In-house expertise ‣ Pro: Open source Apache project ‣ Pro: Writes are extremely fast ‣ Pro: Horizontally scalable, low latency ‣ Pro: Other startup adoption (Digg, SimpleGeo) ‣ Con: It was really young (0.3a)
  • 19. Cassandra ‣ Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra
  • 20. Cassandra ‣ Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra ‣ Say hi to @kelvin
  • 21. Cassandra ‣ Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra ‣ Say hi to @kelvin ‣ And @lenn0x
  • 22. Cassandra ‣ Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra ‣ Say hi to @kelvin ‣ And @lenn0x ‣ A dude from Sweden began helping: @skr
  • 23. Cassandra ‣ Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra ‣ Say hi to @kelvin ‣ And @lenn0x ‣ A dude from Sweden began helping: @skr ‣ Now all at Twitter :)
  • 24. Rainbird ‣ It counts things. Really quickly. ‣ Layers on top of the distributed counters patch, CASSANDRA-1072
  • 25. Rainbird ‣ It counts things. Really quickly. ‣ Layers on top of the distributed counters patch, CASSANDRA-1072 ‣ Relies on Zookeeper, Cassandra, Scribe, Thrift ‣ Written in Scala
  • 26. Rainbird Design ‣ Aggregators buffer for 1m ‣ Intelligent flush to Cassandra ‣ Query servers read once written ‣ 1m is configurable
  • 27. Rainbird Data Structures struct Event { 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 28. Rainbird Data Structures struct Event { Unix timestamp of event 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 29. Rainbird Data Structures struct Event { Stat category name 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 30. Rainbird Data Structures struct Event { Stat keys (hierarchical) 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 31. Rainbird Data Structures struct Event { Actual count (diff) 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 32. Rainbird Data Structures struct Event { More later 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 33. Hierarchical Aggregation ‣ Say we’re counting Promoted Tweet impressions ‣ category = pti ‣ keys = [advertiser_id, campaign_id, tweet_id] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [advertiser_id, campaign_id, tweet_id] ‣ [advertiser_id, campaign_id] ‣ [advertiser_id] ‣ Means fast queries over each level of hierarchy ‣ Configurable in rainbird.conf, or dynamically via ZK
  • 34. Hierarchical Aggregation ‣ Another example: tracking URL shortener tweets/clicks ‣ full URL = http://music.amazon.com/some_really_long_path ‣ keys = [com, amazon, music, full URL] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [com, amazon, music, full URL] ‣ [com, amazon, music] ‣ [com, amazon] ‣ [com] ‣ Means we can count clicks on full URLs ‣ And automatically aggregate over domains and subdomains!
  • 35. Hierarchical Aggregation ‣ Another example: tracking URL shortener tweets/clicks ‣ full URL = http://music.amazon.com/some_really_long_path ‣ keys = [com, amazon, music, full URL] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [com, amazon, music, full URL] ‣ [com, amazon, music] How many people tweeted ‣ [com, amazon] full URL? ‣ [com] ‣ Means we can count clicks on full URLs ‣ And automatically aggregate over domains and subdomains!
  • 36. Hierarchical Aggregation ‣ Another example: tracking URL shortener tweets/clicks ‣ full URL = http://music.amazon.com/some_really_long_path ‣ keys = [com, amazon, music, full URL] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [com, amazon, music, full URL] ‣ [com, amazon, music] How many people tweeted ‣ [com, amazon] any music.amazon.com URL? ‣ [com] ‣ Means we can count clicks on full URLs ‣ And automatically aggregate over domains and subdomains!
  • 37. Hierarchical Aggregation ‣ Another example: tracking URL shortener tweets/clicks ‣ full URL = http://music.amazon.com/some_really_long_path ‣ keys = [com, amazon, music, full URL] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [com, amazon, music, full URL] ‣ [com, amazon, music] How many people tweeted ‣ [com, amazon] any amazon.com URL? ‣ [com] ‣ Means we can count clicks on full URLs ‣ And automatically aggregate over domains and subdomains!
  • 38. Hierarchical Aggregation ‣ Another example: tracking URL shortener tweets/clicks ‣ full URL = http://music.amazon.com/some_really_long_path ‣ keys = [com, amazon, music, full URL] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [com, amazon, music, full URL] ‣ [com, amazon, music] How many people tweeted ‣ [com, amazon] any .com URL? ‣ [com] ‣ Means we can count clicks on full URLs ‣ And automatically aggregate over domains and subdomains!
  • 39. Temporal Aggregation ‣ Rainbird also does (configurable) temporal aggregation ‣ Each count is kept minutely, but also denormalized hourly, daily, and all time ‣ Gives us quick counts at varying granularities with no large scans at read time ‣ Trading storage for latency
  • 40. Multiple Formulas ‣ So far we have talked about sums ‣ Could also store counts (1 for each event) ‣ ... which gives us a mean ‣ And sums of squares (count * count for each event) ‣ ... which gives us a standard deviation ‣ And min/max as well ‣ Configure this per-category in rainbird.conf
  • 41. Rainbird ‣ Write 100,000s of events per second, each with hierarchical structure ‣ Query with minutely granularity over any level of the hierarchy, get back a time series ‣ Or query all time values ‣ Or query all time means, standard deviations ‣ Latency < 100ms
  • 42. Agenda ‣ Why Real-time Analytics? ‣ Rainbird and Cassandra ‣ Production Uses at Twitter ‣ Open Source
  • 43. Production Uses ‣ It turns out we need to count things all the time ‣ As soon as we had this service, we started finding all sorts of use cases for it ‣ Promoted Products ‣ Tweeted URLs, by domain/subdomain ‣ Per-user Tweet interactions (fav, RT, follow) ‣ Arbitrary terms in Tweets ‣ Clicks on t.co URLs
  • 44. Use Cases ‣ Promoted Tweet Analytics
  • 45. Each different metric is part Production Uses of the key hierarchy ‣ Promoted Tweet Analytics
  • 46. Uses the temporal aggregation to quickly show Production Uses different levels of granularity ‣ Promoted Tweet Analytics
  • 47. Data can be historical, or Production Uses from 60 seconds ago ‣ Promoted Tweet Analytics
  • 48. Production Uses ‣ Internal Monitoring and Alerting ‣ We require operational reporting on all internal services ‣ Needs to be real-time, but also want longer-term aggregates ‣ Hierarchical, too: [stat, datacenter, service, machine]
  • 49. Production Uses ‣ Tweet Button Counts ‣ Tweet Button counts are requested many many times each day from across the web ‣ Uses the all time field
  • 50. Agenda ‣ Why Real-time Analytics? ‣ Rainbird and Cassandra ‣ Production Uses at Twitter ‣ Open Source
  • 52. Open Source? ‣ Yes! ... but not yet
  • 53. Open Source? ‣ Yes! ... but not yet ‣ Relies on unreleased version of Cassandra
  • 54. Open Source? ‣ Yes! ... but not yet ‣ Relies on unreleased version of Cassandra ‣ ... but the counters patch is committed in trunk (0.8)
  • 55. Open Source? ‣ Yes! ... but not yet ‣ Relies on unreleased version of Cassandra ‣ ... but the counters patch is committed in trunk (0.8) ‣ ... also relies on some internal frameworks we need to open source
  • 56. Open Source? ‣ Yes! ... but not yet ‣ Relies on unreleased version of Cassandra ‣ ... but the counters patch is committed in trunk (0.8) ‣ ... also relies on some internal frameworks we need to open source ‣ It will happen
  • 57. Open Source? ‣ Yes! ... but not yet ‣ Relies on unreleased version of Cassandra ‣ ... but the counters patch is committed in trunk (0.8) ‣ ... also relies on some internal frameworks we need to open source ‣ It will happen ‣ See http://github.com/twitter for proof of how much Twitter open source
  • 58. Team ‣ John Corwin (@johnxorz) ‣ Adam Samet (@damnitsamet) ‣ Johan Oskarsson (@skr) ‣ Kelvin Kakugawa (@kelvin) ‣ Chris Goffinet (@lenn0x) ‣ Steve Jiang (@sjiang) ‣ Kevin Weil (@kevinweil)
  • 59. If You Only Remember One Slide... ‣ Rainbird is a distributed, high-volume counting service built on top of Cassandra ‣ Write 100,000s events per second, query it with hierarchy and multiple time granularities, returns results in <100 ms ‣ Used by Twitter for multiple products internally, including our Promoted Products, operational monitoring and Tweet Button ‣ Will be open sourced so the community can use and improve it!
  • 60. Questions? Follow me: @kevinweil TM

Notas del editor

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n