SlideShare una empresa de Scribd logo
1 de 76
+
Lucas.Waye @ TiVo.com
April 5th, 2018
About Me
What’s using Presto:
Targeted Audience Delivery
TV networks, programmers,
and advertisers
What are my target
viewership segments?
Set-Top box data
Purchasing Behavior
Location-based Consumer Data
Targeted Audience Delivery
Program Metadata
TV networks, programmers,
and advertisers
What are my target
viewership segments?
Set-Top box data
Purchasing Behavior
Location-based Consumer Data
Targeted Audience Delivery
Program Metadata
brought to you (in part) by
looking to the past for inspiration for the future
Similar Products at TiVo
ETL
Amazon
S3 Java services
on EC2
ETL
Amazon
Redshift
MySQL
(RDS)
Similar Products at TiVo
ETL
Amazon
S3 Java services
on EC2
ETL
Amazon
Redshift
MySQL
(RDS)
transactional and customer-configurable data
semi-aggregated viewership data +
sets of households (e.g., “18-24 years old”, “owns minivan”)
New Product, New Challenges…
ETL
Amazon
S3 Java services
on EC2
ETL
Amazon
Redshift
MySQL
(RDS)
MySQL
MySQL
MySQL
Many new data marts
popping up in our tech stack
New Product, New Challenges…
ETL
Amazon
S3 Java services
on EC2
ETL
Amazon
Redshift
MySQL
(RDS)
more viewership data
OK,
storage is cheap
New Product, New Challenges…
ETL
Amazon
S3 Java services
on EC2
ETL
Amazon
Redshift
MySQL
(RDS)
more viewership data
storage is not cheap…
New Product, New Challenges…
ETL
Amazon
S3 Java services
on EC2
ETL
Amazon
Redshift
MySQL
(RDS)
storage is not cheap… Need finer
grain data!
New Product, New Challenges…
ETL
Amazon
S3 Java services
on EC2
ETL
Amazon
Redshift
MySQL
(RDS)
storage is not cheap… Need finer
grain data!
Can’t aggregate
as much
New Product, New Challenges…
ETL
Amazon
S3 Java services
on EC2
ETL
Amazon
Redshift
MySQL
(RDS)
static,
hard to scale
Wait, what about Redshift Spectrum ?
Redshift Spectrum
Redshift: Pay per
node-hour
Spectrum: Pay per
data access
How Does it Scale?
Experiment: join on two tables
• Small Joins: join small Redshift table with (filtered-down) large table on S3

• Join across ~1M rows

• Large Joins: join large Redshift table with (unfiltered) large table table on S3

• Join across ~10M rows

Compare to: both tables on Redshift
How Does it Scale?
Time
(sec)
Concurrent queries
Redshift Spectrum for “Simple" Queries
0
10
20
30
40
50
60
70
1 3 5 7 9 11 13 15
Latency (sec) vs. # Concurrent Requests
1 day 1 day (Spectrum)
Spectrum faster when cluster loaded
and can pre-filter/pre-aggregate data
small joins
Time
(sec)
Concurrent queries
Redshift Spectrum for “Simple" Queries
0
10
20
30
40
50
60
70
1 3 5 7 9 11 13 15
Latency (sec) vs. # Concurrent Requests
1 day 1 day (Spectrum)
Spectrum faster when cluster loaded
and can pre-filter/pre-aggregate data
small joins
Spectrum faster
Time
(sec)
Concurrent queries
Redshift Spectrum for Complex Queries
Time
(sec)
Concurrent queries
Redshift Spectrum for Complex Queries
Spectrum slower!
Memory for broadcast join on the cluster is a non-parallelizable resource in the cluster
Amdahl’s Law in Effect
Memory for broadcast join on the cluster is a non-parallelizable resource in the cluster
Amdahl’s Law in Effect
“Operations that can't be pushed to the Redshift Spectrum
layer include [JOIN], DISTINCT and ORDER BY. …
When large amounts of data are returned from Amazon S3,
the processing is limited by your cluster's resources.”
https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-performance.html
Wait, what about Redshift Spectrum ?
Our queries won’t work well on Spectrum.
Well, what about [X] ?
Our Choice:
• Storage/Compute Separation

• Easy to add and remove worker nodes

• Query many different data sources (inside our VPC) 

without separate load

• Good performance for analytical queries.

Not so good for transactional and simple queries…

• Managed (e.g., Qubole, Starburst)
Coordinator
Worker Worker Worker
S3 / Hive
metastore
MySQL
Connector
Connector
SELECT SUM(v.seconds_viewed)
FROM hive.db.viewership v
JOIN mysql.db.audiences a ON a.hh_id = v.hh_id
WHERE audience_id = 42
mysql catalog à
hive catalog à
SELECT …
FROM db.audiences
WHERE audience_id = 42
DRAFT - TiVo Confidential 2018
How Presto Works
Data is streamed 

back to the workers
First Challenge:
What instance types should we use?
Presto Worker Memory
System Memory
reserved-system-memory =
0.4 * JVM Max Memory
Reserved Memory
max-memory-per-node
General Memory
(the rest)
All Queries Start Using
Memory From Here
Presto Worker Memory
System Memory
reserved-system-memory =
0.4 * JVM Max Memory
Reserved Memory
max-memory-per-node
General Memory
(the rest)
All Queries Start Using
Memory From Here
Query
Presto Worker Memory
System Memory
reserved-system-memory =
0.4 * JVM Max Memory
Reserved Memory
max-memory-per-node
General Memory
(the rest)
Needs more memory than in
General Pool —> Switch to Reserved
Query
Presto Worker Memory
System Memory
reserved-system-memory =
0.4 * JVM Max Memory
Reserved Memory
max-memory-per-node
General Memory
(the rest)
Needs more memory than in
General Pool —> Switch to Reserved
Query
Presto Worker Memory
System Memory
reserved-system-memory =
0.4 * JVM Max Memory
Reserved Memory
max-memory-per-node
General Memory
(the rest)
Needs more memory than in
General Pool —> Switch to Reserved
Query
Only one query allowed!
Presto Worker Memory
System Memory
reserved-system-memory =
0.4 * JVM Max Memory
Reserved Memory
max-memory-per-node
General Memory
(the rest)
Needs more memory than in
Reserved Pool —> Fail
Query
Presto Worker Memory
System Memory
reserved-system-memory =
0.4 * JVM Max Memory
Reserved Memory
max-memory-per-node
General Memory
(the rest)
Needs more memory than in
Reserved Pool —> Fail
Query
But there’s available
memory??
Presto Worker Memory
System Memory
reserved-system-memory =
0.4 * JVM Max Memory
Reserved Memory
max-memory-per-node
General Memory
(the rest)
Needs more memory than in
Reserved Pool —> keep allocating
(resource overcommit)
Query
Presto Worker Memory
System Memory
reserved-system-memory =
0.4 * JVM Max Memory
Reserved Memory
max-memory-per-node
General Memory
(the rest)
Query
But now a single query can
hog the entire cluster!
Presto Worker Memory
Query
Presto Worker Memory
Query Query
Presto Worker Memory
Multiple Workers
Presto Worker Memory
Query Query
Presto Worker Memory
Multiple Workers
Total Memory
max-memory=
max-memory-per-node * number of nodes
Presto Worker Memory
QueryQuery
Presto Worker Memory
Multiple Workers
Total Memory
max-memory=
max-memory-per-node * number of nodes
QueryQuery
Total Memory
max-memory=
max-memory-per-node * number of nodes
Presto Worker Memory
Query
Query
Presto Worker Memory
Multiple Workers
Query
Query
• What if memory usage varies a lot between different queries?

• Use many inexpensive instances, or a few expensive instances?

• Compute optimized or memory optimized?
Working With Reserved Memory Pool
How do we achieve that?
Conceptually, reserved memory pool should be the “high water mark” 

while most queries complete in the general pool.
• What if memory usage varies a lot between different queries?

• Use many inexpensive instances, or a few expensive instances?

• Compute optimized or memory optimized?
Working With Reserved Memory Pool
Conceptually, reserved memory pool should be the “high water mark” 

while most queries complete in the general pool.
Solution: multiple clusters based on workload
Empiric testing found smaller cluster size was slightly faster
Solution: Cost/Benefit Analysis
How do we achieve that?
Choosing the Right Instance Type
r 4 . 4 x l a r g e
Instance
Class
Generation
Multiplier
For CPU and Mem
t 2 . 2 x l a r g e
c 5 . 16x l a r g e
Choosing the Right Instance Type
r 4 . 4 x l a r g e
Instance
Class
Generation
Multiplier
For CPU and Mem
t 2 . 2 x l a r g e
c 5 . 16x l a r g e
Over 100 to choose from!
Choosing the Right Instance Type
Credit: Willard Simmons (DataXu)
Choosing the Right Instance Type
Credit: Willard Simmons (DataXu)
Older generations
are inefficient
Choosing the Right Instance Type
Credit: Willard Simmons (DataXu)
Better for larger
memory clusters
Older generations
are inefficient
Choosing the Right Instance Type
Credit: Willard Simmons (DataXu)
Better for smaller
memory clusters
Older generations
are inefficient
Second Challenge:
Elastic Scaling
More Concurrency? Add More Nodes
More Concurrency? Add More Nodes
Presto
Worker
Presto
Worker
Presto
Coordinator
1 Query
When will queries complete
at current rate?
More Concurrency? Add More Nodes
Presto
Worker
Presto
Worker
Presto
Coordinator
10 Queries
When will queries complete
at current rate?
Not fast enough!
More Concurrency? Add More Nodes
Presto
Worker
Presto
Worker
Presto
Coordinator
10 Queries
When will queries complete
at current rate?
Qubole provisions more nodes up to a limit
(around 3 minutes)
Presto
Worker
Presto
Worker
More Concurrency? Add More Nodes
Presto
Worker
Presto
Worker
Presto
Coordinator
1 Query
When will queries complete
at current rate?
Presto
Worker
Presto
Worker
Too fast!
More Concurrency? Add More Nodes
Presto
Worker
Presto
Worker
Presto
Coordinator
1 Query
When will queries complete
at current rate?
Qubole decommissions more nodes up to a limit
Not so fast…
Presto
Worker
Presto
Worker
Presto
Coordinator
1 Query
When will queries complete
at current rate?
Not fast enough!
100% CPU 100% CPU
Presto
Worker
Presto
Worker
Presto
Coordinator
1 Query
When will queries complete
at current rate?
Upscaling only works for new queries
Presto
Worker
Presto
Worker
100% CPU 100% CPUIdle Idle
Not so fast…
Not fast enough!
Presto
Worker
Presto
Worker
Presto
Coordinator
1 Query
When will queries complete
at current rate?
Upscaling only works for new queries
Presto
Worker
Presto
Worker
100% CPU 100% CPUIdle Idle
Not so fast…
Not fast enough!
Maybe we should have sent this query
to a more powerful cluster?
Autoscaling is for concurrency
Third Challenge:
Maturity
Query History
Presto UI is nice for watching queries as they’re happening, but not for historical auditing
Service administration portal tracks Qubole commands 

(Presto queries) and links to the Qubole web site

View and download intermediate queries and results

Presto Query Auditing
• Official Presto JDBC driver does not support Prepared Statements

• Worker loss not handled gracefully

(if one task fails, all tasks fail — we take that risk with retry logic)

• No support for upper-case table names in MySQL (Issue 2863)

• TIMESTAMP behavior does not match SQL standard (Issue 7122)

• Naïve query optimizer (talk to Starburst!)
Specific Technical Presto Issues
• Official Presto JDBC driver does not support Prepared Statements

• Worker loss not handled gracefully

(if one task fails, all tasks fail — we take that risk with retry logic)

• No support for upper-case table names in MySQL (Issue 2863)

• TIMESTAMP behavior does not match SQL standard (Issue 7122)

• Naïve query optimizer (talk to Starburst!)
Moral: you may need to get creative with workarounds
Specific Technical Presto Issues
Presto Docker container
using memory connectors
Testing
Presto Docker container
using memory connectors
Testing
Declarative syntax allows us to mock tables
in the Docker container
Presto Docker container
using memory connectors
Testing
Declarative syntax allows us to mock tables
in the Docker container
…so we can test our generated queries in isolation
using Behavior-Driven Development.
Final Takeaways
Setting expectations: Make sure everyone knows Presto is

not a full-fledged database.
Providing one logical view of the data model across many databases is great!

Favorite for many other workloads beyond its initial scope for this reason.

Presto’s simplicity resulted in widespread adoption.

Biggest (Positive) Surprise
Provocative Ending
Provocative Ending
Presto feels like an API gateway, but for data.
Behavioral Services Data Applications
Interface (REST, WSDL, Thrift, etc.) :: Data Definition Language (DDL)
Requests (HTTP, SOAP, etc.) :: Data Manipulation Language (DML)
Service implementation language :: Database technology
Publishing an endpoint :: Exposing a table or view
Service handler :: CREATE VIEW, CREATE TRIGGER
Service endpoint configuration :: Catalog/connector configuration
Provocative Ending
Presto feels like an API gateway, but for data.
Behavioral Services Data Applications
Interface (REST, WSDL, Thrift, etc.) :: Data Definition Language (DDL)
Requests (HTTP, SOAP, etc.) :: Data Manipulation Language (DML)
Service implementation language :: Database technology
Publishing an endpoint :: Exposing a table or view
Service handler :: CREATE VIEW, CREATE TRIGGER
Service endpoint configuration :: Catalog/connector configuration
What other engineering advancements can we push through the lens from
microservices (behaviors) to databases (state)?
Thanks!
Questions?

Más contenido relacionado

La actualidad más candente

Improving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of ServiceImproving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of ServiceMing Ma
 
Cassandra Community Webinar | Practice Makes Perfect: Extreme Cassandra Optim...
Cassandra Community Webinar | Practice Makes Perfect: Extreme Cassandra Optim...Cassandra Community Webinar | Practice Makes Perfect: Extreme Cassandra Optim...
Cassandra Community Webinar | Practice Makes Perfect: Extreme Cassandra Optim...DataStax
 
10 tips to improve the performance of your AWS application
10 tips to improve the performance of your AWS application10 tips to improve the performance of your AWS application
10 tips to improve the performance of your AWS applicationAmazon Web Services
 
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...DataStax
 
How we sleep well at night using Hystrix at Finn.no
How we sleep well at night using Hystrix at Finn.noHow we sleep well at night using Hystrix at Finn.no
How we sleep well at night using Hystrix at Finn.noHenning Spjelkavik
 
Analysis big data by use php with storm
Analysis big data by use php with stormAnalysis big data by use php with storm
Analysis big data by use php with storm毅 吕
 
The Right Data for the Right Job
The Right Data for the Right JobThe Right Data for the Right Job
The Right Data for the Right JobEmily Curtin
 
Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Yukinori Suda
 
DAT402 - Deep Dive on Amazon Aurora PostgreSQL
DAT402 - Deep Dive on Amazon Aurora PostgreSQL DAT402 - Deep Dive on Amazon Aurora PostgreSQL
DAT402 - Deep Dive on Amazon Aurora PostgreSQL Grant McAlister
 
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...DataStax
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersDataWorks Summit
 
Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...
Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...
Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...DataStax
 
AWS Summit Stockholm 2014 – B5 – The TCO of cloud applications
AWS Summit Stockholm 2014 – B5 – The TCO of cloud applicationsAWS Summit Stockholm 2014 – B5 – The TCO of cloud applications
AWS Summit Stockholm 2014 – B5 – The TCO of cloud applicationsAmazon Web Services
 
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...DataStax
 
Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Chris Fregly
 
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...Hadoop / Spark Conference Japan
 

La actualidad más candente (20)

Improving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of ServiceImproving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of Service
 
Cassandra Community Webinar | Practice Makes Perfect: Extreme Cassandra Optim...
Cassandra Community Webinar | Practice Makes Perfect: Extreme Cassandra Optim...Cassandra Community Webinar | Practice Makes Perfect: Extreme Cassandra Optim...
Cassandra Community Webinar | Practice Makes Perfect: Extreme Cassandra Optim...
 
10 tips to improve the performance of your AWS application
10 tips to improve the performance of your AWS application10 tips to improve the performance of your AWS application
10 tips to improve the performance of your AWS application
 
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
 
How we sleep well at night using Hystrix at Finn.no
How we sleep well at night using Hystrix at Finn.noHow we sleep well at night using Hystrix at Finn.no
How we sleep well at night using Hystrix at Finn.no
 
Analysis big data by use php with storm
Analysis big data by use php with stormAnalysis big data by use php with storm
Analysis big data by use php with storm
 
The Right Data for the Right Job
The Right Data for the Right JobThe Right Data for the Right Job
The Right Data for the Right Job
 
Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)
 
DAT402 - Deep Dive on Amazon Aurora PostgreSQL
DAT402 - Deep Dive on Amazon Aurora PostgreSQL DAT402 - Deep Dive on Amazon Aurora PostgreSQL
DAT402 - Deep Dive on Amazon Aurora PostgreSQL
 
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...
 
Deep Dive on Amazon EC2
Deep Dive on Amazon EC2Deep Dive on Amazon EC2
Deep Dive on Amazon EC2
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
 
Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...
Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...
Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...
 
AWS Summit Stockholm 2014 – B5 – The TCO of cloud applications
AWS Summit Stockholm 2014 – B5 – The TCO of cloud applicationsAWS Summit Stockholm 2014 – B5 – The TCO of cloud applications
AWS Summit Stockholm 2014 – B5 – The TCO of cloud applications
 
Master tuning
Master   tuningMaster   tuning
Master tuning
 
Inferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on SparkInferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on Spark
 
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
 
Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016
 
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
The Evolution and Future of Hadoop Storage (Hadoop Conference Japan 2016キーノート...
 
Intro to hadoop
Intro to hadoopIntro to hadoop
Intro to hadoop
 

Similar a Targeted Audience Delivery with Presto

Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
Expecto Performa! The Magic and Reality of Performance Tuning
Expecto Performa! The Magic and Reality of Performance TuningExpecto Performa! The Magic and Reality of Performance Tuning
Expecto Performa! The Magic and Reality of Performance TuningAtlassian
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151xlight
 
2019 StartIT - Boosting your performance with Blackfire
2019 StartIT - Boosting your performance with Blackfire2019 StartIT - Boosting your performance with Blackfire
2019 StartIT - Boosting your performance with BlackfireMarko Mitranić
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware ProvisioningMongoDB
 
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...Amazon Web Services
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in SparkSnappyData
 
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network ProcessingRyousei Takano
 
Geek Sync | Performance Tune Like an MVP
Geek Sync | Performance Tune Like an MVPGeek Sync | Performance Tune Like an MVP
Geek Sync | Performance Tune Like an MVPIDERA Software
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
Scalable Apache for Beginners
Scalable Apache for BeginnersScalable Apache for Beginners
Scalable Apache for Beginnerswebhostingguy
 
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxData
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData
 
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalSizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalVigyan Jain
 
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxData
 
Spring One 2 GX 2014 - CACHING WITH SPRING: ADVANCED TOPICS AND BEST PRACTICES
Spring One 2 GX 2014 - CACHING WITH SPRING: ADVANCED TOPICS AND BEST PRACTICESSpring One 2 GX 2014 - CACHING WITH SPRING: ADVANCED TOPICS AND BEST PRACTICES
Spring One 2 GX 2014 - CACHING WITH SPRING: ADVANCED TOPICS AND BEST PRACTICESMichael Plöd
 
Realtime Analytics on AWS
Realtime Analytics on AWSRealtime Analytics on AWS
Realtime Analytics on AWSSungmin Kim
 

Similar a Targeted Audience Delivery with Presto (20)

Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Expecto Performa! The Magic and Reality of Performance Tuning
Expecto Performa! The Magic and Reality of Performance TuningExpecto Performa! The Magic and Reality of Performance Tuning
Expecto Performa! The Magic and Reality of Performance Tuning
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151
 
2019 StartIT - Boosting your performance with Blackfire
2019 StartIT - Boosting your performance with Blackfire2019 StartIT - Boosting your performance with Blackfire
2019 StartIT - Boosting your performance with Blackfire
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
 
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
Geek Sync | Performance Tune Like an MVP
Geek Sync | Performance Tune Like an MVPGeek Sync | Performance Tune Like an MVP
Geek Sync | Performance Tune Like an MVP
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Scalable Apache for Beginners
Scalable Apache for BeginnersScalable Apache for Beginners
Scalable Apache for Beginners
 
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
 
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalSizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
 
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
 
Spring One 2 GX 2014 - CACHING WITH SPRING: ADVANCED TOPICS AND BEST PRACTICES
Spring One 2 GX 2014 - CACHING WITH SPRING: ADVANCED TOPICS AND BEST PRACTICESSpring One 2 GX 2014 - CACHING WITH SPRING: ADVANCED TOPICS AND BEST PRACTICES
Spring One 2 GX 2014 - CACHING WITH SPRING: ADVANCED TOPICS AND BEST PRACTICES
 
Realtime Analytics on AWS
Realtime Analytics on AWSRealtime Analytics on AWS
Realtime Analytics on AWS
 

Último

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 

Último (20)

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 

Targeted Audience Delivery with Presto

  • 4. TV networks, programmers, and advertisers What are my target viewership segments? Set-Top box data Purchasing Behavior Location-based Consumer Data Targeted Audience Delivery Program Metadata
  • 5. TV networks, programmers, and advertisers What are my target viewership segments? Set-Top box data Purchasing Behavior Location-based Consumer Data Targeted Audience Delivery Program Metadata brought to you (in part) by
  • 6. looking to the past for inspiration for the future
  • 7. Similar Products at TiVo ETL Amazon S3 Java services on EC2 ETL Amazon Redshift MySQL (RDS)
  • 8. Similar Products at TiVo ETL Amazon S3 Java services on EC2 ETL Amazon Redshift MySQL (RDS) transactional and customer-configurable data semi-aggregated viewership data + sets of households (e.g., “18-24 years old”, “owns minivan”)
  • 9. New Product, New Challenges… ETL Amazon S3 Java services on EC2 ETL Amazon Redshift MySQL (RDS) MySQL MySQL MySQL Many new data marts popping up in our tech stack
  • 10. New Product, New Challenges… ETL Amazon S3 Java services on EC2 ETL Amazon Redshift MySQL (RDS) more viewership data OK, storage is cheap
  • 11. New Product, New Challenges… ETL Amazon S3 Java services on EC2 ETL Amazon Redshift MySQL (RDS) more viewership data storage is not cheap…
  • 12. New Product, New Challenges… ETL Amazon S3 Java services on EC2 ETL Amazon Redshift MySQL (RDS) storage is not cheap… Need finer grain data!
  • 13. New Product, New Challenges… ETL Amazon S3 Java services on EC2 ETL Amazon Redshift MySQL (RDS) storage is not cheap… Need finer grain data! Can’t aggregate as much
  • 14. New Product, New Challenges… ETL Amazon S3 Java services on EC2 ETL Amazon Redshift MySQL (RDS) static, hard to scale
  • 15. Wait, what about Redshift Spectrum ?
  • 16. Redshift Spectrum Redshift: Pay per node-hour Spectrum: Pay per data access
  • 17. How Does it Scale?
  • 18. Experiment: join on two tables • Small Joins: join small Redshift table with (filtered-down) large table on S3 • Join across ~1M rows • Large Joins: join large Redshift table with (unfiltered) large table table on S3 • Join across ~10M rows Compare to: both tables on Redshift How Does it Scale?
  • 19. Time (sec) Concurrent queries Redshift Spectrum for “Simple" Queries 0 10 20 30 40 50 60 70 1 3 5 7 9 11 13 15 Latency (sec) vs. # Concurrent Requests 1 day 1 day (Spectrum) Spectrum faster when cluster loaded and can pre-filter/pre-aggregate data small joins
  • 20. Time (sec) Concurrent queries Redshift Spectrum for “Simple" Queries 0 10 20 30 40 50 60 70 1 3 5 7 9 11 13 15 Latency (sec) vs. # Concurrent Requests 1 day 1 day (Spectrum) Spectrum faster when cluster loaded and can pre-filter/pre-aggregate data small joins Spectrum faster
  • 22. Time (sec) Concurrent queries Redshift Spectrum for Complex Queries Spectrum slower!
  • 23. Memory for broadcast join on the cluster is a non-parallelizable resource in the cluster Amdahl’s Law in Effect
  • 24. Memory for broadcast join on the cluster is a non-parallelizable resource in the cluster Amdahl’s Law in Effect “Operations that can't be pushed to the Redshift Spectrum layer include [JOIN], DISTINCT and ORDER BY. … When large amounts of data are returned from Amazon S3, the processing is limited by your cluster's resources.” https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-performance.html
  • 25. Wait, what about Redshift Spectrum ? Our queries won’t work well on Spectrum.
  • 27. Our Choice: • Storage/Compute Separation • Easy to add and remove worker nodes • Query many different data sources (inside our VPC) 
 without separate load • Good performance for analytical queries.
 Not so good for transactional and simple queries… • Managed (e.g., Qubole, Starburst)
  • 28. Coordinator Worker Worker Worker S3 / Hive metastore MySQL Connector Connector SELECT SUM(v.seconds_viewed) FROM hive.db.viewership v JOIN mysql.db.audiences a ON a.hh_id = v.hh_id WHERE audience_id = 42 mysql catalog à hive catalog à SELECT … FROM db.audiences WHERE audience_id = 42 DRAFT - TiVo Confidential 2018 How Presto Works Data is streamed back to the workers
  • 29. First Challenge: What instance types should we use?
  • 30. Presto Worker Memory System Memory reserved-system-memory = 0.4 * JVM Max Memory Reserved Memory max-memory-per-node General Memory (the rest) All Queries Start Using Memory From Here
  • 31. Presto Worker Memory System Memory reserved-system-memory = 0.4 * JVM Max Memory Reserved Memory max-memory-per-node General Memory (the rest) All Queries Start Using Memory From Here Query
  • 32. Presto Worker Memory System Memory reserved-system-memory = 0.4 * JVM Max Memory Reserved Memory max-memory-per-node General Memory (the rest) Needs more memory than in General Pool —> Switch to Reserved Query
  • 33. Presto Worker Memory System Memory reserved-system-memory = 0.4 * JVM Max Memory Reserved Memory max-memory-per-node General Memory (the rest) Needs more memory than in General Pool —> Switch to Reserved Query
  • 34. Presto Worker Memory System Memory reserved-system-memory = 0.4 * JVM Max Memory Reserved Memory max-memory-per-node General Memory (the rest) Needs more memory than in General Pool —> Switch to Reserved Query Only one query allowed!
  • 35. Presto Worker Memory System Memory reserved-system-memory = 0.4 * JVM Max Memory Reserved Memory max-memory-per-node General Memory (the rest) Needs more memory than in Reserved Pool —> Fail Query
  • 36. Presto Worker Memory System Memory reserved-system-memory = 0.4 * JVM Max Memory Reserved Memory max-memory-per-node General Memory (the rest) Needs more memory than in Reserved Pool —> Fail Query But there’s available memory??
  • 37. Presto Worker Memory System Memory reserved-system-memory = 0.4 * JVM Max Memory Reserved Memory max-memory-per-node General Memory (the rest) Needs more memory than in Reserved Pool —> keep allocating (resource overcommit) Query
  • 38. Presto Worker Memory System Memory reserved-system-memory = 0.4 * JVM Max Memory Reserved Memory max-memory-per-node General Memory (the rest) Query But now a single query can hog the entire cluster!
  • 40. Presto Worker Memory Query Query Presto Worker Memory Multiple Workers
  • 41. Presto Worker Memory Query Query Presto Worker Memory Multiple Workers Total Memory max-memory= max-memory-per-node * number of nodes
  • 42. Presto Worker Memory QueryQuery Presto Worker Memory Multiple Workers Total Memory max-memory= max-memory-per-node * number of nodes QueryQuery
  • 43. Total Memory max-memory= max-memory-per-node * number of nodes Presto Worker Memory Query Query Presto Worker Memory Multiple Workers Query Query
  • 44. • What if memory usage varies a lot between different queries?
 • Use many inexpensive instances, or a few expensive instances?
 • Compute optimized or memory optimized? Working With Reserved Memory Pool How do we achieve that? Conceptually, reserved memory pool should be the “high water mark” while most queries complete in the general pool.
  • 45. • What if memory usage varies a lot between different queries?
 • Use many inexpensive instances, or a few expensive instances?
 • Compute optimized or memory optimized? Working With Reserved Memory Pool Conceptually, reserved memory pool should be the “high water mark” while most queries complete in the general pool. Solution: multiple clusters based on workload Empiric testing found smaller cluster size was slightly faster Solution: Cost/Benefit Analysis How do we achieve that?
  • 46. Choosing the Right Instance Type r 4 . 4 x l a r g e Instance Class Generation Multiplier For CPU and Mem t 2 . 2 x l a r g e c 5 . 16x l a r g e
  • 47. Choosing the Right Instance Type r 4 . 4 x l a r g e Instance Class Generation Multiplier For CPU and Mem t 2 . 2 x l a r g e c 5 . 16x l a r g e Over 100 to choose from!
  • 48. Choosing the Right Instance Type Credit: Willard Simmons (DataXu)
  • 49. Choosing the Right Instance Type Credit: Willard Simmons (DataXu) Older generations are inefficient
  • 50. Choosing the Right Instance Type Credit: Willard Simmons (DataXu) Better for larger memory clusters Older generations are inefficient
  • 51. Choosing the Right Instance Type Credit: Willard Simmons (DataXu) Better for smaller memory clusters Older generations are inefficient
  • 53. More Concurrency? Add More Nodes
  • 54. More Concurrency? Add More Nodes Presto Worker Presto Worker Presto Coordinator 1 Query When will queries complete at current rate?
  • 55. More Concurrency? Add More Nodes Presto Worker Presto Worker Presto Coordinator 10 Queries When will queries complete at current rate? Not fast enough!
  • 56. More Concurrency? Add More Nodes Presto Worker Presto Worker Presto Coordinator 10 Queries When will queries complete at current rate? Qubole provisions more nodes up to a limit (around 3 minutes) Presto Worker Presto Worker
  • 57. More Concurrency? Add More Nodes Presto Worker Presto Worker Presto Coordinator 1 Query When will queries complete at current rate? Presto Worker Presto Worker Too fast!
  • 58. More Concurrency? Add More Nodes Presto Worker Presto Worker Presto Coordinator 1 Query When will queries complete at current rate? Qubole decommissions more nodes up to a limit
  • 59. Not so fast… Presto Worker Presto Worker Presto Coordinator 1 Query When will queries complete at current rate? Not fast enough! 100% CPU 100% CPU
  • 60. Presto Worker Presto Worker Presto Coordinator 1 Query When will queries complete at current rate? Upscaling only works for new queries Presto Worker Presto Worker 100% CPU 100% CPUIdle Idle Not so fast… Not fast enough!
  • 61. Presto Worker Presto Worker Presto Coordinator 1 Query When will queries complete at current rate? Upscaling only works for new queries Presto Worker Presto Worker 100% CPU 100% CPUIdle Idle Not so fast… Not fast enough! Maybe we should have sent this query to a more powerful cluster? Autoscaling is for concurrency
  • 63. Query History Presto UI is nice for watching queries as they’re happening, but not for historical auditing
  • 64. Service administration portal tracks Qubole commands (Presto queries) and links to the Qubole web site View and download intermediate queries and results Presto Query Auditing
  • 65. • Official Presto JDBC driver does not support Prepared Statements • Worker loss not handled gracefully
 (if one task fails, all tasks fail — we take that risk with retry logic) • No support for upper-case table names in MySQL (Issue 2863) • TIMESTAMP behavior does not match SQL standard (Issue 7122) • Naïve query optimizer (talk to Starburst!) Specific Technical Presto Issues
  • 66. • Official Presto JDBC driver does not support Prepared Statements • Worker loss not handled gracefully
 (if one task fails, all tasks fail — we take that risk with retry logic) • No support for upper-case table names in MySQL (Issue 2863) • TIMESTAMP behavior does not match SQL standard (Issue 7122) • Naïve query optimizer (talk to Starburst!) Moral: you may need to get creative with workarounds Specific Technical Presto Issues
  • 67. Presto Docker container using memory connectors Testing
  • 68. Presto Docker container using memory connectors Testing Declarative syntax allows us to mock tables in the Docker container
  • 69. Presto Docker container using memory connectors Testing Declarative syntax allows us to mock tables in the Docker container …so we can test our generated queries in isolation using Behavior-Driven Development.
  • 71. Setting expectations: Make sure everyone knows Presto is
 not a full-fledged database.
  • 72. Providing one logical view of the data model across many databases is great!
 Favorite for many other workloads beyond its initial scope for this reason. Presto’s simplicity resulted in widespread adoption. Biggest (Positive) Surprise
  • 74. Provocative Ending Presto feels like an API gateway, but for data. Behavioral Services Data Applications Interface (REST, WSDL, Thrift, etc.) :: Data Definition Language (DDL) Requests (HTTP, SOAP, etc.) :: Data Manipulation Language (DML) Service implementation language :: Database technology Publishing an endpoint :: Exposing a table or view Service handler :: CREATE VIEW, CREATE TRIGGER Service endpoint configuration :: Catalog/connector configuration
  • 75. Provocative Ending Presto feels like an API gateway, but for data. Behavioral Services Data Applications Interface (REST, WSDL, Thrift, etc.) :: Data Definition Language (DDL) Requests (HTTP, SOAP, etc.) :: Data Manipulation Language (DML) Service implementation language :: Database technology Publishing an endpoint :: Exposing a table or view Service handler :: CREATE VIEW, CREATE TRIGGER Service endpoint configuration :: Catalog/connector configuration What other engineering advancements can we push through the lens from microservices (behaviors) to databases (state)?