SlideShare una empresa de Scribd logo
1 de 41
Descargar para leer sin conexión
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in partwithout the express consent of Amazon.com, Inc. 
November 12th, 2014 | Las Vegas, NV 
PFC305Embracing Failure 
Fault Injection and Service Resilience at Netflix 
Josh Evans and NareshGopalani, Netflix
•~50 million members, ~50 countries 
•> 1 billion hours per month 
•> 1000 device types 
•3 AWS regions, hundreds of services 
•Hundreds of thousands of requests/second 
•CDN serves petabytes of data at terabits/second 
Netflix Ecosystem 
Service Partners 
Static Content 
Akamai 
Netflix CDN 
AWS/Netflix 
Control Plane 
Internet
Availability means that members can 
●sign up 
●activate a device 
●browse 
●watch
What keeps us up at night
Failures can happen any time 
•Disks fail 
•Power outages 
•Natural disasters 
•Software bugs 
•Human error
We design for failure 
•Exception handling 
•Fault tolerance and isolation 
•Fall-backs and degraded experiences 
•Auto-scaling clusters 
•Redundancy
Testing for failure is hard 
•Web-scale traffic 
•Massive, changing data sets 
•Complex interactions and request patterns 
•Asynchronous, concurrent requests 
•Complete and partial failure modes 
Constant innovation and change
What if we regularly inject failures into our systems under controlled circumstances?
Blast Radius 
•Unit of isolation 
•Scope of an outage 
•Scope a chaos exercise 
Zone 
Region 
Instance 
Global
An Instance Fails 
Edge Cluster 
Cluster A 
Cluster B 
Cluster D 
Cluster C
Chaos Monkey 
•Monkey loose in your DC 
•Run during business hours 
•What we learned 
–Auto-replacement works 
–State is problematic
A State of Xen -Chaos Monkey & Cassandra 
Out of our 2700+ Cassandra nodes 
•218 rebooted 
•22 did not reboot successfully 
•Automation replaced failed nodes 
•0 downtime due to reboot
An Availability Zone Fails 
EU-West 
US-East 
US-West 
AZ1 
AZ2
Chaos Gorilla 
Simulate an Availability Zone outage 
•3-zone configuration 
•Eliminate one zone 
•Ensure that others can handle the load and nothing breaks 
Chaos Gorilla
Challenges 
•Rapidly shifting traffic 
–LBs must expire connections quickly 
–Lingering connections to caches must be addressed 
•Service configuration 
–Not all clusters auto-scaled or pinned 
–Services not configured for cross-zone calls 
–Mismatched timeouts –fallbacks prevented fail-over
A Region Fails 
EU-West 
US-East 
US-West
AZ1 
AZ2 
AZ3 
Regional Load Balancers 
Zuul –Traffic Shaping/Routing 
Data 
Data 
Data 
Geo-located 
Chaos Kong 
Chaos Kong 
AZ1 
AZ2 
AZ3 
Regional Load Balancers 
Zuul –Traffic Shaping/Routing 
Data 
Data 
Data 
Customer Device
Challenges 
●Rapidly shifting traffic 
○Auto-scaling configuration 
○Static configuration/pinning 
○Instance start time 
○Cache fill time
Challenges 
●Service Configuration 
○Timeout configurations 
○Fallbacks fail or don’t provide the desired experience 
●No minimal (critical) stack 
○Any service may be critical!
A Service Fails 
Zone 
Region 
Global 
Service
Services Slow Down and Fail 
Simulate latent/failed service calls 
•Inject arbitrary latency and errors at the service level 
•Observe for effects 
Latency Monkey
Latency Monkey 
Device 
Zuul 
ELB 
Edge 
Service B 
Service C 
Internet 
Service A
Challenges 
•Startup resiliency is an issue 
•Services owners don’t know all dependencies 
•Fallbacks can fail too 
•Second order effects not easily tested 
•Dependencies are in constant flux 
•Latency Monkey tests function and scale 
–Not a staged approach 
–Lots of opt-outs
More Precise and Continuous
Service Failure Testing:FIT
Distributed Systems Fail 
●Complex interactions at scale 
●Variability across services 
●Byzantine failures 
●Combinatorial complexity
Any service can cause cascading failures 
ELB
Fault Injection Testing (FIT) 
Device 
Service B 
Service C 
Internet 
Edge 
Device or Account Override 
Zuul 
Service A 
Request-level simulations 
ELB
Failure Injection Points 
IPC 
Cassandra Client 
Memcached Client 
Service Container 
Fault Tolerance
FIT Details 
●Common Simulation Syntax 
●Single Simulation Interface 
●Transported via Http Request header
Integrating Failure 
Service 
Filter 
Ribbon 
Service 
Filter 
Ribbon 
ServerRcv 
ServerRcv 
ClientSend 
request 
Service A 
response 
Service B 
[sendRequestHeader] >>fit.failure: 1|fit.Serializer| 
2|[[{"name”:”failSocial, 
”whitelist":false, 
"injectionPoints”: 
[“SocialService”]},{} 
]], 
{"Id": 
"252c403b-7e34-4c0b-a28a-3606fcc38768"}]]
Failure Scenarios 
●Set of injection points to fail 
●Defined based on 
○Past outages 
○Specific dependency interactions 
○Whitelist of a set of critical services 
○Dynamic tracing of dependencies
FIT Insights : Salp 
●Distributed tracing inspired by Dapper paper 
●Provides insight into dependencies 
●Helps define & visualize scenarios
Functional Validation 
●Isolated synthetic transactions 
○Set of devices 
Validation at Scale 
●Dial up customer traffic -% based 
●Simulation of full service failure 
Dialing Up Failure 
Chaos!
Continuous Validation 
Critical 
Services 
Non-critical Services 
Synthetic Transactions
Don’t Fear The Monkeys
Take-aways 
•Don’t wait for random failures 
–Cause failure to validate resiliency 
–Remove uncertainty by forcing failures regularly 
–Better to fail at 2pm than 2am 
•Test design assumptions by stressing them 
Embrace Failure
The Simian Army is part of the Netflix open source cloud platform 
http://netflix.github.com
Netflix talks at re:Invent 
Talk 
Time 
Title 
BDT-403 
Wednesday, 2:15pm 
Next Generation Big Data Platform at Netflix 
PFC-306 
Wednesday, 3:30pm 
Performance Tuning Amazon EC2 
DEV-309 
Wednesday, 3:30pm 
From Asgardto Zuul, How Netflix’s proven Open Source Tools can accelerate and scale your services 
ARC-317 
Wednesday, 4:30pm 
Maintaining a Resilient Front-Door at Massive Scale 
PFC-304 
Wednesday, 4:30pm 
Effective InterProcessCommunications in the Cloud: The Pros and Cons of MicroservicesArchitectures 
ENT-209 
Wednesday, 4:30pm 
Cloud Migration, Dev-Ops and Distributed Systems 
APP-310 
Friday 9:00am 
Scheduling using Apache Mesosin the Cloud
Please give us your feedback on this session. 
Complete session evaluations and earn re:Invent swag. 
http://bit.ly/awsevals 
Josh Evans 
jevans@netflix.com 
@josh_evans_nflx 
Naresh Gopalani 
ngopalani@netflix.com

Más contenido relacionado

La actualidad más candente

AWS re:Invent 2016: Amazon Aurora Deep Dive (GPST402)
AWS re:Invent 2016: Amazon Aurora Deep Dive (GPST402)AWS re:Invent 2016: Amazon Aurora Deep Dive (GPST402)
AWS re:Invent 2016: Amazon Aurora Deep Dive (GPST402)Amazon Web Services
 
AWS re:Invent re:Cap - 새로운 관계형 데이터베이스 엔진: Amazon Aurora - 양승도
AWS re:Invent re:Cap - 새로운 관계형 데이터베이스 엔진: Amazon Aurora - 양승도AWS re:Invent re:Cap - 새로운 관계형 데이터베이스 엔진: Amazon Aurora - 양승도
AWS re:Invent re:Cap - 새로운 관계형 데이터베이스 엔진: Amazon Aurora - 양승도Amazon Web Services Korea
 
Gaming on AWS - 2. Amazon Aurora 100% 활용하기 - 신규 기능 및 이전 방법 시연
Gaming on AWS - 2. Amazon Aurora 100% 활용하기 - 신규 기능 및 이전 방법 시연Gaming on AWS - 2. Amazon Aurora 100% 활용하기 - 신규 기능 및 이전 방법 시연
Gaming on AWS - 2. Amazon Aurora 100% 활용하기 - 신규 기능 및 이전 방법 시연Amazon Web Services Korea
 
Amazon (AWS) Aurora
Amazon (AWS) AuroraAmazon (AWS) Aurora
Amazon (AWS) AuroraPGConf APAC
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
Deep Dive on Amazon EC2 Instances (March 2017)
Deep Dive on Amazon EC2 Instances (March 2017)Deep Dive on Amazon EC2 Instances (March 2017)
Deep Dive on Amazon EC2 Instances (March 2017)Julien SIMON
 
컴퓨팅 서비스 업데이트 - EC2, ECS, Lambda (김상필) :: re:Invent re:Cap Webinar 2015
컴퓨팅 서비스 업데이트 - EC2, ECS, Lambda (김상필) :: re:Invent re:Cap Webinar 2015컴퓨팅 서비스 업데이트 - EC2, ECS, Lambda (김상필) :: re:Invent re:Cap Webinar 2015
컴퓨팅 서비스 업데이트 - EC2, ECS, Lambda (김상필) :: re:Invent re:Cap Webinar 2015Amazon Web Services Korea
 
NEW LAUNCH! Introducing PostgreSQL compatibility for Amazon Aurora
NEW LAUNCH! Introducing PostgreSQL compatibility for Amazon AuroraNEW LAUNCH! Introducing PostgreSQL compatibility for Amazon Aurora
NEW LAUNCH! Introducing PostgreSQL compatibility for Amazon AuroraAmazon Web Services
 
AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)
AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)
AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)Amazon Web Services
 
10 tips to improve the performance of your AWS application
10 tips to improve the performance of your AWS application10 tips to improve the performance of your AWS application
10 tips to improve the performance of your AWS applicationAmazon Web Services
 
Amazon Aurora Let's Talk About Performance
Amazon Aurora Let's Talk About PerformanceAmazon Aurora Let's Talk About Performance
Amazon Aurora Let's Talk About PerformanceDanilo Poccia
 
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...Amazon Web Services
 
Deep Dive on Amazon EC2 Instances - January 2017 AWS Online Tech Talks
Deep Dive on Amazon EC2 Instances - January 2017 AWS Online Tech TalksDeep Dive on Amazon EC2 Instances - January 2017 AWS Online Tech Talks
Deep Dive on Amazon EC2 Instances - January 2017 AWS Online Tech TalksAmazon Web Services
 
PostgreSQL on Amazon RDS
PostgreSQL on Amazon RDSPostgreSQL on Amazon RDS
PostgreSQL on Amazon RDSPGConf APAC
 
Configuration Management with AWS OpsWorks for Chef Automate
Configuration Management with AWS OpsWorks for Chef AutomateConfiguration Management with AWS OpsWorks for Chef Automate
Configuration Management with AWS OpsWorks for Chef AutomateAmazon Web Services
 
AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)
AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)
AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)Amazon Web Services
 
AWS Summit London 2014 | Maximising EC2 and EBC Performance (400)
AWS Summit London 2014 | Maximising EC2 and EBC Performance (400)AWS Summit London 2014 | Maximising EC2 and EBC Performance (400)
AWS Summit London 2014 | Maximising EC2 and EBC Performance (400)Amazon Web Services
 

La actualidad más candente (20)

AWS re:Invent 2016: Amazon Aurora Deep Dive (GPST402)
AWS re:Invent 2016: Amazon Aurora Deep Dive (GPST402)AWS re:Invent 2016: Amazon Aurora Deep Dive (GPST402)
AWS re:Invent 2016: Amazon Aurora Deep Dive (GPST402)
 
AWS re:Invent re:Cap - 새로운 관계형 데이터베이스 엔진: Amazon Aurora - 양승도
AWS re:Invent re:Cap - 새로운 관계형 데이터베이스 엔진: Amazon Aurora - 양승도AWS re:Invent re:Cap - 새로운 관계형 데이터베이스 엔진: Amazon Aurora - 양승도
AWS re:Invent re:Cap - 새로운 관계형 데이터베이스 엔진: Amazon Aurora - 양승도
 
Gaming on AWS - 2. Amazon Aurora 100% 활용하기 - 신규 기능 및 이전 방법 시연
Gaming on AWS - 2. Amazon Aurora 100% 활용하기 - 신규 기능 및 이전 방법 시연Gaming on AWS - 2. Amazon Aurora 100% 활용하기 - 신규 기능 및 이전 방법 시연
Gaming on AWS - 2. Amazon Aurora 100% 활용하기 - 신규 기능 및 이전 방법 시연
 
Amazon Aurora
Amazon AuroraAmazon Aurora
Amazon Aurora
 
Amazon (AWS) Aurora
Amazon (AWS) AuroraAmazon (AWS) Aurora
Amazon (AWS) Aurora
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Deep Dive on Amazon EC2 Instances (March 2017)
Deep Dive on Amazon EC2 Instances (March 2017)Deep Dive on Amazon EC2 Instances (March 2017)
Deep Dive on Amazon EC2 Instances (March 2017)
 
컴퓨팅 서비스 업데이트 - EC2, ECS, Lambda (김상필) :: re:Invent re:Cap Webinar 2015
컴퓨팅 서비스 업데이트 - EC2, ECS, Lambda (김상필) :: re:Invent re:Cap Webinar 2015컴퓨팅 서비스 업데이트 - EC2, ECS, Lambda (김상필) :: re:Invent re:Cap Webinar 2015
컴퓨팅 서비스 업데이트 - EC2, ECS, Lambda (김상필) :: re:Invent re:Cap Webinar 2015
 
NEW LAUNCH! Introducing PostgreSQL compatibility for Amazon Aurora
NEW LAUNCH! Introducing PostgreSQL compatibility for Amazon AuroraNEW LAUNCH! Introducing PostgreSQL compatibility for Amazon Aurora
NEW LAUNCH! Introducing PostgreSQL compatibility for Amazon Aurora
 
AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)
AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)
AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)
 
Aurora Deep Dive | AWS Floor28
Aurora Deep Dive | AWS Floor28Aurora Deep Dive | AWS Floor28
Aurora Deep Dive | AWS Floor28
 
Oracle on AWS RDS Migration - 성기명
Oracle on AWS RDS Migration - 성기명Oracle on AWS RDS Migration - 성기명
Oracle on AWS RDS Migration - 성기명
 
10 tips to improve the performance of your AWS application
10 tips to improve the performance of your AWS application10 tips to improve the performance of your AWS application
10 tips to improve the performance of your AWS application
 
Amazon Aurora Let's Talk About Performance
Amazon Aurora Let's Talk About PerformanceAmazon Aurora Let's Talk About Performance
Amazon Aurora Let's Talk About Performance
 
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...
Deep Dive on the Amazon Aurora MySQL-compatible Edition - DAT301 - re:Invent ...
 
Deep Dive on Amazon EC2 Instances - January 2017 AWS Online Tech Talks
Deep Dive on Amazon EC2 Instances - January 2017 AWS Online Tech TalksDeep Dive on Amazon EC2 Instances - January 2017 AWS Online Tech Talks
Deep Dive on Amazon EC2 Instances - January 2017 AWS Online Tech Talks
 
PostgreSQL on Amazon RDS
PostgreSQL on Amazon RDSPostgreSQL on Amazon RDS
PostgreSQL on Amazon RDS
 
Configuration Management with AWS OpsWorks for Chef Automate
Configuration Management with AWS OpsWorks for Chef AutomateConfiguration Management with AWS OpsWorks for Chef Automate
Configuration Management with AWS OpsWorks for Chef Automate
 
AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)
AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)
AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)
 
AWS Summit London 2014 | Maximising EC2 and EBC Performance (400)
AWS Summit London 2014 | Maximising EC2 and EBC Performance (400)AWS Summit London 2014 | Maximising EC2 and EBC Performance (400)
AWS Summit London 2014 | Maximising EC2 and EBC Performance (400)
 

Similar a (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

Embracing Failure - Fault Injection and Service Resilience at Netflix
Embracing Failure - Fault Injection and Service Resilience at NetflixEmbracing Failure - Fault Injection and Service Resilience at Netflix
Embracing Failure - Fault Injection and Service Resilience at NetflixJosh Evans
 
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02~Eric Principe
 
(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The CloudAmazon Web Services
 
Engineering Netflix Global Operations in the Cloud
Engineering Netflix Global Operations in the CloudEngineering Netflix Global Operations in the Cloud
Engineering Netflix Global Operations in the CloudJosh Evans
 
Tokyo azure meetup #12 service fabric internals
Tokyo azure meetup #12   service fabric internalsTokyo azure meetup #12   service fabric internals
Tokyo azure meetup #12 service fabric internalsTokyo Azure Meetup
 
Arc305 how netflix leverages multiple regions to increase availability an i...
Arc305 how netflix leverages multiple regions to increase availability   an i...Arc305 how netflix leverages multiple regions to increase availability   an i...
Arc305 how netflix leverages multiple regions to increase availability an i...Ruslan Meshenberg
 
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...Amazon Web Services
 
The impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenThe impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenParticular Software
 
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...Amazon Web Services
 
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...Amazon Web Services
 
Our Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudHostedbyConfluent
 
Netflix Massively Scalable, Highly Available, Immutable Infrastructure
Netflix Massively Scalable, Highly Available, Immutable InfrastructureNetflix Massively Scalable, Highly Available, Immutable Infrastructure
Netflix Massively Scalable, Highly Available, Immutable InfrastructureAmer Ather
 
Expect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesExpect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesBhakti Mehta
 
Migrating IBM i Systems to the Cloud: Exploring the Pros and Cons
Migrating IBM i Systems to the Cloud: Exploring the Pros and ConsMigrating IBM i Systems to the Cloud: Exploring the Pros and Cons
Migrating IBM i Systems to the Cloud: Exploring the Pros and ConsPrecisely
 
Database and Public Endpoints redundancy on Azure
Database and Public Endpoints redundancy on AzureDatabase and Public Endpoints redundancy on Azure
Database and Public Endpoints redundancy on AzureRadu Vunvulea
 
(SPOT302) Availability: The New Kind of Innovator’s Dilemma
(SPOT302) Availability: The New Kind of Innovator’s Dilemma(SPOT302) Availability: The New Kind of Innovator’s Dilemma
(SPOT302) Availability: The New Kind of Innovator’s DilemmaAmazon Web Services
 
Pace of Innovation at AWS - London Summit Enteprise Track RePlay
Pace of Innovation at AWS - London Summit Enteprise Track RePlayPace of Innovation at AWS - London Summit Enteprise Track RePlay
Pace of Innovation at AWS - London Summit Enteprise Track RePlayAmazon Web Services
 
Traffic Control with Envoy Proxy
Traffic Control with Envoy ProxyTraffic Control with Envoy Proxy
Traffic Control with Envoy ProxyMark McBride
 
Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)Govind Kanshi
 

Similar a (PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014 (20)

Embracing Failure - Fault Injection and Service Resilience at Netflix
Embracing Failure - Fault Injection and Service Resilience at NetflixEmbracing Failure - Fault Injection and Service Resilience at Netflix
Embracing Failure - Fault Injection and Service Resilience at Netflix
 
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
 
(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud
 
Engineering Netflix Global Operations in the Cloud
Engineering Netflix Global Operations in the CloudEngineering Netflix Global Operations in the Cloud
Engineering Netflix Global Operations in the Cloud
 
Tokyo azure meetup #12 service fabric internals
Tokyo azure meetup #12   service fabric internalsTokyo azure meetup #12   service fabric internals
Tokyo azure meetup #12 service fabric internals
 
Arc305 how netflix leverages multiple regions to increase availability an i...
Arc305 how netflix leverages multiple regions to increase availability   an i...Arc305 how netflix leverages multiple regions to increase availability   an i...
Arc305 how netflix leverages multiple regions to increase availability an i...
 
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...
 
The impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenThe impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves Goeleven
 
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
 
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
 
Our Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent Cloud
 
Netflix Massively Scalable, Highly Available, Immutable Infrastructure
Netflix Massively Scalable, Highly Available, Immutable InfrastructureNetflix Massively Scalable, Highly Available, Immutable Infrastructure
Netflix Massively Scalable, Highly Available, Immutable Infrastructure
 
Expect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesExpect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservices
 
Migrating IBM i Systems to the Cloud: Exploring the Pros and Cons
Migrating IBM i Systems to the Cloud: Exploring the Pros and ConsMigrating IBM i Systems to the Cloud: Exploring the Pros and Cons
Migrating IBM i Systems to the Cloud: Exploring the Pros and Cons
 
Mini-Track: Lessons from Public Cloud
Mini-Track: Lessons from Public CloudMini-Track: Lessons from Public Cloud
Mini-Track: Lessons from Public Cloud
 
Database and Public Endpoints redundancy on Azure
Database and Public Endpoints redundancy on AzureDatabase and Public Endpoints redundancy on Azure
Database and Public Endpoints redundancy on Azure
 
(SPOT302) Availability: The New Kind of Innovator’s Dilemma
(SPOT302) Availability: The New Kind of Innovator’s Dilemma(SPOT302) Availability: The New Kind of Innovator’s Dilemma
(SPOT302) Availability: The New Kind of Innovator’s Dilemma
 
Pace of Innovation at AWS - London Summit Enteprise Track RePlay
Pace of Innovation at AWS - London Summit Enteprise Track RePlayPace of Innovation at AWS - London Summit Enteprise Track RePlay
Pace of Innovation at AWS - London Summit Enteprise Track RePlay
 
Traffic Control with Envoy Proxy
Traffic Control with Envoy ProxyTraffic Control with Envoy Proxy
Traffic Control with Envoy Proxy
 
Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)
 

Más de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Más de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Último

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Último (20)

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

  • 1. © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in partwithout the express consent of Amazon.com, Inc. November 12th, 2014 | Las Vegas, NV PFC305Embracing Failure Fault Injection and Service Resilience at Netflix Josh Evans and NareshGopalani, Netflix
  • 2. •~50 million members, ~50 countries •> 1 billion hours per month •> 1000 device types •3 AWS regions, hundreds of services •Hundreds of thousands of requests/second •CDN serves petabytes of data at terabits/second Netflix Ecosystem Service Partners Static Content Akamai Netflix CDN AWS/Netflix Control Plane Internet
  • 3. Availability means that members can ●sign up ●activate a device ●browse ●watch
  • 4. What keeps us up at night
  • 5. Failures can happen any time •Disks fail •Power outages •Natural disasters •Software bugs •Human error
  • 6. We design for failure •Exception handling •Fault tolerance and isolation •Fall-backs and degraded experiences •Auto-scaling clusters •Redundancy
  • 7. Testing for failure is hard •Web-scale traffic •Massive, changing data sets •Complex interactions and request patterns •Asynchronous, concurrent requests •Complete and partial failure modes Constant innovation and change
  • 8. What if we regularly inject failures into our systems under controlled circumstances?
  • 9.
  • 10. Blast Radius •Unit of isolation •Scope of an outage •Scope a chaos exercise Zone Region Instance Global
  • 11. An Instance Fails Edge Cluster Cluster A Cluster B Cluster D Cluster C
  • 12. Chaos Monkey •Monkey loose in your DC •Run during business hours •What we learned –Auto-replacement works –State is problematic
  • 13. A State of Xen -Chaos Monkey & Cassandra Out of our 2700+ Cassandra nodes •218 rebooted •22 did not reboot successfully •Automation replaced failed nodes •0 downtime due to reboot
  • 14. An Availability Zone Fails EU-West US-East US-West AZ1 AZ2
  • 15. Chaos Gorilla Simulate an Availability Zone outage •3-zone configuration •Eliminate one zone •Ensure that others can handle the load and nothing breaks Chaos Gorilla
  • 16. Challenges •Rapidly shifting traffic –LBs must expire connections quickly –Lingering connections to caches must be addressed •Service configuration –Not all clusters auto-scaled or pinned –Services not configured for cross-zone calls –Mismatched timeouts –fallbacks prevented fail-over
  • 17. A Region Fails EU-West US-East US-West
  • 18. AZ1 AZ2 AZ3 Regional Load Balancers Zuul –Traffic Shaping/Routing Data Data Data Geo-located Chaos Kong Chaos Kong AZ1 AZ2 AZ3 Regional Load Balancers Zuul –Traffic Shaping/Routing Data Data Data Customer Device
  • 19. Challenges ●Rapidly shifting traffic ○Auto-scaling configuration ○Static configuration/pinning ○Instance start time ○Cache fill time
  • 20. Challenges ●Service Configuration ○Timeout configurations ○Fallbacks fail or don’t provide the desired experience ●No minimal (critical) stack ○Any service may be critical!
  • 21. A Service Fails Zone Region Global Service
  • 22. Services Slow Down and Fail Simulate latent/failed service calls •Inject arbitrary latency and errors at the service level •Observe for effects Latency Monkey
  • 23. Latency Monkey Device Zuul ELB Edge Service B Service C Internet Service A
  • 24. Challenges •Startup resiliency is an issue •Services owners don’t know all dependencies •Fallbacks can fail too •Second order effects not easily tested •Dependencies are in constant flux •Latency Monkey tests function and scale –Not a staged approach –Lots of opt-outs
  • 25. More Precise and Continuous
  • 27. Distributed Systems Fail ●Complex interactions at scale ●Variability across services ●Byzantine failures ●Combinatorial complexity
  • 28. Any service can cause cascading failures ELB
  • 29. Fault Injection Testing (FIT) Device Service B Service C Internet Edge Device or Account Override Zuul Service A Request-level simulations ELB
  • 30. Failure Injection Points IPC Cassandra Client Memcached Client Service Container Fault Tolerance
  • 31. FIT Details ●Common Simulation Syntax ●Single Simulation Interface ●Transported via Http Request header
  • 32. Integrating Failure Service Filter Ribbon Service Filter Ribbon ServerRcv ServerRcv ClientSend request Service A response Service B [sendRequestHeader] >>fit.failure: 1|fit.Serializer| 2|[[{"name”:”failSocial, ”whitelist":false, "injectionPoints”: [“SocialService”]},{} ]], {"Id": "252c403b-7e34-4c0b-a28a-3606fcc38768"}]]
  • 33. Failure Scenarios ●Set of injection points to fail ●Defined based on ○Past outages ○Specific dependency interactions ○Whitelist of a set of critical services ○Dynamic tracing of dependencies
  • 34. FIT Insights : Salp ●Distributed tracing inspired by Dapper paper ●Provides insight into dependencies ●Helps define & visualize scenarios
  • 35. Functional Validation ●Isolated synthetic transactions ○Set of devices Validation at Scale ●Dial up customer traffic -% based ●Simulation of full service failure Dialing Up Failure Chaos!
  • 36. Continuous Validation Critical Services Non-critical Services Synthetic Transactions
  • 37. Don’t Fear The Monkeys
  • 38. Take-aways •Don’t wait for random failures –Cause failure to validate resiliency –Remove uncertainty by forcing failures regularly –Better to fail at 2pm than 2am •Test design assumptions by stressing them Embrace Failure
  • 39. The Simian Army is part of the Netflix open source cloud platform http://netflix.github.com
  • 40. Netflix talks at re:Invent Talk Time Title BDT-403 Wednesday, 2:15pm Next Generation Big Data Platform at Netflix PFC-306 Wednesday, 3:30pm Performance Tuning Amazon EC2 DEV-309 Wednesday, 3:30pm From Asgardto Zuul, How Netflix’s proven Open Source Tools can accelerate and scale your services ARC-317 Wednesday, 4:30pm Maintaining a Resilient Front-Door at Massive Scale PFC-304 Wednesday, 4:30pm Effective InterProcessCommunications in the Cloud: The Pros and Cons of MicroservicesArchitectures ENT-209 Wednesday, 4:30pm Cloud Migration, Dev-Ops and Distributed Systems APP-310 Friday 9:00am Scheduling using Apache Mesosin the Cloud
  • 41. Please give us your feedback on this session. Complete session evaluations and earn re:Invent swag. http://bit.ly/awsevals Josh Evans jevans@netflix.com @josh_evans_nflx Naresh Gopalani ngopalani@netflix.com