SlideShare una empresa de Scribd logo
1 de 25
Watching
Somebody
Else’s
Computer
Ronald McCollam
Markus Meier, FSFE, CC-BY-SA 4.0
Cattle vs. Pets
???
Jerem43, CC BY-SA 3.0 Deed
Imgflip/Paramount, Fair Use
Vectorportal, CC BY 4.0
Observability in the Cloud
Basic Principles
Common Challenges
and Best Practices
O11y in IRM
Observability in Incident Response Management
Why Me?
(The least interesting part of any presentation!)
● IT operations management since the “cloud” was new and scary
● 15+ years experience in observability
● Interviewed hundreds of SRE teams
● Literally wrote the book on Grafana, an open source o11y tool
Principles
Why set up monitoring at all?
What’s your bottom line?
How much does downtime cost you?
Are you happy with a $2000 SLA for downtime?
Maybe!
Defaults cover the average user. If you’re doing something outside of the defaults,
think about it – are you really monitoring what you think you’re monitoring?
RTFM!
My cloud provider does this for me, right?
The cloud provider defaults are good enough, right?
Lou Gold, CC BY-NC-SA 2.0 Deed
MLT
Where do I start?
Metrics tell you that something is wrong, but not what
Logs tell you what went wrong
(Traces tell you where a bottleneck is)
You can make logs act like metrics, but not the other way around
Logs are indispensable
Barnaby Dorfman, CC-BY 2.0
Infrastructure (k8s, database services, etc.)
Basic services - what does everything else depend on
The basics
Don’t care about things that scale (CPU, memory… even nodes)
At most, warn about a node failure. Never page for it.
Core resources first
You build for HA, so you should monitor that way
CI/CD - did deployments work? What versions are out there? Where are they?
Alert for user impact
(Hint: user experience = “free” SLOs!)
Synthetics can help
Make sure you catch everything!
(Did you remember PagerDuty? VictorOps? Grafana Cloud?)
So what should I alert for?
Symptoms, not causes
Monitor your monitoring
Quality varies – some are better than others
UX is not a solved problem here
Pick your tool(s)
Do you ever intend to go multi-cloud? (Or migrate?)
COMMUNITY MATTERS
Cloud-specific tools take time to learn
… but so does anything
Cloud provider monitoring
Think about lock in
Automatically onboard
Use a consistent format
Use it!
Semantics → Standards → Automation
Automated SLOs
One dashboard for the whole company
Drive adoption
Build, then automate
Driving adoption through SLOs
Challenges
(Or “Theory vs. Practice”)
Robert Harker, CC-BY-SA 3.0 Gael Mace, CC-BY 3.0
Scaling
It looks that way until it isn’t.
What could go wrong?
● Object storage
The cloud is infinitely scalable, right?
BorisFromStofckdale, GNU FDL
● Autoscaling
● Spot pricing
● Serverless
- even if the storage is there, are the IOPS?
- what happens when a zone runs out of a specific machine type?
- heaven help you on Black Friday
… isn’t serverless
If your monitoring system is in the same cloud and AZ as your production
environment, what happens when that AZ goes down?
Monitoring scaling
Observe from outside
Watch your overhead
Even when you have scaling defined, keep some capacity
Know what’s important
If you do run out of capacity, what can you afford to lose first? What can’t go down?
(Make sure you’re monitoring that thing if nothing else!)
Scaling monitoring
Remember to scale your observability environment!
Node 1 Node 2 Node 3
Prometheus
Scaling monitoring
Remember to scale your observability environment!
Node 1 Node 2 Node 3 Node 4 Node 5 Node 6
Prometheus
Prometheus
O11y in IRM
Robert Scoble, CC-BY 2.0
(and process)
Tools, tools,
tools
You already have tools for this!
(Especially if you think you don’t)
These are mission critical and need to be monitored!
(even if it costs money!)
Consistency is key
Use the same tools in dev and prod
Litlnemo, CC-BY-NC-SA 2.0 Deed Jon Sullivan, Public Domain
Dev Prod
Agree ahead of time on a starting point and process
Create a shared doc/room/channel for communication
Start simple – get everyone on the highest level dashboard to start and work down
Human Intelligence > Artificial Intelligence
Use runbooks over autoremediation
Automate where appropriate…
… but remember humans are great at thinking!
Takeaways
You need to monitor
Engagement matters
/ Infinity ain’t so
Monitor for scale
Scale your monitoring
Consistency is key
CREDITS: This presentation template was
created by Slidesgo, and includes icons by
Flaticon and infographics & images by Freepik
Thanks!
Contact
ronald@grafana.com
rm@mastodon.amaseto.com
CREDITS
This presentation template was created by Slidesgo, and
includes icons by Flaticon and infographics & images by
Freepik.
Images are from Adam Y Zhang, Jerem34, Vectorportal,
Lou Gold, Robert Harker, Gael Mace, BorisFromStockdale,
Barnaby Dorfman, Robert Scoble, litlnemo, Jon Sullivan and
are used via the Creative Commons license.
I’m deeply indebted to my colleagues working in
observability, but most especially to Heds Simons and
Goutham Veeramachaneni for their advice and review of
this content.

Más contenido relacionado

Similar a Watching Somebody Else's Computer: Cloud Native Observability

Availability in a cloud native world v1.6 (Feb 2019)
Availability in a cloud native world v1.6 (Feb 2019)Availability in a cloud native world v1.6 (Feb 2019)
Availability in a cloud native world v1.6 (Feb 2019)Haytham Elkhoja
 
30 days or less: New Features to Production
30 days or less: New Features to Production30 days or less: New Features to Production
30 days or less: New Features to ProductionKarthik Gaekwad
 
Continuous Delivery for Python Developers – PyCon Otto
Continuous Delivery for Python Developers – PyCon OttoContinuous Delivery for Python Developers – PyCon Otto
Continuous Delivery for Python Developers – PyCon OttoPeter Bittner
 
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.02014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0Joakim Lindbom
 
PuppetConf 2017: Deploying is Only Half the Battle! Operationalizing Applicat...
PuppetConf 2017: Deploying is Only Half the Battle! Operationalizing Applicat...PuppetConf 2017: Deploying is Only Half the Battle! Operationalizing Applicat...
PuppetConf 2017: Deploying is Only Half the Battle! Operationalizing Applicat...Puppet
 
Sai devops - the art of being specializing generalist
Sai   devops - the art of being specializing generalistSai   devops - the art of being specializing generalist
Sai devops - the art of being specializing generalistOdd-e
 
PuppetConf 2016: Successful Puppet Implementation in Large Organizations – Ja...
PuppetConf 2016: Successful Puppet Implementation in Large Organizations – Ja...PuppetConf 2016: Successful Puppet Implementation in Large Organizations – Ja...
PuppetConf 2016: Successful Puppet Implementation in Large Organizations – Ja...Puppet
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)Brian Brazil
 
Agile Development Practices - Productivity
Agile Development Practices - ProductivityAgile Development Practices - Productivity
Agile Development Practices - ProductivityAlex Moore
 
When to Code / Config / Config + Code in Salesforce - Nikunj Doshi
When to Code / Config / Config + Code in Salesforce - Nikunj DoshiWhen to Code / Config / Config + Code in Salesforce - Nikunj Doshi
When to Code / Config / Config + Code in Salesforce - Nikunj DoshiSakthivel Madesh
 
Sensepost assessment automation
Sensepost assessment automationSensepost assessment automation
Sensepost assessment automationSensePost
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Brian Brazil
 
SOC Meets Cloud: What Breaks, What Changes, What to Do?
SOC Meets Cloud: What Breaks, What Changes, What to Do?SOC Meets Cloud: What Breaks, What Changes, What to Do?
SOC Meets Cloud: What Breaks, What Changes, What to Do?Anton Chuvakin
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil
 
Dev and Ops Collaboration and Awareness at Etsy and Flickr
Dev and Ops Collaboration and Awareness at Etsy and FlickrDev and Ops Collaboration and Awareness at Etsy and Flickr
Dev and Ops Collaboration and Awareness at Etsy and FlickrJohn Allspaw
 
Cloud Security Practices and Principles
Cloud Security Practices and PrinciplesCloud Security Practices and Principles
Cloud Security Practices and PrinciplesSumo Logic
 
Ci tips and_tricks_linards_liepins
Ci tips and_tricks_linards_liepinsCi tips and_tricks_linards_liepins
Ci tips and_tricks_linards_liepinsLinards Liep
 

Similar a Watching Somebody Else's Computer: Cloud Native Observability (20)

From open source labs to ceo methods and advice by sysfera
From open source labs to ceo methods and advice by sysferaFrom open source labs to ceo methods and advice by sysfera
From open source labs to ceo methods and advice by sysfera
 
Availability in a cloud native world v1.6 (Feb 2019)
Availability in a cloud native world v1.6 (Feb 2019)Availability in a cloud native world v1.6 (Feb 2019)
Availability in a cloud native world v1.6 (Feb 2019)
 
30 days or less: New Features to Production
30 days or less: New Features to Production30 days or less: New Features to Production
30 days or less: New Features to Production
 
Continuous Delivery for Python Developers – PyCon Otto
Continuous Delivery for Python Developers – PyCon OttoContinuous Delivery for Python Developers – PyCon Otto
Continuous Delivery for Python Developers – PyCon Otto
 
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.02014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
 
PuppetConf 2017: Deploying is Only Half the Battle! Operationalizing Applicat...
PuppetConf 2017: Deploying is Only Half the Battle! Operationalizing Applicat...PuppetConf 2017: Deploying is Only Half the Battle! Operationalizing Applicat...
PuppetConf 2017: Deploying is Only Half the Battle! Operationalizing Applicat...
 
Sai devops - the art of being specializing generalist
Sai   devops - the art of being specializing generalistSai   devops - the art of being specializing generalist
Sai devops - the art of being specializing generalist
 
PuppetConf 2016: Successful Puppet Implementation in Large Organizations – Ja...
PuppetConf 2016: Successful Puppet Implementation in Large Organizations – Ja...PuppetConf 2016: Successful Puppet Implementation in Large Organizations – Ja...
PuppetConf 2016: Successful Puppet Implementation in Large Organizations – Ja...
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)
 
Debugging
DebuggingDebugging
Debugging
 
Agile Development Practices - Productivity
Agile Development Practices - ProductivityAgile Development Practices - Productivity
Agile Development Practices - Productivity
 
When to Code / Config / Config + Code in Salesforce - Nikunj Doshi
When to Code / Config / Config + Code in Salesforce - Nikunj DoshiWhen to Code / Config / Config + Code in Salesforce - Nikunj Doshi
When to Code / Config / Config + Code in Salesforce - Nikunj Doshi
 
Sensepost assessment automation
Sensepost assessment automationSensepost assessment automation
Sensepost assessment automation
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
 
SOC Meets Cloud: What Breaks, What Changes, What to Do?
SOC Meets Cloud: What Breaks, What Changes, What to Do?SOC Meets Cloud: What Breaks, What Changes, What to Do?
SOC Meets Cloud: What Breaks, What Changes, What to Do?
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
Dev and Ops Collaboration and Awareness at Etsy and Flickr
Dev and Ops Collaboration and Awareness at Etsy and FlickrDev and Ops Collaboration and Awareness at Etsy and Flickr
Dev and Ops Collaboration and Awareness at Etsy and Flickr
 
Cloud Security Practices and Principles
Cloud Security Practices and PrinciplesCloud Security Practices and Principles
Cloud Security Practices and Principles
 
Ci tips and_tricks_linards_liepins
Ci tips and_tricks_linards_liepinsCi tips and_tricks_linards_liepins
Ci tips and_tricks_linards_liepins
 
Binary crosswords
Binary crosswordsBinary crosswords
Binary crosswords
 

Último

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 

Último (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 

Watching Somebody Else's Computer: Cloud Native Observability

  • 2. Markus Meier, FSFE, CC-BY-SA 4.0
  • 3. Cattle vs. Pets ??? Jerem43, CC BY-SA 3.0 Deed Imgflip/Paramount, Fair Use Vectorportal, CC BY 4.0
  • 4. Observability in the Cloud Basic Principles Common Challenges and Best Practices O11y in IRM Observability in Incident Response Management
  • 5. Why Me? (The least interesting part of any presentation!) ● IT operations management since the “cloud” was new and scary ● 15+ years experience in observability ● Interviewed hundreds of SRE teams ● Literally wrote the book on Grafana, an open source o11y tool
  • 7. Why set up monitoring at all? What’s your bottom line? How much does downtime cost you? Are you happy with a $2000 SLA for downtime? Maybe! Defaults cover the average user. If you’re doing something outside of the defaults, think about it – are you really monitoring what you think you’re monitoring? RTFM! My cloud provider does this for me, right? The cloud provider defaults are good enough, right? Lou Gold, CC BY-NC-SA 2.0 Deed
  • 8. MLT Where do I start? Metrics tell you that something is wrong, but not what Logs tell you what went wrong (Traces tell you where a bottleneck is) You can make logs act like metrics, but not the other way around Logs are indispensable Barnaby Dorfman, CC-BY 2.0
  • 9. Infrastructure (k8s, database services, etc.) Basic services - what does everything else depend on The basics Don’t care about things that scale (CPU, memory… even nodes) At most, warn about a node failure. Never page for it. Core resources first You build for HA, so you should monitor that way CI/CD - did deployments work? What versions are out there? Where are they?
  • 10. Alert for user impact (Hint: user experience = “free” SLOs!) Synthetics can help Make sure you catch everything! (Did you remember PagerDuty? VictorOps? Grafana Cloud?) So what should I alert for? Symptoms, not causes Monitor your monitoring
  • 11. Quality varies – some are better than others UX is not a solved problem here Pick your tool(s) Do you ever intend to go multi-cloud? (Or migrate?) COMMUNITY MATTERS Cloud-specific tools take time to learn … but so does anything Cloud provider monitoring Think about lock in
  • 12. Automatically onboard Use a consistent format Use it! Semantics → Standards → Automation Automated SLOs One dashboard for the whole company Drive adoption Build, then automate
  • 14. Challenges (Or “Theory vs. Practice”) Robert Harker, CC-BY-SA 3.0 Gael Mace, CC-BY 3.0
  • 15. Scaling It looks that way until it isn’t. What could go wrong? ● Object storage The cloud is infinitely scalable, right? BorisFromStofckdale, GNU FDL ● Autoscaling ● Spot pricing ● Serverless - even if the storage is there, are the IOPS? - what happens when a zone runs out of a specific machine type? - heaven help you on Black Friday … isn’t serverless
  • 16. If your monitoring system is in the same cloud and AZ as your production environment, what happens when that AZ goes down? Monitoring scaling Observe from outside Watch your overhead Even when you have scaling defined, keep some capacity Know what’s important If you do run out of capacity, what can you afford to lose first? What can’t go down? (Make sure you’re monitoring that thing if nothing else!)
  • 17. Scaling monitoring Remember to scale your observability environment! Node 1 Node 2 Node 3 Prometheus
  • 18. Scaling monitoring Remember to scale your observability environment! Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Prometheus Prometheus
  • 19. O11y in IRM Robert Scoble, CC-BY 2.0
  • 21. You already have tools for this! (Especially if you think you don’t) These are mission critical and need to be monitored!
  • 22. (even if it costs money!) Consistency is key Use the same tools in dev and prod Litlnemo, CC-BY-NC-SA 2.0 Deed Jon Sullivan, Public Domain Dev Prod
  • 23. Agree ahead of time on a starting point and process Create a shared doc/room/channel for communication Start simple – get everyone on the highest level dashboard to start and work down Human Intelligence > Artificial Intelligence Use runbooks over autoremediation Automate where appropriate… … but remember humans are great at thinking!
  • 24. Takeaways You need to monitor Engagement matters / Infinity ain’t so Monitor for scale Scale your monitoring Consistency is key
  • 25. CREDITS: This presentation template was created by Slidesgo, and includes icons by Flaticon and infographics & images by Freepik Thanks! Contact ronald@grafana.com rm@mastodon.amaseto.com CREDITS This presentation template was created by Slidesgo, and includes icons by Flaticon and infographics & images by Freepik. Images are from Adam Y Zhang, Jerem34, Vectorportal, Lou Gold, Robert Harker, Gael Mace, BorisFromStockdale, Barnaby Dorfman, Robert Scoble, litlnemo, Jon Sullivan and are used via the Creative Commons license. I’m deeply indebted to my colleagues working in observability, but most especially to Heds Simons and Goutham Veeramachaneni for their advice and review of this content.

Notas del editor

  1. First let’s get all the cliches used in this presentation out in the open: I’m sure everybody in here has heard this quote: There is no cloud, just other people’s computers. And while it’s funny because it’s true, it also means that everyone who uses cloud computing needs to think about how not being in control of their environment impacts how they think about monitoring it.
  2. Pets are how we used to think about servers They each had a name We knew their quirks We cared a lot about keeping them healthy Cattle are numbered and replaceable (I hate this analogy so if anyone can help me come up with a better one I’d appreciate it!)
  3. Nobody’s here to hear about me, but in case you’re wondering why I’m qualified to talk about this… Way way back in the day I built monitoring systems for on-prem datacenters when the idea of trusting Amazon to run your computing infrastructure would get you laughed out of the room. I’ve since worked in and around observability doing ITIL/ITSM, logging, metrics and monitoring, and then for the last 4.5 years been at Grafana Labs where I’ve had the opportunity to talk to companies from tiny startups to the Fortune 50s to understand and advise on their observability strategies. And I’m never the smartest person in the room, but I get to hang out with the folks who are. I’ve drawn heavily on the experiences of my colleagues both in Grafana Labs running a multi-cloud SaaS environment at massive scale as well as my friends and colleagues in other companies.
  4. Why care about monitoring at all? Cloud provider does this for me, right? Well, they do it for themselves You need to know how you were impacted to make an SLA case Without that you’re reliant on what they decide Defaults are good enough, right? Maybe! They’ll get you the basics – if you’re doing something outside of the defaults, you need to think! You need to know what you’re actually monitoring. Is it what you think it is? RTFM!
  5. If you have to pick one tool, pick logs. Always start there. Metrics are great! They are easy to implement and alert off of and you can make pretty pictures with them to make the pointy haired bosses happy. But logs will tell you what’s actually broken. (Traces are a distant third here. They’re fantastically useful, because identifying bottlenecks and hotspots in your environment can take your performance to the next level. But this doesn’t matter if you can’t keep your application running to start with.) It’s sometimes painful and messy, but you can aggregate logs together and make them work like metrics. But you can’t pull detailed tracebacks from a metric – they just don’t exist there.
  6. Where in the environment should we start monitoring? It sounds simplistic, but it’s true: always start with core resources. These are the things that if they break, your whole production environment breaks. Things like k8s, databases, storage… things that you have to have working for everything else to function. Usually people manage to think about this, because that’s what we’ve been looking at since the beginning of time. But also remember that core infrastructure means things like your deployment and orchestration systems! If your CI/CD system is broken, you’re going to have a rough time deploying an emergency fix when something breaks! When talking about monitoring core resources, this doesn’t mean “speeds and feeds”. You almost certainly shouldn’t care about CPU utilization, free memory, disk space, even the health of individual nodes. (Remember “cattle vs. pets”? Don’t try to make a failing node healthy. Shoot it in the head and replace it. Worry about persistent or recurring failures.) It makes sense to track node failures. You’ll want that data to look for patterns. But don’t alert anyone for it! You’ve spent a ton of time and money getting into a self-healing environment, so let it self-heal!
  7. Focus on symptoms, not causes. Engineers are really good at finding the cause of an issue. If you try to build that intelligence into your monitoring, you’ll catch the things you thought about ahead of time but not the things that you didn’t expect. As an aside, if you think about what impacts your users, that will guide you pretty clearly to building service level objectives, which we’ll look at more in a moment. SLOs can tell you if you’re meeting expectations, sure, but also if you’re making changes too slowly! It’s a great way to know if you can move faster and break more things without pissing off your users. Synthetics: even a simple page load test to look for a 200 response on your observability system is huge. (You probably want to do this from outside of that system itself, of course!) Remember to look at all the systems that are part of your observability suite! (Is your alerting environment working? Ticketing? If you’re using Datadog, Grafana Cloud, etc. consider having a local environment ping that from time to time…!)
  8. What should you use? Cloud provider monitoring varies. Cloudwatch is really powerful and you can probably (after beating your head against the docs for a while) get it to do what you want. (This is not necessarily the case for other providers’ tools.) UX is inconsistent and often hard to figure out in these tools. There’s not a ton of motivation for cloud providers to improve this, because you’ll take what you get and like it. You can’t switch from Azure Monitor to CloudWatch or vice-versa without moving your whole environment If you are multi-cloud, your tools should be too. You don’t want to have to learn three different mutually incompatible systems to monitor your app! Having a solid community around your tools matters. GCP had continuous profiling before anyone else, but nobody really knew about it because it wasn’t talked about widely. A middling tool that you know about and can get help with is better than the best tool that you’ve never heard of and can’t figure out. DD, Grafana, Splunk, etc. have great resources online to help you figure out how things work. (Google for “How do I do X in Stackdriver” vs. “How do I do X in Grafana or DataDog” – StackDriver is all questions, DD/Grafana are all answers) You can get great results out of cloud provider tooling if you put in the effort But again, now you’re locked into that toolset, and if you want to move or expand you’re stuck. If you had put that effort into learning something platform-agnostic, you’d be home by now
  9. Monitoring is only as good as the use it’s put to! You can build the best monitoring system in the world, but if nobody adopts it, it’s worthless Automatically onboard people – make it simple for people to use. Sensible defaults, provided as part of your infrastructure Keep your logs/metrics/traces in a consistent format with a consistent naming convention. Make it simple for people who aren’t familiar with a service to know what it’s doing and what its state is Once you’ve got this baseline, you can build automation and tooling on top Semantic conventions lead to standardization leads to automation Then you can derive SLOs from your data without much hassle This lets you build ONE SLO dashboard to see everything in the company and enable interested parties to drill down
  10. There’s a simple trick that I learned from our CTO Tom Wilkie. He wanted to drive toward that standardization of SLO-based monitoring. But it’s hard to get a bunch of engineers to really buy into something by just mandating it. (Cat herding!) So he picked a few services and instrumented them the way he wanted, and started sending out a weekly report to the senior leadership team and every engineering manager. After a few weeks, people started asking “how can I get my team’s status into that report?” Leverage FOMO!
  11. … although I have worked for a couple of startups that began in an academic setting and designed everything in a pristine theoretical environment before going to the real world. So cynically I might call this “why everything you think you know is wrong”.
  12. Infinitely scalable? This is the promise of cloud computing. And mostly, you can pretend that it’s true. But when it isn’t, it’s a disaster. Suddenly you go from not knowing or caring about the underlying infrastructure you’re running on to being completely limited by it, but unable to fix it! Object storage IOPS - this is part of knowing and understanding the SLAs your provider gives. Read the fine print! Do they promise performance or just writable bytes? Autoscaling - sure, there might be more resources available in your cloud provider, but are they the machine type you specified? Spot pricing - someone will always have more money than you and be able to outbid you on compute! Serverless isn’t serverless - it’s still somebody else’s computer! A lambda function can fail because there’s still an EC2 instance underneath it running things, and hardware does sometimes break. Don’t blindly assume that your serverless functions will always just work. They need to be monitored too!
  13. If you're monitoring your environment from inside that environment, you're not monitoring! What happens when the DC goes down? Separate cluster, separate AZ for o11y Ensure you have some overhead capacity, even if you scale Rate limits where appropriate... if you're running at 90% and there's a >10% spike that you can't scale, you're dead (Plus scaling still takes time!) Know which thing you're going to kill first if you have to... make sure you're monitoring the critical thing
  14. The flip side of monitoring scaling is scaling monitoring. Something I see really commonly for folks who are rolling their own: When you build your service environment you probably set up scaling groups, but did you remember to set up scaling for your observability? If not, what’s going to happen when you grow that production environment on Black Friday?
  15. You scale up to handle the load, but your Prometheus instance was sized to handle normal traffic. So what happens when it starts taking double the load it was scoped for? It goes down! And when your metric or log system goes down before you have an incident, now you’re really hosed! I might be biased, but: scale your observability systems _before_ you scale your services!
  16. Facebook’s “War Room” in 2010
  17. This is the one point where I’m going to diverge a bit and talk about tools first. It’s not because process isn’t important, but because when we get to IRM, it’s the one area where I see people actually really think through process first. (This is awesome!) But tools are still important.
  18. Communications tools - where you’ll stay in sync. Pick one AND ONLY ONE. Should be the same you use daily, but have a separate incident channel Documentation tools - ideally something collaborative that you can update in real time THESE ARE CRITICAL TOOLS AND SHOULD BE MONITORED
  19. First and foremost, involve developers in observability. This goes back to the “adoption” points earlier. Your developers are the ones who know the most about the internals of systems, and thus they should have a good idea how they can fail. But they need to share a common language and tools with your production environment to be useful. The biggest mistake I see in incident response happens way before bad code sneaks into production. It’s when someone says “Dynatrace/AppD/Splunk costs a lot of money, so we’ll use it in Prod but not in Dev”. Now you have two completely different ways of monitoring and thinking about your environment. Your ops team has a set of queries and alerts, but your devs have no idea how those work. Or worse yet, you’re doing DevOps and your developers spend 90% of their time in the Dev environment and don’t know how to use the Prod tools. Having carried a pager in a past life, I can tell you that issues always occur at 3AM when you’re half awake and not thinking straight. If you don’t intimately know how to use the tools you have, you’ll struggle to make them work in an emergency.
  20. Developers should be taking the first crack at defining alerts, and should include information about how to resolve them. This is usually best done as a runbook rather than trying to automate a fix. Automation can hide recurring issues, and things like a quick reboot can disguise the root cause of issues. Having a runbook attached to an alert tells responders the most likely place to look, but allows for investigation and human intervention when appropriate. (This is specific to your application, not the infrastructure. You still want a pod to restart when something crashes, but if you hit a CrashLoopBackoff you definitely want to dig in and understand why.)
  21. You need to monitor even in the cloud, because downtime costs you more than your cloud provider cares about – If nothing else, you need logs Community! Your tools are useless if nobody can tell you how to use them, and they’re worse than useless if nobody in your organization uses them. Pick tools that you can use, and show people what they can do and make it easy to engage The cloud looks infinite but it’s not! Think about less obvious failure modes and check for them. Think about what happens when scaling fails and what you’ll do as a result! Monitor for scale – alert for user impact, not for infrastructure impact But scale your monitoring – be sure you have the infrastructure to monitor your infrastructure effectively Consistency is key - use your tools, and use the _same_ tools everywhere
  22. Apply the same ideas about architecture to observability Don’t care about CPU utilization, memory utilization Care about service health
  23. It looks that way… until it isn’t!
  24. It looks that way… until it isn’t!
  25. It looks that way… until it isn’t!
  26. It looks that way… until it isn’t!
  27. Throw a few charts in place and you’re good, right? [examples]
  28. Throw a few charts in place and you’re good, right? [examples]
  29. Throw a few charts in place and you’re good, right? [examples]