In this talk we explore some of the tools we built at Hailo to monitor our microservices platform. By using a combination of instrumentation, in-depth service monitoring, request tracing, event correlation and automation frameworks we manage to present a holistic view of our infrastructure.
2. Outline
• Intro to the Hailo world
• Platform Overview
• Monitoring Evolution
3.
4. The Platform
Troll a platform by Swinsto101 / CC BY-SA 3.0 / Desaturated
from original
5. Platform specifics
• SOA based on Go ( and Java… )
• 1000+ AWS instances spanning multiple regions
• 160+ services in production
• Designed specifically for the cloud – different building blocks and
components will constantly be in flux, broken or unavailable.
7. Provisioning Service
CI Pipeline (Janky/Jenkins)
Amazon S3
Provisioning Service Provisioning Service
Provisioning Manager
Docker Registry
Inside an environment
8. A micro-service under the hood
Handler platform-layer
Logic
Storage
Library for abstracting service-to-
service comms
service-layer
Self-configuring external
service adapters
Service
Any service gets for free:
• Provisioning
• Discovery
• Configuration
• Authentication/Authorization
• A/B testing capabilities
• Self-configuring connectivity to
third-party services
• Monitoring
• Instrumentation
9. Mission:
Define high level platform and business metrics
Gather as many insights as possible
Add automatic failover and recovery capabilities
"A[ollo 8 Launch Control Room” by Tfawls
/ Desaturated from original
11. Challenges
• Single StatsD instance and generic graphite setup cannot cope with all the traffic
(surprise!)
• No easy way of generating and searching for graphs quickly
• We didn’t instrument everything
• “Traditional” monitoring systems can only give basic app insights
• Se#ing up app templates is a manual daunting process and does not scale
• No in-depth visibility into our main KPIs
• No way of identifying platform / release / config / cloud infrastructure changes
13. Host Instance
Graphite
Cache
Zabbix
Iterate on what we already know
Relay
CloudWatch
CollectD StatsD
Cache
Cache
Zabbix
Agent
14. Result
• Scaling up graphite and moving StatsD to every box allowed us to collect millions
of metrics
• Instrumenting everything gives us a lot of insights.
• Grafana allows us to quickly build, store and search for important graphs. Widely
adopted by the whole development team!
Tip: Focus on upper 95th and 99th percentiles and work out from there.
17. Provisioning Service
Message bus
Monitoring
Service
New
Service
Publish
Healthchecks
Host Instance
Provisioning Manager
Binding Discovery
Provisioning Service
Host Instance
Monitoring
V2
20. Result
• Service health checks give us in-depth service performance details
• The monitoring service has a holistic view of our platform health and can identify
degraded availability zones
• Developers can identify what is important for their service and track & alert on it.
21. Trace++
Monitoring &
Instrumentation
“Abstract conception of network and communication”
by Leszekglasner / Desaturated from original
26. Result
• Trace incoming requests and pinpoint bo#lenecks & SLA offenders
• Easily identify problems on the request/response path
• Quickly find out exactly which services participate on the request path
29. Result
• Identify business impacting issues immediately
• Highlight the service on the critical path that is most likely responsible for the
problems
31. CollectD StatsD
Zabbix
Agent
Provisioning Service
Host Instance
Phosphor
Publish
c
Dashboards
Monitoring
Persistent
Storage
SNS
Platform
Events
Whisper
Service
c
Platform events
32.
33.
34. Result
• Answer to the most important “Did anything change?” question
• Audit trail for any platform changes
• Holistic view of our platform status
35. It is not over yet!
++ Machine Learning
++ Event source weighting
36. Thanks!
PS. We’re hiring!
@nathariel
boyan@hailocab.com London DevOps