With the rise of micro-services and large-scale distributed architectures, software systems have grown increasingly complex and hard to understand. Adding to that complexity, the velocity of software delivery has also dramatically increased, resulting in failures being harder to predict and contain.
While the cloud allows for high availability, redundancy and fault-tolerance, no single component can guarantee 100% uptime. Therefore, we have to understand availability but especially learn how to design architectures with failure in mind.
And since failures have become more and more chaotic in nature, we must turn to chaos engineering in order to identify failures before they become outages.
In this talk, I will deep dive into availability, reliability and large-scale architectures and make an introduction to chaos engineering, a discipline that promotes breaking things on purpose in order to learn how to build more resilient systems.
65. What is Steady State?
• ”normal” behavior of your system
• Business Metric
https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a
81. The Conveyor Belt Accident
Question: Why did the associate damage his thumb?
Answer: Because his thumb got caught in the conveyor.
Question: Why did his thumb get caught in the conveyor?
Answer: Because he was chasing his bag, which was on a running conveyor
belt.
Question: Why did he chase his bag?
Answer: Because he placed his bag on the conveyor, but it then turned-on by
surprise
Question: Why was his bag on the conveyor?
Answer: Because he used the conveyor as a table
Conclusion: So, the likely root cause of the associate’s damaged thumb is that
he simply needed a table, there wasn’t one around, so he used a conveyor as
a table.
https://www.linkedin.com/pulse/use-5-whys-find-root-causes-peter-abilla/
Hands up - how many of you can relate to this story? Great – so this session is dedicated to you
Short intro on the move from monolith to micro-services.
With the rise of microservices and distributed cloud architectures, the web has grown increasingly complex. As a result, “random” failures have grown difficult to predict. At the same time, our dependence on these systems has only increased.
Traditionally, these sensible measures to gain confidence are taken before systems or applications reach production. Once in production, the traditional approach is to rely on monitoring and logging to confirm that everything is working correctly. If it is behaving as expected, then you don't have a problem. If it is not, and it requires human intervention (troubleshooting, triage, resolution, etc.), then you need to react to the incident and get things working again as fast as possible.
This implies that once a system is in production, "Don't touch it!"—except, of course, when it's broken, in which case touch it all you want, under the time pressure inherent in an outage response.
https://queue.acm.org/detail.cfm?id=2353017
GameDays were coined by Jesse Robbins when he worked at Amazon and was responsible for availability. Jesse created GameDays with the goal of increasing reliability by purposefully creating major failures on a regular basis.
Super power with Docker (Dockerfiles) instead of Chef or Puppet.
Invest time to save time
Write and updates
Counters!!!! Not on the DB – redis!!
Database Federation is where we break up the database by function.
In our example, we have broken out the Forums DB from the User DB from the Products DB
Of course, cross functional queries are harder to do and you may need to do your joins at the application layer for these types of queries
This will reduce our database footprint for a while and the great thing is, this does prevent you from having to shard until much further down the line.
This isn’t going to help for single large tables; for this we will need to shard.
Sharding is where we break up that single large database into multiple DBs. We might need to do this because of database or table size or potentially for high write IOPs as well.
Here is an example of us breaking up a database with a large table into 3 databases. Above we show where each userID is located, but the easiest way to describe how this would work would be to use the example of all users with A-H go into one DB, and I – M go in another, and N – Z go into the third DB.
Typically this is done by key space and your application has to be aware of where to read from, update and write to for a particular record. ORM support can help here.
This does create operation complexity so if you can federate first, do that.
This can be done with SQL or NoSQL, and DynamoDB does this for you under the covers on the backend as your data size increases and the reads / writes per second scale.
Route your website visitors to an alternate location to avoid site outages
Does a region Fail?
Full region: no
Individual services can fail region-wide
Most of the time, configuration issue
Leading to cascading failures.
Eventual consistency, also called optimistic replication,[2] is widely deployed in distributed systems, and has origins in early mobile computing projects.[3] A system that has achieved eventual consistency is often said to have converged, or achieved replica convergence.[4] Eventual consistency is a weak guarantee – most stronger models, like linearizability are trivially eventually consistent, but a system that is merely eventually consistent does not usually fulfill these stronger constraints.
Eventual consistency
The stronger the relashionship between the metric and the business outcome you care about, the stronger the signal you have for making actionable decisions.
The stronger the relashionship between the metric and the business outcome you care about, the stronger the signal you have for making actionable decisions.
partial release to a subset of production nodes with sticky sessions turned on. That way you can control and minimize the number of users/customers that get impacted if you end up releasing a bad bug.
Go fix it!
After running your first experiment, hopefully, there is one of two outcomes. You’ve verified either that your system is resilient to the failure you introduced, or you’ve found a problem you need to fix. Both of these are good outcomes. On one hand, you’ve increased your confidence in the system and its behavior, on the other you’ve found a problem before it caused an outage.