Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2noNSfO.
Michelle Brush discusses modeling complex systems and architectural changes that could introduce new modes of failure, using examples from embedded systems to large stream processing pipelines. Filmed at qconsf.com.
Michelle Brush works as an Engineering Director for Cerner Corporation, where she is responsible for the data ingestion and processing platform for Cerner’s Population Health solutions. She also leads several engineering education programs and culture initiatives including Cerner’s software architect development program and internal developer conference.
2. InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
architecture-failure-modes
3. Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
4. Michelle Brush
Engineering Director, Cerner Corporation
Chapter Leader, Kansas City Girl Develop It
Conference Organizer, Midwest.io
@michellebrush
5. Michelle Brush
Engineering Director, Cerner Corporation
Chapter Leader, Kansas City Girl Develop It
Conference Organizer, Midwest.io
@michellebrush
65. 1. Develop system model.
2. Record known risk areas.
3. Publish model and risk areas.
4. Perform regular risk reviews. (Premortems)
5. Dissect and document missed risks.
66. 1. Develop system model.
2. Record known risk areas.
3. Publish model and risk areas.
4. Perform regular risk reviews. (Premortems)
5. Dissect and document missed risks.
68. The behavior of a system cannot be known
just by knowing the elements of which the
system is made.
69. “Accidents occur due to relationships
not components.”
- Sidney Dekker
Drift Into Failure: From Hunting Broken
Components to Understanding Complex
Systems
95. A*
graph: nodes and edges that need to be
searched.
open list: a list of nodes that haven’t
been explored.
closed list: a list of nodes that have been
explored.
a graph search
algorithm
96. Open List Closed List
A
B
2
C
3
2
D
4
2
E
4
1
4
My Example Graph
0
2
2
1
3
97. Open List Closed List
A 3
A
B
2
C
3
D
4
My Example Graph
2
2
1
3
98. Open List Closed List
B 4
C 4
D 6
A 3
A
B
2
C
3
D
4
My Example Graph
2
2
1
3
99. Open List Closed List
A 3
A
B
2
C
3
2
D
4
E
4
My Example Graph
0
2
2
1
3
B 4
C 4
D 6
100. Open List Closed List
A 3
B 4
A
B
2
C
3
2
D
4
E
4
My Example Graph
0
2
2
1
3
C 4
D 6
E 6
101. 1 GB of hard disk
1 MB of main memory
CPU that’s competitive with the average
computer in the late 90’s.
116. thousands a day
everyday for a year
In the file were identifiers that needed to be globally unique.
117. thousands a day
everyday for a year
In the file were identifiers that needed to be globally unique.
You could change anything in the file passively -
except that identifier.
118. thousands a day
everyday for a year
In the file were identifiers that needed to be globally unique.
You could change anything in the file passively -
except that identifier.
If you change the identifier , you have to
reprocess the file.
119. thousands a day
everyday for a year
In the file were identifiers that needed to be globally unique.
You could change anything in the file passively -
except that identifier.
If you change the identifier , you have to
reprocess the file.
Someone changed all of the identifiers.
144. 1. Develop system model.
2. Record known risk areas.
3. Publish model and risk areas.
4. Perform regular risk reviews. (Premortems)
5. Dissect and document missed risks.
146. 1. Develop system model.
2. Record known risk areas.
3. Publish model and risk areas.
4. Perform regular risk reviews. (Premortems)
5. Dissect and document missed risks.
153. 1. Develop system model.
2. Record known risk areas.
3. Publish model and risk areas.
4. Perform regular risk reviews. (Premortems)
5. Dissect and document missed risks.
154. Premortems
1. Inform everyone the system has failed.
2. Ask the group to identify most likely causes.
3. Adjust the plan to account for those risks.
155. 1. List ALL changes going in the release.
2. Review each change in terms of the risk matrix.
3. If it’s RED, it doesn’t go. (Pull the cord.)
4. If it’s ORANGE, we need a monitoring & mitigation plan.
5. If it’s YELLOW, we need at least a mitigation.
Risk Reviews
156. 1. List ALL changes going in the release.
2. Review each change in terms of the risk matrix.
3. If it’s RED, it doesn’t go. (Pull the cord.)
4. If it’s ORANGE, we need a monitoring & mitigation plan.
5. If it’s YELLOW, we need at least a mitigation.
Risk Reviews
157. 1. List ALL changes going in the release.
2. Review each change in terms of the risk matrix.
3. If it’s RED, it doesn’t go. (Pull the cord.)
4. If it’s ORANGE, we need a monitoring & mitigation plan.
5. If it’s YELLOW, we need at least a mitigation.
Risk Reviews
162. 1. List ALL changes going in the release.
2. Review each change in terms of the risk matrix.
3. If it’s RED, it doesn’t go. (Pull the cord.)
4. If it’s ORANGE, we need a monitoring & mitigation plan.
5. If it’s YELLOW, we need at least a mitigation.
Risk Reviews
164. 1. List ALL changes going in the release.
2. Review each change in terms of the risk matrix.
3. If it’s RED, it doesn’t go. (Pull the cord.)
4. If it’s ORANGE, we need a monitoring & mitigation plan.
5. If it’s YELLOW, we need at least a mitigation.
166. 1. List ALL changes going in the release.
2. Review each change in terms of the risk matrix.
3. If it’s RED, it doesn’t go. (Pull the cord.)
4. If it’s ORANGE, we need a monitoring & mitigation plan.
5. If it’s YELLOW, we need at least a mitigation.
180. 1. Develop system model.
2. Record known risk areas.
3. Publish model and risk areas.
4. Perform regular risk reviews. (Premortems)
5. Dissect and document missed risks.
183. 1. List ALL changes going in the release.
2. Review each change in terms of the risk matrix.
3. If it’s RED, it doesn’t go. (Pull the cord.)
4. If it’s ORANGE, we need a monitoring & mitigation plan.
5. If it’s YELLOW, we need at least define a mitigation.
It’s really, really hard to pull the cord.
184. “I think we’re not ready, and I
already told leadership we’re
going to be late, should I go tell
them I’m wrong?”
186. 1. Develop system model.
2. Record known risk areas.
3. Publish model and risk areas.
4. Perform regular risk reviews. (Premortems)
5. Dissect and document missed risks.
As you go through this process…
194. 1. Develop system model.
2. Record known risk areas.
3. Publish model and risk areas.
4. Perform regular risk reviews. (Premortems)
5. Dissect and document missed risks.
Process
200. My mom is a very extroverted person.
She thinks out loud.
I learned to think like her.
201. Thank you for letting me think out loud.
Michelle Brush
Engineering Director, Cerner Corporation
Chapter Leader, Kansas City Girl Develop It
Conference Organizer, Midwest.io
@michellebrush