Goto Chicago 2017 Bryan Cantrill
Raw notes.
Recounting the story of a Joyent outage
Here is the register piece on the outage.
How did we get here?
He points out the flip side of such higly sophisticated automation: the stress upon the humans in the loop is amplified. A human fallibility in a semi-automated system is worse than a human fallibility in a non-automated system.
Human fallibility in semi-automated systems
Recounted the story of the Air Canada flight that ran out of fuel in flight. 767-200 in 1983. The fuel mishap was due to the process of converting from imperial to metric units at some place in the system.
Amazon S3 outage.
Whither microservices?
Microservices suffer from the amplification problem mentioned above.
Some non-IT illustrations
1963 power outage in the northeast
This illustrates the notion of the load has to go somewhere.
Used the example of Three Mile Island. When you have auxiliary systems, those systems are not checked. The more alarms and alerts you have, the more likely they will overload the operators.
We are gleefully deploying these distributed systems and telling ourselves they will not fail.
Debugging in the abstract
Debugging is the process by which we understand pathological behavior in the system.
I like how he acknowledges that we have it easy in the software world, compared to the real world. He is a very entertaining speaker, but I don't like how he is yelling at us.
Debugging is the ability to be able to ask the right questions. He described the continually narrowing set of constraints.
The craft of debuggable software
One slide as a nod to what you need to do to make things debuggable.
A culture of debugging
We must have an organizational culture that supports taking the extra time for building for debuggability.
When you have an outage you need to harvest all the useful information and learn for it. Every outage presents an opportunity to advance understanding.