I've read the Amazon Cloud outage blog entry Blog
The analysis of the recent Amazon EBS outage has been posted on the AWS blog [1]. I read it and was incredibly impressed with the detail, depth, clarity and competency shown. Of course, they need to impress the users with those things because they're asking us to trust them with our crown jewels. Even knowing that they are in the business of instilling confidence, it's clear the AWS team has their act together.
The document explains exactly what went wrong, why it went wrong, and lists detailed steps they say they will take to fix it. Their approach to resolution has several dimensions: ranging from technical to human communication, to education. (If only the Final Report on the Investigation of the Macondo Well Blowout was so clear on the steps to take to prevent failure. [3]) On the education front, you may be interested to know that Amazon knows that developer education is a key element to building their platform. As such, they are offering a webinar series, starting Monday 2 May, about making it easier to take advantage of multiple Availability Zones. You can see the schedule at [2].
[1] http://aws.amazon.com/message/65648/ [2] http://aws.amazon.com/architecture/ [3] http://ccrm.berkeley.edu/pdfs_papers/bea_pdfs/DHSGFinalReport-March2011-tag.pdf
Technorati Tags: edburns