Steve Loughran:"If you read the thread, a lot of people are upset. What's going
on? they demand; fix it! they say. I sympathise with their point of
view, but I don't agree with it. AWS can't say what's going on, not
until they know. They can't fix it until after that. I've been on
the receiving end of these 'fix it now' crises, and having lots of
people on the phone doesn't help you find the problem any faster.
So well done to the AWS team to (a) fixing it fairly quickly and
(b) having so many users that the outage got such publicity!"
Lister's law: "people under time pressure don't think faster". I agree with Steve. And within Steve's point is a general anti-pattern of trying to go faster and introduce more stress when you're in the weeds. But that's SOP. The most important thing seems to be to design a system that can be fixed in place when bad things happen. That means a lot of things that aren't normally considered part of software "design" - build management, release management, issue tracking, testing, configuration, deployment, rollbacks, logistics, cross-departmental processes, SLAs, findable documents and stacktraces. I really hope Steve writes a continuous deployment book.