Incidents¶
- incidents
- five why's
- blameless postmortems
- identify causes (not culprits)
- assume good will
- take your time
- dangers of automation
- observability
- human bias
- human factors
- percententize work
- how much percent of work should be what?m
- figure out how to track, make it easy/automated
- identify real vs. desired, figure out how to get (closer) to desired
- "Just Culture" (as in, justice, it is just)
- bad apple theory = debunked
- bad apple theory = remove the small percentage of bad apples and the problem goes away
Notes¶
- most incidents happen near updates/upgrades/deployments
- make deployments a non-event
- small increments, high frequency, automated, tested
- it is not systems, it not humans, but humans within systems
- human will make mistakes
- processes are (almost) always part of the problem
- automation needs to include sanity checks (ranges of sane values)
References¶
Below is a significant collection of references to resources that tackle the different parts of incident management. They can explain it better than I ever can, so use them to better your own understanding just as I have.
Books¶
Talks¶
- Ironies Of Automation
- Google SRE: Postmortems and Retrospectives
- John Allspaw: Blameless Post Mortems
Papers¶
- Patient Safety and the "Just Culture" - Marx D
- How do systems manage their adaptive capacity to successfully handle disruptions - M Branlat & D Woods
- Ironies Of Automation - Lisanne Bainbridge
- Managing The Development Of Large Software Systems - Winston Royce
Articles¶
- Weathering The Unexpected - Kripa Krishnan
- John Allspaw: a mature role for automation
- John Allspaw: Resillience Engineering: Part I
- John Allspaw: getting the messy details is critical
- John Allspaw: Ask Me Anything
- John Allspaw: Blameless PostMortems And A Just Culture
- Etsy's Postmortem Proces
- Etsy's Winning Secret: Don't Play The Blame Game
- Blameless Portmortems at Etsy
- Google SRE - Postmortem Culture: Learning from Failure
- HoneyComb.io - Incident Review
- GitHub Outage Incident Analysis
- Google Postmortem
- AWS Postmortem (S3 outage)
- GitHub Page Listing Public Post Mortems
- Charity Majors: I Test In Prod
- The Network Is Reliable: an informal survey of real-world communications failures
- Charity Majors: Shipping Software Should Not Be Scary
- Circle CI: A brief history of devops part I: waterfall
- Circle CI: A brief history of devops part II: agile development
- Circle CI: A brief history of devops part III: automated testing and continuous integration
- Circle CI: A brief history of devops part IV: continuous delivery and deployment
Last update: 2019-09-21 21:59:17