Incidents¶

incidents
five why's
blameless postmortems
- identify causes (not culprits)
- assume good will
- take your time
dangers of automation
observability
human bias
human factors
percententize work
- how much percent of work should be what?m
- figure out how to track, make it easy/automated
- identify real vs. desired, figure out how to get (closer) to desired
"Just Culture" (as in, justice, it is just)
bad apple theory = debunked
- bad apple theory = remove the small percentage of bad apples and the problem goes away

Notes¶

most incidents happen near updates/upgrades/deployments
- make deployments a non-event
- small increments, high frequency, automated, tested
it is not systems, it not humans, but humans within systems
- human will make mistakes
- processes are (almost) always part of the problem
- automation needs to include sanity checks (ranges of sane values)

References¶

Below is a significant collection of references to resources that tackle the different parts of incident management. They can explain it better than I ever can, so use them to better your own understanding just as I have.

Books¶

Talks¶

Papers¶

Articles¶

Last update: 2019-09-21 21:59:17