Skip to content

Incidents

  • incidents
  • five why's
  • blameless postmortems
    • identify causes (not culprits)
    • assume good will
    • take your time
  • dangers of automation
  • observability
  • human bias
  • human factors
  • percententize work
    • how much percent of work should be what?m
    • figure out how to track, make it easy/automated
    • identify real vs. desired, figure out how to get (closer) to desired
  • "Just Culture" (as in, justice, it is just)
  • bad apple theory = debunked
    • bad apple theory = remove the small percentage of bad apples and the problem goes away

Notes

  • most incidents happen near updates/upgrades/deployments
    • make deployments a non-event
    • small increments, high frequency, automated, tested
  • it is not systems, it not humans, but humans within systems
    • human will make mistakes
    • processes are (almost) always part of the problem
    • automation needs to include sanity checks (ranges of sane values)

References

Below is a significant collection of references to resources that tackle the different parts of incident management. They can explain it better than I ever can, so use them to better your own understanding just as I have.

Books

Talks

Papers

Articles


Last update: 2019-09-21 21:59:17