Picking up the pieces – A look at how to run post incident reviews.

We all know that the beautiful world of software is like a basketball balanced on a Jenga game. It’s fine while it’s fine but all it takes is a semi colon in the wrong place, a bad database failover or some unlucky road worker to cut a cable and the whole world comes crashing down. Fortunately we’re all good people and when the pieces are on the floor we pull together and get it all balanced again. But then what?

It’s not good enough to just forget about it. Sweep it under the run and pretend it never happened. Production incidents are part of life. More than that, they are great learning opportunities. By examining what went wrong and identifying the contributing factors, we can not only learn how to prevent this happening again, but also pre-empt other issues within our system.

This talk looks at the strategies for Post Incident Review starting with the common place practices including the Fishbone Diagrams, Sticky Note Brainstorming and the Five Whys methods of root cause analysis. This talk will go beyond these practices to show how leveraging modern monitoring practices along side ChatOps we can the contributing factors, not just a single cause.

These contributing factors can be used to find actionable outcomes that not only help prevent the same thing happening again but feed back into the incident life cycle to make the team better prepared to Detect, Respond to, Remediate and Analyse the next incident.

Last year on Christmas Eve, Klee got a phone call and had to jump onto his laptop and fix problems… glass of champagne in hand. But he realised this kept happening and needed to change.

Like any other team, NIB are trying their best to do a good job, with all the buzzwords (agile, pairing, TDD, CI/CD, devops…). But like everyone else, customers always want more and legacy code keeps growing. Things will inevitably go wrong. You need to build a culture that copes with failure and is prepared to recover.

Post incident review determines if you learn from incidents or not.

Incident life cycle:

  • detection – working out that something is happening
  • response – you work out what you’re going to do
  • resolution – going ahead with a fix
  • analysis – pulling apart what happened (this feeds into all the other steps in the life cycle)
  • readiness – preparation for next time
  • …then back to detection

When should you run a PIR? ASAP! Within 2 days. People forget very quickly and replace real memories with their own version of events. Run PIRs regularly, on smaller incidents as well as big ones – so you’re practised and know what to do.

The path to a great PIR

Root cause analysis – get right down to the thing caused the system to break; and the best solution is the Five Whys technique. Keep asking ‘why’ until you get past the surface reasons for an incident. But beware of blaming an individual; and beware that the people you ask will determine the answer you reach. Blame culture leads to fear; and fear leads to people hiding what really went on.

Blame is weird. People don’t blame one person for a big project’s success, so why do we blame a single person for a big project’s failures?

Blame the process, not the people. – Edward Deming

Go back to Norm Kerth’s prime directive:

Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.

A good tool is a fishbone (Ishikawa) diagram, that lets you identify primary and secondary causes for a specific problem.

But you have to guard against biases:

  • anchoring – the first piece of evidence is the most relevant
  • availability – I can think of it so it’s true (juniors do this a lot)
  • bandwagon effect – getting swept up in the crowd
  • others – hindsight, outcome, confirmation

…many of them quickly lead back to blame culture. Yet again go with the prime directive of retros.

Other useful things to do:

  • Create a TLDR summary of an incident and its resolution.
  • Create a timeline of events, with multiple points of view.
  • Elaborate
    • don’t hide what happened
    • don’t ask ‘why X happened’ if you can instead ask ‘what contributed to X happening’, the factors that led to a decision rather than the decision itself

Key metrics:

  • who was involved
  • time to acknowledge
  • time to resolve
  • severity

Remember to go through what went well – something has gone well, because you’re in a PIR and not still trying to fix it!

Action items:

  • Document them as they come up
  • Identify impact and urgency
  • Commit to some but usually not all
  • Actually do the ones you commit to – put them into tickets and schedule them
  • Feed back into all stages of the life cycle

If you can do all this well, your post-mortem can become a pre-mortem: something that will help you avoid doing this again.