Once upon a time, in a galaxy far away, we had a week where things just kept breaking.

Some of the broken things were in production environments. Some were in engineering environments.

Some of them were incidents. Some of them were merely… issues.

The broken things were roughly like the following:

The details aren’t what matters. The point is that there were a lot of technical issues in a short span of time.

This was exacerbated by a series of other factors:

I started to think that our incident response strategies were fundamentally built to handle one issue at a time.

So: What do you do when there are more incidents than there are people on call?

What we did: Improvisation

We did not explicitly have a plan for this situation. We expect that there will basically be one incident at a time. If only incidents were so well behaved.

So we improvised. We entered a zone of undefined organizational behavior.

It was a group of experienced software engineers, relatively committed to their work. We leaned on each other for support. People not on call got pulled in.

Soon there were people losing sleep, staying up late out of concern, and just getting worn out.

Later, things did start to calm down.

What would I do, next time?

What can one do to make it better?

I think it starts with recognizing that incident fatigue is just its own thing, categorically distinct from other kinds of problems. It is a particular state one can, unfortunately, get into.

Of course, tech business culture gives us some general strategies about what to do.

I’m not against any of these strategies, in general terms, but I’m not sure they help enough. I keep thinking that there’s more we could do.

Maybe there could be some other division of labor, where we move beyond the “first responder” model, and try to solve issues more directly as a team.

Maybe there could be some way to acknowledge the fatigue and get some extra rest.

Maybe it would help to organize a retrospective, not for specific issues, but for the whole set of issues. A space to think about how we handled things overall.

A little bit of collective recognition seems like it would help.

Coda: What we did afterwards

We did end up having a retrospective session to discuss these issues, which was helpful. We were able to identify certain common root causes among our issues.

I came away thinking: I would do more to call attention to incident fatigue sooner, if we ever get there again.

I would do more to ensure people get enough rest and don’t overwork.


Posted under: debugging learning favorites