Once upon a time, in a galaxy far away, we had a week where things just kept breaking.
Some of the broken things were in production environments. Some were in engineering environments.
Some of them were incidents. Some of them were merely… issues.
The broken things were roughly like the following:
- Request timeouts being reached in critical flows, where we didn’t previously expect to see them being reached.
- Health checks failing because of seemingly unrelated configuration changes.
- Mysterious CPU spikes in certain database instances. (Well, they were mysterious before we knew the explanation, anyway.)
- Background jobs not running as scheduled.
- Networking problems caused by configuration problems caused ultimately by cross-team misalignments.
- A series of other awkward things that we can’t even get into here.
The details aren’t what matters. The point is that there were a lot of technical issues in a short span of time.
This was exacerbated by a series of other factors:
- Nontechnical stakeholders were clamoring for answers.
- Key people were out of the office.
- We were dealing with really complicated causal chains. The symptoms of things were very far removed from the causes. Every issue required analyzing a long, complex set of effects.
- There are several different logging systems, and not all requests are tracked in all of them.
- We faced jurisdictional ambiguities about which team owns what.
- We were spread out across so many different time zones.
- We have so many different Slack channels.
I started to think that our incident response strategies were fundamentally built to handle one issue at a time.
So: What do you do when there are more incidents than there are people on call?
What we did: Improvisation
We did not explicitly have a plan for this situation. We expect that there will basically be one incident at a time. If only incidents were so well behaved.
So we improvised. We entered a zone of undefined organizational behavior.
It was a group of experienced software engineers, relatively committed to their work. We leaned on each other for support. People not on call got pulled in.
Soon there were people losing sleep, staying up late out of concern, and just getting worn out.
Later, things did start to calm down.
What would I do, next time?
What can one do to make it better?
I think it starts with recognizing that incident fatigue is just its own thing, categorically distinct from other kinds of problems. It is a particular state one can, unfortunately, get into.
Of course, tech business culture gives us some general strategies about what to do.
- Strategy 1: Prioritization. When there are a lot of issues, rank them by urgency and only handle the most critical ones first. (Problem: Too many issues are critical, and/or their criticality is hard to assess. Eventually, major issues are left unfixed.)
- Strategy 2: Deflection. When there are a lot of issues, try to send them elsewhere, see who else can handle them. (Problem: Dealing with organizational jurisdiction issues can take as much time and effort as just fixing the issue yourself. And the latter actually delivers some practical value. Problem 2: Over time, too much deflection can erode that sense of commitment that we can call “ownership” of technical systems.)
- Strategy 3: Self-care. A very American cultural theory, it seems to me - the idea that probably you should try to do something extra nice for yourself to make up for the extra stress. (Problem: It doesn’t really solve anything, it just distracts.)
I’m not against any of these strategies, in general terms, but I’m not sure they help enough. I keep thinking that there’s more we could do.
Maybe there could be some other division of labor, where we move beyond the “first responder” model, and try to solve issues more directly as a team.
Maybe there could be some way to acknowledge the fatigue and get some extra rest.
Maybe it would help to organize a retrospective, not for specific issues, but for the whole set of issues. A space to think about how we handled things overall.
A little bit of collective recognition seems like it would help.
Coda: What we did afterwards
We did end up having a retrospective session to discuss these issues, which was helpful. We were able to identify certain common root causes among our issues.
I came away thinking: I would do more to call attention to incident fatigue sooner, if we ever get there again.
I would do more to ensure people get enough rest and don’t overwork.