Sometimes it seems like, the closer you look at a running system, the more problems you find with it.

It can feel like the very act of looking creates weird anomalies that weren’t there if you didn’t look.

Of course, they were there already. You just weren’t noticing them.

This seems to be true across many domains of life.

Anomalies in a human body

Indulge me in a personal story. I often have funny results when I get my bloodwork done at my primary care clinic. It’s common in my family to have a slightly low platelet count, and other randomness somewhat outside the norms.

One year, my primary care provider got worried about my lab results and sent me to a hematologist.

The hematologist was intrigued by my case. Specialists love mysteries. He ran a lot of blood tests. He started to speculate about numerous obscure syndromes that I could hypothetically have had. He found nothing with the easy tests, so he moved on to the complicated, obscure tests. Some of them involved PCR gene sequencing. Before I knew it, he had ordered more than $30,000 of medical tests.

He never found any actual disease or any other problem that needed treatment. He sent me home with instructions to take more Vitamin B12 and come back again to be re-tested later.

What I have taken from this is: If you look closely enough at something as complicated as a human body, you will usually find anomalies. And these anomalies don’t necessarily represent symptoms or illnesses. They might just… exist.

Eventually, doctors started to use the word “idiopathic” to describe my case. Idiopathic just means “we don’t know why it happens, but no need to look into it further.”

Medical analysis, which is all encoded in software these days, typically relies on population-level benchmarks for anomaly detection, like “platelet count below 120 is not normal.” But in fact, the same reporting thresholds don’t always produce useful data for everyone. Context-free rules frequently produce false positives.

Beware of letting specialists go down rabbit holes that they can bill you for.

Anomalies in a building

The same effect applies in other areas too. Enough about bodies. Let’s talk about houses.

We live in a relatively new house; it is about ten years old. If I look at it from a distance, it seems to be in pretty good shape. When people visit, they often say it looks good.

But if I look at it more closely, I start to see lots of issues.

The paint is peeling on a drawer. There is a smudge on the doorframe where our kids frequently touch. There is a small crack between a concrete surface and a brick surface. Little vines are climbing up the foundation. The drainage pipe in the yard looks clogged by leaves. Dust accumulates on top of the light fixtures where you can’t see it. The caulk in the bathtub needs fixing. A post is starting to get rotted out where it gets wet. The windows need washing, even where it’s very hard to reach them.

Some of these are easy maintenance issues, once they are noticed. I go around the foundation a few times a year and remove the vines.

Others, I have no idea how to fix, since I am not a carpenter. The drainage pipe is a conundrum.

Other issues I don’t even know how to detect by myself, even if I try. That rotted post was detected by a professional; we missed it. Even though it was potentially a structural problem.

When I was younger, I naively imagined that houses mostly just keep existing once they were built, as long as the roof was intact to keep the rain out. Now I realize that all structures are constantly changing in their environment, and they always require attention. Permanently.

Nevertheless, some of the issues - like the smudged doorframe and the dusty light fixtures — are very, very minor. You can drive yourself crazy if you try to fix them all.

As with the human body, the issues show up as soon as you look for them. But when you aren’t looking, they seem to vanish again. Some anomalies are safe to ignore; others not so much.

Anomalies in software systems

Now back to software.

There’s a kind of iterative cycle in observing our systems at work. It goes like this:

  1. Due to a problem or crisis, we start looking closely at our logs and stats.
  2. While looking at the logs, numerous other issues jump out at us. These issues were usually there for a while, but no one noticed.
  3. Some of the new issues turn out to be minor or just noise. Others require fixing.
  4. We create some new alerts or metrics to filter for whatever signal we want to keep track of next time.
  5. We stop looking at the raw logs.
  6. New issues gradually appear in the logs, but no one notices them if they don’t hit the alert thresholds.
  7. The whole cycle repeats.

One time during an incident call, we noticed an absolute flood of errors in our web request logs that had been happening for a long while. It turns out that some load balancer healthchecks were failing in a loop. It never caused any real problem, but it spammed us with errors.

Alert rules are not a solution to this general problem. In audio processing terms, one can see an alert rule as a bandpass filter against a signal. Unfortunately, all filters, and thus all alert rules, have signal/noise ratios, and thus they will pass noise. Especially for simple threshold-based alerts, there is a tradeoff between not enough filtering (then you are spamming yourself with too much noise) or too much filtering (then you are going to overlook real issues until they become incidents). Setting alert thresholds is informed guesswork (the polite word is “an empirical problem”).

In fact, it requires expertise to even figure out if something is signal or noise. It’s not an easy problem. It requires context and often demands a judgment call about each case: Does this matter?

My current bandaid for this problem is to randomly look at the raw data sometimes, just to see what I notice. Take random walks through your data. Randomly inspect at your house to see what you see.

There’s no real substitute for looking at a log stream and trying to make sense of it. I almost always find something interesting. I also look every so often at dashboards that don’t alert and check for any interesting trends in the graphs (see also reading diagrams).

It’s probably getting more viable to ask a LLM to watch a log stream and check for anomalies. That would be a huge help if it can be done at scale. But we do have large log streams and there can be policy concerns, so I’m not sure if we’re ready for this yet. Also, we don’t know how reliable LLM analysis is, at this point.

Anyway, what I take from this is: The act of looking closely at something can be dangerous. It can lead you to a lot of new places. Some of them are useless. But you have to figure out how to do it, because otherwise, you only find out about the issues when they do become incidents. It’s better to stay ahead of the risks, if you can.


Posted under: debugging incidents reading