Bug Triage

The number of people that know how to effectively debug and triage problems in a complex software product is upsettingly small.

I don’t know why this is. Debugging has always seemed to me a very simple, straightforward task. Start at the top: figure out if the problem is reliably reproducible. If it is, start eliminating codepaths. It’s basically the Holmes Principle applied – when you’ve eliminated every other explanation as impossible, whatever is left, however improbable, is the explanation. Of course there are some bugs that are harder than others – usually, the worst are the ones where you can’t actually describe the conditions under which it happens. But even if we forget about those, I still see people who should know better whose preferred approach seems to be simply flailing at the problem until it goes away. Computers, and software, are complex things. No one can keep the entirety of a complex software product in their mind at once when debugging. The desire to simplify the problem solving regimen is understandable, and even correct. But there’s a right way and a wrong way to do it. The right way, as I described it, is to break the problem down into subcomponents and perform tests that either validate or eliminate a hypothesis about that subcomponent. Here are some of the wrong ways.

There’s an old saying “If architects built buildings the way programmers wrote programs, the first woodpecker that came along would destroy civilization.” I’d suggest a corollary: “if doctors diagnosed patients the way programmers debug programs, no one would ever risk going in to a hospital.”

I should also admit that I’m blurring the lines between “triage,” which is coming up with a rough idea of what component of a software system is failing, with “debugging,” or fixing the bugs. One is really a special case of, or a part of, the other. I’d argue that getting triage wrong is worse than getting a bug fix wrong, because of the inherently timewasting nature of misclassified bugs.

So I’ve given you some sarcastic ranting about how not to triage problems. How about some constructive suggestions? I’ve already talked about trying to ensure that you can reproduce problems, but that won’t be possible for every bug. What else can we do?