Bug Triage

On September 29, 2004, in Computers, by peterb

The number of people that know how to effectively debug and triage problems in a complex software product is upsettingly small.

I don’t know why this is. Debugging has always seemed to me a very simple, straightforward task. Start at the top: figure out if the problem is reliably reproducible. If it is, start eliminating codepaths. It’s basically the Holmes Principle applied — when you’ve eliminated every other explanation as impossible, whatever is left, however improbable, is the explanation. Of course there are some bugs that are harder than others — usually, the worst are the ones where you can’t actually describe the conditions under which it happens. But even if we forget about those, I still see people who should know better whose preferred approach seems to be simply flailing at the problem until it goes away.

Computers, and software, are complex things. No one can keep the entirety of a complex software product in their mind at once when debugging. The desire to simplify the problem solving regimen is understandable, and even correct. But there’s a right way and a wrong way to do it. The right way, as I described it, is to break the problem down into subcomponents and perform tests that either validate or eliminate a hypothesis about that subcomponent. Here are some of the wrong ways.

  • Magical Thinking – This is something the Mac community had a lock on for many many years. Every time someone asks me “Did you zap your PRAM?” I want to punch them in the face. “Zapping the PRAM” was recommended as the solution to every problem from system lockups, to color profile mismatches, heartbreak, psoriasis, and as a substitute for Apple developing an operating system with protected memory and virtual addressing. You get magical thinking in the Windows world as well, of course — “Defrag your hard disk!” say people who are diagnosing a problem while using NTFS. “Run Norton!” “Scan for viruses!” Why do they say these things? Well, they worked once. Maybe that’ll work again!
  • Denial/Blame the Victim – Yes, naive Slashdot addicts and new Linux users, I’m looking at you. “That never happens to me. You must be doing something wrong,” is not, in fact, a useful thing to say. Ever. Inevitably, if these people are or become developers, they’re the guys who write code that doesn’t check for errors because “that error is really unlikely.”
  • Broken Logic – “Aristotle and Plato are both Men. A Man killed Sophocles. Aristotle did not kill Sophocles. Therefore, Plato killed Sophocles.” You’d think you wouldn’t see this sort of logic in the real world. But you’d be wrong. The most clear cases of this are when you see someone jump to a conclusion when they eliminate one possibility. What about the other possibilities? It’s one thing to pick a hypothesis to test next, and it’s quite another to declare “Through the process of elimination, the bug is here” — when you haven’t actually eliminated the other likely candidates.
  • Punish Good Deeds – Figure out which person on your project is the best at triaging bugs. Declare that the bug is in their code. Then they have to debug the problem for you in order to prove that it’s not.
  • Print statements everywhere! – Ok, really, we’ve all done this at some time or another, but it should be a last resort. Particularly in a multithreaded system, anything you do that might change the timing can make debugging harder, not easier.

If you ever go to the hospital for a serious condition, you’ll see proper problem solving in action. A succession of medical students, residents, and interns will march into your room and start asking you questions. Typically, they will ask the same questions, in the same order, until you are ready to scream from the monotony. This is because they are running down the diagnostic evaluation, and each answer is helping them (hopefully) eliminate some possibilities from the universe of things that might be wrong with you. As software developers, we don’t have the collective knowledge that the medical community has about the human body, but we can at least aspire to their commitment to proceed logically.

There’s an old saying “If architects built buildings the way programmers wrote programs, the first woodpecker that came along would destroy civilization.” I’d suggest a corollary: “if doctors diagnosed patients the way programmers debug programs, no one would ever risk going in to a hospital.”

I should also admit that I’m blurring the lines between “triage,” which is coming up with a rough idea of what component of a software system is failing, with “debugging,” or fixing the bugs. One is really a special case of, or a part of, the other. I’d argue that getting triage wrong is worse than getting a bug fix wrong, because of the inherently timewasting nature of misclassified bugs.

So I’ve given you some sarcastic ranting about how not to triage problems. How about some constructive suggestions? I’ve already talked about trying to ensure that you can reproduce problems, but that won’t be possible for every bug. What else can we do?

  • Always have a hypothesis – Hypothesize what is going wrong early, and often. Don’t get attached to your hypotheses. Expect to be wrong most of the time, at least early on. A good hypothesis is one that you can construct a test to disprove.
  • Triage bugs to your limit – There’s nothing wrong with talking to other people about a bug, or asking for help. But when a bug comes your way, you should not hand it off to someone else until you can give a narrative of what you think is happening, and why you can’t trace it any further. It’s all just code. Even if it’s code you’re not familiar with, most problems you encounter will be of a pattern you’ve seen before. And if you find a type of error you haven’t seen before, congratulations! You’re learning something new.
  • Write everything down – Keep a running explanation in your bug tracking system of what tests you’re trying, what machines you’re using, what your results are, theories, conversations, etc. If the bug is handed off to someone else, they’ll appreciate it, and even if it isn’t you’ll be able to search that database later when the exact same bug crops up in a different place.
  • Use the best tools – Learn how to use every tool at your disposal, and master it. Debuggers, profilers, and fault injection are your friends.

I hope in some small way this little rant has proven helpful. Good luck. Now go squash some bugs.

 

8 Responses to “Bug Triage”

  1. You realize that NTFS *still* needs defragmenting sometimes, right? So that magic sometimes works…

    Another wrong way: use GDB.

    Another right way: start explaining the bug to someone who knows nothing whatsoever about the code, and ideally nothing about programming at all. Half the time you’ll blather on for ten minutes, and then stop and slap your head in disgust at missing the blindingly obvious. Spouse/significant others/main squeezes are particularly good for this. Some claim it also works with cats, but I’ve never got one to feign interest for long enough.

  2. Alex Groce says:

    Print statements in non multithreaded code deserve more credit than you’re giving them, especially given the state of the debuggers many of us have to use. In multithreaded code, sure, they’re bad. But in multithreaded code, everything is bad except miraculous luck or Sherlock Holmes reasoning.

  3. Peter said:
    “If architects built buildings the way programmers wrote programs, the first woodpecker that came along would destroy civilization.”

    Yeah, but nobody asks architects to build buildings in 2 weeks from a crayon drawing of a house.

    Jonathan said:
    Another right way: start explaining the bug to someone who knows nothing whatsoever about the code, and ideally nothing about programming at all.

    Amen. Long walks, hot showers, and doing the dishes are also good for bubbling those insights to the top. (Plus, you get cleaner dishes and cleaner programmers.)

    My worst bugs are usually GUI bugs. Specifically, I have yet to find a debugger or debugging approach that makes it significantly easier to deal with Swing doing unexpected things. Automated testing doesn’t apply well to GUIs either, at least not in any affordable package that I’ve found.

  4. Peter said:
    > Apple developing an operating system with
    > protected memory and virtual addressing

    On the bright side, having to reboot every time you write to address 0×000000 will cure you forever of failing to check for null pointers.

    > “If architects built buildings the way programmers wrote programs,
    > the first woodpecker that came along would destroy civilization.”

    Yeah, but nobody asks architects to build buildings in 2 weeks from a crayon drawing of a house. (Stop me before I drag the analogy out further.)

    Jonathan said:
    > Another right way: start explaining the bug
    > to someone who knows nothing whatsoever about
    > the code, and ideally nothing about programming
    > at all.

    Amen. Long walks, hot showers, and doing the dishes are also good for bubbling those insights to the top. (Plus, you get cleaner dishes and cleaner programmers.)

    My worst bugs are usually GUI bugs. Specifically, I have yet to find a debugger or debugging approach that makes it significantly easier to deal with Swing doing unexpected things. Automated testing doesn’t apply well to GUIs either, at least not in any affordable package that I’ve found.

  5. Whoops. Sorry about the double post — I blame my tools, of course.

  6. psu says:

    There are classes of UI bugs that can be tracked using your standard debugger and just stepping the code. But, there are other times when running in the debugger just won’t work because the problem is timing based or a redraw problem.

    For those problems, I generally prefer targetted logging over any debugger built by man.

    What I really want is a general purpose trace system that has no effect on timing and will allow me to replay the exact execution of the app over and over again backwards and forwards.

  7. Eric Tilton says:

    Hey, Peter, I think this one is in your code; can you take a look at it?

    Seriously, though, I also think you’re being too harsh on logging, although having said that, it was only two days ago that I tried to pinpoint a particularly nasty crasher and — et voila — adding logging made the problem nonreproducible. And now my luxuriously full head of hair is now a mere shadow of its once glory.

    I like the hospital analogy; we spent a little too much time in hospitals a year ago, and it actually added to my frustration to realize that the medical professionals were just debugging, since it completely demystified the medical aura of authority for me, and I started thinking about them as irritating coders trying to pass off a particularly troublesome problem with stock answers.

  8. > What I really want is a general purpose
    > trace system that has no effect on timing
    > and will allow me to replay the exact
    > execution of the app over and over
    > again backwards and forwards.

    Me Too. Of course, I’d also like a way to observe particles without letting them be affected by the observation…