The number of people that know how to effectively debug and triage problems in a complex software product is upsettingly small.
I don’t know why this is. Debugging has always seemed to me a very simple, straightforward task. Start at the top: figure out if the problem is reliably reproducible. If it is, start eliminating codepaths. It’s basically the Holmes Principle applied — when you’ve eliminated every other explanation as impossible, whatever is left, however improbable, is the explanation. Of course there are some bugs that are harder than others — usually, the worst are the ones where you can’t actually describe the conditions under which it happens. But even if we forget about those, I still see people who should know better whose preferred approach seems to be simply flailing at the problem until it goes away.
Computers, and software, are complex things. No one can keep the entirety of a complex software product in their mind at once when debugging. The desire to simplify the problem solving regimen is understandable, and even correct. But there’s a right way and a wrong way to do it. The right way, as I described it, is to break the problem down into subcomponents and perform tests that either validate or eliminate a hypothesis about that subcomponent. Here are some of the wrong ways.
- Magical Thinking – This is something the Mac community had a lock on for many many years. Every time someone asks me “Did you zap your PRAM?” I want to punch them in the face. “Zapping the PRAM” was recommended as the solution to every problem from system lockups, to color profile mismatches, heartbreak, psoriasis, and as a substitute for Apple developing an operating system with protected memory and virtual addressing. You get magical thinking in the Windows world as well, of course — “Defrag your hard disk!” say people who are diagnosing a problem while using NTFS. “Run Norton!” “Scan for viruses!” Why do they say these things? Well, they worked once. Maybe that’ll work again!
- Denial/Blame the Victim – Yes, naive Slashdot addicts and new Linux users, I’m looking at you. “That never happens to me. You must be doing something wrong,” is not, in fact, a useful thing to say. Ever. Inevitably, if these people are or become developers, they’re the guys who write code that doesn’t check for errors because “that error is really unlikely.”
- Broken Logic – “Aristotle and Plato are both Men. A Man killed Sophocles. Aristotle did not kill Sophocles. Therefore, Plato killed Sophocles.” You’d think you wouldn’t see this sort of logic in the real world. But you’d be wrong. The most clear cases of this are when you see someone jump to a conclusion when they eliminate one possibility. What about the other possibilities? It’s one thing to pick a hypothesis to test next, and it’s quite another to declare “Through the process of elimination, the bug is here” — when you haven’t actually eliminated the other likely candidates.
- Punish Good Deeds – Figure out which person on your project is the best at triaging bugs. Declare that the bug is in their code. Then they have to debug the problem for you in order to prove that it’s not.
- Print statements everywhere! – Ok, really, we’ve all done this at some time or another, but it should be a last resort. Particularly in a multithreaded system, anything you do that might change the timing can make debugging harder, not easier.
If you ever go to the hospital for a serious condition, you’ll see proper problem solving in action. A succession of medical students, residents, and interns will march into your room and start asking you questions. Typically, they will ask the same questions, in the same order, until you are ready to scream from the monotony. This is because they are running down the diagnostic evaluation, and each answer is helping them (hopefully) eliminate some possibilities from the universe of things that might be wrong with you. As software developers, we don’t have the collective knowledge that the medical community has about the human body, but we can at least aspire to their commitment to proceed logically.
There’s an old saying “If architects built buildings the way programmers wrote programs, the first woodpecker that came along would destroy civilization.” I’d suggest a corollary: “if doctors diagnosed patients the way programmers debug programs, no one would ever risk going in to a hospital.”
I should also admit that I’m blurring the lines between “triage,” which is coming up with a rough idea of what component of a software system is failing, with “debugging,” or fixing the bugs. One is really a special case of, or a part of, the other. I’d argue that getting triage wrong is worse than getting a bug fix wrong, because of the inherently timewasting nature of misclassified bugs.
So I’ve given you some sarcastic ranting about how not to triage problems. How about some constructive suggestions? I’ve already talked about trying to ensure that you can reproduce problems, but that won’t be possible for every bug. What else can we do?
- Always have a hypothesis – Hypothesize what is going wrong early, and often. Don’t get attached to your hypotheses. Expect to be wrong most of the time, at least early on. A good hypothesis is one that you can construct a test to disprove.
- Triage bugs to your limit – There’s nothing wrong with talking to other people about a bug, or asking for help. But when a bug comes your way, you should not hand it off to someone else until you can give a narrative of what you think is happening, and why you can’t trace it any further. It’s all just code. Even if it’s code you’re not familiar with, most problems you encounter will be of a pattern you’ve seen before. And if you find a type of error you haven’t seen before, congratulations! You’re learning something new.
- Write everything down – Keep a running explanation in your bug tracking system of what tests you’re trying, what machines you’re using, what your results are, theories, conversations, etc. If the bug is handed off to someone else, they’ll appreciate it, and even if it isn’t you’ll be able to search that database later when the exact same bug crops up in a different place.
- Use the best tools – Learn how to use every tool at your disposal, and master it. Debuggers, profilers, and fault injection are your friends.
I hope in some small way this little rant has proven helpful. Good luck. Now go squash some bugs.