A programming classic
There’s a classic programmer joke - the stages of debugging:
- That can’t happen
- That doesn’t happen on my machine
- That shouldn’t happen
- Why does that happen
- Oh, I see…
- How did that ever work?
It’s funny because it’s true.
But why do we laugh at this? It’s a pretty terrible state of affairs.
There’s a lot to unpack in this joke.
“That can’t happen”
First off why is our reaction immediately to deny the very existence of the bug? It’s unlikely that someone will have gone to the effort of cooking up an elaborate lie to waste our time looking for a non-existent bug.
Bugs are a violation of expectation, someone expected the system to behave in a certain way and it didn’t.
From a developers point of view, our expectations have also been violated. We told the computer to do one thing, and it has decided to something completely different.
This leads to the classic - “there must be a bug in the compiler” or “must be user error” and the equally popular blame the tools: “it’s because we’re using XYZ language or framework - everyone knows it’s buggy/broken”.
Developers take a lot of pride in their work - we’re generally compensated well because we are considered to be experts in our field - suddenly we’re exposed as being just as fallible as the next person.
Obviously the “That can’t happen” is a foolish response. A computer just does what it is told to do. It is not an evil mischievous imp that is deliberately trying to sabotage our work.
We need to change our immediate response to one of acceptance - there’s a bug, no point in pretending it doesn’t exist.
“That doesn’t happen on my machine”
What kind of a developer just chucks code over the wall without testing it? Of course it works on my machine!
Well, what can we say about this one? We could just lump this in with the denial of the bug existence, but it’s worth breaking it out into its own discussion.
In complex systems this is a remote possibility, code that works in isolation may not work when deployed. Interactions between different parts of a system can cause the behaviour to change in unexpected ways.
However, in a well-architected system this should be rare, and if it’s really happening then it’s a sign that something is wrong.
There’s no point saying “it works on my machine” until you’ve actually gone and tried to reproduce the bug on your machine.
Once you can prove categorically that it works on your machine then you can add that to the evidence pile for debugging the problem.
“That shouldn’t happen”
“Umm, yes, that’s why I’ve reported it as a bug” would be the facetious reply.
But this is another facet of the “I told the computer to do this, but instead it’s doing that”. It’s a violation of expectations on the developer’s side of things.
This is usually the phase of acceptance. We’ve now reached the point where we agree that something is wrong, we’ve seen the bug with our own eyes, it’s something wrong with what we’ve done, there’s no more excuses to hide behind.
“Why does that happen?”
This is where things start to get interesting. This is the fun part bug bashing.
Why is this bug happening? What’s our hypothesis for what we are seeing and how do we test it?
“Oh, I see”
The lightbulb moment of insight, through investigation you’ve developed a hypothesis of why the bug is happening and you have an idea on how to fix it.
The problem now becomes impossible not to see. It’s obvious. How did this code ever ship? Which moves up nicely onto the next stage.
“How did that ever work?”
Hindsight is a wonderful thing.
Now that you know how to create the bug, and you know how the code is wrong, you’re wondering which idiot wrote it (spoiler alert - git blame will point the finger at you).
The code could never have worked properly. You’ll start to wonder how many other bits of the codebase are complete nonsense.
Adjusting our approach
Let’s turn the programming classic on its head and rewrite the stages of debugging:
- This is happening
- Research
- Create a hypothesis
- Test hypothesis
- Fix the problem
- How do we stop this happening again?
“This is happening”
No point denying it - there’s a bug, I’m glad you found it.
Research
We need to gather information on the bug:
- how do we reproduce it?
- what test data do we need?
- how much of the system do we need to run recreate it?
- what’s the minimum I need to recreate to debug it?
- which part of the codebase is it happening in?
- which bit of code is the likely problem?
- do we have any relevant logs from when the problem occurred?
The more information we can gather the better.
Create a hypothesis
Our research should have pointed us at the potential problem, we should have developed enough knowledge to form a working hypothesis on what the bug is caused by. We should hopefully be looking at the bit of code that is wrong and have an idea on how to fix it.
Test hypothesis
How are you going to test your fix works?
Before jumping in and changing code can you definitely recreate the bug? Does it happen consistently in your test environment?
Can you write a unit test to recreate the bug?
When you apply your fix does the test now pass? When you run through the steps to recreate does it now consistently work?
Fix the problem
If we’re lucky the previous step proves that our thinking was correct, we’ve changed the code and everything works.
Clean up any debugging code go through code reviews and deploy - everyone is happy!
Don’t forget to check that you’ve not broken anything else…
The bug is fixed when the person who raised the bug in the first place is happy.
How do we stop this happening again?
This is the real value in finding and fixing bugs. The bug should never have happened in the first place.
- Are we missing unit tests for this part of the codebase?
- Have we missed a whole class of unit tests across the codebase that make this kind of bug more likely?
- Are we missing integration tests?
- Do we have automated tests to catch these bugs?
- Is there something wrong in our process that allowed this bug to slip through the net?
Types of Bugs
What kind of bugs do we encounter? And how do we fix them?
Easy(?) Bugs
There’s a set of bugs that can be classed as “easy(?)”. There’s a question mark next to the easy as a bug being obvious or repeatable does not necessarily mean the finding the underlying cause and fixing it is necessarily easy.
- UI Bugs
- Bugs of Omission or Misinterpretation
- Repeatable bugs
UI Bugs
It functions but it doesn’t look right.
UI bugs tend to revolve around the styling and positioning of elements.
Well organised companies will have wireframes and high definition mockups that you should be working from. They should have style guides and component libraries that tell you how things should look and behave.
Sometimes, we are working in the dark, there may not have been time or resources to design wireframes and mockups, you may be working from some scribbles on a napkin - make sure you fit with the rest of the application. Don’t break people’s expectations!
Another source of these bugs are different device formats - maybe it’s fine on your large desktop monitor, but on small laptops or mobile devices the UI you’ve created just doesn’t work.
There’s also a class of bugs around accessibility issues - these often get overlooked and unless attention is paid to this area it’s easy to forget about it only to have it flagged by a diligent QA person.
Solving these bugs should be straightforward:
- What is it supposed to look like?
- Make it look right
- Test on the correct target devices and sizes
There may be some fundamental process issues to be addressed here - someone knows what it should look like as they have raised the bug.
Why didn’t you know what it was supposed to look like when you built it?
Bugs of Omission or Misinterpretation
You thought you’d built the right thing.
You didn’t…
In theory, this should be an easy one to fix - find out what was supposed to be built, build it…
There are some questions to be asked around what went wrong in this situation - was the task not specified in enough detail, is there a communication gap between the product managers and the dev team that leads to the wrong thing being built?
Or did you just fundamentally misunderstand what was being asked of you?
Sometimes it’s simply a case of trying to hit a moving target. By the time you’ve finished building something everyone’s understanding of what should be built has changed. Expectations have changed and someone forgot to tell you…
Something is broken in your process - it’s important to work out what it is if this class of bug keeps occurring.
Repeatable bugs
Every time I do these steps, this thing happens, it’s not what I expect to happen, it should do this instead.
This is a nice class of bugs - repeatable with a clear set of steps to recreate the problem.
Should be an easy fix:
- Look at the application logs whilst recreating the bug
- Inspect any relevant crash logs and stack traces
- Run through the steps with a debugger attached and have it break on exceptions
- Simply walk through the code and sanity check it - does it make sense?
However, for new developers or people unfamiliar with the codebase these can also be extremely frustrating bugs.
I can happily recreate the bug, it breaks on my machine, I have no idea where to even start looking in the codebase for where to fix it.
Senior developer strolls over, takes one look at the bug and immediately brings up the line of code that is the problem.
Someone who knows the codebase intimately will probably know where most data in the system is coming from and will appear to have some magical power for identifying where a bug it.
This is why bug fixing a few simple bugs can be such a good onboarding process.
What can we do if we don’t know the codebase?
We’ll need to start employing our powers of detection and deduction.
Look at the architecture of the system, how does data flow from one place to another. What are good places to inspect the current state of the system, where does data get transformed. These are all good places to start tracking down bugs.
We can work backwards from the UI, search for a string that is near the problem area, hopefully, the bit of code showing the value will be nearby. Now work backwards from there to where the value is displayed to where it is generated.
Keep working backwards sanity checking the code as you go. You should eventually work your way to the bug or at least a place where you can start debugging the code.
Hard Bugs
Now we start getting onto the more difficult class of bugs:
- User/Data Specific
- Heisenbugs/Rare/Weird Bugs
User/Data Specific
One user or one subset of users have a problem, everyone else is ok. You can’t recreate it locally and you can’t recreate it with any of your test user accounts.
This is a really nasty kind of bug, sometimes it can help to have the user demonstrate exactly what they are doing - there may be some subtleties about the steps they are doing that aren’t captured in the steps to reproduce.
On some systems you may be able to get permission to login as the user.
What is specific about their environment? You need to try and recreate the exact user environment to reproduce the issue.
Remote logging can be a life saver in these situations, pull the logs for when the bug happens, is there anything out of the ordinary.
If you have good logging in place then you should be able to see deviations from the happy path.
You need to become a detective. What is special about this user that makes them different from the other users.
Bring in other people to help create and test different hypotheses about what could be causing the issue.
Heisenbugs/Rare/Weird Bugs
- When you try and debug it, it stops happening
- Only happens in release mode
- Time/Date specific
- Network state specific
- Race conditions/threading issues
- Memory corruption/run away pointers
You’re now into deep detective work, you may need to run soak tests for days on end to recreate the problem.
You’ll need to add detailed logging - but you might find adding the logging moves the problem, or even worse, the problem disappears when you turn on logging!
There’s only one way to track these down and that’s to apply brain power to the problem. You’ll need to keep generating and eliminating hypothesis until you hit upon the correct one.
The tools of the trade
What tools do we have at our disposal?
Logging
I cannot emphasise enough how important good logging is to debugging. Good logging should show the happy path through the code and any errors that can occur. The beauty of this is that when something is going wrong you should be able to see when the code deviates from the happy path. Why does it suddenly stop half way through processing this request?
The debugger
A lot of people just don’t seem to know how to use the debugger for their language!
This is one of the nuclear weapons in our arsenal - learn how to use it.
If anyone tells you that it’s not possible to use a debugger for your particular language - don’t believe them! Check for yourself and learn how to use the tools at your disposal.
Our brains
Computers are only doing what we tell them to do, debugging is simply the art of tracking down where the instructions we’ve given them are incorrect.
We can reason about bugs, we can hypothesis about why the system is behaving in a certain way, and then test the hypothesis.
You have everything you need at your disposal, take a step back from the coal face and the solution will generally present itself.