A programming classic
Thereâs a classic programmer joke - the stages of debugging:
- That canât happen
- That doesnât happen on my machine
- That shouldnât happen
- Why does that happen
- Oh, I seeâŠ
- How did that ever work?
Itâs funny because itâs true.
But why do we laugh at this? Itâs a pretty terrible state of affairs.
Thereâs a lot to unpack in this joke.
âThat canât happenâ
First off why is our reaction immediately to deny the very existence of the bug? Itâs unlikely that someone will have gone to the effort of cooking up an elaborate lie to waste our time looking for a non-existent bug.
Bugs are a violation of expectation, someone expected the system to behave in a certain way and it didnât.
From a developers point of view, our expectations have also been violated. We told the computer to do one thing, and it has decided to something completely different.
This leads to the classic - âthere must be a bug in the compilerâ or âmust be user errorâ and the equally popular blame the tools: âitâs because weâre using XYZ language or framework - everyone knows itâs buggy/brokenâ.
Developers take a lot of pride in their work - weâre generally compensated well because we are considered to be experts in our field - suddenly weâre exposed as being just as fallible as the next person.
Obviously the âThat canât happenâ is a foolish response. A computer just does what it is told to do. It is not an evil mischievous imp that is deliberately trying to sabotage our work.
We need to change our immediate response to one of acceptance - thereâs a bug, no point in pretending it doesnât exist.
âThat doesnât happen on my machineâ
What kind of a developer just chucks code over the wall without testing it? Of course it works on my machine!
Well, what can we say about this one? We could just lump this in with the denial of the bug existence, but itâs worth breaking it out into its own discussion.
In complex systems this is a remote possibility, code that works in isolation may not work when deployed. Interactions between different parts of a system can cause the behaviour to change in unexpected ways.
However, in a well-architected system this should be rare, and if itâs really happening then itâs a sign that something is wrong.
Thereâs no point saying âit works on my machineâ until youâve actually gone and tried to reproduce the bug on your machine.
Once you can prove categorically that it works on your machine then you can add that to the evidence pile for debugging the problem.
âThat shouldnât happenâ
âUmm, yes, thatâs why Iâve reported it as a bugâ would be the facetious reply.
But this is another facet of the âI told the computer to do this, but instead itâs doing thatâ. Itâs a violation of expectations on the developerâs side of things.
This is usually the phase of acceptance. Weâve now reached the point where we agree that something is wrong, weâve seen the bug with our own eyes, itâs something wrong with what weâve done, thereâs no more excuses to hide behind.
âWhy does that happen?â
This is where things start to get interesting. This is the fun part bug bashing.
Why is this bug happening? Whatâs our hypothesis for what we are seeing and how do we test it?
âOh, I seeâ
The lightbulb moment of insight, through investigation youâve developed a hypothesis of why the bug is happening and you have an idea on how to fix it.
The problem now becomes impossible not to see. Itâs obvious. How did this code ever ship? Which moves up nicely onto the next stage.
âHow did that ever work?â
Hindsight is a wonderful thing.
Now that you know how to create the bug, and you know how the code is wrong, youâre wondering which idiot wrote it (spoiler alert - git blame will point the finger at you).
The code could never have worked properly. Youâll start to wonder how many other bits of the codebase are complete nonsense.
Adjusting our approach
Letâs turn the programming classic on its head and rewrite the stages of debugging:
- This is happening
- Research
- Create a hypothesis
- Test hypothesis
- Fix the problem
- How do we stop this happening again?
âThis is happeningâ
No point denying it - thereâs a bug, Iâm glad you found it.
Research
We need to gather information on the bug:
- how do we reproduce it?
- what test data do we need?
- how much of the system do we need to run recreate it?
- whatâs the minimum I need to recreate to debug it?
- which part of the codebase is it happening in?
- which bit of code is the likely problem?
- do we have any relevant logs from when the problem occurred?
The more information we can gather the better.
Create a hypothesis
Our research should have pointed us at the potential problem, we should have developed enough knowledge to form a working hypothesis on what the bug is caused by. We should hopefully be looking at the bit of code that is wrong and have an idea on how to fix it.
Test hypothesis
How are you going to test your fix works?
Before jumping in and changing code can you definitely recreate the bug? Does it happen consistently in your test environment?
Can you write a unit test to recreate the bug?
When you apply your fix does the test now pass? When you run through the steps to recreate does it now consistently work?
Fix the problem
If weâre lucky the previous step proves that our thinking was correct, weâve changed the code and everything works.
Clean up any debugging code go through code reviews and deploy - everyone is happy!
Donât forget to check that youâve not broken anything elseâŠ
The bug is fixed when the person who raised the bug in the first place is happy.
How do we stop this happening again?
This is the real value in finding and fixing bugs. The bug should never have happened in the first place.
- Are we missing unit tests for this part of the codebase?
- Have we missed a whole class of unit tests across the codebase that make this kind of bug more likely?
- Are we missing integration tests?
- Do we have automated tests to catch these bugs?
- Is there something wrong in our process that allowed this bug to slip through the net?
Types of Bugs
What kind of bugs do we encounter? And how do we fix them?
Easy(?) Bugs
Thereâs a set of bugs that can be classed as âeasy(?)â. Thereâs a question mark next to the easy as a bug being obvious or repeatable does not necessarily mean the finding the underlying cause and fixing it is necessarily easy.
- UI Bugs
- Bugs of Omission or Misinterpretation
- Repeatable bugs
UI Bugs
It functions but it doesnât look right.
UI bugs tend to revolve around the styling and positioning of elements.
Well organised companies will have wireframes and high definition mockups that you should be working from. They should have style guides and component libraries that tell you how things should look and behave.
Sometimes, we are working in the dark, there may not have been time or resources to design wireframes and mockups, you may be working from some scribbles on a napkin - make sure you fit with the rest of the application. Donât break peopleâs expectations!
Another source of these bugs are different device formats - maybe itâs fine on your large desktop monitor, but on small laptops or mobile devices the UI youâve created just doesnât work.
Thereâs also a class of bugs around accessibility issues - these often get overlooked and unless attention is paid to this area itâs easy to forget about it only to have it flagged by a diligent QA person.
Solving these bugs should be straightforward:
- What is it supposed to look like?
- Make it look right
- Test on the correct target devices and sizes
There may be some fundamental process issues to be addressed here - someone knows what it should look like as they have raised the bug.
Why didnât you know what it was supposed to look like when you built it?
Bugs of Omission or Misinterpretation
You thought youâd built the right thing.
You didnâtâŠ
In theory, this should be an easy one to fix - find out what was supposed to be built, build itâŠ
There are some questions to be asked around what went wrong in this situation - was the task not specified in enough detail, is there a communication gap between the product managers and the dev team that leads to the wrong thing being built?
Or did you just fundamentally misunderstand what was being asked of you?
Sometimes itâs simply a case of trying to hit a moving target. By the time youâve finished building something everyoneâs understanding of what should be built has changed. Expectations have changed and someone forgot to tell youâŠ
Something is broken in your process - itâs important to work out what it is if this class of bug keeps occurring.
Repeatable bugs
Every time I do these steps, this thing happens, itâs not what I expect to happen, it should do this instead.
This is a nice class of bugs - repeatable with a clear set of steps to recreate the problem.
Should be an easy fix:
- Look at the application logs whilst recreating the bug
- Inspect any relevant crash logs and stack traces
- Run through the steps with a debugger attached and have it break on exceptions
- Simply walk through the code and sanity check it - does it make sense?
However, for new developers or people unfamiliar with the codebase these can also be extremely frustrating bugs.
I can happily recreate the bug, it breaks on my machine, I have no idea where to even start looking in the codebase for where to fix it.
Senior developer strolls over, takes one look at the bug and immediately brings up the line of code that is the problem.
Someone who knows the codebase intimately will probably know where most data in the system is coming from and will appear to have some magical power for identifying where a bug it.
This is why bug fixing a few simple bugs can be such a good onboarding process.
What can we do if we donât know the codebase?
Weâll need to start employing our powers of detection and deduction.
Look at the architecture of the system, how does data flow from one place to another. What are good places to inspect the current state of the system, where does data get transformed. These are all good places to start tracking down bugs.
We can work backwards from the UI, search for a string that is near the problem area, hopefully, the bit of code showing the value will be nearby. Now work backwards from there to where the value is displayed to where it is generated.
Keep working backwards sanity checking the code as you go. You should eventually work your way to the bug or at least a place where you can start debugging the code.
Hard Bugs
Now we start getting onto the more difficult class of bugs:
- User/Data Specific
- Heisenbugs/Rare/Weird Bugs
User/Data Specific
One user or one subset of users have a problem, everyone else is ok. You canât recreate it locally and you canât recreate it with any of your test user accounts.
This is a really nasty kind of bug, sometimes it can help to have the user demonstrate exactly what they are doing - there may be some subtleties about the steps they are doing that arenât captured in the steps to reproduce.
On some systems you may be able to get permission to login as the user.
What is specific about their environment? You need to try and recreate the exact user environment to reproduce the issue.
Remote logging can be a life saver in these situations, pull the logs for when the bug happens, is there anything out of the ordinary.
If you have good logging in place then you should be able to see deviations from the happy path.
You need to become a detective. What is special about this user that makes them different from the other users.
Bring in other people to help create and test different hypotheses about what could be causing the issue.
Heisenbugs/Rare/Weird Bugs
- When you try and debug it, it stops happening
- Only happens in release mode
- Time/Date specific
- Network state specific
- Race conditions/threading issues
- Memory corruption/run away pointers
Youâre now into deep detective work, you may need to run soak tests for days on end to recreate the problem.
Youâll need to add detailed logging - but you might find adding the logging moves the problem, or even worse, the problem disappears when you turn on logging!
Thereâs only one way to track these down and thatâs to apply brain power to the problem. Youâll need to keep generating and eliminating hypothesis until you hit upon the correct one.
The tools of the trade
What tools do we have at our disposal?
Logging
I cannot emphasise enough how important good logging is to debugging. Good logging should show the happy path through the code and any errors that can occur. The beauty of this is that when something is going wrong you should be able to see when the code deviates from the happy path. Why does it suddenly stop half way through processing this request?
The debugger
A lot of people just donât seem to know how to use the debugger for their language!
This is one of the nuclear weapons in our arsenal - learn how to use it.
If anyone tells you that itâs not possible to use a debugger for your particular language - donât believe them! Check for yourself and learn how to use the tools at your disposal.
Our brains
Computers are only doing what we tell them to do, debugging is simply the art of tracking down where the instructions weâve given them are incorrect.
We can reason about bugs, we can hypothesis about why the system is behaving in a certain way, and then test the hypothesis.
You have everything you need at your disposal, take a step back from the coal face and the solution will generally present itself.