9. März 2011

Understanding someone else's source code

Who has the luxury of working on a green field project where you can write everything from scratch, and thus feel in control about your code? I usually don’t. There are several good reasons that force programmers to develop their skills in reading code they have not written by themselves. You might want to add a feature (extending existing code), fix a bug (maintaining existing code), look for inspiration (learning from existing code) or integrate functions into another project (reusing existing code). All of these goals require you to read code.

Reading a book is rather easy. The author already arranged the topics and the pieces of information in such a manner that one can start at the beginning of the book and finish at the last page. Whenever one does not understand anything it is usually sufficient to turn back some pages and then jump back to the current page. Straight-forward. Unfortunately, reading source code is not like reading a book. First, source code does not have an intrinsic order and we are often facing the problem what parts to tackle first. We have to solve the problem what came first, the chicken or the egg. Second, the code-base might be far too large to be able to read all of the code. We have to decide on which parts of the code are significant to us. Third, reading code is generally not much fun. We have to find a comfortable method to avoid learning reluctantly. Fourth, terminology used in code is often specific to the author’s background. We have to be able to find out relationships and meaning of new words. Fifth, there are mature parts and buggy parts of a program. We have to cope with uncertainty.

Everything starts with uncertainty. There is a bunch of discussions running on stackoverflow ([1], [2], [3]) which at least shows that a number of programmers find it challenging to investigate into code not written by themselves. The linked discussions provide at least a lot of ideas. The approach one chooses surely depends on the task at hand. Also, personal preferences and strengths heavily influence the choice of methods. But for whatever reason we are reading the source code, uncertainty is all we have at the beginning. In order to understand a piece of software, we need to know about the inputs, the outputs, the responsibilities of the different parts and most often also about the algorithms in use. This is what helped me in several projects of this kind:

  • use top-down approach
    • learn from the code in a breadth-first manner
    • find software entry points
    • see parts as black boxes to avoid getting distracted by details
    • read through overview and tutorials first
    • ask original author to learn about high-level concepts
    • take breaks to avoid being distracted by details
  • create evidence
    • come up with small theories
    • take small steps in examining code
    • manipulate code in order to create evidence
    • increase verbosity in order to create evidence
    • write tests to validate expectations
    • make use of the debugger
  • collect evidence
    • draw a map on a big sheet of paper
    • comment code to keep track of learned evidence

The top-down approach helps to avoid being trapped by questions on details. In my experience, the interesting parts of the software are hidden in the high-level logic and this way we can decide early which parts of the source code to prune in course of our evaluation. The process of creating and collecting evidence is well-known. It’s nothing else than the scientific method. Consider the outline of the scientific method, as being stated by Wikipedia ([4]):

  1. Use your experience
  2. Form a conjecture
  3. Deduce a prediction from that explanation
  4. Test

It burns down to this. Look at the code, is there some uncertainty remaining? If yes, come up with a small theory. What would be the consequences if this theory proved true? Collect evidence to support or disprove this theory. Make sure you note what you’ve learned. Most probably, there have been new questions showing in this process. Start anew. That’s the scientific method. Applied on the very concrete task of understanding someone else’s code.

In my opinion, it is on the one hand all about not worrying about uncertainty in parts of the source code I am currently not concerned with, and on the other hand creating evidence and removing uncertainty in parts I am currently focusing on. This keeps me on track and collect evidence. If I managed to compile and run the software with a certain level of success, I am confident that at the end the pieces in the puzzle will fit together.