Trust No One: An Introduction to Large Codebases for New Engineers
A series on best practices in large codebases for new engineers. This one focuses on how to think about large codebases via an introduction…
A series on best practices in large codebases for new engineers. This one focuses on how to think about large codebases via an introduction to the practice of commenting. Some variable names have been changed to protect the innocent.
The first thing I advise new software engineers fresh out of school is this: Trust no one.
It is certainly absurdly harsh, but I’ve found it serves a dual purpose: it gives engineers the confidence to share their own ideas (since existing assumptions are always suspect), and it helps motivate them to write robust, readable, and maintainable code.
Most engineers at the beginning of their career have coded primarily for themselves or their TAs or at most a small group. Much of that code is temporary and used only until you hand in your assignment or scrape and analyze your data.
Habits that work very well for academic or personal code will bite you in the ass in a professional environment.
Imagine an environment where dozens (maybe hundreds) are contributing alongside you. Your code will live on for years, and people will be trying to read it long after you’ve forgotten what you’ve written and maybe long after you’ve left the project or company.
In this codebase things will break. Code that once worked will regress. It will regress in a thousand tiny and unpredictable ways. Your nice clean elegant API will have crap added to it, parameters slapped in and deprecated, and semantics inverted until it ends up like this real actual function from Windows XP used to create a symlink:
BOOL br= ::DeviceIoControl(hDir, FSCTL_GET_REPARSE_POINT, NULL, 0,
  &ReparseBuffer, MAXIMUM_REPARSE_DATA_BUFFER_SIZE, &dwRet, NULL);Source: http://www.flexhex.com/docs/articles/hard-links.phtml
Part 1: Comments to the rescue?
In this article, I’m going to discuss comments as a way to illustrate how code can evolve badly and how you can try to protect against future mistakes.
Often people turn to comments to combat the uncontrolled growth of codebases. But comments are ripe for abuse. As with all code, there are good ways and bad ways to write them.
Comments can do terrible damage
Let’s look at a practical example. Everybody knows it’s good to comment your code to “make it readable”. Some people also know it’s bad to comment your code “too much”.
Consider this comment:
def user_history(days=90):
  # Do 90 days of history by default
  for i in xrange(days):
    …Let’s say you add this comment to your fresh new function. In a large codebase with many contributors, hundreds of people could end up reading this comment. They don’t learn anything new from reading it that they didn’t learn from the default parameter above it. That means it’s wasting engineers’ time. But it’s not just a time waster.
Over the course of a year, hundreds of programmers will make changes to your codebase. Let’s say some new code is added:
def user_history(days=90):
  prep_user_history_processing()
  do_some_more_stuff()
  …  # Do 90 days of history by default
  for i in xrange(days):
    …There are now many lines of code separating the function’s declaration from your comment. They might not even fit on the screen together at the same time.
Now let’s say that in the intervening year, the user history processing has changed so that now we often read 1000 days of user history. Eventually one of the many programmers using this function is sick of writing user_history(days=1000). They go ahead and change the default.
def user_history(days=1000):
  prep_user_history_processing()
  do_some_more_stuff()  …  # Do 90 days of history by default
  for i in xrange(days):
    …Boom. Tests pass. Code pushed.
A month later, another coder comes along and skims this function. They’re skipping through the file, and they see the comment but don’t actually look at the function declaration. They call the function with the default argument assuming it’s 90 days. Thus a bug is born.
This is how large code bases actually work. This is how many bugs are written.
Learn to program defensively
Comments that don’t add anything, like the one above, aren’t useful and are potentially dangerous. But more than that in a large codebase the evolution of your code is simply beyond your control. To write truly robust code you must ensure it’s not only correct (which itself can be challenging) but also resistant to skimming, misreading, carelessness, and growth.
When the author added the “90 days” comment above it was the second line of the function just below the default argument. They may have reasoned that if somebody were to change the number they would likely change the comment. They didn’t take into account that in the intervening year enough code could be added that the two would no longer even fit on the screen together at the same time.
It might seem lazy to you that someone would just skip to this part of the code without looking at the top of it. But coders are people and people make mistakes. Reading code is hard. With enough commits and enough time, mistakes are inevitable.
To be a great programmer, you must take into account these future mistakes when you write your code today.
Writing useful comments
Thoughtful code that that keeps in mind and takes care of its many hapless readers can help turn the tide toward order in a large codebase. In our user history case, the coder can fix things by simply not having the comment in the first place. It’s not needed.
In the more general case the coder should consider for comments and for all code:
- How will this be useful over time? 
- What misinterpretations and misconfigurations will this cause? 
- Is this simple enough to be understood even in the face of a barrage of commits, our penchant for finding the easiest way, and our inexhaustible capacity to forget? 
- Who is this code for? 
To answer the last question it’s useful to distinguish whether a comment is for someone calling this function or for somebody modifying the function.
Public comments
If your function is public-facing in a large codebase then the caller shouldn’t have to read the function body to understand how to call it. A clear docstring should obviate the need to read (or likely misread) the code.
def user_history(days=90):
  """Remotely fetch user events for the specified number of days.  Args:
    days (int): the number of days to fetch for, rounded down by
        day to 12am  Returns:
    list(UserEvent): all events found, empty list if none are found  Raises:
    AccessException: raised if caller does not have permission
        to access this data
"""
  for i in xrange(days):
    …This comment has useful information upfront. The reader doesn’t have to trawl through many lines of python to figure out (possibly incorrectly) something like the return type. (Note that static typing or something like Python’s type hints can help with this.)
Take care with your docstring. Take time to consider whether it’s are really useful, whether and how it can be misinterpreted, and whether it’s likely to still be useful a month or a year from now, after many more commits.
The comment above is written in Google’s python args standard. If your large codebase doesn’t have a standardized format for docstrings, you should invest time in starting one. Standards help protect against laziness and mistakes by providing a correct and easy habit.
Because it’s the standard used by every function and it’s parsed for use on the docs page, everybody is accustomed to updating it. You could even deploy a linter that warns you if you’ve changed a function without updating its description.
Public comments do not address the implementation of your function. People only calling the function don’t need to know about that. That’s where internal comments come in.
Internal comments
In theory good code should speak for itself. Variables and functions should be clearly named and control flow should be obvious. But there will always be places when the code can’t speak for itself because of something very subtle or very complex.
Internal comments are written inline with the code and won’t be parsed to show in public docs. They’re intended for somebody intimate enough with the code that they’re actually going to go through and read it line by line. Well, let’s be honest; they’ll probably skip a lot of lines.
Here’s an example of an internal comment:
# DO NOT CHANGE THIS TO A SET!!! IT WILL BREAK!!1!!
number_list = [number for number in xrange(10)]
do_something(number_list)This comment is pretty useful — it says something that’s very hard to glean from the code, and if other coders read it then it could prevent bugs.
But as we know, many coders won’t read it because they’ll be skimming or they’ll make a mistake. This comment would be even more useful as an assert:
number_list = [number for number in xrange(10)]
assert type(number_list) is list
do_something(number_list)Now we have the same information, but even if the coder is lazy and skims the comments, they still can’t cause a bug. Their testing will fail. (Yes, there might not be adequate testing. That’s a subject for another time.)
Still, we didn’t explain why, and it’s really not obvious from reading the code why we have to have a list there. Somebody might come in and try to clean up the code by removing this seemingly superfluous assert. Here you can put in a useful comment to explain what’s going on.
number_list = [number for number in xrange(10)]
# As of version 1.2.1, do_something has a bug that causes it to 
# crash if non-list collections are used for certain inputs.
assert type(number_list) is list
do_something(number_list)Now, we have a precise internal comment that
- tells the reader something useful that they can’t easily glean from the code 
- concerns the internals of the function that callers of this function don’t care about it 
- explains why it’s there and why you shouldn’t remove it 
This by no means covers all places you should use comments, but the general rules hold. Your code should be concise and explicit, and when it can’t easily speak for itself you can write a comment that tells the reader something useful. Often people won’t bother to read comments, so you shouldn’t rely on them too heavily.
Compassion without trust
The truth is that reading and understanding somebody else’s code is hard. Given enough people with enough code and enough time, mistakes and carelessness are guaranteed to happen regularly. Trust no one to be mistake-free.
When producing code for others you should keep these traits of your fellow coders (and yourself) in mind and defend against them as consistently as possible. By being thoughtful and concise and using well-understood standards, and by generally being aware of our own faults as we write code, we can begin to guard against our own mistakes. In future articles I’ll discuss how this applies to other aspects of software engineering including testing and global state.
Article 1: Trust No One: An Introduction to Large Codebases for New Engineers
Article 2: A Beginner’s Guide to Automated Testing
Article 3: Quick Tips for Gitting on a Team
