Let's Make Software

Fun and surprises with closures, references, and inner functions

Kat Busch — Sun, 28 Jan 2024 20:58:00 GMT

What happens if you try to reassign a hashmap1 in an inner function, then modify the hashmap with that function’s return value? It might surprise you to know that it depends on the language. Turns out you’ll get a different result in JavaScript, Ruby, and Python!

That means given these 3 virtually identical code blocks below, 3 different things are going on:

Reassigning a hashmap in an inner function in JavaScript, Ruby, and Python

Above we have the same code in each language. In English here’s what it does:

Define a hashmap in a function named outer
Define an inner function that will set the empty hashmap to a new hashmap with my_key as 9. Then inner returns 7
Call inner and set my_key to the return value inner
Print my_key

Feel free to take a second took make sure you understand what’s going on in the code in the language you’re most comfortable with. Then let’s dive in to find out why this sequence gives us 3 different results in 3 different languages. We’ll review all three closely to see what each prints and why.

JavaScript

Let’s get started with JavaScript. Here’s the same code we saw above.

function outer() {
  let my_obj = {};
  inner = () => {
    my_obj = {my_key: 9};
    return 7;
  }
  my_obj['my_key'] = inner();
  console.log(my_obj.my_key);
}
outer()

So what does JavaScript do? Drumroll please… it prints 9.

Wait what? Why does it print my_obj['my_key'] as 9 when the last thing we clearly did was assign 7 to my_obj['my_key']? Feel free to copy and paste the above code into the JavaScript console on this very page to test it out if you don’t believe me!

It turns out that we assigned 7 to an old version of my_obj, then we reassigned my_obj in inner (to a totally new object) to have 9, and the new version with 9 is what we’re printing out. The old version that gets 7 is no longer referenced.

We can see this more clearly if we save that old version and then print it. You can see the old version prints 7, but the new version has 9.

function outer() {
  let my_obj = {};
  let my_old_obj = my_obj;
  function inner() {
    console.log(my_obj); // prints {}
    my_obj = {my_key: 9};
    return 7;
  }
  my_obj['my_key'] = inner();
  console.log(my_old_obj['my_key']); // prints 7
  console.log(my_obj['my_key']); // prints 9
}
outer()

Here’s what’s going on: a closure is capturing my_obj. A closure is a function that you can save and pass around and captures references to the variables in the environment at the time the function is defined. That means that inside inner, we have access to my_obj from outer. When we reassign my_obj inside of inner, we’re reassigning the actual my_obj we created in outer.

The tricky part is that JavaScript is using a reference to the original my_obj to write inner’s return value into. So we execute inner which changes my_obj, but we’re updating my_old_obj instead.

So to summarize, JavaScript reassigns your object, but it retains a reference to the original object. Then the original object disappears, leaving you with only the one you created in inner.

Ruby

Now let’s take a look at Ruby:

def outer
  my_hash = {}
  def inner
    my_hash = {:my_key => 9}
    7
  end
  my_hash[:my_key] = inner
  puts my_hash[:my_key] # prints 7
end
outer

It looks just like the JavaScript code, but turns out this prints 7!

What’s going on here? We have an inner function, just like in JavaScript, but we don’t have a closure. That means we’re not capturing variables from the environment. It turns out my_hash is a totally new local variable in inner. You can especially see that if you try to print out my_hash in inner:

def outer
  my_hash = {}
  def inner
    puts my_hash # errors out!
    7
  end
  my_hash[:my_key] = inner
  puts my_hash[:my_key]
end
outer

This actually errors out with undefined local variable or method `my_hash' for main:Object.

The variable doesn’t exist because it hasn’t been created in inner and we don’t have access to anything in outer.

We can actually create a closure in Ruby, but not like this. We have to use a lambda. This code below creates a closure and actually prints 9, just like our JavaScript code:

def outer
  my_hash = {}
  inner = lambda do
    my_hash = {:my_key => 9} # overwrites my_hash
    7
  end
  my_hash[:my_key] = inner.call # overwrites my_hash[:my_key] with 7
  puts my_hash[:my_key] # prints 9
end
outer

Python

Finally let’s take a look at Python. Python prints 7.

def outer():
  my_dict = {}
  def inner():
    my_dict = {'my_key':  9}
    return 7
  my_dict['my_key'] = inner()
  print(my_dict['my_key'])
outer()

So it seems like we’ve got the same situation as we do in Ruby. Probably we don’t have a closure, so if we try to print, we’ll get an error, right?

Wrong! In Python we can print out the variable:

def outer():
  my_dict = {}
  
  def inner():
    print(my_dict) # prints {}, doesn't error!
    return 7
  
  my_dict['my_key'] = inner()
  
  print(my_dict['my_key'])
outer()

So what’s going on? Is it a closure or not? Turns out in Python it’s a closure that lets you read, but not write the captured variable. If we try read and write, we get an error:

def outer():
  my_dict = {}
  
  def inner():
    print(my_dict) # errors out!
    my_dict = {'my_key':  9}
    return 7
  
  my_dict['my_key'] = inner()
  
  print(my_dict['my_key'])
outer()

So we can read the outer variable, we can write a new variable, but we can’t do both at once!

There is a workaround though in the form of the Python nonlocal keyword. If we explicitly say that the variable is not supposed to be local and instead it’s supposed to reference the outer my_dict using this nonlocal keyword, then we’ll be actually able to both print and read my_dict.

def outer():
  my_dict = {}
  
  def inner():
    nonlocal my_dict
    print(my_dict) # works this time!
    my_dict = {'my_key':  9}
    return 7
  
  my_dict['my_key'] = inner()
  
  print(my_dict['my_key'])
outer() # prints 7

References, closures, and inner functions

Don’t expect reference reassignment in closures to behave consistently from language to language. It’s best to be careful when relying on closures since they can often have unintuitive behavior in edge cases. You don’t want to confuse other developers (or yourself) in your codebase!

Which behavior do you think is best? If you were designing a programming language, which would you choose?

Thank you for reading Let's Make Software. This post is public so feel free to share it.

object in JavaScript, hash in Ruby, dictionary in Python

Looking Under the Hood: The Basics of Relational Databases

Kat Busch — Mon, 16 Oct 2023 15:24:15 GMT

If you use databases often and want to understand how they work, this article is for you. It isn’t a guide on how to use databases or how to design schemas; it’s just about how standard relational databases are implemented. For a relational databases like MySQL or PostgreSQL, how do things actually work under the hood? How is data laid out on disk? What is this “Query Planner” I’ve heard so much about? Understanding the nuts and bolts of relational databases is a great foundation for optimizing performance and troubleshooting issues.

How are rows stored on a disk?

Relational databases are stored by row, or as they’re called in relationship database-land, tuples or records. This means all the data in a particular row of a database is next to each other on disk. This is different from some types of databases (such as those designed for analytics like Snowflake) that store all the data in a given column together.

Say we have a table called Students that looks like this:

Remember, a disk is just a record of data, so on disk the data is stored like this:

1 Harry Potter                  Holly              3   2 Hermione Granger              Vine               3

Or another way to look at that same data on disk…

Notice the gaps of just \0 (null, empty) repeated? There are a few reasons for gaps.

Easy edits

First, since Wand Wood could be 12 characters, the database leaves extra space at the end in case you edit the wood later to be a longer name. Imagine if you just stored Harry Potter in the space for the 12 characters in his name, then in 4th year he decided to change his official name to Harry J. Potter . You would have no space left, so you’d have to slide all the remaining characters down on disk, meaning rewrite the entire rest of the table after his row! Sounds slow. Better to leave a gap since storage is cheap.

Easy navigation

It’s also important for all tuples to take up the same amount of room so that you can easily hop between them. If you wanted to know where row 5 starts and all the rows were a different length, you’d need to know the size of each row before row 5 and add them up. Now we can just say it’s 4*ROW_LENGTH which is the same for every row, letting you easily navigate the table.

Within a specific tuple, since the fields are all consistent sizes, you can easily locate a field by looking up the size of each one before it (or a precomputed offset) and hopping to it.

Memory alignment

Another more subtle reason for the gaps is that data is stored to align to the space it will take in memory, making it faster to read into memory: if the data is aligned to memory size on disk, then you can read data directly into memory from disk without having to shift it around. On modern 64-bit machines, data generally sits in chunks of size 64 bits for fastest access unless you have some pressing reason to try to squeeze in extra data—like varints in protobufs trying to reduce the size of data sent over a wire.

Data in your database will be spaced out so that it aligns into 64-bit chunks and can easily be read into memory for fast access.

Organizing tuples in a table

Going up a layer, hardware storage is generally divided into segments called blocks usually of 512 kilobytes. Blocks are read all at once and easily transferred to memory at once. For this reason, relevant tuples are stored sequentially on a block so they can all be read in at once. And you wouldn’t want a tuple to start on one block and end on another. If you did that, you would have to read in two whole blocks to read in one tuple. How data is organized within a page is beyond the scope of this article, but here’s some documentation on how it’s done in Postgres for those interested. (Though one thing to note for later is there’s usually some header data associated with each tuple.)

If a table spans multiple blocks, the blocks will point to each other. It’s helpful to have all of a given table as close together on disk as possible so that it can all read into memory in as few loads as possible.

It’s important to remember that these databases were designed in an era where hard disks prevailed, not SSDs. Hard disks had an interesting property where it took MUCH longer to find a particular block (seek) than to read in data from that block (and any subsequent blocks). While seeks are still longer than reads on SSDs, the difference is much less pronounced—a seek is ~9ms on a modern HDD vs ~.1 ms on a modern SSD. That means that database reads would be way, way slower if your data that tended to be accessed together was spread out on the disk. Many of the design decisions that went into the major relational databases are vestigial consequences of this old technology.

That being said, you can tune databases to be more optimized for SSDs. For instance, see random_page_cost and effective_io_concurrency for Postgres.

Some common operations

We can go over some simple operations on databases with just the information we have so far.

Adding a row

Adding a row is pretty simple in a regular table. You navigate to the end of the existing rows (there would usually be a nice pointer to get you here) and write the relevant data.

Adding a row is much trickier if your database is stored in sorted order. Then you must use a table scan or index lookup (more on this later) to find the relevant place in the database. If there happens to be an empty slot (for example from a previously deleted row), you can add your row, but if not you’ll need to shift everything that comes after it over (called a page split). Fancy pointer magic can avert having to shift things, but remember that might result in more disk seeks which are slow.

This is why databases are not usually stored sorted and instead rely on indexes to maintain ordering (again, more on indexes shortly!).

Deleting a row

The logical thing in deleting a row might be to slide everything after it back to conserve space, but remember that as with adding to a sorted table that would mean you have to rewrite the entire table that comes after the row. Instead, a flag in each tuple’s header is usually just flipped to indicate that row is now deleted. The tuple is now morbidly called a tombstone or dead row. They can later be reused, and if you have a lot, it might be a good idea to consolidate your table, called vacuuming in Postgres and done with OPTIMIZE TABLE in MySQL.

Indexes

Okay so we’ve got our tables laid out on disk. Generally, if you want to access data now, you’ll need to scan the entire table and check each row to see if it matches your search query. This can obviously be very slow for a large table… Enter indexes.

Indexes are pretty simple. They store relevant columns in a smaller, faster data structure that you can use to hop quickly to the desired part of your table. As part of designing a database schema, an engineer thinks about which columns and combinations of columns would benefit from this fast lookup structure and creates an index for them. You can always add indexes later on too as you add new columns and data access patterns.

Indexes are usually stored in a tree structure because they have super fast O(log(n)) lookup. The particular structure that’s most common is a B-tree, though you can configure this.

For instance, above in our students table we could add index on year that would let you sort, select, or do cutoffs by student year in O(log(n)). More on this in the next section.

So to summarize, without an index you need to scan a table to find a row. An index lets you use a fast data structure to look up where your data is without having to scan the whole table. It also lets you walk through data in a sorted order without having to read the whole table into memory or store it sorted.

Query planning and execution

So now the real purpose of a database: querying! What is the process between SELECT * FROM students to actually getting the data we need?

This happens in 4 stages:

Parsing: Parsing is relatively straightforward. Just take the text and turn it into code as any compiler or interpreter would.
Query re-writing: this is the query planning stage, where the sophisticated optimizations can happen to change the basic query you wrote into one that is very speedy. This where a lot of the magic (and confusion!) happens.
Physical query plan creation: Take the logical query plan from step 2 and turn it into a physical query plan: that means what do you actually read from which tables, columns indexes, etc? While step 2 seems like it has most of the “optimization”, step 3 actually has some as well. For instance, how should the data be passed between stages: via memory or disk?
Execution: Execute the query and return the results

The query planner generates multiple possible plans and then scores them based on how long it thinks they’ll take and how many resources they’ll consume, then it picks the best one.1

Let’s dig in a little bit more to steps 2 and 3, collectively called the query optimizer. Let’s look at a few simple examples. To get a basic sense of why this would be needed, consider the query

SELECT * FROM students WHERE year > 2

on our students table above. Remember, we have an index on year. So here are two options:

Read every row in students and if the year is > 2, copy that row into our result set. This is called a table scan, and if you’re scanning tables of any large size and you’re trying to make an interactive application, it’s almost always bad to do this because you have to read in the entire table to memory and do computations on it.
Use an index on year to find pointers to rows where year > 2, and copy those rows into our result set.

It might seem like there’s an obvious choice: isn’t O(log(n)) (for option 2) better than O(n)? Often, and that’s why we make indexes in the first place! But there are a few circumstances where it might not be. What if 99.9% of rows have year > 2? Then it might actually cost us time to read in the index rather than just go straight to the table scan. Or maybe the data is very small, and it’s faster just to run through the data than hop between the index and the data. Databases actually keep statistics internally on the contents of tables so that they can make intelligent decisions about these types of tradeoffs.

The above example is super simple. As queries increase in complexity, so too do query plans. There are several ways to plan join operations, and which one to pick can depend on many factors, from how big the tables are to the amount of memory to the cardinality of different fields.

The query optimizer can get extremely fancy, and it can sometimes do a bad job. You’re probably most accustomed to actually looking at the query planner’s results when it is doing a bad job—that’s when a query that should be fast isn’t and you have to go in and debug why with an explain command.

Finally, since query planning is NP-complete (read: slow), there’s a tradeoff: how much time do you spend searching for an optimal query vs just going for it? This tradeoff is often configurable.

Databases in Practice

I hope this guide helps you feel a little more confident delving into the world of relational databases. Relational databases, while simple on the surface, are extremely complex. They have managed to power most software for decades. Postgres has dozens of parameters just to configure the query planner, not to mention everything else. You’ll have no shortage of interesting intricacies to delve into as you continue your journey, from write-ahead logging to replication to transactions.

Sources and Further reading

I drew from a variety of sources from discussions with experts to reading Wikipedia, but I mostly used these:

In addition to the above, a few other interesting things in database-land:

Read about Dremel, a distributed analytics database by Google (the basis for Snowflake, BigQuery, and much of modern OLAP databases)
Read about Aurora, an innovative use of hardware by AWS
Read about Spanner, a globally consistent database created at Google

Please leave other recommendations! And huge thanks to my reviewers, Julie Tibshirani, Dylan Visher, and Stuart Cornuelle.

TypeScript enums explained

Kat Busch — Sun, 13 May 2018 21:42:54 GMT

This article explains the difference between Typescript’s enum, const enum, declare enum, and declare const enum identifiers. Caveat: I don’t recommend you use any of these enums most of the time. See the recommendation section at the end for details.

You can follow along with these explanations in this TypeScript playground.

enum

enum Cheese { Brie, Cheddar }

First, a plain old enum. The JavaScript transpiler will emit a lookup table. The lookup table looks like this:

var Cheese;
(function (Cheese) {
  Cheese[Cheese["Brie"] = 0] = "Brie";
  Cheese[Cheese["Cheddar"] = 1] = "Cheddar";
})(Cheese || (Cheese = {}));

Then when you have Cheese.Brie in TypeScript, it emits Cheese.Brie in JavaScript which evaluates to 0. Cheese[0] emits Cheese[0] and actually evaluates to “Brie”. The reverse lookup options is unique to enums and can be pretty convenient.

const enum

const enum Bread { Rye, Wheat }

No code is actually emitted for this! Its values are inlined. The following variations emit the value 0 itself in JavaScript:

Bread.Rye
Bread['Rye']

Inlining might be useful for performance reasons, although as with all performance optimizations be sure to take note of the trade-off of readability that you’re signing up for.

But what about Bread[0]? This will error out at runtime and your compiler should catch it. There’s no lookup table and the compiler doesn’t inline here.

Note that in the above case, the --preserveConstEnumsflag will cause Bread to emit a lookup table. Its values will still be inlined though.

declare enum

declare enum Wine { Red, Wine }

As with other uses of declare, declare emits no code and expects you to have defined the actual code elsewhere. This enum version emits no lookup table.

Wine.Red emits Wine.Red in JavaScript, but there won’t be any Wine lookup table to reference so it’s an error unless you’ve defined the actual enum elsewhere.

declare const enum

declare const enum Fruit { Apple, Pear }

This emits no lookup table, but it does inline! Fruit.Apple emits 0.

But as with const enum, Fruit[0]will error out at runtime because it’s not inlined and there’s no lookup table.

strings enums and const string enums

As of TypeScript 2.4, you can also create string enums. String enums are like standard int enums but you specify string initializers when you declare your enum.

enum Beer {
    Lager = 'Lager',
    Ale = 'Ale',
}

The above string enum will generate this lookup table in Javascript:

var Beer;
(function (Beer) {
    Beer["Lager"] = "Lager";
    Beer["Ale"] = "Ale";
})(Beer || (Beer = {}));

So Beer.Lager and Beer['Lager’] both evaluate to 'Lager'. There are no longer any numbers associated with your enum at all.

Just as with number enums, you can declare a const string enum that will inline the string literals themselves in transpiled JavaScript.

Recommendation

So of the many types of enums which do I recommend using?

Trick question: none.

I recommend string literal types. It’s more TypeScripty. It’s extremely readable. You can see it in the code as is. If you print the value of a variable you won’t get a mysterious 0 or 1; you’ll get the actual string itself every time.

Because of that, it’s also easily JSONifiable and produces JSON that you can read in other languages without having to maintain the enum mapping in other languages.

It also lets you easily do cool TypeScript things like mapped types.

Here’s an example:

type Nut = 'walnut' | 'almond';

Now, just as with enums, if your function accepts Nut, the compiler will error when a different string is passed in. Check out the playground to see that in action.

But what if I have to use an enum?

If you go with an enum because you need it for some legacy reason, I recommend a string enum if that fits your needs. It’s the most similar to a string literal type. If not, a plain old enum will likely cause the least confusion. If you need to go with enums for performance reasons and you’re truly sure that’s your bottleneck, const enum is the way to go.

So your website is slow? Let’s fix that.

Kat Busch — Thu, 09 Nov 2017 05:12:33 GMT

A lot of websites are slow.

Many web developers don’t even know our websites slow because we’re on fast internet with great hardware and we live close to our servers.

But what if you have users on the opposite side of the globe from your server? What if their bandwidth is miniscule? What if they’re on a phone? It could take them ages to load your site. And users’ slow CPUs might spend an eternity computing the next animation frame.

What’s an engineer to do? Here I’ll introduce how to think about web performance and how to go about identifying and fixing web performance problems. This article will familiarize you with web performance concepts. I’ll point you in the right direction so you can figure out where to get started and where to invest your time when you’re tackling your website’s performance.

What is a page load?

To understand why websites can be slow, let’s go through the stages of everything that goes into loading a web page. This article focuses on page load speed, not how responsive your website is once it’s done loading, but some of the same tools apply to both stages.

Getting connected

Before you can start loading any website, you’ve got to open a connection between the browser and the target site. This includes DNS lookup to find your website’s IP address; a TCP handshake to establish a connection; and an SSL handshake to set up encryption (I hope!). That means you’ve already got several round trips to and from your servers before the user even begins to load your content. If your servers are on the other side of the planet from your client, each round trip is going to be over 100 milliseconds. You can’t beat the speed of light! If you’re looking for a page load of less than a second, you already have slashed off almost a third of your time with the three round trips needed to initiate a secure connection.

Note that there’s a quick fix here if your stack is up to date. HTTP/2 can use caching to reduce SSL setup to only one round trip — just one of many reasons to consider HTTP/2 if you don’t use it already.

Server response

Okay, you’ve connected! Now your servers have to start doing some work to deliver bytes to the user. Depending on the design and nature of your service, response time can vary wildly. Is your website a bit of static content, or do you need to do all sorts of database lookups and computation to prepare a response? Do you compute the whole page and push it at once, or do you send a shell and push other content as it becomes available? Are you rendering React on the server? Your server response could take anywhere from a few milliseconds to hundreds of milliseconds or more.

Content download

So some sort of response has been prepared. Now clients have to download that response from your servers. Transferring a large response could take a while on a slow or flakey connection.

The browser can quickly begin to parse the response. As soon as it starts parsing, it’ll probably receive instructions to go and download a bunch more stuff: CSS, images, and JavaScript. At this point, many websites tap content delivery networks (CDNs) that are experts at delivering static content quickly around the globe.

And thus the great download of JavaScript, CSS, and images begins. Some large modern websites tend to measure their JavaScript alone in hundreds of kilobytes to megabytes, even after compression. If you have a 100 megabit connection, that will only take tens or hundreds of milliseconds. But if you’re on a 5 megabit connection (like some mobile networks), you’re looking at over a second of content download time, and if you’re on a slower or flakier network, this could take many seconds.

Parsing and execution

Luckily, the browser usually parallelizes this download step with parsing and execution. Once your CSS is in, rendering begins even if JavaScript is still loading. When any JavaScript comes down, the browser begins to go through the rather expensive tasks of parsing and then executing it. It will also lazily parse JavaScript — parsing is CPU intensive and on a slow client it can add seconds to page load time.

Everything else

Your JavaScript might kick off requests to your server that download more data. Websites can have post-load pipelines to show you more stories on your newsfeed, more products on your store, load a new menu, download higher res images, etc. You might have some CPU-heavy JavaScript computations that need to be done while your user is interacting with your page, like if you have too much going on in a React render function. Maybe you offload some decoding to a Web Worker. Websites diverge a lot after the initial page load.

What to do about it

Wow! That’s a lot of stuff that goes on just to load a page! The good news is there’s a lot of great tools to help you dig in and understand your performance.

Profile, profile, profile

You might be tempted after reading this to go tear out a bunch of JavaScript code because your number of kilobytes of JavaScript seems high. Stop!

As with any performance problem, the first step is simply to profile. No point putting your codebase through the ringer if it turns out performance isn’t a pressing problem for you.

You need to understand your performance so that you can decide whether it’s a problem. There are a lot of great tools for profiling web pages. On your own machine you can use the Chrome, Firefox and Edge profilers. These can give you an idea of how much time is spent in various stages of the page load, from network requests to JavaScript execution. Sometimes performance will vary wildly from one browser to the next because of implementation differences.

You can use these browser dev tools to emulate slower connections, so you can see what some of your users might be experiencing. Pretending to be on 2G is a fun one.

Make sure you test your page with static resources both uncached and cached.

You can also use tools like webpagetest to see what page loads are like from different browsers and different places around the world.

Log, log, log

On top of that, strongly consider logging page load times for your users. The Navigation Timing API and Resource Timing API are your friends. Send the information from these APIs on your users’ machines to your servers and collect it for analysis.

But careful when interpreting results: these APIs measure many different stages of your page load. Make sure you figure out which event actually lines up with when the user can view or interact with your page.

This user data will let you see the actual performance real users are experiencing and help you get a breakdown of where time is being spent. Logging will help you home in on problems by looking at which countries, browsers, and devices are suffering.

Some teams find it helpful to log performance metrics on every commit or run perf tests on every commit to identify degrading performance quickly. After your profiling, you probably have a good idea of which metrics you need to track and log for each commit: bytes of JavaScript, number of images, etc.

Now for the fun part: make your page fly

If your investigations reveal that your website is slower than you want to be in a way that you believe is impacting user experience, it’s time to get to work. Your profiling information should already have revealed to you where to get started.

There are lots of small changes that can speed things up: remove redirects, preconnect to the CDN, etc.

For many web pages, the main thing you can do to make your website faster is load less stuff. Google recommends you keep your page weight to 1 megabyte uncompressed.

Minify JavaScript, compress your content, turn on tree shaking. Use async and defer whenever possible to load scripts that aren’t needed right away. Remove unnecessary images. Load only the JavaScript you actually need. Remove code for inaccessible features. Redesign your page to load fewer images! Lots of huge images on your homepage might look really snazzy on your fast office internet, but what does it look like on a slower connection where they load pixel by pixel?

But be careful to keep the big picture in mind instead of only adding on some hacks that will make your code more complicated and will break in a few months. Upgrading to HTTP/2.0 won’t solve all your problems if you’re just loading too much data. On the other hand, a thoughtful rearchitecting to make your website’s performance stable for a long period might be the best investment of your time. For instance, you might want to make it so that your page’s components load in parallel, use GraphQL to load only necessary data, or implement server-side rendering.

Eyes on the prize: user experience

But don’t miss the forest for the trees. Your end goal should always be focused on your users and improving their experience. Performance is just one aspect of the overall experience of your website. If you’re smart about your development process, you’ll make a page that’s not only snappy but also delightful to use.

Quick Tips for Gitting on a Team

Kat Busch — Sat, 17 Jun 2017 22:44:14 GMT

A series on best practices in large codebases for new engineers. This one focuses on version control etiquette.

If you’re new to using git on a team instead of solo, this super fast primer will help you get started with standard workflows and etiquette.

Hopefully not what your commit history looks like! By austrini [CC BY 2.0 (http://creativecommons.org/licenses/by/2.0)], via Wikimedia Commons

Use feature branches

Use feature branches. Don’t push your work directly to the master branch until it’s finalized and (depending on your team’s culture) code reviewed. This is basic and discussed in many great tutorials so I won’t go into detail, but it’s a precursor to pretty much all work in larger teams.

Don’t mix bug fixes with feature work

If there’s a bug in the master branch, it’s a good idea to create a new branch (perhaps with a name starting with “fix/”) and file a pull request. An independent branch ensures that the fix can be merged independently of other work. If you tack it on to a pull request for a less important feature you might hold up others who are waiting for the fix and create many angry team members.

Write clear commit messages

When you’re working on a team, the repo’s commit history is a vital record of the progress of a project and an essential tool for finding bugs and understanding code. In three weeks or three years, somebody (even you!) might look back at your commit message to figure out why you wrote something. Help them out! Try to make your commit messages as clear and concise as possible.

Not clear:

“Fix bug with colors”

Clear:

“Fix issue #16: make send button change colors on press”

Not very specific:

“Add list endpoint”

Specifies the relevant part of the repo:

“[Photos controller] Add functionality to list recent photos”

Squash commits before merging

Squashing commits is the process of taking several commits and rewriting them into one commit. Squashing creates a clean commit history. When you and your teammates are looking back at the log on master, instead of seeing a jumble of crap like this:

Thu May 18 13:52 lint errors

Mon May 22 14:47 Feedback from CR

Thu May 18 16:04 Lololol oops

Thu May 18 16:03 Fix tests

Thu May 18 13:52 Fix lint errors

Thu May 18 13:39 Forgot to add file

Thu May 18 13:36 Start on compression

They’ll only see one commit with a clear description of your change:

Mon May 22 15:47 Add video compression option to uploader

Squashing means you can feel free to have lots of crazy work-in-progress commits on your feature branch. Just clean it up with a squash before submitting your pull request and/or before merging with master.

There are several ways to squash with git. GitHub actually now allows you to squash commits with the GitHub UI. If your repo has the correct settings turned on you’ll see a “squash and merge” button on your pull request.

Rebase before merging

Consider rebasing your branch on top of master before you merge. If you merge, you’re combining master and your branch by creating a new commit that corrects conflicts:

[on master] Merge branch ‘video_compression’

[from your branch, now on master] Add video compression option to uploader

[from master] Fix issue #17: remove uploader exit button

You could end up with a lot of merge commits this way. If instead you rebase on top of master, you correct the merge conflicts in your commit and end up with this cleaner history once you merge:

[from your branch, now on master] Add video compression option to uploader

[from master] Fix issue #17: remove uploader exit button

One warning: it’s always better to squash THEN rebase instead of rebase then squash. Git rebases commits one by one. If you haven’t squashed, you’ll need to resolve conflicts in every commit it in your branch with master. If you squash, you’ll only need to resolve one commit’s conflicts.

And that’s it!

Keep in mind that these tips all describe fairly common practices, but customs vary from team to team. If you’re working on a codebase with an existing repo, ask your teammates about their git conventions.

If you have any other tips for gitting on a team, feel free to leave them in a response.

Article 1: Trust No One: An Introduction to Large Codebases for New Engineers

Article 2: A Beginner’s Guide to Automated Testing

Article 3: Quick Tips for Gitting on a Team

The Software Engineer’s Essential Time Estimation Guide

Kat Busch — Sat, 25 Feb 2017 01:04:34 GMT

Hofstadter’s Law: It always takes longer than you expect, even when you take into account Hofstadter’s Law. — Douglas Hofstadter

By Rogerborrell (Own work) [CC BY-SA 4.0 (http://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons

A Product Manager friend of mine recently told me about a problem she was having: “Software engineers can never estimate how long their projects will take. What can I do?” Two CEOs recently told me the same thing.

We engineers have all witnessed this too. I once saw a project estimated at two days take four months. In that case even the “just double it” heuristic would still be off by an order of magnitude. This can have real implications for the business. I’ve seen a whole company move mountains for a launch event that had to be pushed out months.

At a high level the problem is a difference between what engineers mean when we estimate time and what PMs, managers, PR, and really everybody else mean. Most engineers instinctively think about the minimum time to write a working prototype if everything goes pretty much as planned. But those blocked downstream want to know when the project will be ready for launch — and that’s a totally different story.

For engineers, mastering estimation is a lifelong journey. Neglecting it will plague you and everyone you directly or indirectly interface with. Mastering estimation will set you apart and your colleagues will associate you with professionalism, stability, and quality work.

Why we need to estimate

Let me start by answering the question I most often get from engineers: “Why bother?” Many engineers complain (correctly) that it’s an overhead cost. “I’ll finish sooner if I just power through on it until it’s done!”

There are two main reasons: external dependencies and prioritization.

External dependencies

Nothing impactful operates in a vacuum. Projects often have external dependencies like coordination with non-engineering teams (comms, finance, PR, customer support), other engineering teams, or even end-users themselves. It’s typically the job of the manager, PM, or CEO to coordinate with these external dependencies. That means that the one who is best qualified to make a time estimate (the engineer) isn’t the one who needs it most. This asymmetry leads to a fundamental tension.

Prioritization

Time estimates are also key for prioritizing work. “Bang for the buck” is an important metric in engineering and there’s no “buck” without real estimation. Even if the feature you’re working on is the most awesome thing in the world, if you take the time to do a full estimate, you might realize it will take way too long to finish.

Say you’re working on a project that will make the website 50% faster but in the same amount of time you could have finished two projects that will each make the website 40% faster. If you don’t take time to do an initial estimate, you’ll never know that you could have ended up with a much faster website!

Time estimation 101

Now that we all agree that time estimation is necessary the vast majority of the time, let’s talk about techniques.

We underestimate time because we think “How long would it take me to write a basic version of this?”

But shipping is much more than a basic version. You will need to account for the time it takes you to write, test, debug, and polish. Don’t forget the time you’ll be in meetings, interviews, doing code reviews, sending emails, etc.

Another reason we underestimate is that we almost always encounter “unknown unknowns” in the coding process itself. And those are impossible to anticipate fully and account for. Maybe your IDE will get an update that breaks your project and you’ll burn a day fixing it. There’s no way your estimate could have taken that into account.

But we can still do much better than our initial instincts. Here’s what I do:

Step 1: Make a technical plan

You should already have a technical plan or design doc ironed out for any nontrivial project before you begin. You use this to let others know what you’re doing and get feedback. The technical plan is the ideal place to start the time estimate. As you work through the technical details, you’ll already magically be improving your estimate as you uncover unknown unknowns. Maybe you’ll realize that you probably will need to upgrade to a new version of a library that you’re using and that could add a day. You might even realize the library you were planning to use doesn’t actually exist and you’ll need to write it.

Granularity is important here. If any step feels murky or vague you’re either hand-waving (and should learn more) or you need to break it down into smaller steps. At the same time if a step is too fine-grained it might be brittle enough to invalidate the whole plan in practice.

For a good guide on what sort of thinking should go into your technical plan, check out this article by Alicia Chen. One key point is to iron out any potential ambiguities with the PM or other stakeholders so that you don’t end up building the wrong thing and having to start over.

Step 2: Add a time estimate to each step

Estimate how long each step in your technical plan will take to implement. This will often involve research into the details (“is there already a library to do this or not?”). Depending on the nature of the project, throwing together a simple prototype might help reveal a lot of potential future pain points.

Step 3: Add in a bunch of extra time

Now that you have a barebones of your estimate, there are all those things we mentioned earlier to account for.

Debugging as you go: Bugs always come up. A lot of this depends on your experience with a specific codebase and the codebase’s maturity.
Meetings, interviews, vacations, etc: You probably won’t be at your desk coding the whole time. How many hours will you REALLY have to code? You should at least look at your calendar when estimating.
Final testing and bug-bashing: You should generally be writing tests as you go, but a lot of teams need to do an extra round of polish work or integration testing before launch. Give ample budget for it in your estimates. If you’re doing a staged rollout, the initial 1% rollout will probably reveal bugs that need fixing. Account for that.
Code review: How many rounds are typical for you in this codebase? How long do they usually take? Be sure to verify an ample supply of reviewers (and maybe check their calendars). If it’s the kind of project where there’s only one possible reviewer you should solicit a commitment in advance and ask them for a backup in case they’re on vacation or way too busy at a critical point.

Once you start adding in all of these costs to shipping, you’ll start to see your time estimates match up a lot better with when your projects actually launch. Yes, they’ll be longer. Yes, you might feel pressured to shorten them. But when people figure out they can depend on you they’ll come to appreciate your estimates.

Step 4: Review your estimate after you’ve launched

Yes, it sounds like a pain to go back to your time estimate after you’ve finished a project and review what you’ve done. But this review is how you learn and get better next time.

What ended up taking a different amount of time from expected? If integration testing took twice as long as you thought, write that down and leave more time for it next time. Or try to improve your integration testing system.

You’ll definitely see your estimates improve with time. You might even come up with some great insights here that will help your whole team.

In the end, it’s all about communication

Communicate your timetable and changes to it early and often. If you let your manager know a month before launch that there’s a new security bug in the library you were using and you’ll have to start over from scratch, they’ll in turn have time to notify PR, finance, or users that there’ll be a delay.

Communication to relevant parties also lets them give you important information that can affect your estimate. A designer might say, “Oh, if that fancy animation is going to take a whole week we can just cut it completely.” A PM might add, “This is just a prototype to experiment on in user studies. We don’t need to do much bug bashing for this iteration.” A manager might say, “You’re spending half your time in meetings? Let me fix that!”

For engineers, don’t give in to pressure to report a shorter time than is realistic to appease the higher-ups. It is more professional to be honest about your estimates and how they’re changing.

For everybody else involved, respect that estimation is hard and that it’s going to be a process. You can only cut down time estimates by sitting down and removing features or stages that aren’t actually going to be needed for launch. You can’t cut it down by nagging.

We’re never going to be able to perfectly estimate time of a project. The only way around this is an open communication, compassion, and relentless prioritization.

A Beginner’s Guide to Automated Testing

Kat Busch — Sat, 03 Dec 2016 20:32:45 GMT

This is the second entry in my series on navigating large codebases for new software engineers. It’s not necessary to read the first article to understand this one, but you can find it here.

Software tests are the best gift an engineer can give themselves. Tests make your life and everybody else’s easier.

When I started as a software engineer I had absolutely no idea why tests were important and I felt pretty lost on where to start. From my experience mentoring and teaching new engineers, I know I’m not alone. So this is where you start. Here, with real-world examples, I will make the case for automated software testing and I will explain how to get started writing tests in a disciplined and thoughtful way.

Part I: The case for testing

Making your life easier

Years ago, when I was starting on my first project in my first job out of school, I didn’t write any tests. I knew that theoretically tests were good (and I’d written them during my internships) but I didn’t really have a solid understanding of why. I didn’t think it mattered much if I didn’t write tests at all or at least didn’t write them until the last minute. I was on a deadline and I didn’t want to get bogged down writing all that unnecessary extra code.

I was creating a mobile A/B testing system for Dropbox’s Android app. I submitted the Java component of the system for code review, with a note: “TODO: write tests”. As I developed the code, I engaged in an incredibly tedious manual testing process that involved setting up a test server on my computer, installing the app on a test phone, and manually creating an A/B test on the test phone. Obviously I didn’t test very many code paths because it was just so tedious.

After much back and forth, the code reviewer said “You can’t merge this without tests.” So I wrote some basic tests and merged it.

Lo and behold, I soon needed to fix a small bug. I then had to test that everything worked after the fix. “Hmm…,” I thought. “I’ve written some tests for this. I suppose I could use those.” I ran the tests. Within a few seconds, I knew that everything still worked! Not just a single code path (as in a manual test), but all code paths for which I’d written tests! It was magical. It was so much faster than my manual testing. And I knew I didn’t forget to test any edge cases, since they were all still covered in the automated tests.

This gets to one of the most important reasons for tests: tests can make your development process faster. If I’d written those tests earlier then I could have frequently just run the tests in under a minute instead of tediously poking and printlning my code over and over again to verify it was doing what I expected.

Preventing new bugs over time

For large codebases preventing bugs over time is really the most important part of tests. In codebases with many engineers contributing, you do not necessarily have control over changes to your code and especially changes to code that your code depends on. With dozens of commits a day, just by chance one of those commits will likely inadvertently change a behavior that your code relies on.

If you haven’t written tests, then there’s no reliable way for other coders to know that their commit has impacted yours. Good tests are an explicit signal to other engineers (or a future version of yourself) of your assumptions about the behavior of your code and its dependencies. In an ever-evolving codebase of millions of lines, how else could others possibly know that your important (but only semi-related) code will no longer work because of their change? If your code is still in the codebase a year (or five) after you’ve committed it and there are no tests for it, bugs will creep in and nobody will notice for a long time.

I witnessed an obscure but important user-facing feature broken for weeks because of a change in a seemingly unrelated part of the code. It took a build-up of user reports to trickle through customer support to engineering before anybody noticed. Afterwards, the team wrote a post-mortem about how the unclear dependency structure allowed this seemingly unrelated code to break their feature and go unnoticed. The postmortem didn’t even mention the simple obvious thing that would have avoided the whole breakage in the first place: a test! (Of course better code structure is also very important.)

Tests are your best weapon against the complexity of large codebases. Although you try to keep your code clean and clear, something will always break. Remember the rule: If it matters that the code works you should write a test for it. There is no other way you can guarantee it will work.

Part II: Tips for effective testing

Here are some basic tips to make your tests maximally useful and minimally painful. I will introduce some testing concepts and terminology. This is by no means complete. There are entire books worth reading on the topic, but it’s a good start.

Many small tests vs one big test

A lot of times I come across monolithic tests like “fn integer_tests() {}” that test fifteen behaviors of Integer in a row (like this example). I recommend having fifteen separate tests here for two reasons.

The first is purely practical: if that test fails halfway through, you won’t know whether the second half of your fifteen checks are passing or not because they won’t be run. This can make it harder to debug. If you know which three of fifteen exactly are failing, you might be able to identify the problem immediately.

The second reason is a subtler readability issue. With a glob of fifteen checks in a row, it’s harder for another engineer reading them to grok exactly what’s being tested. It’s also harder for them to see whether they need to add a new test case after a change or whether their case is already covered. If there are fifteen separate tests, it’s pretty easy to read through the neat names:

def test_integer_addition():
   ...


def test_integer_subtraction():
   ...

It’s also very easy for readers to look at how each function is set up and add a new test.

Make it easy to add new tests

It should be obvious to somebody editing your code how to test their changes. I recommend making helper functions in your test file to make set up and tear down simple. If adding a new test case is extremely easy then you can be more confident that your library will have high test coverage over time. But if the author has to spend a lot of time investigating exactly how to create the appropriate inputs, etc, then if they’re in a time crunch they might decide to skip a test all together. You should always make it easy for people to do the right thing.

Unit tests vs integration tests

Unit tests exercise one piece of code, like a class, module, or function. They shouldn’t need to set up an entire environment or complex dependencies (like databases). They’re usually very thorough and very quick to run.

Integration tests, sometimes called UI tests or system tests, will test the end-to-end functionality of your project. They’re usually slower to run because they have to initialize an environment, and they’re prone to be flakier because often small changes (like fixing a UI bug) can cause them to fail.

Both kinds of tests are important, but day-to-day you should focus on unit tests because they’re more modular and maintainable and they’re cheaper to run. This usually involves faking things that your code depends on so they don’t have to be created. That way you can be pretty sure that if a test fails it’s from your code breaking, not the thing it depends on. These fake things are often called mocks.

Integration tests are important as a last line of defense because they usually test what the user actually sees, and in the end that’s what matters. All your unit tests could be working, but there could be a problem where your libraries aren’t interacting together as expected. You won’t catch this without an integration test.

Which to write first: code or tests?

There’s a formal development methodology called Test-Driven Development that dictates writing tests before any feature code. This can be hard to do, because you really don’t know exactly what your interface should look like until you start writing it and realize some of your assumptions need to be changed. On the other hand if you’re not focused on testing during development it’s easy to end up with code that’s hard to use and hard to test.

The sweet spot is to write code and test in parallel. It’s an iterative process. Bouncing back and forth between your perspective and an outsider’s in a test helps guarantee your code’s interface is usable. Think of it like a painter who stands back from the canvas and pretends to be an ordinary viewer. You’ll find that writing tests as you go makes your interfaces better and makes your code more testable. If you find yourself writing something hard to test, you’ll notice it early on when there’s still time to improve the design.

Avoid Flakey Tests

Flakey tests are tests that fail some percent of the time, either because of a rare bug or because there’s a problem with the way the test is written. Say the test relies on the order things are inserted in a hashmap. Hashmaps are unordered, but maybe the iterator will return them in order 98% of the time. The other 2% of the time your test will fail. This can lead other people to think their change broke the build and waste time hunting bugs that don’t exist. If your test is flakey do whatever you can to remove the flakiness. To achieve this stability tests should be deterministic. Don’t depend on a given ordering of threads, and if you need a random number generator make sure to specify a fixed seed in your tests.

Test the interface, not the implementation

When you’re testing, it might seem easiest to dig into your interface and test that private variables have the right value at the right time. While this seems convenient, it has some problems:

Your tests will need to be rewritten if the internal implementation is changed. That is, a tree should have the same behavior whether it’s implemented as a flat array or with pointers. If your tests are accessing the array, then if the implementation is changed to pointers you’ll need to totally rewrite your tests.
Mucking with internals of the interface takes focus away from what really matters: do the functions that the user of this interface will call work as expected?

I’m not advocating that someone unfamiliar with the code writes the tests. You know the internals and you know what types of edge cases you should test for. You should haven’t to reach into private data to test that. Reaching in can cause you to butcher your interface and add functionality whose only purpose is to verify state in your test. It’s better to focus your energy on the correctness of real use-cases.

Try it yourself

As with all my posts in this series, my goal is to provide new engineers a practical framework for being successful in large codebases. You’ll still need to build your own intuition and learn from your own mistakes but hopefully hearing about mine will help the process go faster.

As you grow as engineer, you realize more and more that testing is a great tool, not an annoying overhead. It’s a way of helping you write your code faster, make fewer mistakes, and avoid bug creep. Next time you write code, set aside time to treat yourself with some tests.

Article 1: Trust No One: An Introduction to Large Codebases for New Engineers

Article 2: A Beginner’s Guide to Automated Testing

Article 3: Quick Tips for Gitting on a Team

The Rules of Optimization: Why So Many Performance Efforts Fail

Kat Busch — Mon, 26 Sep 2016 13:32:31 GMT

The First Rule of Program Optimization: Don’t do it. The Second Rule of Program Optimization (for experts only!): Don’t do it yet. — Michael Jackson

In our era of modern, speedy machines with oodles of memory, performance is something that few coders ever need to think about; but we think about it anyway.

We think about performance even when we don’t need to, and it’s much to our detriment. It complicates our lives and the lives of other coders in our codebases. Thus the first rule of optimization: Don’t do it.

In this post, I’ll illustrate the reasoning behind the first and second rules of optimization, and I’ll provide some tips for the times when you really truly do need to undertake performance improvements.

Note: When we talk about performance we’re often referring to the speed a program is run, but the same rules apply to optimizing memory usage, battery usage, or whatever other resource might important to your program.

Part I: The Bottleneck

My first important real-world performance problem happened during a summer of research in college. A program for helping biologists identify genes was too slow. My advisor suggested that there was a part of the graph coloring algorithm that took a long time, and I should start with optimizing that.

So I did. I had a great time squeezing CPU cycles out of C++. It’s a fun form of puzzle solving. I improved the performance of the graph coloring algorithm by 70%.

After a week of intricate optimization, I ran the whole program on the full dataset. It was only 15% faster — hardly noticeable. What was going on?

I dug around a little and discovered that when a process finished, it wrote long results to a file. The program caused many processes to write at once. As the operating system’s scheduler switched among the processes, the disk had to do tons of seeks to write each file. Disk seeks are expensive.

I added a few lines of code that created a file system lock so that only one process could write to disk at once. Sequential writes are much faster than seeks, so with that change the program was about 60% faster.

With a few lines of code, I’d more than cut the runtime in half. My week of C++ optimization had not. My advisor had pointed me in the C++ direction because he’d spent the most time on it. He’d probably only spent a couple minutes writing the code that writes to files.

I can’t tell you how many times in my years of performance work across many platforms I’ve seen people waste time on optimizations that do almost nothing. The most common failed performance effort is optimizing something that’s not the bottleneck. You often think you’re optimizing the bottleneck but you’re not looking at the whole picture.

Ironically it’s often the most experienced engineers that make this mistake first . It’s easy to get caught up in our own systems and forget there’s a whole other world out there. I once saw a group of senior backend engineers spend an entire hack week rewriting a complex web endpoint to use Go instead of Python. At the end, they found out that the bottleneck for the page was on the browser side of the app (and not the server side) making their performance improvement completely irrelevant.

These were experienced engineers who knew extremely well what made the server fast and slow, but their detailed knowledge of one part of the system prevented them from looking at the bigger picture.

Always ask yourself, what performance does the end-user care about? How does your code impact that?

Part II: Complexity

Performance improvements that don’t get at the bottleneck can actively harm your codebase. Almost all performance improvements increase the complexity of the code. Unneeded complexity creates a lot of problems that are not performance-related by making code harder to understand.

Let’s take a look at another example (slightly simplified) from a friend who used to use Python for graphics:

for i in xrange(1000):
    my_mesh.draw()

Imagine that before pushing this, the author noticed that draw() is being called 1000 times. Since the Python interpreter has to look up methods on each invocation, time could be saved by caching the reference to the draw method in a local variable:

draw_method = my_mesh.draw
for i in xrange(1000):
    draw_method()

This seems simple enough but it’s liable to cause a lot of problems down the line. While right now it might not seem like it will cause bugs, imagine that it’s in a large code base. Over time, many new lines of code might be added to the loop by many different engineers. Eventually draw_method might be far away from draw_method = my_mesh.draw, and the engineer will have to skip around and lose context to figure out what that means as they read through the code.

It will definitely slow people down and it could cause bugs. For instance, somebody might add live updating:

draw_method = my_mesh.draw

for i in xrange(1000):
    …
    if my_mesh.outdated():
        my_mesh = updated_mesh
    …
    draw_method()

Now there is a bug. The draw method will call draw on the first mesh, not the updated one, because the wrapper to my_mesh.draw wasn’t updated.

It’s quite common to see small performance “improvements” thrown in as part of larger changes, when those improvements haven’t been evaluated for how much they help. Even what seem like simple performance improvements add a cost. The cost is complexity. Complexity means bugs and more time reading, testing, and maintaining code. Particularly in large or long-lived code bases, complexity is the biggest impediment to progress.

The change above is consistently faster on my machine by about .05ms on average. Is .05ms enough of an improvement to warrant the complexity? Most of the time, no.

You are simply not in a position to evaluate whether the complexity cost is worth the improvement in performance unless you have a big picture view of the performance of the system. Maybe you can optimize, but you certainly shouldn’t do it yet.

Part III: How to approach performance problems

But let’s say you think your system is really too slow, and you really do need you to improve its performance for empirical reasons. Here are few rules to get you started in a disciplined way.

Get the right metric

A lot of the problems I covered above come from measuring the wrong things. In my project in college, I measured the time for the graph algorithm to finish and not the time for the results to be written out. The hack week team was measuring server response time and not the time for the web page to appear to a user. Measuring the wrong thing leads to solving the wrong problem.

Making sure you’re measuring the right number is the single most important thing to do when you’re tackling a performance problem. Without that, you have no idea how much you’re helping. Similarly, don’t necessarily trust those more experienced than you to tell you the bottleneck. Make sure you evaluate it yourself.

How important are marginal gains in this system?

Once you’re measuring the right thing, it will be a lot easier to tell which improvements are worthwhile. Nonetheless, it’s worth it to spend some time thinking about what your goals are.

You need to know not only how fast (or memory-intensive, etc) the system is, but also how much marginal gain you’ll get from improvements. Do you save your company money? Do you save your users time? If it’s a script that runs once a week that nobody is dependent on, even savings of an entire minute (basically forever in computer time) might not be worth adding complexity. But if it’s a function run a million times per second across a fleet of thousands of servers, savings of microseconds could save a lot of money.

If you understand what your performance goals are before beginning your work, you can make the right call on performance/complexity tradeoffs later on. If you’re being honest with yourself, you’ll often see that you should scrap marginal gains and focus on major wins.

Measure right

Lots of things can affect your performance measurement. You might be a mobile developer with a slick test device that has only one program running on it. If your goal is to produce a response in < 10ms on average, that’ll be a lot easier on your test device than on your user’s four-year-old phone with low battery and a hundred apps running in the background.

It could also cause you to misunderstand your program’s characteristics. You might be I/O bound on a test device, when on most devices you’re actually CPU bound.

All sorts of external factors can affect performance including:

Memory usage
Server load
Network bandwidth and latency
Battery level
CPU usage
Disk usage
Various caches at all layers

When you’re measuring, try to make sure as many factors as possible are held constant so you can accurately compare different approaches. But also make an effort to understand what these factors usually look like, so you can make sure your appraisal of bottlenecks is realistic.

One simple strategy for cutting out the natural ebbs and flows of resources on a machine is to run your code many times. Python’s easy-to-use timeit library defaults to running your sample a million times. This can help average out some fluctuations in system resource availability.

Try to integrate performance tests into your build

Some software projects fail their builds if a commit causes a performance regression. If performance is important for your project, consider adding performance as part of your continuous integration. It can let you prevent performance regressions before they get shipped. The sooner you are aware of regressions, the easier they are to fix. In large codebases, small regressions can build up over time unless they’re aggressively kept in check. Tests can keep them out of your system.

Finally, the optimization

Armed with the right philosophy and information about your system, you’re ready to begin performance optimization. Don’t do it until you’ve profiled and analyzed and figured out a coherent strategy — then go forth and code!

Once you understand the basics of how to tackle general performance problems, you can start to delve deeper into your system. This is where things can get really fun and exciting. Small systems are elegant logic puzzles. Large systems are universes of slowness whose complex interlocking parts can take months to unravel. Improving performance in both environments can be beautiful.

It’s just not often the right thing to do.

For my guidance on web performance see So your website is slow? Let’s fix that.

Trust No One: An Introduction to Large Codebases for New Engineers

Kat Busch — Fri, 22 Jul 2016 03:58:59 GMT

A series on best practices in large codebases for new engineers. This one focuses on how to think about large codebases via an introduction to the practice of commenting. Some variable names have been changed to protect the innocent.

The first thing I advise new software engineers fresh out of school is this: Trust no one.

It is certainly absurdly harsh, but I’ve found it serves a dual purpose: it gives engineers the confidence to share their own ideas (since existing assumptions are always suspect), and it helps motivate them to write robust, readable, and maintainable code.

Most engineers at the beginning of their career have coded primarily for themselves or their TAs or at most a small group. Much of that code is temporary and used only until you hand in your assignment or scrape and analyze your data.

Habits that work very well for academic or personal code will bite you in the ass in a professional environment.

Imagine an environment where dozens (maybe hundreds) are contributing alongside you. Your code will live on for years, and people will be trying to read it long after you’ve forgotten what you’ve written and maybe long after you’ve left the project or company.

In this codebase things will break. Code that once worked will regress. It will regress in a thousand tiny and unpredictable ways. Your nice clean elegant API will have crap added to it, parameters slapped in and deprecated, and semantics inverted until it ends up like this real actual function from Windows XP used to create a symlink:

BOOL br= ::DeviceIoControl(hDir, FSCTL_GET_REPARSE_POINT, NULL, 0,
  &ReparseBuffer, MAXIMUM_REPARSE_DATA_BUFFER_SIZE, &dwRet, NULL);

Source: http://www.flexhex.com/docs/articles/hard-links.phtml

Part 1: Comments to the rescue?

In this article, I’m going to discuss comments as a way to illustrate how code can evolve badly and how you can try to protect against future mistakes.

Often people turn to comments to combat the uncontrolled growth of codebases. But comments are ripe for abuse. As with all code, there are good ways and bad ways to write them.

Comments can do terrible damage

Let’s look at a practical example. Everybody knows it’s good to comment your code to “make it readable”. Some people also know it’s bad to comment your code “too much”.

Consider this comment:

def user_history(days=90):
  # Do 90 days of history by default
  for i in xrange(days):
    …

Let’s say you add this comment to your fresh new function. In a large codebase with many contributors, hundreds of people could end up reading this comment. They don’t learn anything new from reading it that they didn’t learn from the default parameter above it. That means it’s wasting engineers’ time. But it’s not just a time waster.

Over the course of a year, hundreds of programmers will make changes to your codebase. Let’s say some new code is added:

def user_history(days=90):
  prep_user_history_processing()
  do_some_more_stuff()
  …

  # Do 90 days of history by default
  for i in xrange(days):
    …

There are now many lines of code separating the function’s declaration from your comment. They might not even fit on the screen together at the same time.

Now let’s say that in the intervening year, the user history processing has changed so that now we often read 1000 days of user history. Eventually one of the many programmers using this function is sick of writing user_history(days=1000). They go ahead and change the default.

def user_history(days=1000):
  prep_user_history_processing()
  do_some_more_stuff()

…

  # Do 90 days of history by default
  for i in xrange(days):
    …

Boom. Tests pass. Code pushed.

A month later, another coder comes along and skims this function. They’re skipping through the file, and they see the comment but don’t actually look at the function declaration. They call the function with the default argument assuming it’s 90 days. Thus a bug is born.

This is how large code bases actually work. This is how many bugs are written.

Learn to program defensively

Comments that don’t add anything, like the one above, aren’t useful and are potentially dangerous. But more than that in a large codebase the evolution of your code is simply beyond your control. To write truly robust code you must ensure it’s not only correct (which itself can be challenging) but also resistant to skimming, misreading, carelessness, and growth.

When the author added the “90 days” comment above it was the second line of the function just below the default argument. They may have reasoned that if somebody were to change the number they would likely change the comment. They didn’t take into account that in the intervening year enough code could be added that the two would no longer even fit on the screen together at the same time.

It might seem lazy to you that someone would just skip to this part of the code without looking at the top of it. But coders are people and people make mistakes. Reading code is hard. With enough commits and enough time, mistakes are inevitable.

To be a great programmer, you must take into account these future mistakes when you write your code today.

Writing useful comments

Thoughtful code that that keeps in mind and takes care of its many hapless readers can help turn the tide toward order in a large codebase. In our user history case, the coder can fix things by simply not having the comment in the first place. It’s not needed.

In the more general case the coder should consider for comments and for all code:

How will this be useful over time?
What misinterpretations and misconfigurations will this cause?
Is this simple enough to be understood even in the face of a barrage of commits, our penchant for finding the easiest way, and our inexhaustible capacity to forget?
Who is this code for?

To answer the last question it’s useful to distinguish whether a comment is for someone calling this function or for somebody modifying the function.

Public comments

If your function is public-facing in a large codebase then the caller shouldn’t have to read the function body to understand how to call it. A clear docstring should obviate the need to read (or likely misread) the code.

def user_history(days=90):
  """Remotely fetch user events for the specified number of days.

  Args:
    days (int): the number of days to fetch for, rounded down by
        day to 12am

  Returns:
    list(UserEvent): all events found, empty list if none are found

  Raises:
    AccessException: raised if caller does not have permission
        to access this data
"""
  for i in xrange(days):
    …

This comment has useful information upfront. The reader doesn’t have to trawl through many lines of python to figure out (possibly incorrectly) something like the return type. (Note that static typing or something like Python’s type hints can help with this.)

Take care with your docstring. Take time to consider whether it’s are really useful, whether and how it can be misinterpreted, and whether it’s likely to still be useful a month or a year from now, after many more commits.

The comment above is written in Google’s python args standard. If your large codebase doesn’t have a standardized format for docstrings, you should invest time in starting one. Standards help protect against laziness and mistakes by providing a correct and easy habit.

Because it’s the standard used by every function and it’s parsed for use on the docs page, everybody is accustomed to updating it. You could even deploy a linter that warns you if you’ve changed a function without updating its description.

Public comments do not address the implementation of your function. People only calling the function don’t need to know about that. That’s where internal comments come in.

Internal comments

In theory good code should speak for itself. Variables and functions should be clearly named and control flow should be obvious. But there will always be places when the code can’t speak for itself because of something very subtle or very complex.

Internal comments are written inline with the code and won’t be parsed to show in public docs. They’re intended for somebody intimate enough with the code that they’re actually going to go through and read it line by line. Well, let’s be honest; they’ll probably skip a lot of lines.

Here’s an example of an internal comment:

# DO NOT CHANGE THIS TO A SET!!! IT WILL BREAK!!1!!
number_list = [number for number in xrange(10)]
do_something(number_list)

This comment is pretty useful — it says something that’s very hard to glean from the code, and if other coders read it then it could prevent bugs.

But as we know, many coders won’t read it because they’ll be skimming or they’ll make a mistake. This comment would be even more useful as an assert:

number_list = [number for number in xrange(10)]
assert type(number_list) is list
do_something(number_list)

Now we have the same information, but even if the coder is lazy and skims the comments, they still can’t cause a bug. Their testing will fail. (Yes, there might not be adequate testing. That’s a subject for another time.)

Still, we didn’t explain why, and it’s really not obvious from reading the code why we have to have a list there. Somebody might come in and try to clean up the code by removing this seemingly superfluous assert. Here you can put in a useful comment to explain what’s going on.

number_list = [number for number in xrange(10)]
# As of version 1.2.1, do_something has a bug that causes it to 
# crash if non-list collections are used for certain inputs.
assert type(number_list) is list
do_something(number_list)

Now, we have a precise internal comment that

tells the reader something useful that they can’t easily glean from the code
concerns the internals of the function that callers of this function don’t care about it
explains why it’s there and why you shouldn’t remove it

This by no means covers all places you should use comments, but the general rules hold. Your code should be concise and explicit, and when it can’t easily speak for itself you can write a comment that tells the reader something useful. Often people won’t bother to read comments, so you shouldn’t rely on them too heavily.

Compassion without trust

The truth is that reading and understanding somebody else’s code is hard. Given enough people with enough code and enough time, mistakes and carelessness are guaranteed to happen regularly. Trust no one to be mistake-free.

When producing code for others you should keep these traits of your fellow coders (and yourself) in mind and defend against them as consistently as possible. By being thoughtful and concise and using well-understood standards, and by generally being aware of our own faults as we write code, we can begin to guard against our own mistakes. In future articles I’ll discuss how this applies to other aspects of software engineering including testing and global state.

Article 1: Trust No One: An Introduction to Large Codebases for New Engineers

Article 2: A Beginner’s Guide to Automated Testing

Article 3: Quick Tips for Gitting on a Team