Be Nice and Write Stable Code (Stop rearchitecting your code!)

Be Nice And Write Stable Code

Jul 4 2018

Stop rearchitecting your code! The professional developer values stability over "code purity." Instead of pursuing a Shangra-La vision of code perfection with each and every release, just be nice and write stable APIs . In this post, I talk about taking practical steps toward writing code that remains stable over time.

The Non-Goal

It is completely unhelpful to begin with a suggestion like this:

When you write code, make it future-proof.

If history teaches us anything, it's that we are lousy predictors of what the future holds. Of course it is a good practice to strive toward clean APIs, flexible design, and thoughtful defaults. But inevitably, your users and emerging requirements will surprise you.

What I want to talk about is how to best deal with those surprises, and how to avoid introducing surprises (and frustration) to those who use your code.

Versions and SemVer

In software, we use version numbers to signal that something has changed. Version numbering schemes go from dead simple (integers that increment with every release, or date stamps) to surprisingly complex ( 1.0~pre3+dfsg-0.1+b2 , 2.1.1+git20160721~8efc468-2 , and 1.2.0+LibO5.2.7-1+deb9u4 are a few versions spotted in the wild).

But when it comes to software version numbers, the current leader in version numbering schemes is SemVer (or Semantic Versioning). Don't be fooled, though! Many people claim to know how SemVer works, but have never read the specification . Since this is a critical piece of what we are about to talk about, here is a summary of the spec:

Version numbers take the form X.Y.Z , sometimes augmented with additional pre-release and build information: X.Y.Z-AAA#BBB . And each of those fields means something well defined and specific .

X is the major number. Changes in this indicate breaking changes to the API (and/or behavior).
Y is the minor number. Changes to this number indicate that new features were added, but that no APIs are broken as a result.
Z is the patch version. Changes to this indicate that internal changes were made, but that no changes (even compatible changes) were made to the API.

These three are the important ones for us. Again, I suggest taking 15 minutes to read the entire spec.

Countless projects use a format that looks like SemVer, but many of them ignore the semantics behind the version number. Often, it seems that version numbers are incremented by "gut feel" instead of any consistent semantic: "This feels like a minor version update."

The intention of this post is to explain how to write software in a way that actually adheres to semantic versioning. Get rid of "gut feel" version numbers and give your users some peace of mind.

But Why?

Why bother using a semantic versioning scheme? What's wrong with just updating numbers arbitrarily? The reason is simple: Version numbers help your users understand something about the nature of the changes they can expect. If you don't follow a pattern, they are left guessing. And this frustrates people.

Following SemVer introduces rigor on two fronts:

It sends clear signals to users about the depth of changes they can expect in a release.
It sends a clear signal to your developers about what is, and what is not, allowed when it comes to changing the code.

I cannot understate the importance of (2). SemVer helps us impose self-discipline, which in turn minimizes both internal and external disruption.

Patterns of Change

With the SemVer discussion out of the way, we can now talk about the actual patterns of change.

Remember, the usability of code is a focal point of the professional software developer. Predictable patterns of change are a boon to usability.

Reorganizing, Refactoring, and Renaming

There is no clearer way to state the point than this: If you reorganize the package structure of your public API, or if you do a major renaming, or if you choose to change the methods/structs/classes/etc of your public API, you must increment the major version number .

That's it. There is no grey area here. Such changes mean that anyone who's using your code will experience breakage.

When it comes to working on minor updates, dealing with this means exercising discipline. Yes, the package structure might be poor. Yes, the code might be ugly. But you must wait until the right moment to fix that.

Of course, it's okay to make internal changes that don't touch any public API items. So minor internal-only refactoring can be done in minor, and even patch, releases (though we don't recommend doing it in patch releases).

Note:Stop trying to justify your refactoring with the "public but internal" argument. If the language spec says it's public, it's public. Your intentions have nothing to do with it.

So in effect, the following are not be be changed except during major updates:

Package structure
Public class, struct, enum, trait, interface, etc. names, nor the names of any of the items on these
Constants or public variable names or values
Function/method names
Function/method signatures for existing functions except where the change is additive and the added argument is optional. Return value types and exceptions must also not change.

The bottom line: Refactoring, renaming, and reorganizing is a sweet temptation. But this is a temptation that must be resisted when doing a minor/patch releases. Part of being a professional software developer is creatively coping with imperfect code.

But, you might be saying, how do you add new features without changing any of these? That is the subject of the next few sections.

Introducing New Features

Minor versions may introduce new features, but features must be introduced without breaking existing APIs.

Features are additive in nature: They bring new things, but do not modify or delete existing things. To that end, these are safe as part of a feature release:

Adding a field or method to a struct/class/enum/etc.
Adding a new struct/class/enum/etc. or adding new variables, constants, functions, packages, etc.
Adding new configuration options (but see the next section)
Making something that was non-public into something that is public (e.g. exposing a private API as a public API)

However, there are a few changes that are sometimes done under the guise of a feature, but which are breaking changes that must be avoided:

Changing values of constants, variables, etc.
Changing a function or method signature (e.g. adding more params or changing the return type)
- There is an exception here if a language supports adding optional parameters in a way that will still make old calls to the function to work exactly the same.
Changing an item from public-scoped to non-public (hiding an API)

Modifying by Adding Alternatives

Consider the case where you begin with code like this:

func ListItems(query Query) Items {
  // Code to fetch and list the items
}

The code above might, for example, do a database query and fetch all of the results.

Now a feature request rolls in: "We need to add paging to the list functions." The temptation is to do this:

func ListItems(query Query, limit int, offset int) Items {
  // ...
}

But that is an API breaking change. The correct way to handle this is to introduce a new function:

func ListItems(query Query) Items {
  ListItemsWithLimit(query, 0, 0)
}

func ListItemsWithLimit(query Query, limit int, offset int) Items {
  // ...
}

Note that we have adjusted the internals on the old to replicate the exact behavior of the old function, but by calling into the new function.

It is fine to mark functions as deprecated when you do this. In fact, this is why languages like Java cleverly added built-in support for deprecation. Professional-grade development involves a strategy of deprecation with eventual removal, even if that removal is years down the road.

Importantly, if the newly introduced function cannot replicate the old feature set, you are obligated (unless security concerns dictate otherwise) to provide the old API's functionality to the greatest extent possible.

Beware The Dubious Work-Around

There is an accepted pattern that works well for many things, but which some developers employ to "get around" the SemVer constraints on modifying function signatures:

func ListItems(query Query, options Map) Items {
   // ...
}

In this pattern, the options map is an arbitrary set of key/value options. This pattern itself is fine unless the default behavior changes when a new option is introduced. When adding a new option, the professional developer ensures that when that option is not present, the code behaves the same as it did in the last release.

Deprecating

We touched on deprecation above. But I want to summarize the deprecation strategy:

Mark a thing as deprecated as soon as it is considered deprecated , even if that is a patch or minor release. Deprecation, after all, is a warning condition, not an error condition.
Do not change the behavior of the deprecated thing during minor or patch releases
Remove deprecated things only at major version changes. Until that time, you're still on the hook for supporting them.

Deprecation is a signal that in the future a thing will be removed. But it is not an excuse to change, delete, or ignore the functionality of that bit of code outside of the SemVer constraints.

Errors and Exceptions

One of the most frustrating outages I ever experienced occurred because of a seemingly innocuous change to an upstream library: During a minor release update, the library changed the exception type that it threw on a particular error.

One of the functions in the library threw an IOException whenever a network error occurred, and other exceptions for other problems.

We used the throwing of an IOException to kick off our retry logic. Given that network failures were a frequent occurrence for our particular conditions, this was an important feature.

But during a minor version change, the developers decided to simplify the API by catching all of the different exceptions (including the IOException ) and wrap them in a single generic exception. (Incidentally, the API itself did not change because it was something like func Read(in Reader) error , where error was a parent of all exceptions).

When we upgraded, all our tests passed (because our test fixtures emulated the old behavior and our network was not unstable enough to trigger bad conditions), and production rolled out just fine. But our customers began complaining that the product was much less reliable. Why? Because the retry logic was never triggered. So our app suddenly was as unstable as the network it was on.

The bottom line: Even error handling is part of your public API.

Resisting Subtle Changes

Sometimes subtle but ill-planned changes can cause major breakages for your users.

Here's a short story of how a trivial change in one of our dependencies led to a series of production catastrophes for our users:

We depended on a library that provided a client/server RPC-like protocol. This library had long been marked stable, and indeed stability is one of the touted features of this library. But the developers introduced a very subtle change that appeared to follow the stability requirements, but which actually introduced a serious compatibility flaw. The change went something like this:

The library allowed us to set a maximum message size. The default was 256k, but we wanted it to be significantly larger. So we set this option:

config.MaxMesageSize = 1024

This made it possible for client and server to send each other messages up to 1M. But at some point, a minor release of the package made a very subtle, but killer, change: They split upstream and downstream message sizes into two variables. And here's the killer: They did this by merely creating a second variable ( config.MaxInboundMessageSize ) and changing the behavior of the first to impact only outbound message size.

When we upgraded the package, all seemed to work well. Code compiled. Tests passed. Early users saw no problems. Then we shipped our new version with this updated dependency. And suddenly angry users started filing issues. Stuff that worked yesterday was broken today.

Why? Because behind the scenes, the inbound message size had dropped from 1M to it's default 256k. And while nothing in our early testing sent messages larger than 256k, there were plenty of production instances that did.

The upstream library maintainers had introduced a serious bug into our code by silently changing the behavior of their code, even though they didn't (in a pedantic sense) "break" SemVer.

What should they have done?

I would argue that breaking the size limits into an inbound and an outbound is a totally legitimate thing to do during a minor release. It was just done wrong.

The right way to address a configuration change like this is to introduce two new variables and then add default support for the old one.

Thus, in practice, it would look like this:

type Config struct {
   // Old one
   MaxMessageSize int
   // New settings
   MaxInboundMessageSize int
   MaxOutboundMessageSize int
}

And then the internal code in the library would handle the legacy case while still offering the new functionality:

// Support clients who set the old config
if config.MaxMessageSize > 0 {
    config.MaxInboundMessageSize = config.MaxMessageSize
    config.MaxOutboundMessageSize = config.MaxMessageSize
}

With something like this in the base library, any tools that use it would get the old behavior if they used the old configuration param, but could opt into using the newer options instead.

Bugs and Security Fixes: How To Handle Real Life

Guidance that you should never change certain things is all well and good until reality comes a-callin'. But what if that public constant or variable introduces a security issue, or causes the server to crash?

When the real world comes crashing in, we make exceptions. But professional software developers make them wisely and carefully.

The important concept here is the minimally invasive change . That is, when patching bugs or security releases, we may need to change the API, but we should do it by changing the absolute minimum number of things we can get away with. And we do that even if it means sacrificing our "architectural purity" .

I will plead guilty for introducing a global variable as a stop-gap to fix the internals of a function without changing the function signature. It was ugly. I was ashamed. But it ensured backward compatibility, and that was the important thing. If I had changed the function call, thousands of users would have had to change their code. But with the ugly global, only the few who needed to tune that particular parameter were impacted (and only by having the option to set something they could not previously control).

But a security issue or major bug is a legitimate reason to change things like default values or even larger macro behaviors. If the change is big enough, you're still obligated to change the major version of your code. SemVer doesn't give a free pass on that, and failing to do so still undermines user confidence.

But for less intrusive changes, I personally feel like you can make some minor SemVer transgressions provided:

You make this very clear in your release notes.

"The value of MaxBufferSize was adjusted downward to 2048 because we discovered a buffer overflow in a lower level library for any larger buffer size. See issue #4144"
The code is clearly commented:

// MaxBufferSize sets the maximum size of the network buffer.
// Prior to version 2.5.1, this was 4096. Due to a security flaw
// reported in #4144 that resulted in a buffer overflow, we
// lowered this to 2048.
MaxBufferSize = 2048

Conclusion

The professional software developer has long-term usability and stability as a goal. Yes, well-architected code is important. But there is a time and place for making that your focus. And maintenance releases (minor and patch versions) are not an occasion to refactor, re-organize, or make sweeping modifications.

Be conscientious about how much effort the users of your code put into using your code. I can tell you from experience what we do when the maintenance burden you impose on us gets wearying: We stop using your tools (or we fork them).

SemVer is a communications tool. But to use it well, we must use it accurately. And that means writing code focused on stability.