Your migrations are bad, and you should feel bad

Robert Lechte/ @djrobstep on twitter and github

This talk about making changes to the structure of your database. Schema changes, schema migrations, whatever you want to call them.

The title of this talk is just a futurama reference, I don't want anybody to actually feel bad, rather I hope by the end of this talk you feel better about schema migrations than you did before.

Let's begin with an inspirational quote.

Linus on data:

I'm a huge proponent of designing your code around the data, rather than the other way around, and I think it's one of the reasons git has been fairly successful...

Bad programmers worry about the code. Good programmers worry about data structures and their relationships.

So this is Linus Torvalds, talking about the importance of structuring your data in the right way. I'm not a huge fan of the mean-spirited way Linus Torvalds sometimes says things, but I'm absolutely a huge fan of this quote and what he's saying here.

Bad database → bad backend → bad frontend

I probably don't need to say this at a database conference, but getting the structure of our database right is really important. It isn't just an issue for DBAs and backend developers. What users of software want on the frontend is reliability and fast, responsive performance, and that's impossible without a well-structured and well-functioning backend.

Schema changes are hard

Working with database schemas always seem a little more difficult than it should be. It always seems like a chore to make changes to your tables, and even more of a chore to roll them out. Even if it's just changing a column or two, it's time-consuming and error-prone.

NoSQL

TFW when you make a whole new category of database because you hate making schema changes

I have a pet theory that the main reason for the sudden rise in popularity of NoSQL databases like Mongo a few years back was frustration at schema management. NoSQL databases don't really enforce much structure onto your data: You could start writing code to save your data without having to deal with schema changes. And that advantage was enough to get thousands of developers to use it!

Misguided maybe (after all, Mongo sucks and you should never use it) but a totally understandable reaction.

Schemas are actually good

Enforcing consistency and structure is good

But in most cases, having a strongly enforced schema is really important and a really good thing. Fundamentally, a schema is a guarantee that your data conforms to the structure you intend. And being able to know that and trust that is really good, and really important. It seems foolish to just throw that away.

Solution

Keep schemas, make better tools for working with schemas

But the tools aren't great, so how can we make them better?

Schema changes aren't conceptually complex. It's just an extra table here, one less column there, whatever. So why are schema changes so hard to work with?

The problematic status quo

Let's take a look at the way the tools work now, so we can see what needs fixing.

Rails migrations, Django migrations, Alembic migrations, Go migrations, etc etc

You might have worked with one of these before, or something similar in another language. What's common to almost all of them is that every time you make a schema change, you've got to create a new migration file, and each new version of your schema gets a version number, used for tracking what version your database is up to.

Problem 1

A history of mistakes

Over time you build up a long chain of these files, and after a while you have a folder that looks something like this.

After a while:

01_make_a_table.py
02_make_another_table.py
03_add_a_column.py
04_rename_that_column.py
bla bla bla
97_drop_that_old_table.py

#realtalk

Nobody cares about the old files

The trouble is that most of these files are totally irrelevant. It's very unlikely you'll want to go back to a version of an app and database from three months ago. I'm usually pretty embarrassed about the code I wrote three months ago, and never want to see it again.

History is already in version control.

Why keep it hanging around?

We have established tools for managing code history. Why keep old migration files hanging around cluttering up the working copy of your code?

Problem 2

A file every time you make a change

The overall assumption these tools make is that migrations are a Big Deal and a fairly rare operation.

A heavyweight process for crafting migration files

This is totally against the ethos of modern day iterative development, particularly if you're in the early stages of a project and doing rapid prototyping. It doesn't let you experiment, try out different configurations, it makes it harder change your mind.

Not easy, not convenient, not intuitive, not agile.

Unnecessary version number management.

None of this feels very easy, or intuitive. The version numbers in particular are really annoying, and just another thing you need to keep track of and worry about.

Problem 3

Testability

Another problem is that these migrations aren't very testable. There are many things we want to be sure of when we make changes.

So many questions

Does our development schema match live?
Does our test schema match live?
Will our migration run on staging and live without breaking?
How long will our migration take to run on the production data set?
Will our migration cause downtime between deploy and migration?

Usual answer

We don't know

It's possible to test and check these things, but it's hard and not included out of the box.

Ain't nobody got time for that

In practice, the proper checks tend to be skipped. People just deploy and hope for the best.

Problem 4

Framework lock-in

Object-relational mappers: Maybe good, maybe bad, but shouldn't be compulsory.

The next issue is that existing migration tools are usually built to work with one particular object-relational mapper library. So if you want to use that tool's migration functionality, you're coupled to that particular tool.

Conscious uncoupling

Some people are fans of ORMs, some aren't. But it shouldn't matter. Good migration tools should work with any setup, ORM or not, or a mix. The right tools should be available no matter how you're otherwise interacting with the database.

Problem 5

All-or-nothing design assumption

The tools all assume you're using them to manage the entire lifetime of your database. And, you must manage the entire database with them. Say you want to do a one-off schema change with django's migration tools, but you haven't built that database with django from the start. You're going to have to spend time retrofitting django models onto that database before you can take advantage of django's migration features.

Problem 6

Database as primitive table store

Also, they all assume you're using the database as nothing more than a primitive table store and not using any of the really powerful features PostgreSQL has.

Party like it's SQL99?

SQL has come a long way since the 90s, but we're not taking advantage of the modern features.

PostgreSQL has more than just tables

Views! Functions! Upserts! Hypercubes! Extensions!

When you want to use these features, you're on your own. Even basic SQL views aren't handled by most of these tools.

I'll stop complaining now

Let's rethink this

It's all very well to complain, but how could we actually solve these problems and improve the situation? Let's look at some ideas.

BeforeThe status quo

Schema 1 → Schema 2 → Schema 3 → Schema 4 → Schema 5 → Schema 5 → Schema 6 → Schema 7 → Schema 8 → Schema 9 → Schema 10 → Schema 11 → ... → Schema 591 → Schema 592 → Schema 593 → Schema 594

What can we do about the messy chain of schema versions we talked about before? Instead of this enormous sequence of versioned schemas, we can simplify the problem to just a few.

AfterSimplified

Empty schema → Dev schema → Live schema

There are the only three states we really need to care about. The current live schema is obviously pretty important.

So is the development schema, which every so often becomes our live schema.

And so is the empty schema: We need to be able to set up an empty database with the development schema so we can work with and test the database and the application.

And that's it.

How to get rid of migration files and version numbers?

What about all these migration files and version numbers? How could we replace these with something better? It turns out here's something more straightforward and fundamental we can use to define database changes.

A familiar conceptDiffs

You've almost certainly heard of diffs before. Diffs are a ubiquitous concept in software development. Diffs show changes between versions of text, code, maps, and lots of other things.

Text diffs look like this:

+ the new text
- the old text

Plain old text diffs look like this. But database schemas can also be represented as diffs.

Database schema diffs look like this:

alter table "author" add column "date_of_birth" date;

A database diff is a list of the SQL statements you need to run to get from one database state to another.

For this diff we have one database where the author table has a date of birth column, and another that doesn't. This diff is the statements required to change that first database to make it match the structure of the second one.

A basic equation:

old database + diff = new database

Fundamentally it works the same as any other diff: Old plus diff equals new. Apply the diff to the current database and you've got the new database.

migra

A diff tool for your database schemas

So what I thought would be really good would be a diff tool for PostgreSQL. In fact I wanted one so badly that I wrote one in Python.

Command line tool, Python library

$ pip install migra

It's a command line tool and also a python library. The way it works is best illustrated with a simple example.

Basic example

First create two databases...

$ psql
# create database a; create database b;
# \c b
# create table t(id serial primary key, name varchar);

To demonstrate, let's start with two databases, "a" and "b", and create a table in "b".

Basic example

Compare "a" to "b"

$ migra postgresql:///a postgresql:///b

Compare "a" to "b"

$ migra postgresql:///a postgresql:///b
create sequence "public"."t_id_seq";

create table "public"."t" (
    "id" integer not null default nextval('t_id_seq'::regclass),
    "name" character varying
);

CREATE UNIQUE INDEX t_pkey ON t USING btree (id);

alter table "public"."t" add constraint "t_pkey" PRIMARY KEY using index "t_pkey";

Migra outputs the sql needed to make the schema in database "a" match database "b". In this case it's equivalent to the create table statement from before, but decomposed, with the primary key constraint, sequence, and index created separately.

Migration script example

First decide on a goal schema

Goal schema can be created by ORM models, setup scripts, whatever

Now we move on to a more comprehensive example. Here we'll generate a real migration script for a live database.

Migration script example

Goal setup script (again into database "b")

create table author(id serial primary key, name varchar, dob varchar);
create table book(id serial primary key, title varchar, page_count integer, author_id integer references author(id));

Migration script example: dump your live schema

pg_dump -no-owner --no-privileges --schema-only prod -f prod.schema.dump.sql

Migration script example: compare current to target, generate a migration file

Result:

$ migra postgresql:///a postgresql:///b
create sequence "public"."book_id_seq";

create table "public"."book" (
    "id" integer not null default nextval('book_id_seq'::regclass),
    "title" character varying,
    "page_count" integer,
    "author_id" integer
);

alter table "public"."author" alter column "dob" set data type date;

CREATE UNIQUE INDEX book_pkey ON book USING btree (id);

alter table "public"."book" add constraint "book_pkey" PRIMARY KEY using index "book_pkey";

alter table "public"."book" add constraint "book_author_id_fkey" FOREIGN KEY (author_id) REFERENCES author(id);

Check it matches

$ psql a -1 -f migration_script.sql
$ migra postgresql:///a postgresql:///b
(EMPTY OUTPUT)

Migration script example: review the autogenerated migration file, edit as necessary, apply

$ psql production -1 -f migration_script.sql

And that's the migration done. Because we've compared directly to the production schema, there's no need for version numbers, and no need to build up an enormous cluttered chain of migration files: Just a single script to review.

Autosync: Changing your schema during development

Another thing you might want to do is make changes to your database models, but keep your local development database in sync. And it turns out we can script this up fairly easily.

Autosync example: script it up

def sync():
    with temporary_db() as TEMP_DB_URL:
        with S(TEMP_DB_URL) as s_goal:
            set_up_goal_database(s_goal)

        with S(DB_URL) as s_current, S(TEMP_DB_URL) as s_goal:
            m = Migration(s_current, s_goal)
            m.add_all_changes()

            if m.statements:
                print('CHANGES ARE PENDING:\n\n {}'.format(m.sql))
                if prompt('Apply these changes?'): m.apply()
            else:
                print('Already synced.')

No files, no version numbers: One operation

Instead of creating a migration file with your changes, you just run this sync command, confirm, and your database is in sync.

Fast enough to keep you in your flow

That means you can keep working and keep focussed on the feature you're actually working on, rather than being distracted by version numbers.

Testing: Before and after!

I was complaining before about testing. If you want to carry out a migration with no errors and no downtime, you really need to know that your app works with both the before and after versions of the database. You can write a test for this pretty easily.

Testing example

pending_sql = Path('pending.sql').read_text().strip()

if pending_sql:
    DATABASE_SETUPS_TO_TEST = [
        load_pre_migration,
        load_post_migration
    ]
else:
    DATABASE_SETUPS_TO_TEST = [
        load_post_migration
    ]

@pytest.fixture(params=DATABASE_SETUPS_TO_TEST)
def db(request):
    with temporary_database() as test_db_url:
        setup_method = request.param
        setup_method(test_db_url)
        yield test_db_url

Here's how that might work in practical terms. If we have a migration we need to do, we want to run every test twice, before and after. The way you do this will depend on a lot of things, but here's how you do it in a typical python project using postgres.

If we use the convention that the pending migration code is stored in pending.sql, then we can check that file for changes. And if we have changes, then our feature code uses two different features, pre and post. Our fixture code creates a temporary database for each test, so each test runs against an isolated database: once with the old schema, and once with the migrated schema. Once we make both sets of tests pass, we know that we can deploy our app, check that's successful, and then run our migration, and we won't get any errors in either state.

Migrations we can trust

This test setup should give us a lot more confidence that we won't have any errors while migrating, and encourage us to make our apps a lot more robust.

Talk directly to PostgreSQL

ORM independence, tracking of views, functions, extensions, and other sweet PostgreSQL features.

You'll note that in all these examples we're connecting to postgres directly, and working beneath the layer of any object-relational mappers. This means that the setup works independently from your application layer. You could change frontends or even have multiple frontends and your migration code doesn't need to change.

Your migrations are bad, and you should feel bad