On handling outages: a case study

If you happen to use Basecamp 3 to manage your projects, you might have noticed a huge outage on November 8th, 2018; it lasted almost 5 hours .

The issue was that they failed to use bigint for the primary keys of their tables so they ran out of IDs. The TLDR solution, taken from David Heinemeier- DHH, creator of Ruby on Rails and Founder and CTO at Basecamp:

We took half of our replicas offline, did the 3h migration, put them back online, will now be converting the other half of the fleet.

And I'm not writing this to expose and/or throw **** at them.

I'm writing this to applaud their communication and openness about the whole outage.

I'm writing this to expose how over-communication, honesty, humbleness and clarity DO make a difference , specially on difficult situations.

To give you some context, the first notice on their Twitter account about something going wrong was at 5:40 AM on November, 8th:

Basecamp

@basecamp

juMjqyB.png!web

Basecamp 3 is having trouble right now. Sorry about that! We're working on a fix and will keep you updated as we go.

13:40 PM - 08 Nov 2018

vmMNNbr.png!web 5 maa6Vna.png!web 26

From that tweet and until everything was working again, there were 15 more tweets with constant updates! With the last one being at 10:47 AM, November 8th, signed by DHH himself:

Basecamp

@basecamp

juMjqyB.png!web

Basecamp 3 is back up at the moment. We had to switch to a backup set of caching servers, and they're holding up at the moment. It's obviously been touch and go, so not out of the woods yet. Pains us to ask for even more patience on such a trying day. So sorry :cry: ^DHH

18:47 PM - 08 Nov 2018

vmMNNbr.png!web 0 maa6Vna.png!web 23

All that information is a huge deal . You know they are working really hard to get everything up and going, and you might also know that outages can get really messy.

Despite all the chaos that was probably happening, they kept posting updates with specific details of the cause and solution being taken- on their Twitter account, status page and on their blog. And not only that, DHH was also posting some more technical details about the outage to the point where he links to the pull request that could have saved everything :

DHH

@dhh

juMjqyB.png!web

I'm not often ashamed of our work at @basecamp . But today is one such day. To be stuck in read-only mode for hours due to a failure to use bigint for our primary keys on every table is embarrassing. It's been the default in Rails since 5.1 :see_no_evil: github.com/rails/rails/pu…

16:10 PM - 08 Nov 2018

vmMNNbr.png!web 71 maa6Vna.png!web 502

I find all this incredibly valuable and relieving . Even though it was a really long outage, they handled each and every customer interaction gracefully. I could not get upset with them with so many information about the problem/solution being provided.

Hell, that morning I was even more productive because Basecamp remained read-only ; I could check on what was pending on my side and just get to it with no distractions.

I've been part, and cause, of outages at my company and it's really stressful. And we don't even handle the amount of traffic Basecamp 3 does.

So, as DHH states it, this is a reminder to stay humble . We could be the next ones involved in a situation like this. We all make mistakes, that's inevitable , but knowing how to properly communicate them is what matters in the long run.

Hope you have enjoyed this short rambling :heart:

Recommend

Goroutine 的同步（第三部分）

Nginx换用Caddy

过犹不及，别再在编程中高射炮打蚊子 - 众成翻译

【译】理解JavaScript中的柯里化 - LINJIAJUN - 博客园

论文撞车英伟达，一作「哭晕在厕所」，英伟达：要不要来实习？

微软亚研20周年，微软ResNet等AI技术突破盘点

『高级篇』docker之kubernetes环境搭建与预先环境准备（32）

A Look into Pharo, Modernizing Smalltalk for the 21st Century

Just a Techie? – Techies, Devs, Boffins and Geeks

Android Security Auditing (Investigating Unauthorized Screenshots)

About Joyk