53

On handling outages: a case study

 5 years ago
source link: https://www.tuicool.com/articles/hit/ZBZv22Y
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

If you happen to use Basecamp 3 to manage your projects, you might have noticed a huge outage on November 8th, 2018; it lasted almost 5 hours .

The issue was that they failed to use bigint for the primary keys of their tables so they ran out of IDs. The TLDR solution, taken from David Heinemeier- DHH, creator of Ruby on Rails and Founder and CTO at Basecamp:

We took half of our replicas offline, did the 3h migration, put them back online, will now be converting the other half of the fleet.

And I'm not writing this to expose and/or throw **** at them.

I'm writing this to applaud their communication and openness about the whole outage.

I'm writing this to expose how over-communication, honesty, humbleness and clarity DO make a difference , specially on difficult situations.

To give you some context, the first notice on their Twitter account about something going wrong was at 5:40 AM on November, 8th:

2uiYVfE.jpg!web

Basecamp

@basecamp

juMjqyB.png!web

Basecamp 3 is having trouble right now. Sorry about that! We're working on a fix and will keep you updated as we go.

13:40 PM - 08 Nov 2018

vmMNNbr.png!web 5 maa6Vna.png!web 26

From that tweet and until everything was working again, there were 15 more tweets with constant updates! With the last one being at 10:47 AM, November 8th, signed by DHH himself:

2uiYVfE.jpg!web

Basecamp

@basecamp

juMjqyB.png!web

Basecamp 3 is back up at the moment. We had to switch to a backup set of caching servers, and they're holding up at the moment. It's obviously been touch and go, so not out of the woods yet. Pains us to ask for even more patience on such a trying day. So sorry :cry: ^DHH

18:47 PM - 08 Nov 2018

vmMNNbr.png!web 0 maa6Vna.png!web 23

All that information is a huge deal . You know they are working really hard to get everything up and going, and you might also know that outages can get really messy.

Despite all the chaos that was probably happening, they kept posting updates with specific details of the cause and solution being taken- on their Twitter account, status page and on their blog. And not only that, DHH was also posting some more technical details about the outage to the point where he links to the pull request that could have saved everything :

RvE7ZfJ.jpg!web

DHH

@dhh

juMjqyB.png!web

I'm not often ashamed of our work at @basecamp . But today is one such day. To be stuck in read-only mode for hours due to a failure to use bigint for our primary keys on every table is embarrassing. It's been the default in Rails since 5.1 :see_no_evil:  github.com/rails/rails/pu…

16:10 PM - 08 Nov 2018

vmMNNbr.png!web 71 maa6Vna.png!web 502

I find all this incredibly valuable and relieving . Even though it was a really long outage, they handled each and every customer interaction gracefully. I could not get upset with them with so many information about the problem/solution being provided.

Hell, that morning I was even more productive because Basecamp remained read-only ; I could check on what was pending on my side and just get to it with no distractions.

I've been part, and cause, of outages at my company and it's really stressful. And we don't even handle the amount of traffic Basecamp 3 does.

So, as DHH states it, this is a reminder to stay humble . We could be the next ones involved in a situation like this. We all make mistakes, that's inevitable , but knowing how to properly communicate them is what matters in the long run.

Hope you have enjoyed this short rambling :heart:


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK