2

SourceHut outage post-mortem

 3 months ago
source link: https://lwn.net/Articles/958794/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

SourceHut outage post-mortem

[Posted January 19, 2024 by daroc]

SourceHut has published a post-mortem of its outage earlier this month. The post-mortem covers the causes of the outage and what steps SourceHut took to mitigate it, ending by saying:

As unfortunate as these events were, we welcome opportunities to stress-test our emergency procedures; we found them to be compatible with our objectives for the alpha and we learned a lot of ways to improve our reliability further for the future. We are going to continue working on our post-incident tasks, building up our infrastructure’s resilience, reliability, and scalability as planned. Once we address the high-priority tasks, though, our first order of business in the immediate future will be to get some rest.

(Log in to post comments)

SourceHut outage post-mortem

Posted Jan 19, 2024 20:51 UTC (Fri) by kleptog (subscriber, #1183) [Link]

One particularly amusing mis-step occurred when we configured the NAT through OVH: we naively NAT’d all traffic through to AMS [...], which made our brand new OVH account look like the source of an outgoing DDoS, with predicable consequences that took some work to resolve with OVH.

The use of "amusing" here is someone who has seen the trenches and lived to tell the tale.

While they keep apologising for their infrastructure not having been fully redundant, the fact they came out of this with all their data and business intact puts them way beyond the vast majority of businesses. Well done.

SourceHut outage post-mortem

Posted Jan 20, 2024 16:10 UTC (Sat) by rsidd (subscriber, #2582) [Link]

I agree, they recovered from a catastrophic situation (caused by malice not natural disaster) with almost everything intact. This should build faith in them for future customers. Also, this post gives valuable lessons on redundancy management for other organizations. Thanks Drew et al for that (I know of Drew and Simon because of sway/wlroots which I use).

SourceHut outage post-mortem

Posted Jan 23, 2024 9:45 UTC (Tue) by intgr (subscriber, #39733) [Link]

> recovered from a catastrophic situation (caused by malice not natural disaster) with almost everything intact.

The what?... This was a DDoS attack, not a missile destroying their datacenter.

Their service was null routed, so the servers got no traffic and were sitting idle during the whole event.

Why *wouldn't* things be intact once they mitigate inbound traffic? Granted, incompetence has no bounds, but it would be a serious ops achievement to get data loss from DDoS.

SourceHut outage post-mortem

Posted Jan 23, 2024 12:12 UTC (Tue) by ddevault (subscriber, #99589) [Link]

Ordinarily a DDoS is not a catastrophic problem, but our primary datacenter provider really dropped the ball here. We were not just null routed for a while, we were *permanently* null routed and communication from our provider on the matter was near to none. I still can't reach our original subnet there, we had to abandon the site entirely (and all 10 servers running there, now almost all shut down). We used to be very happy customers with this datacenter but after two back-to-back acquisitions the quality of service dropped off a cliff and we didn't realize how vulnerable we were in their hands. We're terminating our contract, demanding a refund, and packing our shit up and shipping it somewhere else.

The DDoS wasn't really a big deal but our provider's response to the DDoS made the incident less like a random DDoS attack and more like a datacenter bombing in terms of consequences for our team.

SourceHut outage post-mortem

Posted Jan 23, 2024 22:41 UTC (Tue) by ms-tg (subscriber, #89231) [Link]

While I appreciate the grace and gentle politeness, would it be more helpful to everyone to share who the service provider was?

SourceHut outage post-mortem

Posted Jan 24, 2024 9:14 UTC (Wed) by ddevault (subscriber, #99589) [Link]

Maybe later, right now we're still working with them to finish our contract there and they're making some concessions and providing an opportunity to discuss our grievances, out of respect for that I'm not going to name them publicly yet.

SourceHut outage post-mortem

Posted Jan 21, 2024 4:04 UTC (Sun) by Paf (subscriber, #91811) [Link]

Makes me wish I had business to send their way - very nicely written up

SourceHut outage post-mortem

Posted Jan 21, 2024 4:20 UTC (Sun) by lutchann (subscriber, #8872) [Link]

It's always refreshing to see an honest self-evaluation from a service provider. I myself am in no position to say "this is the best that could be done" versus "a competent team would have handled things better", but either way, I applaud the transparency.

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK