SourceHut outage post-mortem
source link: https://lwn.net/Articles/958794/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
SourceHut outage post-mortem
SourceHut has published a post-mortem of its outage earlier this month. The post-mortem covers the causes of the outage and what steps SourceHut took to mitigate it, ending by saying:
As unfortunate as these events were, we welcome opportunities to stress-test our emergency procedures; we found them to be compatible with our objectives for the alpha and we learned a lot of ways to improve our reliability further for the future. We are going to continue working on our post-incident tasks, building up our infrastructure’s resilience, reliability, and scalability as planned. Once we address the high-priority tasks, though, our first order of business in the immediate future will be to get some rest.
(Log in to post comments)
SourceHut outage post-mortem
Posted Jan 19, 2024 20:51 UTC (Fri) by kleptog (subscriber, #1183) [Link]
One particularly amusing mis-step occurred when we configured the NAT through OVH: we naively NAT’d all traffic through to AMS [...], which made our brand new OVH account look like the source of an outgoing DDoS, with predicable consequences that took some work to resolve with OVH.
The use of "amusing" here is someone who has seen the trenches and lived to tell the tale.
While they keep apologising for their infrastructure not having been fully redundant, the fact they came out of this with all their data and business intact puts them way beyond the vast majority of businesses. Well done.
SourceHut outage post-mortem
Posted Jan 20, 2024 16:10 UTC (Sat) by rsidd (subscriber, #2582) [Link]
SourceHut outage post-mortem
Posted Jan 23, 2024 9:45 UTC (Tue) by intgr (subscriber, #39733) [Link]
The what?... This was a DDoS attack, not a missile destroying their datacenter.
Their service was null routed, so the servers got no traffic and were sitting idle during the whole event.
Why *wouldn't* things be intact once they mitigate inbound traffic? Granted, incompetence has no bounds, but it would be a serious ops achievement to get data loss from DDoS.
SourceHut outage post-mortem
Posted Jan 23, 2024 12:12 UTC (Tue) by ddevault (subscriber, #99589) [Link]
The DDoS wasn't really a big deal but our provider's response to the DDoS made the incident less like a random DDoS attack and more like a datacenter bombing in terms of consequences for our team.
SourceHut outage post-mortem
Posted Jan 23, 2024 22:41 UTC (Tue) by ms-tg (subscriber, #89231) [Link]
SourceHut outage post-mortem
Posted Jan 24, 2024 9:14 UTC (Wed) by ddevault (subscriber, #99589) [Link]
SourceHut outage post-mortem
Posted Jan 21, 2024 4:04 UTC (Sun) by Paf (subscriber, #91811) [Link]
SourceHut outage post-mortem
Posted Jan 21, 2024 4:20 UTC (Sun) by lutchann (subscriber, #8872) [Link]
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK