2

How We Sustain DNS Outages at Grofers

 2 years ago
source link: https://lambda.grofers.com/how-we-sustain-dns-outages-at-grofers-800731e5280c
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

How We Sustain DNS Outages at Grofers

A comprehensive blog by our security team explaining our in-house solution to deal with DNS outages

Image for post
Image for post
Design by Asif Jamal

Cloudflare is one of the most popular DNS and CDN service provider currently used by over 16 million internet sites. Every day, these sites utilize Cloudflare’s services for performance enhancement, DDoS mitigation, and more.

We do too.

So when Cloudflare suffered multiple outages, it affected websites around the globe. And Grofers was no exception.

The first outage happened on 24th June when Cloudflare proxy went down. The second outage happened on 2nd July, and this time the WAF was down for about half an hour. As a result, websites around the globe suffered outages with 502 Bad Gateway error message.

This downtime was good learning for developers:

Relying on third-party services can lead to issues that may not always be in your control if you haven’t planned for it and you can’t just hope that everything will work.

First of all, we have been very happy with Cloudflare’s product and services. While these outages impacted us and many others, we don’t plan to move away from Cloudflare at all.

Adding Cloudflare to our infrastructure solved a lot of problems for us really quickly which we were not capable of solving ourselves.

Outages happen. That’s not ideal but a reality you have to live with. This outage was critical as Cloudflare handles one of the most crucial tasks for us — serving all of our traffic.

It was a single point of failure. Being an e-commerce company serving millions of customers, we cannot live with such outages.

So we went to our whiteboards and started articulating the problem scenarios. We had to tackle three major service breakdowns:

  1. DNS — if the DNS goes down, our services would continue to work but the end-user would not be able to discover our services
  2. Cloudflare proxy — provides performance optimization and CDN capabilities. In cases we use this feature, Cloudflare becomes the frontend server for serving all of our traffic. If Cloudflare proxy goes down, no request would get proxied to our backend servers
  3. Web Application Firewall (WAF) — an outage at the proxy level would mean that the WAF wouldn’t work as well

What was desired?

  1. High Availability of DNS — DNS high availability could be achieved if our records were replicated to other DNS providers as well.
  2. Failover for Cloudflare (CF) proxy — proxy downtime could be mitigated by disabling the proxy (the orange cloud on the CF dashboard) when there was a proxy-level outage. That would lift off Cloudflare’s WAF and performance optimizations during the time when the proxy is down/disabled but would help us at least keep our applications running. We can afford suboptimal performance and no WAF capabilities during the outage period.
Image for post
Image for post
The Cloudflare Cloud for proxying requests

In short, we had to find a solution that would ensure continuity during both the Cloudflare DNS and Cloudflare proxy outages.

We had to set up multiple DNS providers for high availability of DNS i.e. use AWS Route53 alongside our existing Cloudflare DNS setup.

We started looking for an all-in-one alternative to this. As our infrastructure is completely over AWS, the first service which came to our mind was AWS Route53.

In the AWS ecosystem, we had the following options:

  • AWS Route53 provides the DNS service
  • AWS Cloudfront can be used as the CDN
  • AWS WAF can be used as the WAF
Image for post
Image for post

We stumbled upon an article by the StackExchange team explaining how they approached the DNS high availability problem. Their solution integrates with various cloud providers, it didn’t have a way to handle the Cloudflare proxy setting and hence it wouldn’t have helped us in the case of a proxy outage.

Choosing Nameservers

We wanted to use Cloudflare’s DNS service to serve maximum traffic because of Cloudflare’s performance and security benefits.

We decided to have 2 Cloudflare NS records and 1 AWS Route53 NS record so that we have a major amount of requests going to Cloudflare.

The good folks at StackExchange did the hard work that guided this decision for us. Read about this post on the details of DNS behaviour, particularly read these two sections:

  • What is the performance impact for our users if one of the providers is offline?
  • What is the best number of nameservers to be using?

Handling All The Cases

Case 1: Cloudflare DNS up, Proxy up

As we quoted earlier, we wanted to use Cloudflare as our primary DNS and also be able to leverage performance and security benefits from Cloudflare.

In order to do so, we decided to make sure that even the DNS entries on Route53 resolve to Cloudflare’s proxy.

To do so, we decided to use one of the features provided by Cloudflare and AWS where to configure www.example.com to be served via Cloudflare proxy, you would use the orange cloud in the Cloudflare management dashboard to let Cloudflare resolve www.example.com to Cloudflare’s proxy. But when using Route53 as the DNS provider, you would need to setup a CNAME record on Route53 with the following value: www.example.com.cdn.cloudflare.net.

This would ensure that even when serving requests from Route53, we are able to leverage Cloudflare proxy’s performance and security features.

Please note — this is not applicable for apex or root level domain since Route53 supports CNAME flattening only for AWS resources.

Case 2: Cloudflare DNS up, Proxy down

In this case, when Cloudflare DNS is up and proxy is down that means the CDN service and the WAF service won’t be functional for which we needed to utilize AWS CDN and WAF in place of Cloudflare.

Just like the above case, one can setup a record as www.example.com.cdn.cloudfront.net (this is Cloudfront, not Cloudflare) so as to utilize AWS CDN i.e CloudFront and AWS WAF, the subdomains in Cloudflare must be changed to the following format: <subdomain.domain.com>.cdn.cloudfront.net.

Please note — you might want to be careful in choosing the right TTLs so that the DNS changes can propagate fast enough to your needs. With this setup, you will not be up all the time. The only benefit you get is that you are not completely handicapped during an outage.

Case 3: Cloudflare DNS down

In case of a Cloudflare DNS outage, the proxy servers might or might not be reachable via the Route53 route since the DNS infrastructure for the canonical hostnames for every subdomain is different than the primary DNS infrastructure. We confirmed this by checking the nameservers for the domains used in these cases.

Image for post
Image for post

We chose not to handle the two cases separately to reduce complexity although a solution handling both the cases might be more optimized.

When Cloudflare DNS is down, we change all the DNS entries in Route53 pointing to Cloudflare proxy (as mentioned in 1st case) from www.example.com.cdn.cloudflare.net to their corresponding actual values (actual A records, actual CNAME records pointing to ELBs) or to corresponding Cloudfront CNAME records.

Case 4: AWS DNS is down

DNS clients would retry getting answers from other nameservers in-case the nameserver they try is not available. This is where adding nameservers for both Cloudflare and Route53 comes in handy. It will retry the request after it fails for a certain number then follow the round-robin algorithm to send it to other NS records.

All the DNS requests will continue to be served by Cloudflare.

Since we set 2 Cloudflare NS records and 1 AWS NS record, around 33% of the request will face a slight delay for all requests. This performance degradation is acceptable to us instead of a full downtime.

Summary of all the cases

This is how all test cases look like:

Image for post
Image for post

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK