66

How Checkly Achieves Zero Downtime Deployments With Terraform

 5 years ago
source link: https://www.tuicool.com/articles/hit/bqUJvuE
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Checkly, a monitoring tool that validates the correctness of API endpoints and browser click flows, shared their experience of usingTerraform to perform zero downtime deployments for their Docker based infrastructure on AWS.

Checkly uses "workers" to run jobs submitted by users. Each worker runs in aDocker container, 5 of which run in an EC2 instance. Checkly's challenge was to deploy to AWS without affecting the user experience, concurrently support multiple versions of their code, and independently upgrade the worker code. Terraform's modules, rolling updates and custom remote executor code were used to achieve these.

Checkly uses the Puppeteer framework to automate browser actions.Puppeteer is a headless Node API for the Chrome browser. Each Checkly worker is aNode process which can accept parameters and run its tests without needing to save any state, making it easy to horizontally scale in response to request traffic. User requests are pushed into anAWS SQS queue from a cron job, from which the workers pick them up, and push the results into another queue. A failed job will not invoke the SQS API to delete the message and will be retried. Deploying a new version into AWS is through a Docker based lifecycle followed by using Terraform primitives to do a rolling update. The code passes through 3 environments - dev, test and production. InfoQ got in touch with Tim Nolet , Founder at Checkly, to find out more:

The unit tested code is built into a Docker container, with the build, tag and push Docker commands in the package.json as scripts. We push the container (tagged with a a version and with "test") to our private Docker repository and then cycle the test EC2 instances which pull the latest "test" container using the Terraform "taint" command.

The "taint" command in Terraform forces a resource - EC2 instances in this case - to be destroyed and recreated. Checkly’s team lets the test instance run for a couple of days. If all goes well, the Docker image is re-tagged with "latest" and the "taint" is repeated for all production EC2 instances, which completes the rolling update. One of Checkly’s goals is to allow for multiple versions of the app to co-exist, which can require additional handling in either the code or in the data stores and message queues. For example, if the JSON format used in the SQS messages changes, both formats have to be handled for a short period of time while the old infra goes down and the new one comes up. Nolet elaborates on their approach:

As we are quite young, we have not had huge changes yet in the overall data transfers objects or messaging schemes. But I would always solve that in the code. The queuing bus, the storage and all other middleware are just not the right place for it. So if that means a bunch of extra "if" statements or case switches to handle two message type, so be it. We use Postgres as our main datastore, so the JSON fields are very welcome to handle small tweaks to the data model without too much hassle.

Terraform offers primitives like create_before_destroy and the remote executor that are utilized by Checkly. The create_before_destroy flag is available to all Terraform managed resources and is used to ensure that a replacement resource is created before the old one is removed. When Terraform invokes the underlying AWS provisioner, the remote-exec command keeps checking if the Node process is running in the container and returns once it is, signalling Terraform that the resource is ready. It uses a simple grep command to achieve this. Checkly's Terraform code is organized into modules, with one module per AWS region.

Terraform code can be tested by testing toolkits likeTerratest which can validate infrastructure managed by Terraform. However, Checkly does not use any test frameworks for this, depending instead on the fact that "the test and the production environments are identical, and any major issues will be caught in the former", says Nolet.

Checkly’s base Docker image is Ubuntu-based with all the packages necessary to run Puppeteer and headless Chrome, which adds some extra libraries and fonts. The Docker container runs a PM2 process which launches a Node process. This part of the Docker strategy is stable and errors which might lead to a deployment rollback are usually in the actual product code, according to Nolet. Checkly uses both AWS CloudWatch and AppOptics for monitoring.CloudWatch alerts on AWS queue sizes, delays as well as basic instance health. AppOptics is more application specific, checking metrics like the number of runs in a given region in the last 10 minutes, or the run times in a given region. Checkly's status dashboard is publicly available .


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK