Ansible at Grofers (Part 4) — Manageable auto-scaling with Ansible

Auto-Scaling Groups (ASG) in AWS EC2 provide a really easy way to scale your services horizontally based on different kinds of alarms. It launches instances using existing AMIs baked by you for your services.

While ASGs are easy to understand at a high level, they have their own challenges — a fairly difficult one being how to bring the latest code, packages and configuration into the AMI. You can automate building AMIs but it quickly becomes challenging to build AMIs for code bases in a micro-services architecture that can potentially be updated every hour.

How did we solve it

If you have been following our blog, you would know how much we love Ansible. Our entire engineering team uses Ansible for setting up and deploying services. So our natural inclination was towards using Ansible for simplifying our auto-scaling setup as well.

Any EC2 instance in our infrastructure goes through at least two separate runs of Ansible:

First playbook for basic setup of the instance (installing basic common tools like consul agent, rsyslog, collectd, etc.)
The second one is used to setup the service built or being setup by development teams.

Having these two run on a new EC2 instance would make sure that the service is ready for use.

When a new instance is added in an ASG, we want the above two things to happen, but slightly differently to start the services essential for business as fast as possible.

Introducing Bootstrap

Bootstrap is our internal tooling that developers use to prepare easily maintainable AMIs for their services which helps them bring their services up as fast as possible in the latest desired state.

Written in Bash, it is basically a wrapper on top of Ansible that gives the flexibility to configure how to execute playbooks using a simple configuration. When AMI is baked with bootstrap configuration for a service, bootstrap will run Ansible on startup to bring up that service in the desired state. This way you have to rebake your AMI only when you want to change your bootstrap configuration. There is no need to rebake AMIs when code or ansible playbooks change.

A bootstrap configuration file includes information like ansible repository, branch, playbook name, and some other stuff required for us to run execute the ansible playbook. Using this information, the playbook is executed locally on the instance. We have our own ansible module which places the file with the required values. All that needs to be done to setup an AMI is:

- name: Add bootstrap file for consumer web
  asg_bootstrap:
    ansible_git_repo: [email protected]:grofers/ansible-repo.git
    ansible_git_branch: master
    ansible_playbook: my_playbook.yml
    ansible_roles_requirements: roles.yml
    ansible_playbook_args: "-t tag1,tag2"
    file_name: 00_setup_abc
    state: present

Create a task in your ansible playbook to create bootstrap’s configuration file using our custom ansible module called asg_bootstrap. The task would look something like above.
Run ansible playbook on the new instance and create the AMI out of it.

How does it work

When an instance starts, there are a bunch of tasks that need to be performed like tagging of the instance, configuring rsyslog, collectd, etc. But during traffic peaks, we can’t wait for services like collectd and rsyslog agent to come up before our business critical backend services.

To solve this, bootstrap allows you to set priority of tasks so that you can choose any critical setup necessary to be done before business facing service can be started. We have a directory structure like this for the configuration files

./
|── bootstrap.d/
|   ├── prerun/
|   |   ├── 00_tagging
|   |   ├── 01_set_hostname
|   ├── 00_setup_abc
|   ├── 01_setup_xyz
|   └── postrun/
|   |   ├── 00_setup_common

Configuration files get their name from the file_name argument of the asg_bootrap module as shown in the code snippet above. Bootstrap picks up configuration files in bootstrap.d/prerun directory first, then in bootstrap.d/ directory at the root level and finally in bootstrap.d/postrun directory. We prefix the configuration files with numbers so that we can easily control the order of execution. The order of execution for the configuration files in the above example will be like this:

prerun/00_tagging — Responsible for setting the Name tag for that instance
prerun/01_set_hostname — Responsible for setting FQDN on the basis of the Name tag
00_setup_abc — Responsible for setting FQDN on the basisof the Name tag
01_setup_abc — Responsible for setting up business facing critical service XYZ
postrun/00_setup_common — Responsible for setting up common tooling like rsyslog, collectd, etc.

Challenges & fixes

Setting dynamic Name tags for identifying instances

A single service can have a lot of instances and every instance must be uniquely identifiable, meaning it should have a unique Name tag. ASG doesn’t have support for dynamic Name tags for new instances — every new instance started in an ASG would always get the same Name tag. Our 00_tagging script solves this by querying a the instance’s key in Consul to get the last index number for that service, increments the index number, update the key (using CAS operation) with the new index number in Consul and retag the instance with the appropriate Name tag.

Missing Tags

Name tags are required by our Ansible’s dynamic inventory script to function properly. We started seeing intermittent failures because AWS’s autoscaling sometimes takes time to tag the instance and is not deterministic. So at the time of Ansible running on instance start, it is possible that the name tag is not present on the instance. To fix this, if the instance doesn’t have a name tag, we check which ASG the instance is attached to, directly get the Name tag from there and forcefully apply it.

AWS API Limit Exceeded

We have our own dynamic inventory script which gets the instances from AWS and creates groups on the basis of their tags. This resulted in us hitting AWS API’s limit resulting into failures. To reduce the number of AWS API calls, bootstrap script sets some environment variables (like ip address, name tag, etc.) which are used by ansible instead of dynamic inventory script to create groups at its end.

Changes in bootstrap script

It is a little troublesome if there are changes in bootstrap itself. We have to bake a new AMIs for every service with a new bootstrap script. One of the ways this can be fixed is to keep the script at a remote location which could be fetched and executed when the instance comes up. This problem is still unsolved for us but having iterated on bootstrap enough number of times has got us to a point where this is not a very big problem for us.

Conclusion

Though we faced some challenges in the beginning, this setup has helped us achieve auto-scaling without the overheads of having an AMI build pipeline and in giving developers the right kind of tooling that enables them to quickly implement auto-scaling with a lot of control, improving the overall DevOps culture in our team.

If you like to solve similar problems at scale, we are always looking for new talent. Check for open positions here.

Read our other blog posts in this series:

Ansible at Grofers (Part 4) — Manageable auto-scaling with Ansible

Ansible at Grofers (Part 4) — Manageable auto-scaling with Ansible

How did we solve it

Introducing Bootstrap

How does it work

Challenges & fixes

Setting dynamic Name tags for identifying instances

Missing Tags

AWS API Limit Exceeded

Changes in bootstrap script

Conclusion

Recommend

GitHub - midwayjs/pandora: A Manageable, Measurable and Traceable Node.js Applic...

Keeping classes short and manageable

Moving toward a more manageable and secure web with Chrome Enterprise

Ansible at Grofers (Part 3) — Managing EC2 Instances

Amazon EC2 Auto Scaling 支援 Warm Pools

Intro to Amazon Web Services (AWS) Auto Scaling | Developer.com

Auto Scaling of SAP Systems on Azure – Part I

Auto Scaling 就不能綁個 Lambda 嗎… – Gea-Suan Lin's BLOG

Google wants to make Fuchsia devices manageable with Android’s ADB tool

Manageable Infrastructure as Code using Pulumi with Joe Duffy

About Joyk