Ansible at Grofers (Part 3) — Managing EC2 Instances

Credits: https://www.ansible.com/blog/ansible-ec2-tags

This post is part 3 of the series on how we use Ansible at Grofers to manage our infrastructure. This post explains the issues we faced before we started managing our infrastructure with Ansible, the steps we took to solve all those issues and the infrastructure state we are in after the change.

Background

Our infrastructure management was not in a good shape until two years back. Especially in case of EC2 instances (the most used resource probably in any AWS set up), the state of our EC2 instances would depend on who launched and set up those instances. No security hardening was ever done and almost all the instances were public. Launching instances was a pain in the neck. We used to face a lot of issues with managing EC2 instances.

Here are few of the problems we often faced:

Launching EC2 instances was a privilege for only a few members of the team who had access to AWS console. Everyone else had to personally ask the privileged people to launch a new instance for them. Setting up new infrastructure was a painful task and required a lot of planning, often days. The entire purpose of running your infrastructure on the cloud was defeated.
Almost everybody had root access to our production instances. We used to SSH using the PEM key file generated through the EC2 dashboard, shared among the members who need to deploy code on a production instance.
No audit logging was possible since the same credentials were used by everyone. Also, there was no easy way to rotate those credentials.
There were no automated deployments. Developers used to SSH into the instances, pull their code and restart the services to get the latest code deployed.
There is a lot of infrastructure that is usually common for every service. For example: logs centralization, metrics collection, monitoring — all of this is essential for every developer to manage their services in production. Further, depending on your business requirements, you could need a few generic tools on every instance (like email infrastructure for sending internal emails for reporting or monitoring). Basic security hardening of each instance is also critical and needs to be done on each instance. Setting up all of this is cumbersome for every new instance. And while AWS makes this a lot simple by letting you create AMIs and using them as base images for every new host, there is some amount of configuration that always changes from cluster to cluster. For example, cluster wise access control, setting FQDN for identifying metrics and logs at a host and cluster level, etc.
There were no conventions at all. Lack of conventions left no scope for automation of managing AWS resources.

Considering all the above issues, we began charting out the plan towards building a stable infrastructure which is easier to manage, cost optimized, accessible to everyone but still secure and leverages the power of cloud.

Grouping all those steps at the higher level, we targeted these things in the order mentioned:

Stable consistent state across all instances
User Access Management
Restricting EC2 Console Access and Resource Management

Naming Conventions

Before starting towards solving the above mentioned issues, we wanted to standardise EC2 instance naming conventions. At that time we had more than 10 internal teams and to achieve sane configuration management, automation, autoscaling (seamless scaling required naming standardisation to be used by Ansible’s dynamic inventory for grouping hosts), budget allocations and cost monitoring, proper user access and resource management, it was required to have a standard naming pattern containing enough relevant information about the resource that can be used for configuration.

Considering our product and team structure, we came up with this pattern:

<environment>-<product>-<service>-<serial-number>

Where:

environment as in production (prod), staging (stage) and testing (test).
product stands for the internal product org at Grofers. Example: consumer, lastmile, cms, warehouse, etc.
service stands for the service in a product setup. For example: api, web, redis, db, kafka, broker, etc.
serial-number is the sequence number for an instance in the cluster of the service, starting from 1 and increasing by 1 per new instance.

For example: prod-consumer-api-1, stage-lastmile-mqtt-1, test-warehouse-dashboard-1, etc.

This standardization helped us use dynamic inventory feature in Ansible and was our first step towards proper configuration management on the cloud. Dynamic inventory would use this in a way that it will create additional groups that can be used for setting group_vars. The following additional groups would be created:

<environment>
<product>
<product>_<service>

With this capability, we could define clusters at different levels and set generic to specific configuration. For example, all instances in stage environment should allow every user to be able to SSH while all instances in prod environment should allow all senior engineers to be able to SSH. All developers working on the consumer product should have SSH access on all instances running consumer product services (following the pattern prod-consumer-<service>-XX).

Also, this naming convention and some additional tags on EC2 instances helped us with our cost allocation and management on AWS as well. We could generate cost reports at environment, product and product/service cluster level.

EC2 Instance State Management

To ensure a consistent state across all instances and to set basic expectations for all developers, we started preparing AMIs baked with a common set of tools (curl, wget, telnet, vim, rsyslog, consul, collectd, etc.). This made sure that developers don’t have to spend time setting up basic infrastructure. Tools like rsyslog, consul, collectd tied back into our central infrastructure for maintaining production services and were configured on instance startup.

All developers would use these base AMIs for setting up their services and preparing AMIs for their services for using with auto-scaling groups. But soon enough we realized that this is not scalable since even a small fix in our common tools or any change in configuration required re-baking of not only the base AMI but also the service specific AMIs created on top of it. So we wrote a common set of roles and playbooks which will run on every EC2 instance boot up to ensure the state is up to date. Dynamic inventory script would need AWS EC2 API access and credentials. To avoid AWS credentials management completely on EC2 instance, we created an IAM role that we would attach at the time of launching the instance. This IAM role has very specific EC2 read privileges to fetch only the required information for the dynamic inventory script to work.

The problem of rebaking AMIs on configuration change was not a problem only for the base state of the instance. This was also a problem for the services deployed on these instances when they were auto-scaled. While auto-scaling launch configurations would use a particular version of the service’s AMI, when the service’s configuration would change we would want to bring the service at the latest desired state when a new instance was launched. The solution of running Ansible playbooks on instance boot up was converted into a more sophisticated tooling to allow developers to overcome the problem of their service specific AMIs going out of sync from the desired configuration state. More on this in another blog post.

This setup helped us start new instances using an internal base AMI and bring it to the latest expected state as committed in our Ansible code base.

User Access Management

Our vision was to let everybody be able to launch instances in any environment for fulfilling their requirements. But while everybody should be able to launch instances, how do we give access to them in a secure way?

While one person can launch an instance, there is always a more than one person (a team) who need to SSH and be able to do anything on a particular instance. So when an instance is launched, more than one person would need to be able to SSH into that instance. How do we get to that? For security purposes, you cannot give access to everyone on each instance. So we had to give access to a specific team on every new instance depending on the name of the instance.

To do this, we decided to first have every developer share their public key with us. With every developer’s unique key with us, we could restrict access at user level and also monitor individual user activity for auditing purposes.

We wrote an Ansible role which would create a set of users and add their public keys on remote instances which would allow passwordless SSH access. This role expects a variable which is a list of users who should be given access on an instance. We started using this role for giving access to specific developers on the hosts relevant to their work.

Next was giving a group of developers access to certain instances. Using group_vars we started targeting group of instances for giving one or more developers access on those instances. Whenever a developer would launch an instance, the developer (or his team depending on the configuration in group_vars) will get access to the newly launched instance. Which group should be picked by Ansible would depend on the name tag of the instance as previously described in the Naming Conventions section.

We included this user management role in our common set of tasks that runs on every instance boot up as mentioned above. Coupled with our naming convention and dynamic inventory, this became the biggest enabler in letting developers launch new instances by themselves without any intervention.

Managing EC2 Console Access and Resources

So far we did everything to let our developers launch new EC2 infrastructure as easily as possible. We didn’t want to create human bottlenecks for launching new infrastructure. With all the prep done so far, developers could launch instances of their choice and get access to new instances in a secure way. But a minimum requirement for all this to work is that new EC2 instances are configured properly in the EC2 dashboard at the time of launching. And unfortunately there is no way to set validation rules in the dashboard for proper configuration. Like one of the most important things to configure properly for all our setup to work was the name of the instance — it must follow the defined naming convention. But EC2 dashboard doesn’t provide any way to define validation rules for instance names.

Similarly, even though we were managing/restricting SSH access to our EC2 instance using Ansible like we described in the previous section, developers could still launch new instances with PEM keys from the dashboard and get away with access management — not that they would purposely want to do this but it’s hard to remember all the options exactly like they should be used for this setup to work.

To address this problem, we created another Ansible role to launch EC2 instances using the ansible-playbook command itself. Developers would run a playbook which will prompt them asking for input for collecting information like environment in which the the instance should be launched, name of the product for which this instance is being launched, name of the service (or a unique cluster name), type of instance (t2.small, c4.large, etc.), EBS volume size, number of instances to launch, etc. Based on the input by the user, the role will launch instances with a proper EC2 configuration and following the proper name convention that would make all the automation work to bring the instance to the latest correct base state and give developers access.

The set of tasks we have written in this role and provisions we have made helped us manage the process of launching instance in a number of ways:

Enforcing naming conventions on the basis of input.
Launching new instances in proper VPC and subnet based on the info provided (this post explains how we have structured our VPCs).
Restrict users to launch instances of only a specific set of products/teams.
Adding required permissions to the instance to update its state based on the tag and other EC2 info.
Balance number of instances of a specific service of a product across EC2 availability zones (unless the user has asked for launching in a specific availability zone in the launch prompt), based on the existing instance count per availability zone for that service.
Make instance type recommendations based on the service name and the type of instance being launched. Like if a developer is trying to launch an instance for elasticsearch, we recommend proper instance type to them during the launch process itself by redirecting them to a documentation which contains all the recommendation for number of common use-cases (like databases, stateless web services, celery clusters, etc.).
Pushing CLI usage for all EC2 related tasks rather than giving console access. This helped us enforce conventions and apply some useful restrictions in a way that was not possible to achieve on EC2 dashboard.This helped us which will help us better manage resources and keep our infrastructure secure.
Another big win for us was pushing the use of Ansible as one tool for managing infrastructure and getting it adopted in all our teams. Since developers needed to lauch instances and started using Ansible on a daily basis, they started adopting Ansible for other infrastructure related tasks as well.

What next?

Though we have almost reached the state we had in our mind before automating instance management but since there is no final state for a secure, stable and efficient infrastructure, we continuously strive to make our infrastructure management more seamless and scalable to ensure amazing developer experience and along with sanity and security.

This setup has served us pretty well over the past 2 years. It has enabled us to move fast by allowing developers to provision resources, while still maintaining enough control to keep things sane. However, a lot has happened over the past 2 years. Containers have become mainstream. We stand to benefit a lot from containerization as we work on micro-services architecture. So we are actively working on containerization of our systems. More on that in a future post!

Ansible at Grofers (Part 3) — Managing EC2 Instances

Ansible at Grofers (Part 3) — Managing EC2 Instances

Background

Naming Conventions

EC2 Instance State Management

User Access Management

Managing EC2 Console Access and Resources

What next?

Recommend

Securing AWS Access Keys

Personal Staging Environment for Micro-Services Architecture

How We Improved Information Security at Grofers

Ansible at Grofers (Part 4) — Manageable auto-scaling with Ansible

Credentials leaked in public? Here’s what Grofers implemented to prevent such mi...

Software Engineering Internship at Grofers

A Comprehensive Guide To Deploying A Website In Kubernetes

How We Sustain DNS Outages at Grofers

How I Started Working With Kubernetes

How My Grofers Internship Came To Be

About Joyk