Discovering Continuous Automation With Request Mirroring

Setting the stage

Maintaining zero p0/p1 production issues is of utmost importance, and this speaks volumes about the effort that goes into the development and testing of each release. There are thousands of runtime configurations as well, which can change the code execution at runtime. The item page also has a lot of historical context covering countries all across the globe and all kinds of devices and networks. Any change on the item page can create a bad user experience.

Despite these challenges, we have a goal to achieve to be able to release every day in production. Given these conditions, this goal can only be achieved by covering all possible combinations of use cases under continuous automation and delivery pipelines. It is not possible for a single person to know all the combinations of use cases to be covered to have a full automation coverage.

This requires thinking in unconventional ways and taking an out-of-the-box approach.

Identifying the evil

There are three issues to consider.

Each item page release needs an extreme cycle of QA and testing to prove feature functionality and reduce potential bugs in production. Each configuration change has to go through multiple layers of approval to make sure no unfortunate event happens in production that can impact users. All of this is costly, time consuming, slow paced, and not all the scenarios from a production environment are covered.
QA cycles happen before every release with a defined set of use cases without randomization or going outside a defined set of rules. While in production, there can be use cases that are outside this defined environment. Today, there is no way to capture these unknown production use cases before releasing a new build.
Testing happens in staging environments that might not have the correct data to test all the corner cases, and the code is tested against a static dataset in static conditions, while the same code runs against a different dataset in production in dynamic conditions. As the software gets bigger, the dependency graph increases, use cases expand, and the combination of all these scenarios grows exponentially, which adds uncertainty.

Taming the bull

To be able to release every day, we need to realize our goal of discovering and identifying unknown use cases and capturing them from a dynamic set of data in production. To do this, we built “Neo,” which helps us run the new code against production traffic without any impact in production.

We took the request mirroring concept from networking and used it to build a continuous automation pipeline. We compare [request, response] between the production environment vs a mirrored environment [N vs (N+1)] and generate hourly reports. This comparison can be done for the full lifecycle of a request, starting from front-end pool up to any nth level of downstream pool. This tool allows us to expand the envelope of QA and testing outside the defined set of use cases.

How it works

To make this tool work, we need a production machine with Nth build and a test machine with (N+1)th build.

Zj22auF.png!web

Neo in action on Production Machine, (N)th build

1. When a request is received by a production machine with the (N)th build of the item page, Neo intercepts incoming requests and assigns a “RequestMirrorId” to identify each request uniquely.
2. Neo then mirrors each and every aspect of the incoming request in production and sends this mirrored data to a test machine with (N+1)th build.
3. Neo then stores a copy of the production request in a file (let's call this file “ProductionRequest”) in a central storage location.
4. When the production machine is ready to return the response, Neo also creates a copy of this response and stores it in a file (let's call it “ProductionResponse”) in central storage.

Neo in action on Test Machine, (N+1)th build

Neo intercepts the mirrored request on a test machine with (N+1)th build and stores this request in a file (let's call it “MirroredRequest”) in central storage.
When test machine is ready to return the response, Neo creates a copy of response and stores it in a file (let's call it “MirroredResponse”) in central storage.

Once the data is collected, Neo can compare these files based on “RequestMirrorId” by running different comparators and can than generate a report of the delta and mark any use cases that are not covered in comparators or automation.
This comparison can be done all the way to last dependent service in the call hierarchy of dependencies. The following diagram illustrates how each payload is stored and compared using “requestMirrorId.”

Ez6FbuB.png!web

Neo then processes this delta against the acceptance criteria, and anything that is not acceptable or unknown is marked as a potential issue.

The following sample diagram illustrates the full approach.

MFjYVfb.png!web

More details on the pipeline

Neo has many components that needed integration of different code bases and pipelines. Below are the components and the steps of the full pipeline.

ItemPage code base : As part of the tool, HttpRequest, HttpResponse interceptors were introduced in ItemPage backend code to intercept request responses to be able to mirror and make a copy of the data. We can choose and filter what requests to Mirror based on Header, queryParams or request path criteria. These interceptors are customizable with the properties shown in tables below.
Picking machines from Production:The next step is to pick a couple of healthy production machines (Nth build) at random from a production pool and configure them with mirroring, and then point them to the test machines (N+1th build) to send mirrored data. This step is automated to encourage randomization and to pick up healthy machines in case previously chosen machines are not part of a pool anymore or are not available. The following configurations are set on production machine to start mirroring.

MIRROR_ENABLED

true

MIRROR_TARGET_HOSTNAME

"," separated target test machines

MIRROR_TARGET_PORT

target port

MIRROR_PERCENTAGE

% of traffic to be mirrored
Connection to Central Storage:The next step in the pipeline is to change the configuration on production and staging machines to connect and store data from the Nth and N+1th builds in central storage. MIRROR_RESPONSE_DISPATCH_PATH is used to connect to central storage. The production machine will connect to the dispatch path with /BASE and the test machine will connect to the dispatch path with /NEW.

MIRROR_RESPONSE_DISPATCH_PATH

API path to collect data

MIRROR_RESPONSE_DISPATCH_ENABLED

true

Central Storage:Central Storage is a file-based system where production and mirrored requests, responses and header data are stored. Data from each service pool is stored in its own file and against the same “RequestMirrorId” that was assigned at the time of mirroring from a source production machine. Ex: “/production/ItemPage/”RequestMirrorId” vs “/test/ItemPage/”RequestMirrorId”
Central Storage Wrapper:Each Service pool makes a call to “/Base" and "/New” services to set up data storage in central storage using the path provided in “MIRROR_RESPONSE_DISPATCH_PATH.” These APIs accept Nth and N+1th responses to map them against “/production” and “/test” folders, respectively.
Comparators:These are a series of HTML, JSON, and header comparators that are integrated in a Jenkins pipeline that runs every hour and reads the data from central storage to compare and generate a temporary delta file.
Report Generator:This is a separate Jenkins job that is kicked off at the end of the Comparator's pipeline to pick up the temporary delta file and generate an HTML readable report out of it. This report contains all the diffs in headers, requests, responses, etc. This diff is important, because headers can get encoded, additional request parameters can get added or missed, and responses can change while moving from one pool to another or by downstream services.

Important nuances

VI total production traffic is more than 350 million requests per day. It's not possible to mirror all the requests, because mirrored request adds load on all downstream services and storage as well. Typically we pick 2-3 machines randomly from a production pool and mirror their traffic.
Mirrored requests can fire tracking events similar to production requests. These tracking events are suppressed by passing a “nocache=true” value in tracking header or by pointing test machines to test tracking pools.
Important user-specific and sensitive information is masked out while storing HTML and JSON responses and creating reports.

Conclusion

This tool can be run in a production environment continuously for a long time to cover all the possible unknown and random use cases without impacting production traffic or users. It provides continues automation and can capture multiple issues that would have been otherwise missed due to change in dependencies or other factors. We use similar approaches to test our key pages, such as search, and automate and verify other efforts such as platform migration.

Setting the stage

Identifying the evil

Taming the bull

How it works

More details on the pipeline

Important nuances

Conclusion

Recommend

Networking @Scale Boston 2019 recap

Even Privacy-Focused Cryptocurrencies Can Spill Your Secrets

The Blue Tape List

React Table is a “headless” UI library

Pgloader – migrate your data from any SQL database into PostgreSQL

The RIPE NCC has run out of IPv4 Addresses

【链得得独家】雅虎或联手line，打造日本数字货币交易巨头平台

金鸡重啼，电影万岁

福布斯2020年区块链的三个预测

华为联手帝瓦雷带来智能音箱 Sound X，要价 1999 元

About Joyk