Load testing microservices and identifying scalability issues

Engineering at Fitbit prioritizes quality around the products that we build. We do a number of things to support this–one of them is pre-release load testing. This ensures not only that the services can handle traffic generated by millions of users, but also handle it with acceptable latencies. Fitbit commonly uses two strategies to load test services: dark testing in production and simulation in an isolated environment dedicated to performance testing.

Dark testing in production is typically a much lighter strategy to implement, as long as the request pattern to the service under test closely parallels that of an existing production service. Sometimes the work involved in setting up a funnel of traffic from an existing production service to your service under test is significant, however. In this case, the alternative of simulation in an isolated environment becomes desirable from both a flexibility and a production stability perspective.

jjIlCiLJmxutvOAgPWqdoLahWcD-jhYiOzTXQWP0khnHMC_sdsrM-JNih_Q4rHRn5oCxwD6ISutJ5PAco1xHOk3IS3A3YSd0l0BnDCcznidiE_r36ErdgLJ2KU0RkV1FO1aMp3KX

See how you rank among your friends on the Fitbit leaderboard!

The leaderboard service is a prime example of a service load tested at Fitbit with minimal implementation overhead using dark reads. The creation of this service supported a larger goal at Fitbit of breaking up our monolithic application into smaller, more manageable microservices. As one of these new microservices, the leaderboard service was not so much a new feature as it was a re-implementation of existing leaderboard logic in a new context. Conveniently, this sister leaderboard code in the monolith provided an easy source of fake (but realistic!) traffic to the new service.

ReFim0lwQgfNylmTZcRfzWEgCsiiHo6t8dgRcpe798VOuum1TmNB1bumzUdYQCdiXjVt8t1V6zDTK9FjGpYd035Yr-qwPAoac0FuaueVx_aIVQwlkRPTk6oHKewydngYOvaH3kSl

5T1SMFMsxcypfp782bd29mcj6G-N8oe-vge2w_OE7qoK2YdLPBlBYc4lWusvyDTCHGZ1q-T8z2xzMlnJBgoQV8e_8AJnSYJNMNvi69TiHcqoDGoLr9HMWZQSE_CGXW2mq1yPWPxy

A monolith can be hard to change and deploy safely once it gets too large. Smaller microservices are much easier to manage and deploy separately, as needed.

At the onset of each call to the monolith’s leaderboard endpoints, requests to the new leaderboard service were added to an independent thread pool. An independent thread pool was used because dark reads should have as little of an impact as possible on the service handling real production traffic. This allows for the dark requests to succeed or fail as quickly or as slowly as the service under test can manage, with a minimal negative impact on the performance of the service in production.

The dark requests were also placed behind a feature flag to control the percentage of production traffic for which a dark read should be performed. This enabled us to ramp up traffic to the new microservice slowly, and scale its dedicated resources accordingly (not to mention providing a killswitch if anything were to go wrong!). Instrumentation was added both to track the latency of these dark reads and also to compare the results of the dark reads with the responses sent to users.

The success criteria of this test were quite simple–handle the full load of dark traffic representing what’s currently being serviced in production, and return an equivalent response at least as fast as the existing monolithic endpoint. There was of course some tuning done to the microservice along the way, but its independent nature allowed us to more easily learn about, tune, and release a new microservice with quality and confidence.

nhGlPhG20PU6PVE73d4Bjf8S1_CxVRmY28IhX8SjXHOX7DIvJscqoqEZeGZVmZ8602J9bRbPG1GIaeECyWPIDjmKYyJIlF2g1Py5CFefYFfxU6jaMdgQNNJIFY16NIjyuFiFlUCc

But you never know when that meteor is going to hit…

Sometimes a new service does not have an easy, safe, or relevant source of dark traffic. For these cases, Fitbit has a server environment dedicated to performance testing. Any or all microservices can be deployed and tested as a system in isolation there. We used this environment to ensure that the new in-app Dashboard experience (rolling out worldwide at the time of this post!) was ready to handle the traffic from millions of users.

Testing in this way is often more difficult not only in implementation but also when defining success criteria. When a microservice is meant to embody existing logic elsewhere, it should generally perform as well as or better than its predecessor. In the case of the new Dashboard experience, however, we were moving from a client-driven Dashboard to a server-driven Dashboard. It would be more trouble than it’s worth trying to replicate behavior precisely across systems with such fundamental differences. Instead we were tasked with designing a new user experience, complete with many new error cases and new requirements for acceptable latencies.

The process of load testing this Dashboard service in isolation involved three steps:

We deployed the service itself with its (many) dependencies into the performance environment, where tests are regulated to prevent interference with one another and metrics can be tracked in isolation from any outside noise.
We wrote a script in JMeter to programmatically generate traffic mimicking a user’s typical request pattern when interacting with each of the service’s public endpoints.
We ran the script, pointed it at the instances deployed to the performance environment, and observed how the service behaved under progressively increasing load. This is the part where things got interesting, and where we learned the most about how the system was functioning.

ecQwZLpdfpuVuiQCXlGCALUIR-GBNy8noFD1id5IFcOnwn_VcEIAAYuUV8Jx8hvZSdEBwcd96M_ahkMZZpXwApmB5V9xrjfspsan2t2Ri-_b9GqI0CFBxF7VVjVnvpG80-mGMsMZ

An example Dashboard in the new app experience

We ran the script for the first time, but strangely a majority of requests were failing according to the JMeter results. Even more strange was the lack of evidence of these failures in the service logs, meaning that the requests must have been rejected before even reaching the service itself. At Fitbit, we have a request routing service in place to collect general metrics on the traffic to all Fitbit APIs. This service seemed like a good place to begin the investigation into where these requests were disappearing.

5nwL4MS1xqIDm9SFBT1nRIm1NcXb_5Ardr8PB8ug0Nb_sV0b0SqY9l6iJat2w6RStGaNJF0bW7lJKlGxEefDuuuJzr-2KFGo_fJYgPWFhNStaZFPa0Rfm-EQYg_lDAHPDYMwmYd7

An upper limit has been reached in the number of open connections to our service under test.

Alas! This graph of active connections between the request routing service and the Dashboard service made it clear that we were hitting a ceiling of 300 simultaneous open connections allowed between these two services at one time. When no quota was available, incoming requests were simply failed on the spot until other in-flight requests were completed. By configuring the request routing service to allow for more active connections to our Dashboard service in the performance environment, these bursts of requests were no longer denied in subsequent tests.

While this was merely an inconvenience during the test, it did reveal a requirement to carefully estimate and configure this request routing service’s request rate limit for our service in production. If we had too low of a quota, requests would be dropped in this same way in production! If we had too high of a quota, our service wouldn’t be able to handle all of the potential incoming requests. But this would have to come later, after the load test’s results are available to inform the estimation.

The script was run for a second time to simulate linearly increasing load over a long period of time to observe when the service’s performance degraded. But again, a strange phenomenon was occurring. Traffic to the service was increasing very quickly and then remaining roughly constant for the remainder of the test, even though the number of user-threads was increasing linearly throughout the duration of the entire test.

lE7AA8dOvwo582GHTG2AL818uuuIYzb0K_XMWCToBUuug1IQM1YJXA-Vxleignu4f3FixkwdbcSoFpE3IRkdqKN1KdqiNXutxG14_IUqetblSVa0IAxMzMd2eGYRKLcssQgpsoGf

That’s not linear…

One key realization explains this progression–a lack of any sort of rate limiting in the JMeter user-thread loops. Although each user-thread is executed independently and in parallel, each individual thread still needs to finish an iteration of its loop before beginning its next iteration. When the only thing preventing each thread from proceeding to its next iteration is the latency of the request being made, a negative feedback loop forms. A very small number of user-threads can create a large amount of traffic when each loop iteration takes a very short amount of time. But with a large amount of traffic comes increased latencies on the service handling the traffic. Increased latencies mean longer loop iterations on each of the user-threads, and thus less traffic!

Eventually (quite quickly, in this case), an equilibrium is reached where latencies have increased and request rate has decreased to a point where the service is fully saturated. If traffic to the service were to increase, then its latencies would increase due to lack of resources and traffic would decrease. Conversely, if traffic to the service were to decrease, then its latencies would decrease due to extra resources becoming available and traffic would increase. This equilibrium provides an interesting metric on the service, representing the saturation point in requests per second that it can handle if it consumes all of its available resources. In practice, this point is not something a healthy service should ever approach, however, since the service would likely breach its SLOs well before approaching this state. Regardless, we discovered that this service as tested could handle no more than 1,250 requests per second with the resources it was allocated.

In order to get a more useful result from the load test, we had to ensure that each of JMeter’s user-threads produced a roughly constant amount of traffic over time. One simple way of achieving this is to inject into each of the loop’s iterations an artificial pause of the same magnitude as the acceptable latencies of the requests themselves. This ensures that each loop will take about the same amount of time, assuming latencies remain smaller than what is acceptable from the service.

SOTByCZ6M2dtROZOqOEL6uuYQfyfVXKdPmBGgT8ha00m5WuvSsYrnGCowwKeu8em1da_j0Z0YRxKoetGGiJu-Blb8-kjqcYMwwM1ELsb8ZepkpjgtRp4VGBRjfoySydqn5RTNf8P

That’s more like it!

With the pause added to the JMeter script, we finally had a load test providing useful results! Let’s take a look at some of the graphs.

E0-Iekd70NJ0l0tAwlsnS6wErg0rO5YGL5dlXWDNM830ifu639cGiEc4CuVswFe5v7uGH8yEVAOOW4v-wMNTm5HDImqiwZh8unAhxDzVuywvGWv5dRp8szRL6SYPU1nFUgTDPWBe

p99 Endpoint Latencies

qYoBlZwKZtzhxQ1gI5sZRm3EvKjiCEtGpsohmk2M1CS3rjn4cLAwMfNgObcrIQ0YLFAzQvqduigafZqURoFicyErR8bNkR5-YBSofPYkqRZqc7aU62NLf-4V7VlnVUtUnblYHim9

p90 Endpoint Latencies

D3jtD3SAj4wTE9YvEva5OxT2DZOXQk7AR1jZWOCv2Zwd86fmIPHF5PXKIdiTbkuZ33yJlXGXULwlAezJmBLdWKDCiXQjSsa1LacQtNtmNPE25WB3euOqAN5JjrCM-IJZicPtcqhd

p50 Endpoint Latencies

In these graphs, p99 Dashboard load times rise above 2 seconds after about 15:51. This was an unacceptable amount of latency, based on user research showing that in as little as 2 seconds users lose interest if the information they are looking for does not load. The service was deemed to be degraded after this time, because the sum of latencies was consistently over 2 seconds on the p99 graph, and latencies on the p90 and p50 graphs were also showing relatively large increases in latency. Furthermore, the HTTP status total graph begins to diverge from a linear trajectory at this time, suggesting that the total request latency within a single user loop was becoming more significant than the injected pause duration (chosen to be 1.5 seconds per loop). This results in the throughput of a single instance of this service to be about 900 requests per second.

To test this hypothesis, a second instance of the Dashboard service was deployed to the performance environment, and the test was run again. If the previous results were accurate, and the service scales linearly with the number of instances deployed as expected, then the service ought to degrade at roughly twice the load observed in the previous test.

RACqOsmPPEXMBjb2oLmGx5tTBgNsXMA25HOf4SnmqhmzH0rrDYQ0JyocOpmo4BEZetlJqp4Cyv9Z4bSzXKM12wpTbxtofMlxl3r_3lCuqtw1SYSe1jL5I5X35NDvbt4jR7yBJoWh

Load testing, now with twice the resources!

Instead, although there were twice the number of instances, each appeared to be degrading at about half the rate from before, resulting in a total throughput roughly equal to the first test! A quick look at some of the Dashboard service’s upstream dependencies’ latencies revealed what might have been the issue:

os7iGKvpxMYF3dOxtsIgLmEnD7JQfcXUWMeTMKFrfqUCaP48OxEfKyXF7Kj6ppPEEQiwilH2CmBFpyDBmJTUyjABHt638Qu2OFGlYL2SeKJTzCPIvQoXAkbwmlxHlshsCpCYm3II

p99 latencies of upstream requests made by the Dashboard service to other APIs

SILRh8J8uriqRITDSHF_WssH1RlXHOMezjGtR8C7a2OyAPQ_4JPYIsZUMFPo7W4lDADxE08Zf8A5wRcOO_-9Qb2-YbfpfsI8FKsm5xh5kGA6DrHaL3DdCDgfD8oTwss_IMkLCh-U

\p99 latencies of end-to-end Dashboard service requests

It appeared that a significant portion of the service’s degradation was due to its upstream dependencies. But even though the increasing latencies of the upstream requests was clearly having an impact, they do not account for the entirety of the increase in latency within the Dashboard service. To be safe, we would remain with a conservative estimate of 900 requests per second per instance based on the results of this load test.

Now we had all of the information we needed to estimate the number of instances required to support full production traffic! We estimated based on traffic to the current Dashboard that this new service would be handling about 24,000 requests per second in production. With each instance able to handle 900 requests per second, and a 30% overhead to be safe, this means that we need 34 instances to safely handle production traffic. Of course, this may be an overestimation based on the degradation of upstream dependencies that was observed, but it’s better to overestimate than underestimate! We can always release resources down the road if it looks like we have more capacity than expected.

With these results, we were confident in beginning the rollout of our new system. Of course there were bumps in the road along the way, but with a solid base provided by these tests these issues were mitigated quickly and with minimal user impact.

Soon all users will be able to enjoy the new in-app Dashboard experience, and all of the potential that comes with it. New features, new stats, and new experiences will all be supported by this platform to empower and inspire the world to live a healthier, more active lifestyle.
If you also appreciate quality software and inspiring a healthier, more active lifestyle, come join us at fitbit.com/careers! And may your services run smoothly, even when the going gets tough.

About the Author

Jonathan Farr – Senior Software Engineer

R2qrbuk3c_Ryt22MhEdro5zoF9dJgpEgI76noZYJzZHE25KZCZMz7G5XXoT9ZPAxNEp0B-HwgSkNA0rDUxvkds3ppaVFVkuZ9Nnh_Ft4rTIz-Tv2YpG7zoc0To3dABZ67YiuDh35

Jonathan has been a backend engineer at Fitbit for 2½ years (and an internship!) and has been focused on the design and development of the new in-app Dashboard experience for the past year. He plays board games every chance he can find. Some of his favorites include Hanabi, Spirit Island, Dominion, Sagrada, and The Mind. He also organizes a bi-weekly board game night at Fitbit to convince more people to play board games with him. When not playing board games, he also enjoys playing Pokémon GO around San Francisco and climbing at Dogpatch Boulders or Berkeley Ironworks.

Load testing microservices and identifying scalability issues

About the Author

Recommend

The Tower of Terror: A Bug Mystery

A machine learning solution for detecting and mitigating flaky tests

Learnings from KotlinConf 2019

Do we still need LeakCanary now that Android Studio 3.6 has “Memory Leak Detecti...

Kotlin Coroutines – Use Cases on Android

Comparing Kotlin Coroutines with Callbacks and RxJava

Do I need to call suspend functions of Retrofit and Room on a background thread?

How to run an expensive calculation with Kotlin Coroutines on the Android Main T...

My first online course “Mastering Kotlin Coroutines for Android Development” is...

Kotlin Coroutines Exception Handling Cheat Sheet

About Joyk