1

How we overcame performance nightmares in our monolith app

 2 years ago
source link: https://developer.ibm.com/blogs/how-we-overcame-performance-nightmares-in-our-monolith-app/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Blog Post

How we overcame performance nightmares in our monolith app

Steps for performance improvement


Subscriber and Subscription Management (SSM) is the system that funnels orders for IBM SaaS offerings offered through IBM and third-party marketplaces to the appropriate endpoints. This provisions orders for the customers and manages their entire subscriber and subscription lifecycle. It handles about 2,000 requests per hour.

SSM is a legacy monolith app. However, dealing with such a mission-critical application with millions of lines of code can be a nightmare. Making it more complex is the transaction handling implemented at every smallest service layer unit. To support high-end business use cases, there are dozens of composite APIs that SSM supports. These composite APIs internally make calls to the smallest-unit APIs, holding multiple DB connections for a single composite API request.

This eventually resulted in blasting the DB memory, losing myriad live transactions. You might be asking:

  • Can’t transaction handling be implemented at the composite API level rather than the smallest API unit? No because the data access layer structure is tightly coupled with the lower-level APIs, and moving to higher level would introduce a lot of stale object-state exception cases.
  • Can’t the monolith app be broken down into microservices architecture, which is a current market trend? No because this is a costly affair in terms of resources and time; and moreover, developers were busy in tending to the above issue, giving no room to think and invest time in this approach.

It was critical to find a fast and efficient solution to this problem, since it impacted the business. To make things worse, with SSM being at the core of the marketplace ordering flow, both upstream and downstream systems were significantly impacted. It was also difficult to identify the source of the problem, whether it was at the code, database, or infrastructure layer (since the application is deployed on IBM Cloud). With the team’s engineering skills and aggressive debugging, the issue was analyzed.

The journey

Pattern Discovery Phase

We analyzed the historical performance issues using an internal monitoring tool. This helped identify a huge number of calls were being made to fetch a user with many roles or associated entitlements, resulting in the application consuming more resources and ultimately causing delays for future API calls. Tis was a progressive effort achieved thorough:

  • Grouping the specific APIs in the monitoring tool that caused additional load to the application.
  • Taking a snapshot of historic data, enabling us to find the pattern that caused the performance degradation.
  • Creating similar API sets to run in an SSM preproduction environment.

Problem Reproduction Phase

Performance load tests were run on an SSM preproduction environment over a few weeks at different times of the day. For every run, heap dumps were collected. Heap dump collection for analysis was a bottleneck. The solution was to kill the main Java process and copy it to a local machine for debugging. Steps to collect the heap dump from IBM Cloud environment:

  • ibmcloud target --cf -sso
  • ibmcloud cf apps
  • ibmcloud cf ssh <appname>
  • Run - ps -aux (to get the process ID of the running cf apps)

We then killed the process ID with the -3 option (do not use the -9 option). Once the above commands are fired, you will notice core dump under the following folder:

       vcap@27854948-c2e2-4bc8-7649-c266:~$ ls -ltr /home/vcap/app/
        total 5840
        drwxr-xr-x 4 vcap vcap      62 Jul  5 09:50 WEB-INF
        drwxr-xr-x 3 vcap vcap      38 Jul  5 09:50 META-INF
        drwxr-xr-x 2 vcap vcap      26 Jul  5 09:50 jsp
        -rw-r----- 1 vcap vcap 5979538 Jul  5 12:40 javacore.20210705.124041.16.0001.txt

You can generate as many core dumps as you want (depending on the investigation).

Next, we copied the remote core dump into a local laptop: ibmcloud cf ssh <appname> -c "cat <path of core dump>" to local laptop path directory.

After couple of executions, the same scenario was simulated, which gave some confidence that the investigation is on the right track. It was indeed a daunting task to simulate it over and over again during peak times.

Problem Analysis Phase

With a few dumps, REST calls (GET and POST) were analyzed in depth. This gave insights on the degraded application behavior. The GET calls were holding the DB connection even after getting the result set. In between, other incoming requests waited for the DB connections to release. This typically caused a deadlock situation, resulting in the overall app going into degraded performance mode during high-traffic times, resulting in a crash. As the following screenshot shows, 75 threads in "at com/mchange/v2/resourcepool/BasicResourcePool.awaitAvailable(BasicResourcePool.java:1503(Compiled Code))" were awaiting connection from pool.

Screenshot shows 75 threads waiting for connection

Solution Phase

Based on the analysis, the commit mechanism of the GET calls was changed from autoCommit = False to True. This releases the connection immediately when the result set is fetched vs. holding the it until the end of the transaction.

Screenshot shows releasing the connection

We fine-tuned the DB connection pool size for optimizing the connections between the application and data layer. We increased ibernate.c3p0.max_size from 125 to 250 to create additional DB connections in the DB connection pool. We also reduced hibernate.c3p0.idle_test_period from 120 to 60 (time for which the connection can be idle before releasing).

Screenshot shows reduction of hibernate time

The combined approach above resulted in ~80% improvement in the response time for all APIs.

Bar charts show improvement in response time for all APIs

The performance improvement was beneficial and had a positive impact on the API consumers. The journey was harder, but the discovery and learning made the application and the team more resilient.

Acknowledgements

Thank you to Anil Sharma for the analysis on the database and Bhakta for sharing expertise on heap dumps. And special thanks to Nalini V. for guiding us in this journey.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK