Our story starts with us trying to implement a feature on one of our nodes in back-end applications. Sangam Athena. During these implementations, we realized that our application was using a high amount of ram and also there were a large number of containers being launched to serve traffic. Athena was comparatively a lightweight processing app, hence we were surprised at those numbers. We tried to look for reasons and stumbled onto the use of pm2 inside containers in Athena as well as some infra parameters which look like being set at high values.
In the end, We discovered that improving some of the deployment on auto-scaling parameters resulted in a large reduction in the number of containers for our node js application as well as memory usage by application.
Our optimizations list
We decided to undertake two deployment updates
- Change in ECS autoscale target CPU values:
This was done with the idea to make sure we were utilizing our resources better, scaling at 40% could have meant that new containers were launched even when existing containers were only running at about 40% capacity. We concluded that the value was probably either copied from the deployment pipeline of another application or was based on results of experiments done long back and was ignored for a long time.
2. Removing PM2 from containers:
We realized our node js application was using the pm2 process manager inside docker containers. If that was not worse, it was configured to deploy multiple instances of our application inside docker containers. Again, we concluded that the configuration for docker deployment was copied over from another application and was ignored for a long time.
What is PM2 and why remove it from docker?
In simple terms, pm2 is a process management software, originally created to be used on servers. And it serves two main functions.
- Process Manager:
Before the days of docker and serverless, common deployment for most applications was like below.
- Create a physical /virtual windows/Linux machine as a server.
- Install your application runtime (node js/java etc.) and application(s) on this machine
- Start it and then let the machine and the application run forever.
The problem with the above approach was that if your application crashed due to some reason, you have to login to the server and manually restart the application. A process manager solves this problem by monitoring the application process and automatically restarting it in case it fails. This is the main function of Pm2
Pm2 also allows for scaling your application by creating multiple processes of your node js application and distributing traffic among them.
There are other features defined on the pm2 website but the most commonly used features are the 2 explained above.
PM2 in the context of Sangam Athena
Athena uses a common AWS ECS-based deployment strategy used across our organization. Linux Vms are provided dynamically by AWS fargate to run docker containers. And each machine runs multiple containers for multiple applications one of which is Athena. AWS load balancer distributes traffic evenly across containers. A simple illustration may look like the one below.
Sangam Athena was configured to launch pm2 in cluster mode with process count equal to the parent as CPU core counts. It resulted in 6 or even 8 processes being spawned inside a single container. It looked like it was a configuration copied over from the days of standalone server deployments.
Should we use PM2 inside docker containers?
We need to take into account that two problems for which PM2 is used, are solved by AWS ECS as well.
- Process Manager: We do not need a process manager when we use ECS because ECS automatically deletes containers in which an application is crashed and creates a new container on either the same or another Linux machine in the ECS cluster and attaches it to the load balancer. The load balancer also stops routing traffic to a crashed container once its health check indicates a crash.
- Scaling: As it’s clear from the diagram ECS deployment with a load balancer gives us the ability to spawn as many containers as we wish to cater to traffic. Furthermore, ECS automatically scales based on traffic, creating and destroying containers as per requirement.
In general, over internet discussions, we found that people advise against using pm2 inside docker containers for similar reasons. PM2 developers have released a version of the application to use inside containers, but we didn’t feel we needed any of the features it provided.
In the end, We decided to remove PM2 from our containers and do a canary test.
Please go to the end of the article to check some of the resources which we used.
A failed Experiment
We did two short trial runs by incorporating the above two enhancements. However, within hours of deployment, we got frequent spikes in http 502 on load balancers for Athena service. Clearly, we had missed something.
Learning and adopting
We found that our Athena application containers were crashing multiple times due to unhandled exceptions thrown from 3rd party libraries in our code. And our ECS/load balancer health checks were not fast enough to detect the failure and redirect requests to other containers.
When we were using pm2 our containers were not crashing as only one of the processes inside the pm2 cluster was restarted immediately by PM2.
On the other hand, after removing PM2, the container crashed ECS identified that container had crashed and created a new instance. And the load balancer also redirected traffic to other containers after a health check. But our health check was not as quick as pm2 and as a result, many requests were still routed to a failed container resulting in 502 spikes.
After analysis, we concluded that,
- Crashing containers can be fixed by better error handling and upgrading third-party packages to the latest version.
- We need to fine-tune the ECS failure detection health-check timeout to avoid such a scenario in case of any other runtime exception in future.
- ECS health check timeouts can’t be set to very small intervals because they need to be kept reasonably above startup interval otherwise containers don’t start at all.
- For now, we will keep PM2 for failure detection, however, we would set the process count to 2 in its cluster mode.
Changes and results
After applying changes, we re-run our tests and after success has deployed to production for Sangam Athena.
100% prod release results (with similar traffic)
|Metric||1 to 5 Match 22||1 to 5 May 22|
|Avg CPU util %||34||27|
|Avg Memory %||63||26|
|max no of container||13||4|
|Avg number of containers||6||2|
|Avg latency p99 per min||1.57s||1.13s|
|Avg latency p95 per min||2.95s||2.06s|
|Avg req served per min||130||134|
After going 100% live, we saw that even during peak traffic hours only 4 containers were able to serve almost all requests whereas the count used to reach 13 before optimization.
Athena containers count before and after
After subsequent experiments and deployments, we concluded that both infra changes introduced by us had benefits.
- Change in ECS autoscale target CPU values: Resulted in a reduction in the number of containers needed to serve live traffic Adding and removing pm2 did not affect this.
- PM2 restricting pm2 processes to 2: Resulted in a reduction of % memory utilization, thus enabling us to allocate less memory per container.
Regarding PM2 we concluded that,
- PM2’s availability features can avoid the recreation of containers in case of unhandled exceptions and errors, promises etc.
- Using PM2 cluster mode inside a docker container does not make sense in ECS it increases memory consumption of applications and does not add any advantages.
- Removing pm2 did not add any significant advantages either apart from a reduction in memory when cluster mode was disabled.
A discussion over using pm2 inside docker containers: https://stackoverflow.com/questions/51191378/what-is-the-point-of-using-pm2-and-docker-together
PM2 for docker: https://pm2.io/docs/runtime/overview/
A must-watch for understanding best practices: Docker and Node.js Best Practices from Bret Fisher at DockerCon
AWS ECS best practices: https://docs.aws.amazon.com/AmazonECS/latest/bestpracticesguide/application.html