How Shaadi.com optimized compute costs using Amazon EBS gp3 volumes
Shaadi.com was founded in 1996 with one simple objective – to provide a superior matchmaking experience to Indian people around the world. Shaadi.com has over 3 million active profiles globally. As of today, Shaadi has helped 3.5 crores (35 million) people find their matches.
Shaadi.com uses AWS to deploy its key products, including the matrimonial portal, video calling features, user chats, and matching campaigns that recommend compatible users to each other. During the COVID-19 pandemic, Shaadi faced unprecedented scaling challenges that increased their costs by 37%. However, with the help of AWS Enterprise Support, the Shaadi team saved costs while scaling using Amazon EBS gp3 volumes.
In this blog, we walk through the architecture of Shaadi’s matrimonial portal, explaining the significance of each tier and how they interact with each other. Then, we explain the cost and architecture challenges Shaadi faced after an unexpected 58% increase in user traffic, which was solved by adding more compute nodes and increasing the size of Amazon EBS gp2 volumes. Finally, we discuss how gp3 volumes simplified Shaadi.com’s architecture and lowered total backend compute costs by 25%.
Matrimonial portal architecture
The Shaadi matrimonial portal on AWS consists of a three-tier architecture designed to have object-relational-mapper (ORM) as an abstract layer. This enables front-end web services to interact with any backend databases, and to scale the middle tier independently using a microservice architecture.
A first tier is a group of front-end web servers that receive customer requests and route them to the middle tier. The middle tier has two services, one of which is the profile metadata service named “Back 1” which updates users’ profiles and photos, as well as favourite and ignored profile details. The other is the “Profile API” service for connection requests, chat and inbox messages, and user interests. Both the front end and middle tier use Auto Scaling Groups and Application Load Balancers to automatically adapt to customer traffic. The middle tier also uses Elastic Load Balancing to distribute traffic across their servers. The backend tier is a set of three MySQL database clusters that store all customer data. Each of the databases has primary and secondary database (DB) servers, which use synchronous database-level replication. The following diagram outlines the architecture:
Both of the mid-tier services communicate with backend MySQL databases to service customer requests. Each database has a different set of customer data: Database 1 (DB1) stores users’ profile and photo details, database 2 (DB2) stores users’ favourite and ignored profiles, and database 3 (DB3) stores users’ interest, accepted requests, and chat message details.
As customers register on shaadi.com, they generate traffic through the stack to DB1 and DB2 for creating and updating profile metadata. As they begin interacting with other profiles, such as tagging another user profile as a favourite, sending an interesting request, or initiating a chat request, shaadi.com customers drive traffic to DB3. In addition to user-generated traffic, shaadi.com also runs hourly profile match campaigns that drive additional traffic to DB3. Data synchronization between primary and secondary databases is critical. If there is a delay or failure, users can miss interest requests and chat messages, or see outdated lists of favourite profiles, all of which leads to a poor matchmaking experience.
In April 2020, the COVID-19 pandemic spread across India and around the world, affecting many lives. Shaadi.com experienced an unanticipated 30% increase in new users within the first few weeks of the nationwide lockdown. The front-end services, built using Auto Scaling Groups, automatically provisioned new infrastructure to meet the demand without any impact on customer experience. As these additional users completed registration and started to interact with each other, there was an increase in overall traffic. Before the lockdown, there were 1.8 million interest requests/day between prospective candidates served by DB3, which increased to 2.5 million interest requests/day after the lockdown. With the help of Datadog, an application performance-monitoring tool, we noticed that an increase in interest requests lead to an increase in daily user profile interactions and chat exchanges. This further grew daily backend DB3 traffic by 58%. Increased traffic to DB3 resulted in delayed database synchronization and longer API response times. To improve the response time, additional DB3 servers were added which increased back-end database costs by 37%. However, in February 2021 another unexpected increase in new users challenged us to find a way to scale out the databases at a lower cost. Next, we share our initial approach to scaling and then describe how Enterprise Support suggested gp3 volumes to simplify the architecture and lower performance-scaling costs
Initial approach: Scaling architecture with gp2
During the first few weeks of our COVID-19 lockdown in April 2020, at peak activity times during the day, end-users experienced between 10-second and 500-second delays for routine actions. These delays impacted functionalities such as seeing the list of matched candidates, sending and receiving chat messages, seeing connection request updates or viewing their list of candidate matches. This poor performance lasted about 2 hours each day, resulting in a poor customer experience. Additionally, during these peak hours, with the application timing out, scheduled matching campaigns (matching suitable candidates based on search criteria and preferences) failed to generate new matches.
When the Shaadi team did a deep dive, they saw that the processor and memory performance of our DB3 instances (R5.8xlarge) were at their limits. Our Database Administrators (DBAs) observed in the “diskio_iops_in_progress” metrics (representing the number of I/O requests issued to the device driver that are incomplete) that the DB3 instances were trying to drive more IOPS than we had provisioned the attached gp2 volumes for (1.5-TB volumes with 4,500 IOPS). They also observed that “diskio_io_time” metrics (representing the amount of time that the disk has had I/O requests queued) for the disks increased to a few thousand milliseconds.
To address the issue, the team at Shaadi.com added two additional R5.8xlarge nodes to the database cluster. We also modified the size of all DB3 gp2 volumes by 33% (1.5–2 TB), increasing the available IOPS per volume from 4,500 to 6,000. We had the option of using provisioned IOPS, but the price was 3x more than gp2, and we did not need the higher durability or higher performance. Therefore, the team at Shaadi.com decided to increase the DB volume size from 1.5 TB to 2 TB to meet the IOPS requirement. This solved Shaadi.com’s customer experience problem, however, it increased the cost of the DB3 databases by 37% and permanently increased the size of the gp2 volumes for higher IOPS.
|Monthly cost of six R5.8xl instances each with 1.5 TB of gp2 EBS volume with 4,500 IOPS is $9,730.||Monthly cost of eight R5.8xl instances each with 2.0 TB of gp2 EBS volume with 6,000 IOPS is $13,373.||Monthly cost increased $3,643 (37.44%)|
A lower-cost solution: Dynamically scale performance with gp3
In February 2021, Shaadi.com saw another unexpected 20% increase in user traffic that again caused high latency on the backend DB3 cluster. The DBA team identified that the instances were again being throttled on IOPS at the gp2 volumes. This time, we signed up for Enterprise Support, the highest-level support offered by AWS. Enterprise Support assigned a technical account manager (TAM) to help with daily business as usual, in addition to strategic work. The TAM has been assisting Shaadi.com by providing immediate support and root cause analysis (RCAs) to resolve business impacting issues. This approach to resolution is similar to the I/O issue in which Shaadi.com’s TAM discovered the root cause and built a long-term plan for scaling the backend databases. The TAM proposed a migration to gp3 volumes to provision IOPS independently of the volume size. This way, DBAs could increase the performance on the volumes to meet unexpected growth without over-provisioning storage.
Testing was executed in the staging environment using Elastic Volume operations to perform a “data in place” migration of the gp2 volume to a gp3 volume. This non-disruptive migration was successful and the volume modification was transparent having had no impact on the users. Additionally, there was no degradation of application performance during the process. After the initial test, Shaadi.com migrated the volumes on the staging and production servers. The resulting benefit was a 15% lower storage cost and the agility to increase IOPS for future demand.
The TAM said that gp3 can take advantage of higher IOPS to better use the Amazon EBS-optimized bandwidth available on R5.8xlarge. The gp3 provisioned IOPS were increased from 6,000 to 8,000 per volume, and instances averaged a “CPU Utilization” decrease of 20% on the DB3 secondary instances. The result was that by right-sizing the EBS gp3 volumes to drive more performance from R5.8xlarge instances, the number of nodes in the DB3 cluster could be reduced by 25% (from 8 to 6 instances) without causing any performance impact to the end-users. This solved the customer experience problem, enabled the right-sizing of EBS volumes, and optimized the EC2 instances.
|Monthly cost of eight R5.8xl instances each with 2.0 TB of gp2 EBS volume with 6,000 IOPS is $13,373.||Monthly cost of six R5.8xl instances each with a 2.0 TB of gp3 EBS volume with 8,000 IOPS is $9,963.||Monthly cost decreased $3,410 (25%)|
Shaadi.com has duplicated the success with DB3 by migrating all gp2 volumes to gp3. This project has been so successful that Shaadi.com is considering the use of gp3 for containerized workloads.
For the front-end and middle-tiers, Shaadi.com created new AMIs with gp3 volumes and redeployed the services in phases using the existing Auto Scaling groups. For the back-end database servers, Shaadi.com made an online transition using the Elastic Volume feature, starting with secondary databases. With a few simple clicks, the DBAs modified the existing database volumes from gp2 to gp3 without any disruption to the application and database.
- Analyse compute utilization as part of Amazon EBS migration to see if there are opportunities to optimize costs.
- There is no requirement to stop instances or cause disruption to the user – Migrate using Elastic Volumes in-place.
- Leverage AWS Enterprise Support (TAM) – The TAM discovered a solution which saved 15-20% of storage costs and 25% of compute costs.
- A recent example in May 2021 underscored the benefits of the migration to gp3. The launch of new features resulted in increased traffic. This change in the business resulted in an increased load on the database servers. To reduce the load on database instances, Shaadi.com DBAs increased the IOPS of gp3 volumes from 8,000 to 10,000, which improved the disk performance with only a marginal increase in cost.
Shaadi.com faced unexpected scaling challenges on MySQL databases due to surges in demand during the COVID-19 pandemic. A sudden 58% rise in user traffic caused the site’s latency to go up to 500 seconds for database synchronization. This led to a poor user experience and risked damaging Shaadi’s reputation as a leader in the matrimonial segment.
The first approach to solving this issue involved adding additional compute nodes to the database cluster to scale performance and increasing the size of the Amazon EBS gp2 volumes. While scaling IOPS improved performance, it also increased storage costs significantly as the IOPS and volume size were not independent. When Shaadi faced another similar surge in demand in February 2021, scaling IOPS and storage together was cost-prohibitive. Shaadi needed to be able to scale compute without over-provisioning (and over-paying for) storage that they didn’t need. AWS Enterprise Support helped Shaadi adopt gp3 volumes to scale performance of the storage separately from the compute nodes. By right-sizing compute and storage, Shaadi was able to decreased overall backend scaling cost by 25% without impacting database synchronization or the users’ experience.
Thanks for reading this blog post on using Amazon EBS gp3 volumes to save on compute costs. If you have any comments or questions, leave them in the comments section.