Site icon Tech Shaadi

Sangam Search 2.0

sangam search 2.0

tl;dr 

I know these benefits are super enticing and you must be wondering how we pulled it off. So here it goes..


A little background

In the past 2 years, we have seen more than 6 downtimes because of Solr that couldn’t manage the scale (sometimes it ran out of CPU/memory, or sometimes we couldn’t even figure out what happened), we had no proper monitoring for this, and it has caused severe business impacts, as much as 2 hours of downtime multiple times. 

Apart from this, the Solr schema which we were using was actually created to suffice Shaadi business use-cases, and Sangam piggy-backed on it. But after a point, that schema couldn’t satisfy many of our use-cases, and we had to do several workarounds. Because of this, the performance of the search service was poor.

For example, when a user logs in to Sangam and lands on the ‘MATCHES FOR YOU‘ section, the user would see the loader for as minimum as ~8 secs on the fast data network and ~10-12 secs otherwise before some profile shows up on the screen. This was something which was giving sleepless nights to a few of us, and we wanted to fix it.

We began to understand why these outages were happening, what we can do to suffice more business use-cases, and how we can make the search service performant. The cool thing is that our initial aim was to suffice upcoming business use-cases but performance was a byproduct. 

We were battling some basic issues:

There was a lot more.
So, we started working on figuring out the solution and we set a tight deadline (45 days with 1 engineer) for us to complete this project.

but Why did we do it?

A very simple answer to this is – the business needed it. However, the search in Sangam is a bit more complicated than it looks from the outside –

What did we do about it?

We did away with almost all the inefficiencies of the existing search –

This newly upgraded, more optimized Solr, the newly designed schema, and changes in the service helped us saving those multiple round-trips thus reducing the response time drastically, from earlier ~8 – 12 secs to ~3 – 4 secs.

New Search Flow

The other search engines 

There were essentially two other search engines that we had evaluated – AWS cloudsearch and Elastic Search(ES). Both options were good and were sufficient for all our needs but we still chose Solr because –

Cost

The cost covers both the time and money, this project has saved a lot of time while developing as we had some expertise and it has and will improve dev velocity for future projects. Also, this will save some $$$ for us as the load on DB and the search service will go down.

API

Solr provides APIs for indexing, schema changes, and even for configuration changes. We use that very API to index from our consumers. We have deployed a total of three consumers (Go lang), 

All consumers are event-based and can be triggered based on specific events,

New Indexing Flow

Testing

We did a stress test as well as a load test before shipping this to production, we used K6 for both testing.

Load test

We first disabled the Solr cache, used 1000 K6 virtual users, and made many concurrent requests. The peak load on the existing Solr was 15k req/min so we decided to test the new Solr with 50k req/min and it ran smoothly without any load on infra.

Stress-test
CPU

Solr’s CPU went up to only 35% and could have handled more load.

Stress test

We decided to do a stress test of the infra with a 3X number of requests than the current production setup. We scaled the new staging infra (DB, back-end services all scaled) and matched it with production infra, ~4K req/min are being handled every day on the production so we decided to hit 12k req/min to see if the new Solr can handle this and it did.

CPU 

As you can see, Solr’s CPU went up to 17% when we are doing this test while other services like DB CPU went up to almost 80-85%.

DB

And the search service maxed out at 95% CPU.

Search Service 

To conclude, this new infra is much more capable of handling stress and load almost thrice the number of requests existing infra can handle.

Performance

The performance has significantly improved and as you can see in the graphs below that show data of 5 days and the experiment had moved to 40% A/B.

The p90-p99 latency of the old search on proxy was ~3-4 secs and in the new service, it has reduced to ~1 sec. 

p50

p75

p90

p99

The average number of requests also went down, as in the new schema, the search happens in a single round-trip.

Request Count

The average latency of Solr has also decreased from an average of ~1 sec to ~340 ms.

Solr latency (male)

Solr latency (female)

As you must have noticed in the graphs above, even though the requests have increased on the new service, the response time still holds somewhere near ~1 sec. We were sure that this will definitely help our users to see their matches much faster than ever before and will have a great impact on their journey in Sangam.

YAY!

Exit mobile version