Sangam Search 2.0

Shivam Rastogi/ November 4, 2020/ Backend, Web

tl;dr 

  • Upgraded the Solr search engine from v4.5 to v8.5
  • Created more powerful schema with API control
  • Identified and fixed a few problems in the existing search service, increased business capabilities to suffice more complex use-cases 
  • And the cherry on top – significantly reduced the average search response time at peak from ~8-12 secs to ~3-4 secs. That’s a real win for us and our users and, nothing makes us happier than a happy user!

I know these benefits are super enticing and you must be wondering how we pulled it off. So here it goes..


A little background

In the past 2 years, we have seen more than 6 downtimes because of Solr that couldn’t manage the scale (sometimes it ran out of CPU/memory, or sometimes we couldn’t even figure out what happened), we had no proper monitoring for this, and it has caused severe business impacts, as much as 2 hours of downtime multiple times. 

Apart from this, the Solr schema which we were using was actually created to suffice Shaadi business use-cases, and Sangam piggy-backed on it. But after a point, that schema couldn’t satisfy many of our use-cases, and we had to do several workarounds. Because of this, the performance of the search service was poor.

For example, when a user logs in to Sangam and lands on the ‘MATCHES FOR YOU‘ section, the user would see the loader for as minimum as ~8 secs on the fast data network and ~10-12 secs otherwise before some profile shows up on the screen. This was something which was giving sleepless nights to a few of us, and we wanted to fix it.

We began to understand why these outages were happening, what we can do to suffice more business use-cases, and how we can make the search service performant. The cool thing is that our initial aim was to suffice upcoming business use-cases but performance was a byproduct. 

We were battling some basic issues:

  • Solr was running on a very old version which has its own issues
  • Solr required manual deployment and scaling that takes at least an hour
  • Indexing was a pain
  • As I said, the schema was incompatible with business
  • The underlying search service faced all the burns to suffice recent business cases

There was a lot more.
So, we started working on figuring out the solution and we set a tight deadline (45 days with 1 engineer) for us to complete this project.

but Why did we do it?

A very simple answer to this is – the business needed it. However, the search in Sangam is a bit more complicated than it looks from the outside –

  • Multiple round-trips of the search were happening internally to show a set of profiles to a user. For example, to show the ‘MATCHES FOR YOU’ section we needed to make two round-trips and for the ‘MORE’ section the round-trips become seven
  • Due to these multiple round-trips, the response time of the search service was pretty high as this was causing unnecessary load on the underlying services
  • The existing schema wasn’t designed to save these round-trips, nor was it capable of that. Also, there were many unwanted fields (~60) in the schema which were making it inefficient
  • Solr indexing was happening using DIH (Data Import Handler) and there was No API support for indexing
  • No way to add complex fields (fields that are derived from multiple tables on different databases) in the schema as DIH wasn’t capable of handling this
  • Each time a user makes a change in profile, preferences, or filters, there were 2 Solr instances which were running in parallel and the indexing was happening independently on both instances which was causing unnecessary load on the DB
  • Existing Solr wasn’t capable of handling scale as someone had to manually add an instance and index

What did we do about it?

We did away with almost all the inefficiencies of the existing search –

  • Upgraded the Solr engine from v4.5 to v8.5, there were many optimizations, bug fixes, improvements, and new features that were made available to us. 
  • Moved to a container-based deployment (scaling is still manual but hardly takes 5-10 mins)
  • Recreated the schema to suffice all existing, new, and upcoming business use-cases
  • API for indexing, schema and even for configuration changes
  • Moved to real-time event-based indexing instead of cron (a cron used to run every 5 mins for indexing the recently updated profiles)
  • Instead of indexing all the fields every night (cron), we decided to index only those fields which require a daily update (derived fields <10) using consumers that do partial indexing
  • Adding a new field in the schema will take a day max now and full indexing consumer can run to index it
  • Changed the search service to handle the new schema
  • Introduced the alerting and monitoring system for Solr

This newly upgraded, more optimized Solr, the newly designed schema, and changes in the service helped us saving those multiple round-trips thus reducing the response time drastically, from earlier ~8 – 12 secs to ~3 – 4 secs.

New Search Flow

The other search engines 

There were essentially two other search engines that we had evaluated – AWS cloudsearch and Elastic Search(ES). Both options were good and were sufficient for all our needs but we still chose Solr because –

  • Solr! powering billions of searches every day ūüöÄ
  • We did a feasibility check with the product team and found Solr is pretty much capable of handling all our current and future business needs
  • Solr seems a more cost-effective solution
  • Moving to any of these options requires more dev. time as we don’t have the expertise and the underlying service has no existing implementation for any of them
Cost

The cost covers both the time and money, this project has saved a lot of time while developing as we had some expertise and it has and will improve dev velocity for future projects. Also, this will save some $$$ for us as the load on DB and the search service will go down.

API

Solr provides APIs for indexing, schema changes, and even for configuration changes. We use that very API to index from our consumers. We have deployed a total of three consumers (Go lang), 

  • Full 
  • Partial
  • Delta

All consumers are event-based and can be triggered based on specific events,

  • Delta consumers listen to the profile, preferences, filters update event, and index in almost real-time. 
  • Whilst, the partial consumer runs every night (maxwell event) and index only derived fields which takes ~15-20 mins (where cron was running for 3-4 hours and was indexing all the fields unnecessarily). 
  • Full consumer runs when there is a change in either schema or in config and index the entire active DB in less than an hour.

New Indexing Flow

Testing

We did a stress test as well as a load test before shipping this to production, we used K6 for both testing.

Load test

We first disabled the Solr cache, used 1000 K6 virtual users, and made many concurrent requests. The peak load on the existing Solr was 15k req/min so we decided to test the new Solr with 50k req/min and it ran smoothly without any load on infra.

Stress-test
CPU

Solr’s CPU went up to only 35% and could have handled more load.

Stress test

We decided to do a stress test of the infra with a 3X number of requests than the current production setup. We scaled the new staging infra (DB, back-end services all scaled) and matched it with production infra, ~4K req/min are being handled every day on the production so we decided to hit 12k req/min to see if the new Solr can handle this and it did.

CPU 

As you can see, Solr’s CPU went up to 17% when we are doing this test while other services like DB CPU went up to almost 80-85%.

DB

And the search service maxed out at 95% CPU.

Search Service 

To conclude, this new infra is much more capable of handling stress and load almost thrice the number of requests existing infra can handle.

Performance

The performance has significantly improved and as you can see in the graphs below that show data of 5 days and the experiment had moved to 40% A/B.

The p90-p99 latency of the old search on proxy was ~3-4 secs and in the new service, it has reduced to ~1 sec. 

p50

p75

p90

p99

The average number of requests also went down, as in the new schema, the search happens in a single round-trip.

Request Count

The average latency of Solr has also decreased from an average of ~1 sec to ~340 ms.

Solr latency (male)

Solr latency (female)

As you must have noticed in the graphs above, even though the requests have increased on the new service, the response time still holds somewhere near ~1 sec. We were sure that this will definitely help our users to see their matches much faster than ever before and will have a great impact on their journey in Sangam.

YAY!