I would like to share a debugging session we had for an issue related to Kafka consumer rebalancing. A few days ago we got an alert that the Shaadi Metric Beat (SMB) consumer lag is consistently increasing.
The APM of the service showed that the service is going on and off. The container of the service was thrashing, and we observed that the latency also shoots up. The latency has gone up to 40-50 seconds in some cases.
And so we thought we had our culprit. SMB conducts 3 API calls to enrich the data for an event –
a) Profile API for profile attributes
b) AB-EXP service for getting AB groups for all the experiments of the user
c) Session-API for getting session related info of the user.
We then went ahead and checked the latency graph of all down calling services. And to our surprise, we found no anomalies in the latency graph of all the 3 services.
We assumed there must be an increase in the indexing latency of ES. But we could not find any load or increase in the indexing latency there either. This drove us insane – thinking about the lack of latency increase in the underlying services and our service latency going up. We did some digging into the logs and discovered that the containers were thrashing because the consumer is leaving the group. And this happened because of the long running message processing.
As there was a constant rebalancing and not enough APM data it made it even harder to debug what was causing the message processing to timeout. Even through the latency graph of all the services was normal.
After much of mutual head scratching and adding more logs, we got to know that whenever the ‘poll timeout error’ was displayed the ‘AB service’ was called. So we tried to figure that out from our AB service APM dashboard, and observed that there is a change in latency distribution graph of the AB service.
From the latency distribution graph it became clear that the handful of latency requests are running into seconds or even minutes. When we looked at the latency graph the average latency was not indicating the actual picture, as this request was less than 0.1% of the total requests served by the AB service. But those long tail end requests were giving a hard time to the consumer.
To solve this, we reduced the timeout of the API call to 1 sec from 10 sec. We knew this will obviously have a caveat, that the events on which AB service will timeout will not have the experiment data for the event. But this will at least not push the consumer into a rebalancing loop.
One solution can also be to increase the poll timeout for the consumer. But in the above case, we can allow some events to not have AB-EXP data if request is taking more than 1 sec to respond.
The purpose of writing this debug journey is to share the insights we gained while fixing the above issue. Distribution graphs are very helpful, so were the poll timeout settings of the Kafka, and rebalancing of the consumer.
Although it took us a wee bit longer to investigate this mystery, we hope it helps in saving some time in your future debugging adventures!
Happy APMing ya’ll!