Root Cause Analysis | AlayaCare Cloud ANZ Production Service Impairment | August 30, 2024

Sijan Shrestha
Sijan Shrestha

Incident Date

Aug 30, 2024 at 07:21 to Aug 30, 2024 at 08:34 (1 hours and 13 minutes) *All times listed in this report are in AEST using 24h clock.

Customer Impact

Two Australian customers experienced slow application response time for all operations in the system.

Technical Summary

One Australian customer sent a sustained large number of calls to an integration endpoint. The volume exceeded their database’s capacity to process the requests. That customer’s application performance degraded as all queries to the affected database took longer to process. This also impacted one other customer currently sharing the database. There was a brief period of slightly degraded performance for other customers as the system automatically adjusted to recover.

Root Causes

The primary root cause of the slowdown in database performance was the large increase in call volume to the integration endpoint. Performance could be fully restored only after the IP originating those calls was blocked. If there had been a mechanism to limit the volume of integration calls that can reach the system, it would have prevented the incident from occurring. The absence of such a rate limiting mechanism is a contributing root cause.

Timeline

Time (AEST) Event
07:21 Monitoring alerts triggered for Australia region performance degradation.
07:25 Responders identified the performance issue is caused by very high call volume on a single integration endpoint
07:27 Responders reached out to AlayaCare CS to ask customer to stop the problem integration.
07:34 Start terminating slow queries via automated job to keep DB stable
08:15 Job to kill queries was not able to keep up with volume of integration API calls. Responders started prepping maintenance page to block the integrations.
08:24 Customer joined the incident response call to understand what they can do.
08:34

Ingress rule put in place to block traffic from the integration source IP. Incident resolved almost immediately.

Incident Resolution

The incident was resolved by blocking the source IP for the integration calls which were overloading the system.

Addressing Root Causes – This Quarter

• Update the incident response workflow to block source IP earlier in the incident response, thus resolving the incident faster

• Improve performance of automated job for killing long-running queries so that it can handle a larger volume

• Enhance database tenant isolation to reduce the effect on other customers when a single customer has a large spike in API call volume

Addressing Root Causes – Future Steps

• Implement a rate limiting service to prevent external integrations from having a negative impact on overall application performance

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Article is closed for comments.