Root Cause Analysis | AlayaCare Cloud ANZ Production Service Impairment | May 6, 2025

Sijan Shrestha
Sijan Shrestha
  • Updated

Incident Summary – May 6, 2025

Incident Type: Service Degradation
Severity: SEV0
Duration: 57 minutes
Region Impacted: Australia (ANZ tenants)
Services Affected: Scheduling, Visit Verification (VV), Invoice Generation, and other areas dependent on database access.


What Happened

On May 6, 2025 (AEST), we experienced a significant service degradation affecting approximately 100 tenants in our Australian production environment. The issue was caused by a sudden and sustained spike in database load, which impacted the performance of core services, including scheduling and visit verification.

The root cause was traced to a combination of:

  • Changes in activity patterns combined with the queries used to retrieve context fields overloaded a DB-internal cache of table information (the open_table_cache).

  • A complex and long-running query pattern tied to the Visit Verification feature was also observed during the incident. This didn’t directly contribute to the incident, but we’re looking to improve on this as it is bad practice to have such long running queries.

  • This caused high CPU usage on the multi-tenant database.

These combined to overload the database server, resulting in slow or failed page loads for impacted users.


Resolution and Recovery

During the incident, we:

  • Terminated a set of problematic long-running Visit Verification queries.  This was a secondary issue observed during investigation of the incident, but did not cause it.

  • Temporarily reduced system workload to stabilize the environment.

  • Adjusted the database configuration for the open_table_cache value to account for the new traffic pattern.

These actions led to full recovery within 57 minutes. Customers were notified once the system was confirmed to be fully operational.


Preventive Measures and Next Steps

We take this incident seriously and have already implemented several improvements, including:

  • Increased the open_table_cache parameter to more appropriate values for the new activity patterns.

  • Throttling of high-impact background jobs.

Further initiatives are in progress to improve the scalability and reliability of our platform, including:

  • Migration of some tenants to separate database instances to better distribute load. (ETA early June)

  • Improvements in how we query for context fields to reduce impacts on the open_table_cache.

  • To address the secondary visit verification issue and prevent future problems: Ongoing optimization of Visit Verification queries to reduce resource usage.

  • Enhanced performance monitoring and alerting.

  • A long-term redesign of the key query logic that powers visit verification and scheduling pages

  • Broader infrastructure tuning across regions.


Our Commitment

We deeply regret the disruption caused by this incident. Reliability is a top priority at AlayaCare, and we are committed to ongoing investments in performance, scalability, and rapid incident response. Thank you for your continued trust.

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Article is closed for comments.