Application is back online
Incident Report for Totango
Postmortem

Event Description 

Users attempting to login to Totango were unable to do so. A timeout would occur and eventually fail.

Findings & Timeline

Background

At 16:26 UTC, Totango engineering got an alert for a crash loop in one of the DATA API pods in our US infrastructure.

Initial Investigation: February 7th, 2022 at 16:30 UTC

At 16:30 UTC Totango Operations team restarted the DATA API pod in question hoping it would resolve the slow performance. Continuing our investigation we noted that the slow performance was not resolved by restarting the single pod. Since there was not clear root cause, a full restart of the DATA API was initiated at 14:45 UTC. At this stage all 4 pods began crash looping causing the system to become unavailable.

Incident Mitigation : February 7th, 2022 at 17:05 UTC

At 17:05 UTC the engineering team added additional pods to the infrastructure hoping to alleviate the load. Following the addition of these two nodes, the system stabilized and returned to normal operation.

Issue identified: February 8th, 2022 at 08:00 UTC

Totango engineering determined the cause of the issue coming as a result of a recent change to the system which had been deployed a day before. A call was being made to the configuration database relating to password expiration. 

Resolved: February 8th, 2022 at 11:20 UTC

Once the call was cached, it reduced the load on the system dramatically. 

Root Cause

Root cause was identified to be a new JDBC call to the configuration database as part of password expiration. This new call was not cached and created undue stress on the system. 

Corrective Action

  1. Add cache to the new excessive call to the configuration database. DONE
  2. Improve our visibility to data-api metrics and Http filters. ONGOING
  3. Improve DATA API readiness implementation. ONGOING

Long term

  1. Add auto-scaling to the DATA API. NOT STARTED

Mimic same load on the system after deployments. NOT STARTED

Posted Feb 16, 2022 - 14:54 UTC

Resolved
The incident has been resolved.
Posted Feb 16, 2022 - 14:54 UTC
Update
A root cause was identified and a hotfix was pushed into production to resolve this incident. We are continuing to monitor to ensure the stability of the system.
Posted Feb 09, 2022 - 14:22 UTC
Update
The system continues to operate as expected. We are continuing to monitor to ensure high availability.
Posted Feb 07, 2022 - 19:54 UTC
Update
The team is continuing to monitor while the system remains operational.
Posted Feb 07, 2022 - 17:57 UTC
Monitoring
The system has been brought back online and we are continuing to monitor the situation.
Posted Feb 07, 2022 - 17:05 UTC
Update
We are continuing to investigate this issue.
Posted Feb 07, 2022 - 16:56 UTC
Investigating
We have noticed page load delays and our engineering team is currently investigating.
Posted Feb 07, 2022 - 16:47 UTC
This incident affected: Totango Web Application.