Application is back online

Incident Report for Totango

Postmortem

Event Description

Users attempting to login to Totango were unable to do so. A timeout would occur and eventually fail.

Findings & Timeline

Background

At 16:26 UTC, Totango engineering got an alert for a crash loop in one of the DATA API pods in our US infrastructure.

Initial Investigation: February 7th, 2022 at 16:30 UTC

At 16:30 UTC Totango Operations team restarted the DATA API pod in question hoping it would resolve the slow performance. Continuing our investigation we noted that the slow performance was not resolved by restarting the single pod. Since there was not clear root cause, a full restart of the DATA API was initiated at 14:45 UTC. At this stage all 4 pods began crash looping causing the system to become unavailable.

Incident Mitigation : February 7th, 2022 at 17:05 UTC

At 17:05 UTC the engineering team added additional pods to the infrastructure hoping to alleviate the load. Following the addition of these two nodes, the system stabilized and returned to normal operation.

Issue identified: February 8th, 2022 at 08:00 UTC

Totango engineering determined the cause of the issue coming as a result of a recent change to the system which had been deployed a day before. A call was being made to the configuration database relating to password expiration.

Resolved: February 8th, 2022 at 11:20 UTC

Once the call was cached, it reduced the load on the system dramatically.

Root Cause

Root cause was identified to be a new JDBC call to the configuration database as part of password expiration. This new call was not cached and created undue stress on the system.

‌

Corrective Action

Add cache to the new excessive call to the configuration database. DONE
Improve our visibility to data-api metrics and Http filters. ONGOING
Improve DATA API readiness implementation. ONGOING

Long term

Add auto-scaling to the DATA API. NOT STARTED

Mimic same load on the system after deployments. NOT STARTED

Posted Feb 16, 2022 - 14:54 UTC

Resolved

The incident has been resolved.

Posted Feb 16, 2022 - 14:54 UTC

Update

A root cause was identified and a hotfix was pushed into production to resolve this incident. We are continuing to monitor to ensure the stability of the system.

Posted Feb 09, 2022 - 14:22 UTC

Update

The system continues to operate as expected. We are continuing to monitor to ensure high availability.

Posted Feb 07, 2022 - 19:54 UTC

Update

The team is continuing to monitor while the system remains operational.

Posted Feb 07, 2022 - 17:57 UTC

Monitoring

The system has been brought back online and we are continuing to monitor the situation.

Posted Feb 07, 2022 - 17:05 UTC

Update

We are continuing to investigate this issue.

Posted Feb 07, 2022 - 16:56 UTC

Investigating

We have noticed page load delays and our engineering team is currently investigating.

Posted Feb 07, 2022 - 16:47 UTC

This incident affected: Totango Web Application.