Application is down
Incident Report for Totango
Postmortem

Event Description 

Users attempting to login, work with segments or view account profiles were latency, system timeouts and inability to log in.

Findings & Timeline

Reported: August 1st, 2022 17:52 GMT

At 17:52 GMT, Totango Engineering team reported a latency and Totango users began to experience issues with viewing segments, account profiles and occasional issues logging into the application

Identified: August 1st, 2022  18:48 GMT

At 18:48 GMT, Totango Engineering team implemented a short term fix to allow for normal operations to resume temporarily. However the root cause had not yet been completely resolved.

At 19:42 GMT The system issues returned and users were no longer able to access the application.

At 20:54 GMT The issue was isolated, allowing us to restore normal operations to the application.

Resolved: August 1st, 2022 21:24 GMT

The root cause was fully resolved.

Root Cause

A sudden and large volume of account data was introduced on the shared index in ElasticSearch. This data load caused all services on the shared index to timeout all search requests,  leading to the Data API thread pool to fill and eventually shutdown causing the system to become unresponsive.

Corrective Action

  • Re-shard the problematic index out of the shared index - Done
  • Increased monitoring and alert systems when a service is approaching a high volume of documents on the shared index - Ongoing
Posted Aug 02, 2022 - 20:28 UTC

Resolved
This incident has been resolved.
Posted Aug 02, 2022 - 18:42 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Aug 01, 2022 - 21:24 UTC
Update
We have taken steps to mitigate the issues and the system should be returned to normal operation. We are continuing to review and make sure this does not change. Please accept our apologies for the incident.
Posted Aug 01, 2022 - 20:54 UTC
Investigating
We have identified performance issues again on the system and are back to investigating the issue on high priority.
Posted Aug 01, 2022 - 19:49 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Aug 01, 2022 - 18:48 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Aug 01, 2022 - 18:20 UTC
Investigating
We are currently investigating problems logging into the Totango application and are working to
restore service.
Thank you for your patience while we’re looking into it.
Posted Aug 01, 2022 - 17:52 UTC
This incident affected: Totango Web Application.