Degraded Performance
Incident Report for Totango
Postmortem

Event Description

Users were not able to login and also logged-in users experienced system slow response time

Findings & Timeline

First Noticed: February 16th, 2021 at 14:46 UTC

Our monitoring systems alerted us to an issue with application performance and failures to login. These issues lasted for 10 minutes at which time the system stabilized and returned to normal operation.

Issue Identified: February 16th, 2021 at 18:30 UTC

Following the incident, our team was able to identify the root cause of the temporary performance degradation.

Root Cause

  • The Totango engineering team found that an NPS campaign, sent by one of our customers, was engaged by an automatic Bot which clicked on several hundred emails at the same time, and caused an instant load on the system (specifically on the data api). This load prevented other processes (for example logins) to respond in a timely manner.
  • When users click on an NPS campaign they are redirected to a destination page where they need to confirm and submit their score. This page, loaded several hundred times, caused the load on the system.
  • The exceptional system load was resolved after all campaigns were processed.

Preventive Action

  • The Totango engineering team researched the NPS campaigns processing flow and found several areas that will be optimized:

    • Moving unnecessary data requests from the page load to after the user submits and form - In progress
    • Using cache to improve the loading time of the destination page and reduce the load on the APIs - In progress
    • Increase database resources to be able to handle this kind of system load - In research

Optimize the API rate limit to prevent this kind of load on the system - In research

Posted Feb 22, 2022 - 18:43 UTC

Resolved
This incident has been resolved.
Posted Feb 22, 2022 - 18:42 UTC
Monitoring
There is currently no impact to system performance. We are continuing to work to optimize the system following the events of this past Wednesday and we also continue to monitor for any system performance degradation.
Posted Feb 18, 2022 - 13:39 UTC
Identified
We have identified the cause of the recent performance degradation. We have taken steps to mitigate this issue in the short term and we are working on a permanent solution.
Posted Feb 16, 2022 - 18:29 UTC
Update
While we continue to investigate the issue, the system is fully functional at this time.
Posted Feb 16, 2022 - 17:43 UTC
Investigating
We are currently investigating reports of degraded performance using Totango.
Posted Feb 16, 2022 - 16:40 UTC
This incident affected: Totango Web Application.