Issue in plays
Incident Report for Totango
Postmortem

Event Description 

Success Plays unexpectedly fired in the US region on April 19, 14:15 UTC, due to the use of an old Redis cluster.

Findings & Timeline

Reported: April 19, 14:15 UTC The Totango system monitoring detected the issue. 

Identified: April 19, 15:20 UTC The root cause was identified as a DNS change from the new Redis cluster to the old one.

The application team immediately stopped all SuccessPlay triggering. The SuccessPlay engine at that time already triggered a few thousands tasks and events (as part of the task creation emails which were sent to users).

The application team then created a cleanup plan, which included deleting duplicated tasks and events and reverting attribute changes.

Resolved: Apr 21, 19:58 UTC

Duplicated task deletion is completed

April 21, 20:51 UTC 

Cleanup plan execution was completed in full, including events deletion and reverting attribute changes.

Root Cause:

An old Redis cluster was being used actively instead of the new one due to a DNS change, causing SuccessPlays to fire unexpectedly.

Corrective Action

  • Deprecate and isolate old components when a replacement is created - Completed
  • Implement a monitoring system to alert on any manual changes performed - In Progress (ETA: June 30th)
  • Enforce a single config file for all application DNSs - Planned (ETA: August 2023)
  • Create a dedicated runbook to tackle similar cases in the future - In Progress (ETA: May 5th)
  • Implement anomaly detection flows to identify changes in automation triggering behavior, pause it, and alert about it - In Progress (ETA: June 1st)
Posted Apr 27, 2023 - 16:42 UTC

Resolved
All tasks and events have been deleted and all attributes have been reverted to their previous state unless otherwise requested.
Posted Apr 21, 2023 - 20:51 UTC
Update
Task deletions have completed successfully and any attributes should be reverted to their previous state unless otherwise requested.
Posted Apr 21, 2023 - 19:58 UTC
Update
We are continuing our work to remove tasks and have started work on reverting any attribute changes that were made.
Posted Apr 21, 2023 - 13:57 UTC
Update
We are working to delete any duplicated tasks created as a result of this issue.
Posted Apr 20, 2023 - 11:17 UTC
Monitoring
SuccessPlays are now active and should begin to process again. We are working on making sure systems return to normal operations.
Posted Apr 19, 2023 - 20:16 UTC
Update
We are planning to re-enable SuccessPlays in the next 15-30 minutes. Another update will be posted following this action. As always we appreciate your patience.
Posted Apr 19, 2023 - 19:17 UTC
Identified
We have identified the cause for SuccessPlays firing unexpectedly. We have currently stopped SuccessPlay processing on all services while we work to remediate the issues for those services which have been impacted. We will provide further updates as our fixes are implemented. Thank you for your patience.
Posted Apr 19, 2023 - 16:55 UTC
Investigating
We are currently investigating an issue regarding Plays.
Posted Apr 19, 2023 - 15:20 UTC
This incident affected: Tasks and SuccessPlays.