Issue in plays

Incident Report for Totango

Postmortem

Event Description

Success Plays unexpectedly fired in the US region on April 19, 14:15 UTC, due to the use of an old Redis cluster.

‌

Findings & Timeline

‌

Reported: April 19, 14:15 UTC The Totango system monitoring detected the issue.

Identified: April 19, 15:20 UTC The root cause was identified as a DNS change from the new Redis cluster to the old one.

The application team immediately stopped all SuccessPlay triggering. The SuccessPlay engine at that time already triggered a few thousands tasks and events (as part of the task creation emails which were sent to users).

‌

The application team then created a cleanup plan, which included deleting duplicated tasks and events and reverting attribute changes.

Resolved: Apr 21, 19:58 UTC

Duplicated task deletion is completed

April 21, 20:51 UTC

Cleanup plan execution was completed in full, including events deletion and reverting attribute changes.

‌

Root Cause:

An old Redis cluster was being used actively instead of the new one due to a DNS change, causing SuccessPlays to fire unexpectedly.

‌

Corrective Action

Deprecate and isolate old components when a replacement is created - Completed
Implement a monitoring system to alert on any manual changes performed - In Progress (ETA: June 30th)
Enforce a single config file for all application DNSs - Planned (ETA: August 2023)
Create a dedicated runbook to tackle similar cases in the future - In Progress (ETA: May 5th)
Implement anomaly detection flows to identify changes in automation triggering behavior, pause it, and alert about it - In Progress (ETA: June 1st)

Posted Apr 27, 2023 - 16:42 UTC

Resolved

All tasks and events have been deleted and all attributes have been reverted to their previous state unless otherwise requested.

Posted Apr 21, 2023 - 20:51 UTC

Update

Task deletions have completed successfully and any attributes should be reverted to their previous state unless otherwise requested.

Posted Apr 21, 2023 - 19:58 UTC

Update

We are continuing our work to remove tasks and have started work on reverting any attribute changes that were made.

Posted Apr 21, 2023 - 13:57 UTC

Update

We are working to delete any duplicated tasks created as a result of this issue.

Posted Apr 20, 2023 - 11:17 UTC

Monitoring

SuccessPlays are now active and should begin to process again. We are working on making sure systems return to normal operations.

Posted Apr 19, 2023 - 20:16 UTC

Update

We are planning to re-enable SuccessPlays in the next 15-30 minutes. Another update will be posted following this action. As always we appreciate your patience.

Posted Apr 19, 2023 - 19:17 UTC

Identified

We have identified the cause for SuccessPlays firing unexpectedly. We have currently stopped SuccessPlay processing on all services while we work to remediate the issues for those services which have been impacted. We will provide further updates as our fixes are implemented. Thank you for your patience.

Posted Apr 19, 2023 - 16:55 UTC

Investigating

We are currently investigating an issue regarding Plays.

Posted Apr 19, 2023 - 15:20 UTC

This incident affected: Tasks and SuccessPlays.