Users were not able to login and also logged-in users experienced system slow response time
Findings & Timeline
First Noticed: July 6th, 2022 at 13:50 UTC
Our monitoring systems alerted us to an issue with application performance and failures to login. Investigation started by engineering teams.
Problem Area Identified: July 6th, 2022 at 14:00 UTC
At this point, our team was able to identify the problem and began work on steps to mitigate and trace the root cause. There were a handful of users in the system appearing to overload our systems with numerous API calls.
July 6th, 2022 at 15:30 UTC
Continuing efforts are made to mitigate the issue, though the root cause has not yet been identified.
July 6th, 2022 at 17:00 UTC
The root cause analysis found the problematic code. A decision to revert a recent code deployment made
July 6th, 2022 at 17:30 UTC
The revert of the problematic code showed good signs and resolved the issue with the exception of impacting only 2 users.
July 6th, 2022 at 18:30 UTC
Steps were taken to contact the two users and guide them through logging off the system to rectify the erroneous API calls.
Fix the original bug that caused this issue. Make sure the call to fetch contacts from the DB is done only once - Done
Enforce rate limitations across all components - Not Started
User token isolation to better control rate limitations - Not Started