Users attempting to login to Totango were unable to do so. A timeout would occur and eventually fail.
Background
At 16:26 UTC, Totango engineering got an alert for a crash loop in one of the DATA API pods in our US infrastructure.
Initial Investigation: February 7th, 2022 at 16:30 UTC
At 16:30 UTC Totango Operations team restarted the DATA API pod in question hoping it would resolve the slow performance. Continuing our investigation we noted that the slow performance was not resolved by restarting the single pod. Since there was not clear root cause, a full restart of the DATA API was initiated at 14:45 UTC. At this stage all 4 pods began crash looping causing the system to become unavailable.
Incident Mitigation : February 7th, 2022 at 17:05 UTC
At 17:05 UTC the engineering team added additional pods to the infrastructure hoping to alleviate the load. Following the addition of these two nodes, the system stabilized and returned to normal operation.
Issue identified: February 8th, 2022 at 08:00 UTC
Totango engineering determined the cause of the issue coming as a result of a recent change to the system which had been deployed a day before. A call was being made to the configuration database relating to password expiration.
Resolved: February 8th, 2022 at 11:20 UTC
Once the call was cached, it reduced the load on the system dramatically.
Root Cause
Root cause was identified to be a new JDBC call to the configuration database as part of password expiration. This new call was not cached and created undue stress on the system.
Long term
Mimic same load on the system after deployments. NOT STARTED