Application is not avavilable

Incident Report for Totango

Postmortem

What Happened?
Starting June 8th, at 17:46 PST, our application APIs timed-out for a portion of our customer base. The issue lasted for several minutes and corrected itself. A few more episodes of 15-20 minutes happened in the next few days as well. Once the incident started, users were unable to login or the application’s performance was unstable. Note, however, data processing and collection was not impacted and no data was lost.

Our team investigated and identified the problem to be related to network subnet configuration. The team re-configured the network which solved the issue on June 14, at 1:46 PST.

Lesson Learned
* Changes in our infrastructure to improve Application Resiliency and redundancy, so Totango would have the ability to react to such problems in one of its components and still provide the best possible service.

We are also improving our network failover mechanism, so that if there are future cases of network failure, redundant components will be able to pick the work without end-user degradation

Posted Jun 19, 2018 - 15:41 UTC

Resolved

This incident has been resolved.

Posted Jun 15, 2018 - 08:36 UTC

Update

The application is stable for a few hours now, We keep monitoring our systems.

Posted Jun 13, 2018 - 19:20 UTC

Monitoring

We are currently experiencing a temporary system disruption. Our engineering teams are working on resolving with the highest priority. We will post any updates here so check back periodically. We will share the root cause and actions taken to prevent this from occurring in the future.

Posted Jun 13, 2018 - 16:09 UTC

Investigating

We are currently investigating problems logging into the Totango application and are working to
restore service.
Thank you for your patience while we’re looking into it.

Posted Jun 13, 2018 - 15:52 UTC

This incident affected: Totango Web Application.