What Happened?
Starting June 8th, at 17:46 PST, our application APIs timed-out for a portion of our customer base. The issue lasted for several minutes and corrected itself. A few more episodes of 15-20 minutes happened in the next few days as well.
Once the incident started, users were unable to login or the application’s performance was unstable. Note, however, data processing and collection was not impacted and no data was lost.
Our team investigated and identified the problem to be related to network subnet configuration. The team re-configured the network which solved the issue on June 14, at 1:46 PST.
Lesson Learned
* Changes in our infrastructure to improve Application Resiliency and redundancy, so Totango would have the ability to react to such problems in one of its components and still provide the best possible service.