Yesterday we experienced an unexpected service interruption. In the spirit of openness and transparency we wanted to communicate what exactly happened.
At approximately 6:00 PM (PDT) on May 28, 2019, our network probes and monitoring systems reported a problem with network availability. Our system administration team immediately started an investigation and isolated the problem to a configuration issue with the load balancer technology we utilize for high availability. While we were working to resolve the issue, connectivity was unavailable for customers. At approximately 12:50 AM (PDT) on May 29, a solution was implemented. Consistent uptime is our number one priority with Intervals. You can monitor the uptime report on our status page at any time. We sincerely apologize for any inconvenience this may have caused.
How was the problem resolved?
The source of the problem was emergency maintenance performed by our hosting company, IBM. This emergency maintenance inadvertently caused the load balancer technology we utilize to not perform properly. We utilize redundant load balancers but the issue impacted primary and secondary load balancers. As IBM actively investigated and tried to troubleshoot the load balancers we decided to commission new load balancers and brought them into production. Deploying new load balancers did take additional time and required updating DNS but we believe it was the best long term solution.
What did we learn?
During this interruption we were focused on two things: fixing the problem and responding to customer support inquiries as quickly as possible. We notified all administrator level users of the problem, updated twitter during the incident, and followed-up with administrators with a summary email similar to this blog post. In hindsight we should have notified additional user levels. We tend to error on the side of caution with notifications since Intervals is quasi white-labeled but based on the feedback we received via email we should notify additional users levels.
We are stable and working expected, but since we had to update DNS there will be residuals until DNS is fully propagated. The solution involved updating DNS to route traffic to different servers. DNS updates can be tricky because DNS servers are cached around the globe by local DNS servers and can take some time for all DNS servers to see the update. If you are continuing to experience problems we recommend flushing your DNS cache or rebooting your computer. That may help.
For flushing cache, here are some instructions on how to do this on Windows:
And here are instructions for OS X:
We will thoroughly analyze the situation and our redundancy policies and make any necessary adjustments to prevent this type of problem in the future. If you have any questions or concerns please contact our support team. We’d be more than happy to provide you with any more information you might need. We appreciated the patience and understanding extended to us during this interruption in service.