Kalix Platform Outage
Incident Report for Kalix EMR
Resolved
[This is a historical notification as the Kalix status page was not available at this time]

A full day outage due to a problem with the deployment of our backend servers.

Detailed explanation:
Kalix's servers were hosted on a 'kubernetes cluster' with a share of Windows and Linux servers. On Thursday night, we received alerts that a machine was running out of hard drive space. We attempted to complete routine maintenance on the specific server machine. Unfortunately, this maintenance caused all of Kalix's servers to experience networking issues, meaning that no communication could be made to the outside (i.e., the internet).

We decided that the quickest way to get Kalix back up and running was to restart the servers. Due to recent Windows Azure updates, Kalix was left in a state where the servers could no longer work reliably with the cluster software. A significant portion of time was spent on various ways to update our software etc. so that it would become compatible again.

Eventually, after a full night of working on the problem, it seemed that the compatibility issue was not fixable. A decision was made to move our Windows servers to their own hosting solution. We wanted to avoid this solution if possible, due to the probability of other errors occurring. At this point, we felt there was no other choice.

Moving Kalix to the new servers took an additional couple of hours to complete. A relatively stable Kalix was then made live. The outage was resolved after 22 hours of downtime.

For the following day, we continued to complete background updates to automate our deployment process (making Kalix more readily up-gradable later), and to fix the final DNS issues (mostly used in the short-codes for links).

In the end, this outage was not caused by any coding issues. The reason that Kalix experienced such an extended outage was because we needed to change how the platform was structured. The windows servers are now hosted in their own environment and should be just as stable before the compatibility issues started to happen.

Our most significant takeaway message from this event was that we need a way streamline our communication pathways with our customers. This is why we have launched our status page on status.kalixhealth.com. This way you can sign up for notifications, and you will be automatically alerted if anything incidents affect Kalix.
Posted Mar 30, 2018 - 12:00 PDT