Indexing of records delayed (again)

Incident Report for Kalix EMR

Resolved

We have added additional emergency notifications to the team internally, so that if this issue happens we can detect it quickly and deal with the issue straight away.

We are closing the immediate issue here, and will continue to monitor closely over the next few days to see what happens.

Posted Aug 02, 2018 - 20:33 PDT

Update

Mitigation:
To reduce the chances of this happening again we have added an additional server. This means that if one back-end server has issues the other server should jump in within 30 seconds and take over.

This may not solve the problem completely, since the issue could also happen on the second server, but it should give us significantly more time to respond and add some redundancy to the process.

Posted Aug 02, 2018 - 14:44 PDT

Monitoring

The server is now running again and the queue is completely processed - Kalix will be functioning normally now.

We will be actively monitoring this for the next few days and will not close this issue until we are confident that the root cause of the issue has been dealt with.

Posted Aug 02, 2018 - 14:34 PDT

Identified

The cause of the issue could not be discovered, since the server was not accessible to us at all - this is something we will be looking into with our provider. For now we have re-created the server and Kalix is already catching up with the queued up work (over 50% already).

Posted Aug 02, 2018 - 14:26 PDT

Update

Here is a copy of the previous issue that describes the problem:

One of our servers that indexes records and does additional processing was down due to issues with our underlying cloud provider. Some of the problems that would have been seen during this downtime was the following:

1. Reminder messages, including opt-in messages etc would have been delayed.
2. New records, such as clients and contacts, would not have been searchable.
3. Insurance batches and other background processes would have been delayed.

Posted Aug 02, 2018 - 14:21 PDT

Investigating

Unfortunately our server ended up in a bad state again today. We understand it is very frustrating for all our users (and certainly for us). We have not actually released any code in the last two weeks or so, so it doesn't seem to be anything that we have done to cause this issue.

We are still in the process of working out the root cause with our cloud provider (for the incident yesterday as well), but we will be taking more aggressive measures and monitoring considering that the same issue happened again.

More to come....

Posted Aug 02, 2018 - 14:20 PDT

This incident affected: Kalix Platform, Messaging, and Notifications.