Backend API

Post-Mortem - Back-end incident 21st of June 2022

Løst
Vurdert

Situation:
On June 21st 08:58 CEST all nodes (servers) of the Zivver Back-end went down.
This was caused by the restarting of one faulty node (one server), on restart of this node an error from another node caused all nodes to go down together.
On June 21st 08:27 CEST, CloudFlare had an issue because of which many websites were completely unavailable for a while. The chance of these two big issues happening so close to each other is close to nil, which caused this outage to throw off our engineers for a short period of time. Although these issues looked related they weren’t. It was just a coincidence.

Impact:
All Zivver services were unavailable for 32 minutes. This led to users being unable to send any Zivver messages in that time frame.

Solution:
Zivver disabled the second faulty node after which all other nodes recovered immediately. After investigating the faulty node and it was deemed safe to restart it, all nodes were up and running again at 09:30 CEST.

Root-cause:

  1. At 01:58 CEST AWS reported an issue on their side with one of our nodes (server hosted by AWS).
  2. At 08:47 CEST it was observed that another node was behaving in an unexpected way with a lower CPU usage than normal.
  3. At 08:52 CEST an alert was received that notified us of slow response times of the back-end, this was linked to the fact that only 16 of the 18 nodes were functioning correctly.
  4. At 08:54 CEST it was decided by our engineers to restart the node with the low CPU usage to reduce the response times.
  5. At 08:58 CEST it was found that the restarting of the node caused all other nodes to go down.
    The latter was caused by the node that AWS reported to have issues. This node wasn’t connected anymore to the load balancer (used to divide all incoming traffic over the different nodes), but it was still connected to hazelcast (used to cache certain data). At the moment the node that was restarted tried to connect to the load balancer, all nodes failed because they tried to update their Hazelcast cache but couldn’t connect to the node with the AWS issues.

Mitigating Actions
These actions will prevent this issue from happening again in the future:

  1. We are making Hazelcast a separate service. This makes sure that when there is an issue with a node and Hazelcast, one node can’t influence the other nodes.
  2. We have set up new alerts for both CPU thresholds and data transfer thresholds that could indicate a similar situation.
  3. We are increasing the number of nodes so that we have more fall-back possibilities in the future.