Backend API

Post Mortem - Zivver email services degraded 03-11-2022

Postmortem

Situation:

  1. On Thursday morning 3rd of November at 7:58 CET we noticed a significant increase in our platform response times.
  2. Around 9:45 CET we started throttling logins to give the platform room to recover, whilst searching for the root cause.
  3. Around 15:00 CET the root cause was found and we gradually removed the login throttling and around 16:30 CET the platform was back to normal.

Impact:

  1. At the start of the incident the platform was mostly unresponsive with most login-requests failing or timing out.
  2. After the login throttling was in place, users needed multiple tries to login (only 10-30% of login traffic was let through). After login was successful for a user, the platform was reachable and working properly for this user the rest of their session.

Root-cause:

During SSO login requests the platform checks the data of the user’s Identity Provider. This check was done whilst keeping a database connection open. On Thursday morning a specific Identity Provider check timed out and retried to connect constantly. Since this affected many users trying to log in, all connections to our platform database got saturated by automatically retrying checks.

Solution:

Disabling SSO for these users with a broken Identity Provider resolved the problem immediately.

Mitigating Actions:

  1. Reducing the timeout for the SAML Identity Provider check ensures that the connection cannot be blocked for others in the future.
  2. An updated version of the platform was deployed where the bug was resolved and no database connection can be blocked while fetching the SAML Identity Provider data.
  3. The number of possible connections to the database has been increased to mitigate the risk of saturation.
    We will do a dedicated bug-hunt for bugs where a database connection is blocked longer than necessary.
Resolved
Opened

This Post-Mortem was opened retrospectively.