Postmortem for the April 9th 2016 outage (and related ones)

Postmortem for the April 9th 2016 outage

What was the cause for the outage?

The cause for the outage was an issue with one of the storage arrays that are used by the servers for storing their operating data. This includes the daily backups for the databases.

Who was affected?

Clients that have websites and emails hosted with us, as well as our control panels and front end systems.

What was the timeline of the outage?

April 9 19:00 - We were performing regular system maintenance on our servers while we noticed input/output errors.
April 9 19:05 - Our remote monitoring systems started showing alerts for the servers.
April 9 19:09 - Outage notifications were posted on our twitter and Facebook pages.
April 9 19:10 - Working on identifying the cause for the input/output errors.
April 9 19:20 - Datacenter engineers notified about a potential issue with the storage array used by the servers.
April 9 20:20 - Datacenter engineers have identified the issue and started working on fixing it.
April 10 02:30 - The storage array has been returned to a normal operating status.
April 10 03:00 - Attempt to restart the servers affected.
April 10 03:30 - After detailed troubleshooting we have determined that the storage array has suffered catastrophic data corruption.
April 10 03:45 - Restored the servers to the weekly backup snapshots taken on April 2nd.
April 10 04:00 - Servers have been restored to a normal operating status. Declaring the outage as resolved.
April 9th to April 13th - Sporadic outages.

What was the total time of the outage?

The total time for the outage was sixteen (16) hours.

Why are you using a single array for storing server data?

We are not. The storage array was a redundant storage array. In addition, a separate storage array is used for the website files and emails, and an additional storage array is used for the backups (see more below).

What exactly was affected?

Databases were affected as the result of the corrupted data. Emails and website files are stored separately and were not affected.

Why did you have to restore to a week-old backup?

The daily database backups were stored on the same storage array that stored the server data. Weekly backups (including database backups) are stored on a separate array to protect against this particular scenario (primary storage array failure). The primary storage array had corrupted the database data and daily backups irreversibly, hence our need to restore to the weekly backups that were stored on a different array.

What steps have you taken so that this particular issue will not happen again?

We have changed our backup strategy to include daily backups in addition to weekly backups, that will be stored separately so that in the future we can restore to a more recent backup in case of an issue. We will be changing our daily database backup strategy to multiple backups per day, kept for a week.

How does your 99.9% uptime guarantee come into play?

According to our 99.9% uptime guarantee, we will be issuing credits to our customers. All affected customers will be issued a credit that is equal to 50% of their monthly fee. In case of yearly contracts, the yearly rate will be divided by 12, and the 50% credit will be applied to the result.

We would like to apologize for any inconvenience these outages might have caused, but the issues were extremely rare and could not have been predicted in advance.