outage for "sterre-lon" in London data center
Incident Report for Servebolt
Postmortem

Recent Service Disruption — Our Apologies

On Wednesday, March 9th at approximately 08:15 CET we experienced a specific server outage at our London data center. This resulted in service disruption for all websites hosted on this server.

With any significant event that affects our customers, we conduct an extensive examination to understand the root cause and develop a course of action to improve our systems and procedures. To that end, we wanted to provide a synopsis of the situation that occurred and our reassurance that we are working diligently to proactively mitigate and prevent future outages.

Here's what happened

Earlier that day, at 04:10 CET we performed urgent security updates to the Linux kernel. We perform these kinds of kernel security updates quite frequently and usually don't last any longer than 10 to 30 seconds. We were actively monitoring the server's performance, but the server was performing as expected. 

We have performed the same update on about 40 servers already without any problems, but after about 4 hours at approximately 08:15 CET MariaDB started crashing, ramping up to full outage at 09:10 CET. 

Our Operations team started working on the problem, but it quickly became evident that the MariaDB logs had been corrupted. 

In the time that followed we initiated a full restore from the backup server to a spare server in case the data turned out to be permanently damaged. In the meantime Operations was continuously working on recovering the corrupted databases. At 11:20 CET we were able to successfully confirm the full recovery of 97% of the affected databases. The remaining 3% unfortunately had to be restored from the backup server.

At 13:00 CET all databases were restored and recovered and the incident closed.

Here's what we're doing

In our research into the root cause of the issue we've identified it as incompatible firmware versions. Going forward, we will be adding additional steps in ensuring incompatibilities are mitigated, and taken care off separate from emergency security updates.

Outages disrupt your life and your business. We understand and we take our responsibility to you very seriously. We sincerely apologize for the disruption and the inconveniences this likely has caused you.

Please allow me to take this opportunity to thank you for your business and provide my personal assurance that we are dedicated to meeting our commitment to you.

Sincerely,

Erlend Eide
CEO

Servebolt.com

Posted Mar 11, 2022 - 16:03 CET

Resolved
This incident has been resolved. The majority of databases were recovered without data loss. Only a few needed to be restored from backup.
Posted Mar 09, 2022 - 14:30 CET
Monitoring
We’ve been able to restore most of the databases and are currently monitoring the server.
Posted Mar 09, 2022 - 13:07 CET
Update
We are continuing to work on fixing the issue. Please reach out to support if you have any urgent questions.
Posted Mar 09, 2022 - 11:56 CET
Update
We are continuing to work on a fix for this issue.
Posted Mar 09, 2022 - 10:57 CET
Identified
We are experiencing a database crash on "sterre-lon" currently. Unfortunately the databases seems to have corrupted beyond repair, and are now beeing restored from backups.
Posted Mar 09, 2022 - 10:17 CET
This incident affected: Servebolt Cloud LON.