"arnt-nyc" unavailable in data center New york
Incident Report for Servebolt
Postmortem

Recent Service Disruption — Our Apologies

On Thursday, March 10th at approximately 05:10 EST we experienced a specific server outage at our New York data center. This resulted in service disruption for all websites hosted on this server.

With any significant event that affects our customers, we conduct an extensive examination to understand the root cause and develop a course of action to improve our systems and procedures. To that end, we wanted to provide a synopsis of the situation that occurred and our reassurance that we are working diligently to proactively mitigate and prevent future outages.

Here's what happened

At 05:10 EST we performed urgent security updates to the Linux kernel. We perform these kinds of updates quite frequently, and they usually don't last any longer than 10 to 30 seconds.

In order to prevent a related outage we had yesterday, we prepared and tested a firmware update separately. Unfortunately, something went wrong and the BIOS chip didn't process the update as expected. We attempted a new firmware upgrade through the management controller. But it started erroring out with strange errors like size mismatch. The management controller had lost contact with the BIOS chip or got confused in some manner.

After various attempts to get the server back online by power cycling through the management controller, we unfortunately had to revert to employing physical help with our data center provider to power cycle the server physically to reset all hardware state in order for it to reboot.

At 06:20 EST the server started coming back online and at 06:37 EST it was fully operational again. 

Here's what we're doing

Going forward, we will be adding additional steps in ensuring firmware updates are taken care off separate from emergency security updates.

We'll be adding new spare servers to our New York datacenter to have even more extra capacity when we need it in such cases. We'll also consider the use of remote controlled power distribution units (PDUs) where possible going forward.

Outages disrupt your life and your business. We understand and we take our responsibility to you very seriously. We sincerely apologize for the disruption and the inconveniences this likely has caused you.

Please allow me to take this opportunity to thank you for your business and provide my personal assurance that we are dedicated to meeting our commitment to you.

Sincerely,

Erlend Eide
CEO

Servebolt.com

Posted Mar 11, 2022 - 17:16 CET

Resolved
This incident has been resolved.
Posted Mar 10, 2022 - 17:15 CET
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Mar 10, 2022 - 12:40 CET
Investigating
We are currently investigating this issue.
Posted Mar 10, 2022 - 11:20 CET
This incident affected: Servebolt Cloud NYC.