freistilbox Blog

Newer articles « Page 13 of 17 » Older articles

Incident review: Outage vm3

On Thursday, 15 May, one of our VM hosts named vm3 did not return back to operation after a standard maintenance procedure, resulting in an outage of more than 14 hours. While we were able to restore all affected DrupalCONCEPT POWER servers, we only had backups available that were more than 24 hours old. And in the case of a custom-built managed server, we even lost most of its files completely.

We regard reliability and effective IT processes as essential for our business. An outage of this duration and with these results is not acceptable. We are embarassed and deeply sorry about this incident and I apologize on behalf of freistil IT to all customers that we disappointed.

In this review, I’d like to give you detailed insight into what’s happened and what we’re going to do to prevent incidents like this in the future.

What’s happenend

On Monday, 12 May, the VM host vm3 signaled one of two disks of its RAID–1 array as failed. It kept running on the second disk without any problems. We scheduled a maintenance window to have the failed disk replaced for Thursday, 15 May at 19:00 UTC, and announced the scheduled maintenance on the freistilbox Status Page.

Data center staff shut down the server at 18:55 UTC (a few minutes early) and replaced the broken disk. After restarting the server, we found that the server would not boot into a working system again. It turned out that there was no bootable operating system available on the remaining disk any more, which suggested that this disk had failed, too. When we realised that there was nothing we could do about the second failed disk, we decided to go the only viable, albeit laborious, way of rebuilding the server from scratch. After getting the second disk replaced, we started reinstalling the server OS, then the host environment and finally, the guest servers.

When we started the restore process, we realised that already the first phase, building a directory tree of the data to restore, would take several hours. We hoped that it would finish over night, but after 7 hours on Friday morning, the backup database was still working on collecting data for the restore directory tree. Fortunately, we found out by experimenting that by aborting the slow query on the database server, we could force the backup system to fall back to doing a full restore of all files in the backup set.

After the restore jobs were finished on all affected servers, we started reimporting the database dumps that were included in the backups. That’s when we found that we had timed the creation of these dumps badly: The job for doing daily database dumps actually ran later than the file backup that was supposed to pick them up. Restoring data from the Wednesday night backup meant that we had lost almost a whole day of data but the backup then only contained database backups from Tuesday night.

And as if this wasn’t bad enough news for our customers already, it turned out that one of the affected servers didn’t have any of its websites backed up at all. The respective server is a custom-built managed server. While with DrupalCONCEPT and freistilbox servers, everything (including the backup) is configured automatically, this server would have needed a manual backup configuration and we obviously had forgotten this part during setup.

Some customers had newer backups available that we were able to copy back to their server but in the end, most of them still suffered a catastrophic loss of data.

On Friday at about 11:30 UTC, all servers were online again. We then spent the rest of the day with assisting our affected customers to solve some minor remaining issues.

What we are going to do about it

In a post mortem meeting on Monday, 19 May, we discussed the incident and decided on remediation measures to prevent it from repeating.

The root cause of the incident, the loss of both disks of a RAID–1 array, is a rare event but we need to be prepared for it to occur. We especially need to minimise the amount of data lost due to such a failure.

While the affected customers had consciously chosen a one-server setup that has many single points of failure (SPOF), neither they nor we had expected that an outage would take this long and would result in such catastrophic data loss. We need to make sure that all our backups have complete coverage and that they can be restored within a reasonable amount of time (a few hours max).

As a result of our post mortem, we decided on the following remedial measures:

  • We checked to make sure that all customer data, especially on custom-built servers, will be fully backed up from now on.
  • We rescheduled our file backup in order to include the latest database dumps.
  • Planned maintenance must be done right after a backup run. We will either schedule it after the regular daily backup job or we’ll trigger an extra backup in advance of the maintenance.
  • We’ll schedule regular disaster recovery exercises where we take production backups and restore them to a spare server.
  • We’ll research how we can speed up the restore process. This could mean improvements to specific components or even switching to a different backup system altogether.
  • If customers need shorter backup periods than 24 hours, we’ll support them in setting up custom backup jobs directly from their content management system.

In conclusion, I’d like to state that this incident showed an embarassing lack of preparation on our side for the failure of a whole disk array. I apologize to all affected customers that we were not able to restore normal operation more quickly and to the full extent. I assure you that we are working hard to prevent an incident like this from ever happening again.

Nagyon szépen köszönöm Szeged!

Three findings on my flight home:

  • Compared with Hungarian, German with its puny three umlauts can pack up and go home.
  • Compared with Ireland, Hungary has much nicer weather.
  • Compared with the pubs at home, in Hungary you can treat four or five times as many people to a beer for the same amount of money.

As you can see, Hungary has the advantage in many regards and I had a great time here at the Drupal Developer Days this past week.

When I arrived in Szeged on Wednesday evening, Drupal 8 coding sprints had already been running for a few days and they’d continue all week. During this time, up to 150 Drupal developers were working to make progress on code and documentation issues. There were 115 commits to Drupal core and after removing 19 blocker issues, we’re now 40% closer to Drupal 8 Beta. Our Git repository traffic was so high that it even triggered drupal.org’s DDoS defenses!

At the event location only, we consumed 5500 sandwiches, 1000 servings of coffee, 240l of beer, 300kg of sweet snacks and 120kg of bananas. The venue was a really good choice. It had all the space we needed, good catering and the WiFi worked well throughout the event. And having the Novotel (with its amazing value for money) right next door is unbeatable convenience.

From Thursday to Saturday, there were a lot of interesting presentations as well as a reprise of the Caching Deep Dive multi-hour workshop that had a lot of success at DrupalCon Prague. Almost every talk referenced the upcoming Drupal version. And although many things are still in flux, it feels to me like Drupal 8 is taking shape.

What drove engagement most was the great community spirit. Everyone was welcome, from the Drupal novice to the long-time core contributor. There were smiling faces all around and you could simply walk up to anyone to have a chat or ask a question. People got together spontaneously, be it to code or to go have dinner. These personal experiences are what I love most about the Drupal community. If you haven’t been to a DrupalCamp yet, go to DrupiCal now and see what’s happening near your place! Go on, I’ll wait here.

My personal highlight was the #AberdeenFreistilCloudBox party on Friday night which I had the pleasure to co-organize. I had contacted Aaron Porter before the event, suggesting we join forces and do something about the community’s lack of awareness of our European Drupal hosting companies that we both had perceived at DrupalCamp London a few weeks earlier. When I met Aaron on Thursday morning, he invited me to join him in scouting for a party venue for Friday night. We looked at two bars in the center of Szeged and decided on hosting (that’s what we’re good at, after all!) the party at the CoolTour Cafe. We were able to make a deal that secured our guests 100 free beers as well as free admission to the concert room where one of Hungary’s best-known Jazz singers was going to be on stage. And when on Friday afternoon the number of sign-ups for the party crossed the 100 mark, Aaron and I decided to throw in another 100 beer vouchers. The party was a success and fun was had by everyone. The title of drink distributor extraordinaire goes to Dave Hall who’s apparently related to a huge Dutch family since he returned every few minutes to grab a voucher for another member of the Needabeer clan.

In the end, I’d like to congratulate the organization team around Kristof van Tomme to a great Drupal community event! I’m thankful for a lot of good conversations, and happy that I overcame my initial reluctance to register and got to be a part of Drupal Developer Days Szeged 2014.

Pushing Drupal 8 forward in Hungary

I'm going!

2014 is the year of Drupal 8 and we’re getting our freistilbox hosting platform ready to host its first Drupal 8 websites. Of course, Drupal 8 will be the central topic of the Drupal Developer Days in Szeged from 24 to 30 March. We can’t miss such an important event, so Jochen will be following DrupalMarvin’s invitiation from DrupalCon Prague last year.

Jochen is going to be in Szeged from Wednesday the 26th to Sunday the 30th, so if you’re there, make sure to say Hi! He’ll be more than happy to get you a beer or coffee, and you can ask him whatever you’d like to know about freistilbox.

And don’t forget to pack your towel!

Heading to DrupalCamp London

Tomorrow, I’ll fly to our neighbor island for DrupalCamp London and Markus is going to join me on Friday. Together, we’re going to breathe some community air again and get a feel for what British Drupal shops need in terms of hosting.

With 600 attendees, DrupalCamp London is going to be an impressive event! There will be 30 community sessions as well as BoFs and sprints across the weekend. I’ll certainly try to at least attend the “Next Generation DevOps” and “Concurrent Programming” talks. I’m also looking forward to the Drupal CxO meet-up on Friday before the actual conference.

For Markus and me, it’s a valuable opportunity that we’ll be at the same place at the same time for a change. That’s why we’ll stay a few more days more after the weekend to do important strategy work for 2014 together.

We’re very excited to meet a lot of enthusiastic Drupal developers In London! So, if you’d like to join us for a pint and talk about your Drupal hosting needs, simply drop us a line via email or on Twitter!

Newer articles « Page 13 of 17 » Older articles