freistilbox Blog

Newer articles « Page 12 of 17 » Older articles

Let us take care of... security updates.

freistilbox is a fully managed hosting platform. That means that we do everything that’s necessary to run a reliable hosting service.

Last week, a new software security threat with a catching name raised its ugly head: Shellshock is a security flaw in the widely used command-line shell “bash”. This security flaw can be exploited to issue an arbitrary command to a server to be executed. Troy Hunt has the technical details.

After this security weakness became widely known on Wednesday and security fixes were made available soon after, we immediately tested and installed them. Since then, we got two follow-up bash updates with additional fixes that we rolled out in the same swift fashion.

If you prefer to sleep peacefully, knowing that we take care of hosting security, why don’t you check out all the other advantages of freistilbox?

Incident review: Datacenter outage on 2014-08-03

On Sunday, 2014-08-03, freistilbox operation was severely disrupted due to a power failure at a datacenter.

We apologise for this outage. We take reliability seriously and an interruption of this magnitude as well as the impact it causes to our customers is unacceptable.

What happened

On Sunday, 2014-08-03, at 12:34 UTC, our on-call engineer was alerted by the monitoring system that a number of servers suddently went offline, and the list was quite long. This indicated a network outage, and we posted a short notice to our status page. We then immediately contacted datacenter support. While we didn’t get a direct answer first, the datacenter posted a first public status update at 12:54, explaining that server room RZ19 suffered an outage.

Since one of our server racks is located in this server room, the impact of this outage was severe. The affected rack hosts all kinds of servers including database and file storage nodes. Without these services, even application servers outside of RZ19 weren’t able to deliver content any more.

Since we run the nodes of our database clusters in different server rooms, we executed a failover procedure to the standby nodes of the affected databases. This restored operation for a part of our hosting infrastructure.

At about 13:00, our servers started to come back online. When we checked their uptime, we realised that they must have just had started up, so we suspected a power outage. This was confirmed when the datacenter announced that RZ19 had suffered a “brownout” that caused its servers to reboot. Later, the ISP added that a whole datacenter location suffered a power outage. The UPS systems of all server rooms had been able to compensate until the power generators had started up – with the exception of RZ19.

At about 14:00, most of our servers were running smoothly again. A few of our database servers had suffered data corruption and since we had already switched to their standby nodes, we decided to repair them later. At that time, it was more urgent to replace application boxes that still had not come back. Some of our customers choose to run single-node freistilbox clusters and the websites running on these boxes were still down. We launched new boxes on servers with spare capacity and at about 15:00, our infrastructure was fully functional again.

What we’re doing about it

Since we don’t run our own datacenters, we depend on our hosting partners when it comes to hardware infrastructure (servers, network, power, cooling etc.). We can’t prevent power outages, only trust that our infrastructure providers take all the necessary measures to prevent them.

What we can do ourselves is build our hosting architecture as resilient as possible in order to minimise the impact of a power outage. We have already built in a lot of redundancy into freistilbox. This enabled us, for example, to quickly switch to non-affected database servers as we did at the beginning of this incident. We have identified a few points, though, where an outage can cause bigger parts of our infrastructure to fail.

The most critical one of these points is our current storage technology. While it comes with data replication features (of which we make use, of course), it is hard to distribute data over server rooms or even distant datacenters without running into network latency issues. That’s why we’re currently testing alternative solutions that don’t have this weakness. As a beta test, we’re already running our own company freistilbox cluster (the one that’s hosting this website) on one of these alternatives. This means we’ll be able to further improve our storage resiliency very soon.

Another point is the private cloud infrastructure on which we run the application boxes of our customers’ freistilbox clusters. By adding more system automation, we’re going to minimise the time it takes us to spin up replacement boxes when that becomes necessary, for example and especially during an outage.

Again, we sincerely apologise to all our customers affected by this outage and thank them for their continued trust.

freistilbox comes to DrupalCamp North East

In terms of Drupal events, there is no summer break; the best example being the DrupalCamping going on in Wolfsburg at the moment. I’m so sad that my schedule doesn’t allow me be there and camp with my German Drupal friends!

Fortunately, I get to attend DrupalCamp North East in Sunderland next weekend. I’m very much looking forward to fly over to the UK again for the third time this year because I enjoy the Drupal community there as much as the ones in Germany and Ireland.

Since community is one of our core values at freistil IT, we try to participate at these events as actively as possible. I’m proud to announce that my session proposal about “ DevOps with Drupal” has been accepted and I’ll do my very best to explain how embedding development in operations and vice versa can improve working with Drupal in a great way.

If you’re also going to be at DrupalCamp NE next weekend, give me a shout via Twitter! I’ll happily arrange sharing a few drinks and great news about our new Partner Programme!

Why you need an ops team and how you can get it for free

If you’re the type of customer we love the most, you’re a Drupal or WordPress shop that builds amazing websites. This requires great developers and these developers tend to know a thing or two about web infrastructure. So, why not have them also run the hosting of the websites they know best?

Let me tell you why not. Why I think that that’s a really bad idea that can quickly lead you to lose track of your main business goal, which is — remember — building amazing websites.

The world of web operations

Running a website that serves a lot of users is far from trivial. There are a lot of IT topics that need to be covered in order to build and operate an application that…

  • …reliably and quickly delivers the information the user needs (= performance),
  • …can cope with a steadily (or even exponentially!) growing user base (= scalability),
  • …and is robust enough that smaller incidents (e.g. disk failure, network partitions) will not cause it to be inaccessible (= availability).

I found a detailed overview of all the important issues that an operations engineer needs to address in Mathias Meyer’s blog post [Web Operations 101 For Developers](http://www.paperplanes.de/2011/7/25/web_operations_101_for_developers.html). It’s a long post and I highly recommend reading it in full (after you’ve finished this article).

Managing infrastructure

Every business relies on some kind of infrastructure. If you were a transport business, you’d rely on infrastructure like highways, gas stations and warehouses. Your business is based on web applications, so you rely on IT infrastructure like networks and server racks, operating systems and software applications.

Getting some kind of hosting infrastructure is easy. It’s just a few clicks over at Amazon Web Services or DigitalOcean. But in his article, Mathias points out the catch:

Every little piece of it can break at any time, can stall at any time. The more pieces you have in your application puzzle, the more breaking points you have. And everything that can break, will break.

Someone needs to manages this IT infrastructure. This could be you or someone from your team, it could also be someone you specifically hire for that task. And keeping stuff running requires know-how and experience:

You don’t need to know everything about every piece of hardware out there, but you should be able to investigate strengths and weaknesses, when an SSD is an appropriate tool to use, and when SAS drives will kick butt. Learn to distinguish the different levels of RAID, why having an additional file system buffer on top of a RAID that doesn’t have a backup battery for its own internal write buffer is a bad idea. That’s a pretty good start, and will make decisions much easier.

I’d say that’s quite a laundry list of insight that doesn’t come by just reading some manuals. And that’s only the hardware aspect – Mathias also details a separate list for the operating system level.

Is this how you want to spend valuable engineering time?

Managing incidents

There will come the time when stuff hits the fan.

You should be willing to dig into whatever data you have posthumous to find whatever went wrong, whatever caused a strange latency spike in database queries, or caused an unusually high amount of errors in your application.

Troubleshooting and incident response are a special area of expertise that requires both deep knowledge and experience to find and eliminate the problem’s root causes.

Is this how you want to spend valuable engineering time?

Managing automation

Deploying your application to a single server is easy and it’s actually not that much more demanding to use version control software like Git or even a Continuous Integration tool like Capistrano. But how about deploying a new app version to 5 or 15 servers? What if that new version alters the database schema making it incompatible with older versions, so all servers need to updated at the same time instead of sequentially?

As Mathias points out in his post, you need automation:

”There’s an abundance of tools available to automate infrastructure, hand-written script are only the simplest part of it. Once you go beyond managing just one or two servers, tools like Chef, Puppet and MCollective come in very handy to automate everything from setting up bare servers to pushing out configuration changes from a single point, to deploying code.”

But before you will be able to benefit from the high efficiency these tools offer, you need to learn how they work and how you describe to them the infrastructure you want them to build.

Is this how you want to spend valuable engineering time?

Managing growth

Over its lifetime, your web application will probably become more complex and with it the IT infrastructure required to support it. You’ll add a caching service here, a key-value database there – want a PHP extension with that? All these add-ons need to be installed, configured and fine-tuned.

Whenever you add a new component, a new feature to an application, you add a new point of failure.

Complex systems tend to break in very interesting ways, so troubleshooting will also become more difficult as your application grows.

Is this how you want to spend valuable engineering time?

Managing health

Only by monitoring the current status of your hosting components and recording metrics about their performance over time, you can make decisions when things start to behave strangely, or — better yet — before they do so.

I can’t say it enough how important having a proper monitoring and metrics gathering system in place is. It should be by your side from day one of any testing deployment.

So you’ll soon decide to get some monitoring software and a metrics collection service in place. But that’s just the start:

You’ll never get alerting and thresholds right the first time, you’ll adapt over time, identifying false negatives and false positives, but if you don’t have a system in place at all, you’ll never know what hit your application or your servers.

Is this how you want to spend valuable engineering time?

Managing logs

Probably every service in your hosting infrastructure writes some kind of log where it saves details about the things it does and events that happen. That’s very useful:

In case of an emergency, a good set of log files will mean the world to you. This doesn’t just include the standard set of log files available on a Unix system. It includes your application and all services involved too.

But each service will log its own kind of details in its individual format, sometimes as a text file, sometimes in a database. It takes a lot of time to learn how to find and understand the relevant stories buried in thousands of lines of text scattered over different sources.

Is this how you want to spend valuable engineering time?

Managing failure

Failure will happen. All the time.

The bottom line of everything is, stuff breaks, everything breaks at different scale. Embrace breakage and failure, it will help you learn and improve your knowledge and skill set over time.

In our experience, failures will almost every time lead to better insight, improved skills and a more robust hosting infrastructure. But:

Is this how you want to spend valuable engineering time?

Stay on course

The answer is No. No, you most certainly don’t want to spend valuable engineering time on doing all these daily IT operations tasks. They tend to get more and more expensive over time, and, more importantly, they distract you from your core business.

Behind freistilbox, there’s a team of IT experts that know how to manage a growing business-critical infrastructure. We take care of all daily (and nightly) operations tasks, handle incidents and make sure that your website runs with optimal performance.

By fully managing your hosting platform, we enable you to keep a laser-like focus on your mission: building amazing websites.

That’s how you should spend every second of valuable engineering time.

How you can do DevOps without an ops team

Better yet, we’re available to you like an in-house ops team, via phone, email and chat; with our Premium Support, you can even reach us 24/7.

  • Got a question about HTTP caching headers? We’ll explain them to you over the phone.
  • You need help in optimising a database query? Send us a support request and we’ll work out a solution.
  • You’d like us to keep an eye on our servers while you launch your new website? We’ll set up a chat room where you get instant answers and live updates how your hosting platform is keeping up.

This is much more than just technical support, it’s decades of IT know-how at your fingertips during the whole life cycle of your web application. And it’s included for free in all our hosting packages.

freistilbox is not only high-performance web hosting, it’s DevOps done right.

Changelog: Fully writeable shell user homes

The “Changelog” is a new category in our blog where we publish important changes to freistilbox infrastructure and functionality.

Each freistilbox cluster comes with its own “shell node” that customers access via SSH to run maintenance tasks like mysqldump or drush. In order to make it easy to access the right website instance, each one has its own user account.

So far, the interactive use of these user accounts was severely limited by tight write restrictions on the user home directory.

In a change we’ve rolled out this week, we’ve replaced the old instance directories with homes to which the shell user has full write access. This solves the problems that many customers experienced when they tried to store configuration files or to create arbitrary files and subdirectories.

Together with all the symlinks to important website directories, the work subdirectory that we used to create as a workaround for the previous write restrictions has been automatically moved to the new shell user home directory. Apart from the full write permissons, everything should look and function exactly as it used to.

Enjoy!

Newer articles « Page 12 of 17 » Older articles