freistilbox Blog

Newer articles « Page 16 of 17 » Older articles

New feature: SSL offloading

On many hosting platforms, including our own DrupalCONCEPT, secure traffic that is encrypted via SSL has to be handled directly by the web server. This not only puts additional computing load on those servers, it also prevents HTTP caching which means less responsiveness. To speed up the delivery of static page assets, some customers choose to use mixed mode, i.e. deliver these assets via HTTP even if the page is requested via SSL. But because this workaround can cause sensitive data to be transferred in an insecure way, it is not a practice we recommend.

For freistilbox, we eliminated this shortcoming! If you want to add SSL encryption to a website hosted on freistilbox, we have a great feature for you: SSL offloading. This means that SSL packets are decrypted the moment they reach our freistilbox infrastructure. The content of these SSL packets is then passed on to the next system layers as plain HTTP requests. This has several advantages.

First, content caching works both for plain HTTP and for SSL traffic. Since the Varnish cache proxy is located between the SSL offloading layer and your freistilboxes, it can store static assets and even pages regardless of encryption. You really don’t need to unsettle your visitors with those mixed content browser warnings.

The second benefit of SSL offloading is made obvious by its name: Your web application servers don’t have to use precious computing resources for decrypting requests and encrypting responses. Our hosting platform takes complete care of that. (As usual with freistilbox, I can’t resist to add.)

So go on, make your website more secure and enable SSL! You’ll find everything you need to set up SSL in our online documentation.

How we're reducing the impact of network issues

Our freistilbox hosting platform is built from the ground up with high availability in mind. In order to minimize the impact of failures, every backend service (i.e. each MySQL database, each Apache Solr core etc.) is running on at least two servers. And if you run your website on more than a single freistilbox, you’re in good shape on the web application level, too.

Redundancy alone doesn’t guarantee maximum uptime, though. Recently, we had to deal with various kinds of network problems ranging from minor packet loss to a full loss of external connectivity. While we can’t prevent datacenter staff from mistakenly shutting down our IP addresses on the routing level, we realized that we needed to make our infrastructure more resilient against other, more common, network issues.

We found that even smaller network congestions, oftentimes caused by high traffic from or to a neighboring server of another datacenter customer, could seriously impact requests from our web boxes to backend services. The reason for this is that, on a box doing hundreds or even thousands of database requests per second, increases of only a few milliseconds in network latency add up quickly. This can very well impact operation to the extent that the box becomes incapable of serving new incoming requests because it runs full with web server processes waiting for their data.

This problem would be even more severe if, instead of leasing bare-metal servers, we were using cloud-based infrastructure where we can’t even influence with whom we’re sharing a VM host. The Drupal experts at 2bits even make this recommendation to VPS users:>When you encounter variable performance or poor performance, before wasting time on troubleshooting that may not lead anywhere, it is worthwhile to contact your host, and ask for your VPS to be moved to a different physical server. Doing so most likely will solve the issue, since you effectively have a different set of housemates.

With IaaS vendors like Amazon, that would mean replacing your server instances with others on a trial-and-error basis. What a pain.

To minimize the impact of network performance degradation on our hosting infrastructure, we’ve started three improvement projects:

  • Optimize request distribution at the loadbalancer level.
  • Build our own CDN.
  • Move our servers into dedicated racks.

We did already finish project 1. A loadbalancer needs to distribute HTTP requests to those backend boxes that have the necessary resources and are responsive. Boxes that are maxed out or do not respond for other reasons become ineligible. We recently optimized the health checks that our loadbalancers use to determine what boxes are ready to receive requests. Now, a box only gets passed HTTP requests if it proved itself to be stable by successfully responding to a continuous series of health checks.

One cause of boxes to become unresponsive is that their backend requests get stuck on the network. And since we don’t control the network layer, we instead chose to minimize our dependency on it. That’s why, in project 2, we’re building our own Content Delivery Network. We’re going to cover this topic in another blog post, so stay tuned!

Where we still need to rely on the communication with backend services (for example, with database clusters), we need to make this communication more robust. That’s the goal of project 3. We are going to move our servers into our own racks where they share a direct network connection only with each other, not with other datacenter customers. This dedicated network connection makes data transfers between our servers faster, more reliable and more secure.

These are only the most prominent ones of all changes that we’re doing day in, day out to improve the performance and availability of our freistilbox hosting platform. And although the quality of our services is growing steadily, our prices don’t. So, if you know someone who’s looking for a hosting service that reduces their IT headaches without breaking the bank, please tell them about us!

And if you’d like to help us improve our next-generation managed hosting, join the team!

Our recent loadbalancer outage

On Wednesday night, we experienced a massive loadbalancer outage that affected a huge part of the websites that we are hosting. I’d like to take the time to explain what went wrong, and what consequences this incident will have on how we build our IT infrastructure with our partners.

Context

We use loadbalancers to distribute incoming requests from website visitors to the right web application servers. In our case, these loadbalancers are Linux servers running HTTP proxy software like HAProxy and nginx. Of course, we have redundancy for machines of this importance, so every loadbalancer configuration always runs on a pair of machines. In the case of an outage, caused for example by a hardware failure, we can switch the routing of the loadbalancer’s IP addresses to the spare machine which immediately starts distributing incoming requests. While we can switch these IP addresses between servers, from a billing perspective they are permanently associated with one single server.

Because of our rapidly growing freistilbox infrastructure, we recently decided to replace the oldest loadbalancer pair with much more powerful hardware after three years of operation. This loadbalancer is responsible for routing a big part of the incoming traffic to our DrupalCONCEPT and freistilbox clusters at our datacenter partner Hetzner AG.

In preparation of the hardware upgrade, we first built the first node of the new loadbalancer pair and switched the routing of all of the old loadbalancer’s IP addresses to this new machine a few days in advance. This switch happened over night and there was no service interruption. We were pleased to see that the new server managed all incoming requests with a mere 2% of its CPU power.

Now we had to upgrade the old LB server with which all the loadbalancer IP addresses were associated. For network architecture reasons, the new machine needed to physically replace the old one and on Tuesday, 2013-03-26, at about 14:30 UTC, Hetzner datacenter staff swapped the servers. Since web traffic was already handled by the other new loadbalancer node, the replacement procedure had no impact on website operation.

We only found a seemingly small issue after the upgrade. The IP addresses now associated with the new server were not yet displayed on the datacenter management web interface. Their routing was obviously working and all websites were reachable, so no emergency measures seemed necessary. We sent a support request to the datacenter, though, asking why the address list had vanished.

To make sure that loadbalancer operation was not in danger, we followed up with a call to Hetzner support at 16:07 UTC. The support agent told us that the subnets were still associated with the server and our customer account and that we’d get feedback from backoffice support the following day.

The outage

In the night, at 00:16 UTC on 2013-03-27, our monitoring system suddenly started sending IP Address down alerts. A lot of alerts, actually. It quickly became clear that all IP addresses associated with the new loadbalancer had gone down. Which meant that many websites had become unreachable. Our on-call engineer immediately sent a support request to the datacenter. He also tried to get direct information from Hetzner support via phone but was asked to wait for an email response. Another inquiry attempt about 15 minutes later was cut short, too.

When we still didn’t have any feedback at 01:30, we called Hetzner again to emphasize the severity of this outage. We were told that their network team did not have a night shift presence at the datacenter and that the network engineer on call had not responded yet. We demanded to have the issue escalated to highest priority and to be kept in the loop about any progress. The support agent confirmed that he’d make sure that we’d get feedback within a few minutes.

Still waiting for feedback at 01:59 UTC, we were relieved to see first recovery notifications from our monitoring system. One of the missing subnets even was displayed again in the datacenter web UI.

But there were a lot of addresses that were still down, so we called Hetzner support again at 02:18. The agent, sounding clearly annoyed, stated that he had already sent an email response that all addresses were active again and that if there were problems remaining, they were probably caused by our system configuration. Not accepting this simplistic explanation, we told the agent that we’d prepare a list of the addresses that were still down so Hetzner could actually check them.

While collecting this information, we realized that only the first quarter of the biggest IP subnet on the loadbalancer was online again. We contacted Hetzner again, indicating that they had probably used a wrong prefix or subnet mask while reconfiguring the routing. A few minutes later, at 02:54, our monitoring sent us recovery notifications for all remaining addresses.

Root cause analysis

First thing In the morning, we contacted our Hetzner sales contact, gave them our timeline of the outage and asked for an explanation for what had happened. It turns out that we were right with our concerns about the vanished address list: When the contract for the old server was terminated after it got replaced, its IP addresses got canceled with it. Then, in the night, an automatic deprovisioning process removed them from the routing tables.

Where we go from here

Our sales contact at Hetzner apologized sincerely for this clerical error and a day later notified us that they added a security step to their cancelation process. Now, the person doing the contract change gets a warning message that asks them to in doubt confirm with sales if an upgraded server’s address list should be canceled with it.

This outage could have been prevented completely if either our support request about the IP addresses missing in the web UI would have been handled earlier or if the support agent that we spoke to on Tuesday afternoon would have realized that the addresses had actually been canceled with the old server.

The loadbalancer downtime would also have been much shorter if the on-call network engineer at Hetzner had acted more quickly and then also had taken more care in reconfiguring the routing and making sure that all IP addresses were reachable again. We especially find it unacceptable that the support agent we spoke to tried to pass the buck to us and that we had to prove that service restoration had indeed not been executed properly.

That’s why we chose to escalate this incident to Hetzner’s CEO. We also asked for a personal meeting with the managers responsible for datacenter and support operations to discuss how we can cooperate more effectively. We haven’t yet heard back from Hetzner on this request and will check back with them in a few days.

Even though we had executed every step of our loadbalancer upgrade with diligence and tried to make sure that there was no impact on website operation at any time, we suffered a significant outage. This shows how dependent we are on our IT partners, their processes and staff and we’re going to put more effort into making sure that the companies with which we partner align with our values and goals towards service quality. Additionally, on a technological level, we’re discussing how we can increase the availability of our customers’ websites further by spreading our infrastructure out over multiple IT infrastructure providers.

In closing, I apologize sincerely for this outage. We were lucky that it happened at a time where its impact on website visitors was low but it was 2,5 hours of downtime nonetheless. This is unacceptable for a company that promises its customers that they won’t have to worry about their hosting in any way. We are making every effort to prevent such an outage from happening ever again.

Jochen Lillich, founder and IT architect, freistil IT

Starting freistilbox delivery

As we mentioned in our review of 2012, we had to delay the delivery of our new freistilbox infrastructure because we encountered architectural problems. Today, we are happy to announce that after finding some good, long-term solutions, we’ve finally started the rollout of freistilbox clusters.

In this post, we’d like to explain what it was that threw sand between our gears and how we solved the problem.

Idea

On DrupalCONCEPT, we had many services sharing the resources of a server; in the case of DrupalCONCEPT POWER, we even had Git, Varnish, Apache, Solr and MySQL running on a single server. With time, we found that this put too many limitations on performance and scalability optimization. So we decided to run almost every freistilbox service on its own servers, resulting in a completely distributed architecture.

Implementation

From an operations view, freistilbox needs a lot more servers than DrupalCONCEPT: In the backend, there are clusters for Git, MySQL, Solr and file storage. Incoming requests are received by load balancers and SSL offloaders which route them to the customer’s freistilbox cluster. Each of these freistilbox clusters has two servers running Varnish and Memcached, a maintenance server for SSH/SFTP logins and cron jobs, and finally the actual boxes, i.e. the application servers running the web applications (Drupal, for example).

First, these servers need to be provisioned. To make this easy, we’ve built a private cloud infrastructure that we operate on bare metal servers leased from our datacenter partners. Thanks to many years of experience with Chef and virtualization, we were able to implement this quite efficiently.

But what caused us a lot of headaches – and the embarrassing delay in delivery – was that these servers needed to be interconnected on the business process level. On a single DrupalCONCEPT server, it was easy for us to synchronize local processes, for example triggering a code deployment after receiving a Git repository update. On freistilbox, however, this synchronization needs to happen between servers. Let’s take the deployment process as an example again:

  • The customer pushes an update onto the Git server.
  • The Git server then needs to notify the application servers affected by the update.
  • Only these application servers finally deploy the changes in parallel which brings the update online.

At my previous jobs, I had experienced how quickly distributed technologies like CORBA can become complicated and costly, so we tried to find a simpler approach. To make a long story short: Try as we might, it turned out that our simple approaches didn’t work as effective or as reliable as we needed them to be. Finally, we bit the bullet and solved the problem with a full-grown orchestration infrastructure based on MCollective.

Conclusion

We’re sad that this conceptual odyssey has cost us as lot of unplanned effort, time and, worst of all, customer trust. Apparently, we had to be reminded the hard way that it isn’t ideas that count but their execution. We won’t make this mistake again.

On the other hand, we’re very happy that we now have all the components in place that we need to build an awesome managed hosting platform.

To all customers who have been waiting for their freistilboxes: The wait is over. We appreciate your patience more than we can put in words, and we promise to make it worth your while.

Faster database restores

Until now, our backup strategy didn’t allow for an easy restore of single MySQL databases. Customers that needed to have their database restored sometimes had to endure one or two hours until their website was reset to its previous state again. The reason was that we had chosen the backup method of making a consistent snapshot of the complete MySQL server. So, in order to restore one or more single databases, we first had to restore this snapshot to a spare server where we then were able to dump the single databases that we actually needed.

We’re happy to announce that this weakness has been resolved! Our database backup now works in two phases:

  1. Write a database dump for every single database on the server. These dumps are stored locally in generations (daily/weekly/monthly).
  2. Copy the dump files to our well-proven enterprise backup system.

With this new strategy, we have stored all database content at several ages and in multiple places and are still able to quickly restore a single database right from its most recent dump file.

We’re optimistic that this change will significantly shorten the time-to-resolve of our customer’s database restore requests.

Newer articles « Page 16 of 17 » Older articles