freistilbox Blog

Newer articles « Page 16 of 17 » Older articles

Beer and coding on the island

Much too quickly, the Drupal Dev Days in Dublin were over. About 150 people had come to do code sprints, pub tours and participate in talks about project management, Drupal internals and — looking at myself — building virtual development environments. (More on that later.)

We were lucky that the Dublin Institute of Technolgy provided the conference venue for free. Its building in Aungier Street was easy to reach and had more than enough space for us. Thanks, DIT!

freistil IT supported the Dev Days as Gold Sponsor and we’re very happy with our engagement. freistilbox was featured prominently both at the conference venue and at the Odeon where we had the freistilbox Party with drinks and barbecue on Saturday night. The organization team managed to answer all our questions quickly and always kept us up-to-date on what was relevant for us as a sponsor.

We participated in the Job Speed Dating on Friday afternoon, but unfortunately, there were only a handful of interested applicants. Actually, there were quite obviously more companies looking for people than people looking for jobs. But that doesn’t make the idea of quickly connecting job seekers with relevant businesses a bad one. On the contrary, we hope that more conferences are going to add such an event to their programme!

The session schedule on Saturday and Sunday was diverse and interesting. (I have to say that it made me sad to see attendance dropping heavily on Sunday, probably because of the freistilbox Party the night before.) Together with Steven Jones and Marcus Deglos, I gave an introduction into easy VM management with Vagrant. In my part, I highlighted how using automation tools like Chef simplify setting up an individual system configuration from scratch that can then be replicated exactly and within minutes.

Since I’m going to move to Ireland next month, I was double happy to visit the green island for Drupal Dev Days. I enjoyed my stay very much and would like to thank everyone for making it a great Drupal community event! Special thanks go to the organization team for their tremendous efforts. Togha oibre!

Let's share and party at the Drupal Dev Days!

We're sponsoring DevDays Dublin

We’re very busy preparing for the Drupal Dev Days in Dublin in a few weeks! Aside from supporting the event as GOLD sponsor, we’ll be giving a talk about automating your development environment. And with **freistilbox Solo** , we’ll also be presenting the first release of our standalone development environment for freistilbox users!

When, a few days ago, the event organizers asked us if we’re interested in also sponsoring the social event, we happily said yes! So, be our guests and join the freistilbox Party on Saturday night!

Will you be at the Drupal Dev Days and would like to learn more about freistilbox? Please drop us a line and we’ll make sure to talk to you!

New feature: SSL offloading

On many hosting platforms, including our own DrupalCONCEPT, secure traffic that is encrypted via SSL has to be handled directly by the web server. This not only puts additional computing load on those servers, it also prevents HTTP caching which means less responsiveness. To speed up the delivery of static page assets, some customers choose to use mixed mode, i.e. deliver these assets via HTTP even if the page is requested via SSL. But because this workaround can cause sensitive data to be transferred in an insecure way, it is not a practice we recommend.

For freistilbox, we eliminated this shortcoming! If you want to add SSL encryption to a website hosted on freistilbox, we have a great feature for you: SSL offloading. This means that SSL packets are decrypted the moment they reach our freistilbox infrastructure. The content of these SSL packets is then passed on to the next system layers as plain HTTP requests. This has several advantages.

First, content caching works both for plain HTTP and for SSL traffic. Since the Varnish cache proxy is located between the SSL offloading layer and your freistilboxes, it can store static assets and even pages regardless of encryption. You really don’t need to unsettle your visitors with those mixed content browser warnings.

The second benefit of SSL offloading is made obvious by its name: Your web application servers don’t have to use precious computing resources for decrypting requests and encrypting responses. Our hosting platform takes complete care of that. (As usual with freistilbox, I can’t resist to add.)

So go on, make your website more secure and enable SSL! You’ll find everything you need to set up SSL in our online documentation.

How we're reducing the impact of network issues

Our freistilbox hosting platform is built from the ground up with high availability in mind. In order to minimize the impact of failures, every backend service (i.e. each MySQL database, each Apache Solr core etc.) is running on at least two servers. And if you run your website on more than a single freistilbox, you’re in good shape on the web application level, too.

Redundancy alone doesn’t guarantee maximum uptime, though. Recently, we had to deal with various kinds of network problems ranging from minor packet loss to a full loss of external connectivity. While we can’t prevent datacenter staff from mistakenly shutting down our IP addresses on the routing level, we realized that we needed to make our infrastructure more resilient against other, more common, network issues.

We found that even smaller network congestions, oftentimes caused by high traffic from or to a neighboring server of another datacenter customer, could seriously impact requests from our web boxes to backend services. The reason for this is that, on a box doing hundreds or even thousands of database requests per second, increases of only a few milliseconds in network latency add up quickly. This can very well impact operation to the extent that the box becomes incapable of serving new incoming requests because it runs full with web server processes waiting for their data.

This problem would be even more severe if, instead of leasing bare-metal servers, we were using cloud-based infrastructure where we can’t even influence with whom we’re sharing a VM host. The Drupal experts at 2bits even make this recommendation to VPS users:>When you encounter variable performance or poor performance, before wasting time on troubleshooting that may not lead anywhere, it is worthwhile to contact your host, and ask for your VPS to be moved to a different physical server. Doing so most likely will solve the issue, since you effectively have a different set of housemates.

With IaaS vendors like Amazon, that would mean replacing your server instances with others on a trial-and-error basis. What a pain.

To minimize the impact of network performance degradation on our hosting infrastructure, we’ve started three improvement projects:

  • Optimize request distribution at the loadbalancer level.
  • Build our own CDN.
  • Move our servers into dedicated racks.

We did already finish project 1. A loadbalancer needs to distribute HTTP requests to those backend boxes that have the necessary resources and are responsive. Boxes that are maxed out or do not respond for other reasons become ineligible. We recently optimized the health checks that our loadbalancers use to determine what boxes are ready to receive requests. Now, a box only gets passed HTTP requests if it proved itself to be stable by successfully responding to a continuous series of health checks.

One cause of boxes to become unresponsive is that their backend requests get stuck on the network. And since we don’t control the network layer, we instead chose to minimize our dependency on it. That’s why, in project 2, we’re building our own Content Delivery Network. We’re going to cover this topic in another blog post, so stay tuned!

Where we still need to rely on the communication with backend services (for example, with database clusters), we need to make this communication more robust. That’s the goal of project 3. We are going to move our servers into our own racks where they share a direct network connection only with each other, not with other datacenter customers. This dedicated network connection makes data transfers between our servers faster, more reliable and more secure.

These are only the most prominent ones of all changes that we’re doing day in, day out to improve the performance and availability of our freistilbox hosting platform. And although the quality of our services is growing steadily, our prices don’t. So, if you know someone who’s looking for a hosting service that reduces their IT headaches without breaking the bank, please tell them about us!

And if you’d like to help us improve our next-generation managed hosting, join the team!

Our recent loadbalancer outage

On Wednesday night, we experienced a massive loadbalancer outage that affected a huge part of the websites that we are hosting. I’d like to take the time to explain what went wrong, and what consequences this incident will have on how we build our IT infrastructure with our partners.

Context

We use loadbalancers to distribute incoming requests from website visitors to the right web application servers. In our case, these loadbalancers are Linux servers running HTTP proxy software like HAProxy and nginx. Of course, we have redundancy for machines of this importance, so every loadbalancer configuration always runs on a pair of machines. In the case of an outage, caused for example by a hardware failure, we can switch the routing of the loadbalancer’s IP addresses to the spare machine which immediately starts distributing incoming requests. While we can switch these IP addresses between servers, from a billing perspective they are permanently associated with one single server.

Because of our rapidly growing freistilbox infrastructure, we recently decided to replace the oldest loadbalancer pair with much more powerful hardware after three years of operation. This loadbalancer is responsible for routing a big part of the incoming traffic to our DrupalCONCEPT and freistilbox clusters at our datacenter partner Hetzner AG.

In preparation of the hardware upgrade, we first built the first node of the new loadbalancer pair and switched the routing of all of the old loadbalancer’s IP addresses to this new machine a few days in advance. This switch happened over night and there was no service interruption. We were pleased to see that the new server managed all incoming requests with a mere 2% of its CPU power.

Now we had to upgrade the old LB server with which all the loadbalancer IP addresses were associated. For network architecture reasons, the new machine needed to physically replace the old one and on Tuesday, 2013-03-26, at about 14:30 UTC, Hetzner datacenter staff swapped the servers. Since web traffic was already handled by the other new loadbalancer node, the replacement procedure had no impact on website operation.

We only found a seemingly small issue after the upgrade. The IP addresses now associated with the new server were not yet displayed on the datacenter management web interface. Their routing was obviously working and all websites were reachable, so no emergency measures seemed necessary. We sent a support request to the datacenter, though, asking why the address list had vanished.

To make sure that loadbalancer operation was not in danger, we followed up with a call to Hetzner support at 16:07 UTC. The support agent told us that the subnets were still associated with the server and our customer account and that we’d get feedback from backoffice support the following day.

The outage

In the night, at 00:16 UTC on 2013-03-27, our monitoring system suddenly started sending IP Address down alerts. A lot of alerts, actually. It quickly became clear that all IP addresses associated with the new loadbalancer had gone down. Which meant that many websites had become unreachable. Our on-call engineer immediately sent a support request to the datacenter. He also tried to get direct information from Hetzner support via phone but was asked to wait for an email response. Another inquiry attempt about 15 minutes later was cut short, too.

When we still didn’t have any feedback at 01:30, we called Hetzner again to emphasize the severity of this outage. We were told that their network team did not have a night shift presence at the datacenter and that the network engineer on call had not responded yet. We demanded to have the issue escalated to highest priority and to be kept in the loop about any progress. The support agent confirmed that he’d make sure that we’d get feedback within a few minutes.

Still waiting for feedback at 01:59 UTC, we were relieved to see first recovery notifications from our monitoring system. One of the missing subnets even was displayed again in the datacenter web UI.

But there were a lot of addresses that were still down, so we called Hetzner support again at 02:18. The agent, sounding clearly annoyed, stated that he had already sent an email response that all addresses were active again and that if there were problems remaining, they were probably caused by our system configuration. Not accepting this simplistic explanation, we told the agent that we’d prepare a list of the addresses that were still down so Hetzner could actually check them.

While collecting this information, we realized that only the first quarter of the biggest IP subnet on the loadbalancer was online again. We contacted Hetzner again, indicating that they had probably used a wrong prefix or subnet mask while reconfiguring the routing. A few minutes later, at 02:54, our monitoring sent us recovery notifications for all remaining addresses.

Root cause analysis

First thing In the morning, we contacted our Hetzner sales contact, gave them our timeline of the outage and asked for an explanation for what had happened. It turns out that we were right with our concerns about the vanished address list: When the contract for the old server was terminated after it got replaced, its IP addresses got canceled with it. Then, in the night, an automatic deprovisioning process removed them from the routing tables.

Where we go from here

Our sales contact at Hetzner apologized sincerely for this clerical error and a day later notified us that they added a security step to their cancelation process. Now, the person doing the contract change gets a warning message that asks them to in doubt confirm with sales if an upgraded server’s address list should be canceled with it.

This outage could have been prevented completely if either our support request about the IP addresses missing in the web UI would have been handled earlier or if the support agent that we spoke to on Tuesday afternoon would have realized that the addresses had actually been canceled with the old server.

The loadbalancer downtime would also have been much shorter if the on-call network engineer at Hetzner had acted more quickly and then also had taken more care in reconfiguring the routing and making sure that all IP addresses were reachable again. We especially find it unacceptable that the support agent we spoke to tried to pass the buck to us and that we had to prove that service restoration had indeed not been executed properly.

That’s why we chose to escalate this incident to Hetzner’s CEO. We also asked for a personal meeting with the managers responsible for datacenter and support operations to discuss how we can cooperate more effectively. We haven’t yet heard back from Hetzner on this request and will check back with them in a few days.

Even though we had executed every step of our loadbalancer upgrade with diligence and tried to make sure that there was no impact on website operation at any time, we suffered a significant outage. This shows how dependent we are on our IT partners, their processes and staff and we’re going to put more effort into making sure that the companies with which we partner align with our values and goals towards service quality. Additionally, on a technological level, we’re discussing how we can increase the availability of our customers’ websites further by spreading our infrastructure out over multiple IT infrastructure providers.

In closing, I apologize sincerely for this outage. We were lucky that it happened at a time where its impact on website visitors was low but it was 2,5 hours of downtime nonetheless. This is unacceptable for a company that promises its customers that they won’t have to worry about their hosting in any way. We are making every effort to prevent such an outage from happening ever again.

Jochen Lillich, founder and IT architect, freistil IT

Newer articles « Page 16 of 17 » Older articles