Our recent loadbalancer outage

On Wednesday night, we experienced a massive loadbalancer outage that affected a huge part of the websites that we are hosting. I’d like to take the time to explain what went wrong, and what consequences this incident will have on how we build our IT infrastructure with our partners.

Context

We use loadbalancers to distribute incoming requests from website visitors to the right web application servers. In our case, these loadbalancers are Linux servers running HTTP proxy software like HAProxy and nginx. Of course, we have redundancy for machines of this importance, so every loadbalancer configuration always runs on a pair of machines. In the case of an outage, caused for example by a hardware failure, we can switch the routing of the loadbalancer’s IP addresses to the spare machine which immediately starts distributing incoming requests. While we can switch these IP addresses between servers, from a billing perspective they are permanently associated with one single server.

Because of our rapidly growing freistilbox infrastructure, we recently decided to replace the oldest loadbalancer pair with much more powerful hardware after three years of operation. This loadbalancer is responsible for routing a big part of the incoming traffic to our DrupalCONCEPT and freistilbox clusters at our datacenter partner Hetzner AG.

In preparation of the hardware upgrade, we first built the first node of the new loadbalancer pair and switched the routing of all of the old loadbalancer’s IP addresses to this new machine a few days in advance. This switch happened over night and there was no service interruption. We were pleased to see that the new server managed all incoming requests with a mere 2% of its CPU power.

Now we had to upgrade the old LB server with which all the loadbalancer IP addresses were associated. For network architecture reasons, the new machine needed to physically replace the old one and on Tuesday, 2013-03-26, at about 14:30 UTC, Hetzner datacenter staff swapped the servers. Since web traffic was already handled by the other new loadbalancer node, the replacement procedure had no impact on website operation.

We only found a seemingly small issue after the upgrade. The IP addresses now associated with the new server were not yet displayed on the datacenter management web interface. Their routing was obviously working and all websites were reachable, so no emergency measures seemed necessary. We sent a support request to the datacenter, though, asking why the address list had vanished.

To make sure that loadbalancer operation was not in danger, we followed up with a call to Hetzner support at 16:07 UTC. The support agent told us that the subnets were still associated with the server and our customer account and that we’d get feedback from backoffice support the following day.

The outage

In the night, at 00:16 UTC on 2013-03-27, our monitoring system suddenly started sending IP Address down alerts. A lot of alerts, actually. It quickly became clear that all IP addresses associated with the new loadbalancer had gone down. Which meant that many websites had become unreachable. Our on-call engineer immediately sent a support request to the datacenter. He also tried to get direct information from Hetzner support via phone but was asked to wait for an email response. Another inquiry attempt about 15 minutes later was cut short, too.

When we still didn’t have any feedback at 01:30, we called Hetzner again to emphasize the severity of this outage. We were told that their network team did not have a night shift presence at the datacenter and that the network engineer on call had not responded yet. We demanded to have the issue escalated to highest priority and to be kept in the loop about any progress. The support agent confirmed that he’d make sure that we’d get feedback within a few minutes.

Still waiting for feedback at 01:59 UTC, we were relieved to see first recovery notifications from our monitoring system. One of the missing subnets even was displayed again in the datacenter web UI.

But there were a lot of addresses that were still down, so we called Hetzner support again at 02:18. The agent, sounding clearly annoyed, stated that he had already sent an email response that all addresses were active again and that if there were problems remaining, they were probably caused by our system configuration. Not accepting this simplistic explanation, we told the agent that we’d prepare a list of the addresses that were still down so Hetzner could actually check them.

While collecting this information, we realized that only the first quarter of the biggest IP subnet on the loadbalancer was online again. We contacted Hetzner again, indicating that they had probably used a wrong prefix or subnet mask while reconfiguring the routing. A few minutes later, at 02:54, our monitoring sent us recovery notifications for all remaining addresses.

Root cause analysis

First thing In the morning, we contacted our Hetzner sales contact, gave them our timeline of the outage and asked for an explanation for what had happened. It turns out that we were right with our concerns about the vanished address list: When the contract for the old server was terminated after it got replaced, its IP addresses got canceled with it. Then, in the night, an automatic deprovisioning process removed them from the routing tables.

Where we go from here

Our sales contact at Hetzner apologized sincerely for this clerical error and a day later notified us that they added a security step to their cancelation process. Now, the person doing the contract change gets a warning message that asks them to in doubt confirm with sales if an upgraded server’s address list should be canceled with it.

This outage could have been prevented completely if either our support request about the IP addresses missing in the web UI would have been handled earlier or if the support agent that we spoke to on Tuesday afternoon would have realized that the addresses had actually been canceled with the old server.

The loadbalancer downtime would also have been much shorter if the on-call network engineer at Hetzner had acted more quickly and then also had taken more care in reconfiguring the routing and making sure that all IP addresses were reachable again. We especially find it unacceptable that the support agent we spoke to tried to pass the buck to us and that we had to prove that service restoration had indeed not been executed properly.

That’s why we chose to escalate this incident to Hetzner’s CEO. We also asked for a personal meeting with the managers responsible for datacenter and support operations to discuss how we can cooperate more effectively. We haven’t yet heard back from Hetzner on this request and will check back with them in a few days.

Even though we had executed every step of our loadbalancer upgrade with diligence and tried to make sure that there was no impact on website operation at any time, we suffered a significant outage. This shows how dependent we are on our IT partners, their processes and staff and we’re going to put more effort into making sure that the companies with which we partner align with our values and goals towards service quality. Additionally, on a technological level, we’re discussing how we can increase the availability of our customers’ websites further by spreading our infrastructure out over multiple IT infrastructure providers.

In closing, I apologize sincerely for this outage. We were lucky that it happened at a time where its impact on website visitors was low but it was 2,5 hours of downtime nonetheless. This is unacceptable for a company that promises its customers that they won’t have to worry about their hosting in any way. We are making every effort to prevent such an outage from happening ever again.

Jochen Lillich, founder and IT architect, freistil IT