Fighting the 503 Server Error

We’re happy to move another entry on our new product roadmap to the Finished column: We’ve greatly improved the error handling on our load balancers.

Handling of application errors

Before this change, our load balancers delivered a terse 503 Server Error page for each and every condition that prevented the content requested from being delivered. Unfortunately, this included the situation when it wasn’t a part of the hosting platform failing but the web application. For example, if Drupal is put into maintenance mode or has issues connecting to its database, it delivers an error page with a HTTP error code 500 and an error message in the page body. But instead of delivering this page, our load balancers replaced it with their plain Server Error page. In other words, they made the issue worse by concealing its cause.

We’ve improved the load balancer configuration so that now, a 503 Server Error is only displayed when there is no way of delivering useful content. But if its just the application sending an error page, its content will be passed through to the visitor.

Trying everything to deliver

The most frequent cause of the dreaded 503 Server Error is that a load balancer has run out of healthy application boxes to which it can pass on incoming requests. Especially customers that with only a single box ran into this problem when that box got overloaded, even if only for a few seconds.

We’ve found a way to prevent ugly error messages even in this situation: A Varnish function named grace mode allows us to keep content remaining in the cache for a defined period of time after its expiry time. If a request can neither be answered with fresh cache content nor be forwarded to any box, Varnish will now try to deliver recently expired cache content (max. 1 hour over expiry time). Only if there isn’t anything left that can be delivered to the visitor within reason, an error message will come up.

Minimizing box downtime

We’ve also optimized the intervals in which our load balancers check if the application boxes in a freistilbox cluster are healthy. An unresponsive box is now detected and taken out of the load balancing pool within only 5 seconds. Previously, the delay was about 15s, so we’ve greatly reduced the amount of failed load balancer requests. And boxes that have recovered are also taken back into the pool fare more quickly, giving us a more stable load distribution.

Looking at our monitoring metrics, we’re quite happy with the results of these changes. We see far less failing requests, less spikes in box usage and overall more stable website operation.

We’d love to hear from you: Are you experiencing a positive change in your application’s stability? Please let us know in the comments!