Keeping our cloud afloat during Hurricane Sandy

There’s irony in being a cloud provider when a major piece of your infrastructure lies below sea-level. Of course, the premise behind the cloud is that you never have to think about the underlying physical location of your servers. If a flood were to happen in a critical network hub, say New York City, it should in theory have no impact on a cloud service (because, you know—clouds are in the sky).

As Hurricane Sandy forecasts came in, our ops team began to consider what impact it might have on our service. We had two options: A) keep our NYC servers as they were and hope that power and connectivity would remain intact, or B) take our NYC servers out of rotation and direct that traffic elsewhere.

There is a tradeoff between latency and reliability between these two options. Having to direct traffic that would be best suited for NYC across the continent would impact latency and quality for some users, but would likely guarantee that the service as a whole would continue to function.

We chose the conservative path and optimized for reliability. Expecting the storm to make landfall early Tuesday morning (October 30), we pulled all NYC servers from our normal configuration on Monday night (October 29) and redirected that traffic to our datacenter in Oakland. We are glad we did.

Hours after the switch, we received word from our datacenter facility that they had lost utility power, they had switched to generator power, and had 65 hours of fuel. A half hour later, they told us that flooding had submerged the site’s diesel pumps which prevented fuel from being pumped into the generator, and now they only had 5 hours of fuel available.

Even if power had remained intact, the backbone supplying network connectivity to NYC was unreliable. XO, ATT, Telia, and Sprint all had frequent outages in the days following the storm. Each outage would have disrupted live calls traveling over that backbone at the time of the outage.

We began a partial restoration of service on November 7 when our datacenter was back online with reliable power—partial because we were still concerned with reliability of the network backbone providers and only felt comfortable putting our web servers back into rotation. Web traffic is less error prone and easy to redirect elsewhere in case of failure, as opposed to media traffic which is more adversely affected by network flaps. On November 12 we were confident that backbone connectivity had stabilized and we brought our media servers back online—at which point we were 100% back to our normal configuration.

We can’t say that the storm had no impact on our service, but we’re confident its effect was minimal. Because we were proactive and conservative in our dealing with the storm, we had ample time to reconfigure our service before the power outage.