Incident Report – 11th April 2017

April 12, 2017 12:09 pm

Yesterday on the 11th April experienced a significant outage that affected the majority of our systems.
This resulted in customers being unable to login or access their system.

The cause of this outage was not related to our own systems or actions but was instead caused by a major incident at the datacenter where our servers were located.

Below is the initial status update from our hosting provider and here is their later post mortem.

Multiple redundant power sources in our SFO2 datacenter failed temporarily causing a large number of Droplets and core systems, including networking, to go down. We have restored power and networking, and are currently working to bring all affected Droplets safely online.
Following the full power recovery, we started to experience delays in event processing. Our engineering team was able to isolate the root cause. We will update the status page once all events are processing without delay and publish further details to our blog once we’ve conducted a full post mortem.

So what happened? (non technical version)

Basically a combination of power outages occurred in the data center, which simultaneously affected both the primary and backup systems.

Datacenters always have backup power systems in place which can take over automatically without the servers noticing the change.
If either the primary or backup power had failed, all would have been fine, but both together meant things switched off, cutting the connection to our servers.

What followed was a domino affect where this outage created issues with various key components in making the servers run and be accessible to internet users around the world.
These additional issues meant that once the power outages were sorted, there was still a lot of work to do for the data center staff to get everything online.

Why did it affect (non technical version)

Like the datacenter, our systems were designed with backups so that if any server crashed things could continue to operate as normal.

However this outage did more than crash a single server, it took out thousands of servers – basically an entire regions operations for our hosting provider.
So again, like the datacenters primary and backup power systems, this outage took out both our primary and backup servers at the same time.

What did we do during the incident? (non technical version)

From the moment that the first alerts arrived we had all hands on deck, initially trying to diagnose the issue and later trying to resolve it.

Unfortunately given the nature of the issue being outside of our systems and beyond our control there was nothing we could do in regards to the underlying problem.

However we quickly reviewed what alternatives we had to get customers back online as quickly as possible.

Our preferred strategy in a situation such as this would be to redirect everyone to our backup systems and instantly have everything operational, however since they were also inaccessible this was not an option.

Next alternative was to use our offsite backups to restore access instead. This required setting up new servers in regions that were operating normally and transferring hundreds of gigabytes of data to them.

Once this was ready we had a dilemma – the data was from a daily backup so was not fully up to date, potentially up to 24 hours behind.
Taking this live would cause serious complications with any bookings that had arrived since the backup was taken which we wanted to avoid.
Rather than take these servers live straight away we instead offered customers who we were in communication with during the incident a read only version of their systems which would at least allow them to view and check all of their booking data.

Fortunately our hosting provider sorted their issues and our systems came back online, meaning our undesired final resort of switching to the older backup was not necessary.

What have we done since?

Expanded hosting and backups across 2 more continents
We are now running 3 independent backup systems in 3 different continents, meaning that only the largest of catastrophic international disasters could affect us in the way that the regional outage hit us today.

Getting these 2 new regional systems operational within 24 hours means that we are ready now incase disaster hit again immediately, although there is still some fine tuning and automation to implement. Over the upcoming weeks we will be carefully testing and improving these systems.

Launched a status page
Our new status page is available to help keep users informed during any future issues that may arise, big or small.

You can subscribe to status updates by clicking the Subscribe button at the bottom of the page.

Thsi status page is running on a completely different hosting provider in a different country to all of our other servers so should remain accessible no matter what other issues we experience.

In addition our helpdesk ( also runs on separate infrastructure and includes a direct live chat connection to our support teams.

Revised our internal infrastructure strategies
This incident highlighted shortcomings in our previous systems ability to handle the unexpected and our disaster recovery programs.

While we did have solid backup processes in place, we could have done more to avoid this and we intend to going forward.

Once our new regions have been upgraded to support automated failover we will be running regular tests to ensure that entire servers and regions can fail without experiencing any system outage or disruption to our users. We will of course do this in ways that ensures no customer data is ever at risk of loss.