Apr 12, 2017

Incident report — 11th April 2017

A datacenter outage took us offline. Here is what happened and what we did about it.

Yesterday on the 11th April Transporters.io experienced a significant outage that affected the majority of our systems. This resulted in customers being unable to log in or access their Transporters.io system.

The cause of this outage was not related to our own systems or actions but was instead caused by a major incident at the datacenter where our servers were located.

Below is the initial status update from our hosting provider:

“Multiple redundant power sources in our SFO2 datacenter failed temporarily, causing a large number of Droplets and core systems, including networking, to go down.”

So what happened? (non-technical version)

Basically a combination of power outages occurred in the data center, which simultaneously affected both the primary and backup systems.

Datacenters always have backup power systems in place which can take over automatically without the servers noticing the change. If either the primary or backup power had failed, all would have been fine — but both together meant things switched off, cutting the connection to our servers.

What followed was a domino effect where this outage created issues with various key components in making the servers run and be accessible to internet users around the world. These additional issues meant that once the power outages were sorted, there was still a lot of work to do for the data center staff to get everything back online.

Why did it affect Transporters.io?

Like the datacenter, our systems were designed with backups so that if any server crashed things could continue to operate as normal.

However this outage did more than crash a single server — it took out thousands of servers, basically an entire region's operations for our hosting provider. So again, like the datacenter's primary and backup power systems, this outage took out both our primary and backup servers at the same time.

What did we do during the incident?

From the moment that the first alerts arrived we had all hands on deck, initially trying to diagnose the issue and later trying to resolve it.

Unfortunately given the nature of the issue being outside of our systems and beyond our control, there was nothing we could do in regards to the underlying problem.

However we quickly reviewed what alternatives we had to get customers back online as quickly as possible.

Our preferred strategy in a situation such as this would be to redirect everyone to our backup systems and instantly have everything operational — however since they were also inaccessible this was not an option.

Next alternative was to use our offsite backups to restore access instead. This required setting up new servers in regions that were operating normally and transferring hundreds of gigabytes of data to them.

Once this was ready we had a dilemma — the data was from a daily backup so was not fully up to date, potentially up to 24 hours behind. Taking this live would cause serious complications with any bookings that had arrived since the backup was taken, which we wanted to avoid. Rather than take these servers live straight away we instead offered customers who we were in communication with during the incident a read-only version of their systems which would at least allow them to view and check all of their booking data.

Fortunately our hosting provider sorted their issues and our systems came back online, meaning our undesired final resort of switching to the older backup was not necessary.

What have we done since?

Expanded hosting and backups across 2 more continents. We are now running 3 independent backup systems on 3 different continents, meaning that only the largest of catastrophic international disasters could affect us in the way that the regional outage hit us.

Launched a status page to keep users informed during any future issues. Our new status page runs on a completely different hosting provider in a different country to all of our other servers so should remain accessible no matter what other issues we experience.

Revised our internal infrastructure strategies. This incident highlighted shortcomings in our previous systems' ability to handle the unexpected and our disaster recovery programs. Once our new regions have been upgraded to support automated failover we will be running regular tests to ensure that entire servers and regions can fail without experiencing any system outage or disruption to our users.

Artikelinformation

Udgivet

Apr 12, 2017

Del denne artikel

Var denne artikel nyttig? Del den med andre.