Monday 5 June 2017

Why BA should care about IT

20:42 Posted by G 1 comment


Having been in the eye of the storm of the BA IT systems failure last weekend, and only getting away on holiday, 2 days after we should have, I think there’s lots of things to learn.


I think what most struck me about the outage was the sheer size of it.  Upon arriving at Heathrow terminal 5 on Saturday morning with the extended family all excited about a week’s holiday in Greece, we were met with huge queues outside T5, and at that stage it looked like a baggage or check in problem. But then over the course of the next hour it quickly became clear how severe the outage was.  Not only were check-in systems not working, but the departures information boards had been stuck since 9.30 am.  Even when we got to the gate, which turned out to be the wrong one, there were planes on stand waiting to push back, more aircraft waiting for a gate, and flight crew equally confused.  When we did finally get on board an aircraft the pilot informed us that the flight planning systems weren’t working so he couldn’t create a flight plan, and therefore was unable to work out the correct amount of fuel to put on board, and without that he was unwilling to push back off the stand.  Even when we got the news (first via the BBC) that all flights were cancelled, the pilot told us even the system to cancel flights wasn’t working.  This meant that it getting busses to take us back to the terminal took a long time, followed by the ignominy of having to go back through passport control having not left the airport let alone the country.

From an IT perspective there’s a few interesting aspects.  Firstly BA have claimed this to be a power related incident.  This is an interesting cause.  As far as I’m aware there were no other companies impacted by this outage, which strongly suggests that this is was not in a shared (co-located) data centre, as otherwise we’d have seen other outages.  This also implies that BA aren’t running in the cloud, as we saw no cloud outages over the weekend.  Secondly assuming this was a dedicated BA data centre then there’s been a major failure of resiliency.  I would normally expect of any decent quality data centre that there would be a battery backup to provide power in the immediate follow-up to a power failure.  As soon as there’s been a power failure detected diesel generators should kick in to provide longer term power.  Normally batteries would sit in-line with the external power to smooth the supply and provide instant protection if the external power fails.  At this level of criticality it would be normal to have 2 diverse and ideally separate power suppliers.  The diesel generators are some of the most loved engines in the world, they are often encased in permanently warmed enclosures to keep them at the correct operating temperature.  Quite often the diesel they consume is pre-warmed as well.  This often also is stored in 2 different locations to ensure that if one gets contaminated there’s a secondary supply that can still be used. These engines are often over a million pounds each, and in some sites I’ve seen then have n+n redundancy (if 4 generators are needed there are 8 on site) to deal with 100% failure.  Clearly as a customer you pay more to have this level of redundancy but as we’ve seen over the weekend you never want an incident like this.

In addition to having all this redundancy built into a data centre its vital that all these components are regularly tested.  It’s normal for data centres to test battery back-up and run up the generators at least once a month to ensure all the hardware and processes work as they should in an emergency.

Once you’re inside the data centre, all the racks (where servers are housed) are typically dual powered from different backup batteries, and power supplies, and then each server is dual powered to further protect against individual failures.  In total there are 6 layers of redundancy in between power coming into the data centre and the actual server (redundant Power suppliers, redundant Battery back-up,  redundant power generators, dual power to the rack, dual power supplies to the server, redundant power supplies in the server itself).

As you can see in theory it’s pretty difficult to have a serious power failure.  While it’s possible to have a serious failure in parts of a power supply system, it would be highly unusual for this to be service impacting.

However as we saw in the outage at the weekend something catastrophic must have happened to produce such a widespread outage, and one that seems to have affected BA globally.

Even outside of pure power redundancy most large corporations will have redundancy built into individual systems, be that within the same data centre or in a secondary site (ideally both).  For the more sophisticated sites, these are often what’s known as active-active, i.e. the service is running in both sites at the same time, so if there’s a failure in one server or site the service keeps running but with degraded capacity (the application may appear slower to users), however it is still available.

Most companies will spend at least 7 figures sums annually running with this level of redundancy and will test it regularly (most regulators insist this is at least every two years).  It would appear that for this level of outage and number of systems that failed, either there wasn’t the appropriate level of redundancy or it hasn’t been tested regularly enough.

It’s worth pointing out that all the points mentioned above are expensive, painful to test, and do little to add to the bottom line of the company, but it is just this sort of ‘insurance’ that you never want to rely on, but having thorough and well tested plans makes all the difference when this sort of event happens.

 There’s been lots of reports in the UK press, and comments from unions saying this event is reflective of BA outsourcing its IT services to a third party.  I’m not sure if outsourcing had any impact on the outage, but the mere fact that if BA do outsource their IT it’s an indication that they do not perceive IT to be a core function for BA, as they’ve asked someone else to do it on their behalf.

You may have read many IT articles about Uber being the biggest taxi company and owning no taxis, and airBnB being the biggest hotel chain, but owns no hotels.  It’s clear that both these examples are technology companies not traditional taxi or hotel vendors and therefore with such a reliance on technology they would be expected to have very highly resilient systems that are regularly tested. 

BA however doesn’t fit that model, their biggest expenses wouldn’t be IT, they probably spend significantly more on aircraft, fuel, staff etc.  However when I thought about it, their main systemic risk probably is IT.  If any one model of aircraft was grounded for any reason they use a range of planes in their fleet so this would be impactful, but not catastrophic.  Similarly if one of the unions that some of their staff belong to goes on strike (as we’ve seen in the past) is annoying but not critical.  The same could probably be said for their food or fuel vendors, who probably vary around the world, and so if any one of them fail, they can most likely work around an individual failure.

Not so with IT, it appears that one power failure in one data had the ability to completely cripple one of the biggest airlines in the world.  I cannot believe that BA would have actively known this risk and chosen to run with it.

It the ever increasing digital world we live in every company is slowly turning into a technology company.  Maybe not front facing, but even in a traditional industry such as aviation where aircraft hardware will always be key, this weekend proved you can have all the planes in the world, but if the tech isn’t there to support it, you’ve got no business.



1 comment:

  1. thanks for the article, i think you hit the nail on the head ' main systemic risk probably is IT.' yes, yes it is, for so many companies, it is taken for granted. And outsourcing is the problem when it is done to pass the buck, it takes serious resources to manage contracted work, that is something we use to know, but some seem to imagine tech is different, it isn’t.

    ReplyDelete

Note: only a member of this blog may post a comment.