June 2017 ~ Gareth's Blog

For the first time in a few years, I managed to watch a reasonable amount of the the Le Mans coverage this year. I love a bit of endurance racing, and clearly to us Europeans Le Mans is the biggest race of the year. I've still never been, something I need to rectify in the very near future !

I missed the actual finish, but via a few friends on social media I was aware of how exciting the finish was in the GTE pro class (spoiler alert) with the Covette and the Aston battling it out right to the very end. I find it astonishing that after 23 hours 55 minutes and 2,800 miles of hard racing the cars were less than a second apart.

I saw this morning on Prodrive's Facebook page they'd posted the last 5 minutes of the race, plus some post race celebrations. It's makes for great watching (especially if you're an Aston fan!).

What I really liked though is the interaction between the Prodrive team and the punters on Facebook. Clearly it must have been gut wrenching for the unfortunate Covette team, but I like the comment from the Prodrive team saying that they went and spoke to the Covette team afterwards. I like that it's a two way conversation with the Facebook followers, and also that even in this highly competitive, big budget world the team still have the humility to go and talk to their competitors

Thats the spirit that all racing should aim to emulate, F1 has much to learn both from a racing spectacle as well as fan interaction.

Here's the video :

Having been in the eye of the storm of the BA IT systems failure last weekend, and only getting away on holiday, 2 days after we should have, I think there’s lots of things to learn.

I think what most struck me about the outage was the sheer size of it. Upon arriving at Heathrow terminal 5 on Saturday morning with the extended family all excited about a week’s holiday in Greece, we were met with huge queues outside T5, and at that stage it looked like a baggage or check in problem. But then over the course of the next hour it quickly became clear how severe the outage was. Not only were check-in systems not working, but the departures information boards had been stuck since 9.30 am. Even when we got to the gate, which turned out to be the wrong one, there were planes on stand waiting to push back, more aircraft waiting for a gate, and flight crew equally confused. When we did finally get on board an aircraft the pilot informed us that the flight planning systems weren’t working so he couldn’t create a flight plan, and therefore was unable to work out the correct amount of fuel to put on board, and without that he was unwilling to push back off the stand. Even when we got the news (first via the BBC) that all flights were cancelled, the pilot told us even the system to cancel flights wasn’t working. This meant that it getting busses to take us back to the terminal took a long time, followed by the ignominy of having to go back through passport control having not left the airport let alone the country.

From an IT perspective there’s a few interesting aspects. Firstly BA have claimed this to be a power related incident. This is an interesting cause. As far as I’m aware there were no other companies impacted by this outage, which strongly suggests that this is was not in a shared (co-located) data centre, as otherwise we’d have seen other outages. This also implies that BA aren’t running in the cloud, as we saw no cloud outages over the weekend. Secondly assuming this was a dedicated BA data centre then there’s been a major failure of resiliency. I would normally expect of any decent quality data centre that there would be a battery backup to provide power in the immediate follow-up to a power failure. As soon as there’s been a power failure detected diesel generators should kick in to provide longer term power. Normally batteries would sit in-line with the external power to smooth the supply and provide instant protection if the external power fails. At this level of criticality it would be normal to have 2 diverse and ideally separate power suppliers. The diesel generators are some of the most loved engines in the world, they are often encased in permanently warmed enclosures to keep them at the correct operating temperature. Quite often the diesel they consume is pre-warmed as well. This often also is stored in 2 different locations to ensure that if one gets contaminated there’s a secondary supply that can still be used. These engines are often over a million pounds each, and in some sites I’ve seen then have n+n redundancy (if 4 generators are needed there are 8 on site) to deal with 100% failure. Clearly as a customer you pay more to have this level of redundancy but as we’ve seen over the weekend you never want an incident like this.

In addition to having all this redundancy built into a data centre its vital that all these components are regularly tested. It’s normal for data centres to test battery back-up and run up the generators at least once a month to ensure all the hardware and processes work as they should in an emergency.

Once you’re inside the data centre, all the racks (where servers are housed) are typically dual powered from different backup batteries, and power supplies, and then each server is dual powered to further protect against individual failures. In total there are 6 layers of redundancy in between power coming into the data centre and the actual server (redundant Power suppliers, redundant Battery back-up, redundant power generators, dual power to the rack, dual power supplies to the server, redundant power supplies in the server itself).

As you can see in theory it’s pretty difficult to have a serious power failure. While it’s possible to have a serious failure in parts of a power supply system, it would be highly unusual for this to be service impacting.

However as we saw in the outage at the weekend something catastrophic must have happened to produce such a widespread outage, and one that seems to have affected BA globally.

Even outside of pure power redundancy most large corporations will have redundancy built into individual systems, be that within the same data centre or in a secondary site (ideally both). For the more sophisticated sites, these are often what’s known as active-active, i.e. the service is running in both sites at the same time, so if there’s a failure in one server or site the service keeps running but with degraded capacity (the application may appear slower to users), however it is still available.

Most companies will spend at least 7 figures sums annually running with this level of redundancy and will test it regularly (most regulators insist this is at least every two years). It would appear that for this level of outage and number of systems that failed, either there wasn’t the appropriate level of redundancy or it hasn’t been tested regularly enough.

It’s worth pointing out that all the points mentioned above are expensive, painful to test, and do little to add to the bottom line of the company, but it is just this sort of ‘insurance’ that you never want to rely on, but having thorough and well tested plans makes all the difference when this sort of event happens.

There’s been lots of reports in the UK press, and comments from unions saying this event is reflective of BA outsourcing its IT services to a third party. I’m not sure if outsourcing had any impact on the outage, but the mere fact that if BA do outsource their IT it’s an indication that they do not perceive IT to be a core function for BA, as they’ve asked someone else to do it on their behalf.

You may have read many IT articles about Uber being the biggest taxi company and owning no taxis, and airBnB being the biggest hotel chain, but owns no hotels. It’s clear that both these examples are technology companies not traditional taxi or hotel vendors and therefore with such a reliance on technology they would be expected to have very highly resilient systems that are regularly tested.

BA however doesn’t fit that model, their biggest expenses wouldn’t be IT, they probably spend significantly more on aircraft, fuel, staff etc. However when I thought about it, their main systemic risk probably is IT. If any one model of aircraft was grounded for any reason they use a range of planes in their fleet so this would be impactful, but not catastrophic. Similarly if one of the unions that some of their staff belong to goes on strike (as we’ve seen in the past) is annoying but not critical. The same could probably be said for their food or fuel vendors, who probably vary around the world, and so if any one of them fail, they can most likely work around an individual failure.

Not so with IT, it appears that one power failure in one data had the ability to completely cripple one of the biggest airlines in the world. I cannot believe that BA would have actively known this risk and chosen to run with it.

It the ever increasing digital world we live in every company is slowly turning into a technology company. Maybe not front facing, but even in a traditional industry such as aviation where aircraft hardware will always be key, this weekend proved you can have all the planes in the world, but if the tech isn’t there to support it, you’ve got no business.

Gareth's Blog

Wednesday, 21 June 2017

Prodrive, Le Mans 2017 and Social Media done right

Monday, 5 June 2017

Why BA should care about IT