Five things we can learn from the British Airways system outage

British Airways (BA) passengers around the world were seen facing cancelled flights and delays due to computer problems over the bank holiday weekend. Although flights have now returned to normal, the IT glitches have meant shares have fallen and has wiped half a billion pounds off its value. So, what can we learn from the IT failure?

1. Power problems happen.

We rely on mains power so much, we forget it’s not perfect. Mains power is designed for running light bulbs & motors, not computers. Power problems are surprisingly common and computer equipment doesn’t like them. It’ll cope with minor problems, but a big problem can overwhelm the power supply and cause equipment failure or data corruption.

By protecting your systems with a good quality UPS that includes overvoltage protection. This means all power supplies to all equipment, not just some of them.

A power surge through the unprotected supplies could easily fry the equipment, even if some supplies are protected. Replacement equipment might take days or weeks to arrive, and then will take time to configure.

If you’re relying on the UPS to keep your equipment running if there’s a power outage, make sure it has sufficient capacity to cover the period it will take for power to come back. That might be 30 minutes or hours – depending on what went wrong. If your equipment must stay online – then you should have a generator and fuel on site. The generator should be tested at least once a month to ensure it’ll work when you need it.

2. Your DR system should be completely separated from your main system.

It should not be in the same data centre and certainly should not be sharing the same networking hardware as it seems this was the case at BA. This includes separate communication links and equipment.

3. You need an effective business continuity/disaster recovery (BC/DR) plan.

Always have an option C. A BC/DR plan should be tested regularly and key staff (and possibly suppliers) should be familiar with it. By doing this you are ensuring that they function effectively. It seems clear BA thought they had a BC/DR system and a plan – yet it failed completely.

Most BC/DR plans that fail, do so due to human, not technical, factors. It’s possible that the problem at BA was compounded by 3rd party staff in India who were not familiar with the plan and did not follow it correctly.

4. Beware of over-converging systems.

For example, try to keep your email and telephone systems on different servers. BA had a bad situation made much worse because they lost a phone system. Converge by all means – but ensure you can get phones back fast even if everything else has died. GMA have some great solutions to this problem which allow choices of on-premise, hosted or cloud servers and the ability to recover your phone system rapidly.

5. Never put your website on your main system.

Have someone else run it on an entirely separate system, in a separate location. This is one BA did get right. Ensure you can edit the website easily if your system is down.

While you’re busy recovering your business – being able to put a statement and advice for customers on your website gives you an easy way of communicating with your customers. Communication is key to managing this process and this was the main thing BA customers complained about.

After the recent outages, outside experts have said the major vulnerability appears to be the heavy reliance on a single system. This has left them questioning whether businesses like BA have a good enough strategy in their complex IT systems and if they test them frequently enough.

GMA can help you plan and deliver a BC/DR strategy for both your IT and your whole business. www.gmal.co.uk / 020 8778 7759