To develop our data centre services to help our client to achieve better continuity of services in their data centres we decided a good place to start would be to try and analyse why data centres go wrong.
Information on this has been published before but usually by manufacturers who have a specific interest in justifying a demand for their own products or sometimes by users such as Google who are not in a hurry to give much away about their own shortcomings. As a result information tends to be varied and with no common reporting terminology.
Our approach was to roundup up all press articles published over the last thirty months in on-line trade journals all over the world. Their sources in turn were very often the ‘status dashboards’ published by the data centre operators for their own customers.
Over a thirty month period 32 major failures were identified. A ‘major failure’ is here defined by us as something that took down the entire data centre or at least rendered its main operational status as unusable for one or more major customers.
With 32 failures in 30 months this is just under once a month. But this is only for those incidents publically announced. If we take an assumption that this is only half of all incidents then it means [quote]There is a major operational incident at a data centre about every two weeks.[/quote]
What does this represent in terms of failure rates? Taken over a likely figure of about 1000 major data centres in the world, failures at a rate of 24 per year then give a chance of major failure at about 2.4% or one in forty, per year.
Another interesting figure to be gleaned from our survey shows the average length of downtime after a major incident has occurred. Of the 32 failures we investigated 24 of them reported actual outage times. [quote]The failures varied from 48 hours to 30 minutes with an average downtime of 14.7 hours per major incident.[/quote]
Not surprisingly major incidents such as fire, flooding and total power loss take some time to put right.
The final area we looked at was the cause of the outage.
From our analysis [quote]Power problems are the main cause of failures at 31%.[/quote] This includes power failures that started with the utility supply but where the internal data centre back-up system failed to respond correctly. This includes generators failing to start for numerous reasons and multiple generators failing to synchronise.
Storm and flood damage came second at 22% and must make data centre operators think carefully about location, building design and lightning protection.
Fire was the next major event. This includes fires within the data centre and centres also taken out by buildings on fire in close proximity.
At least three quarters of the problems could have been avoided through the use of better design and operational practices, not least a proper testing, under load, of all the back-up power supply equipment.
Capitoline introduced one of the first data centre design training programmes (DCD) four years ago and has now trained over 300 companies worldwide. In response to the demand for better operational and management practices Capitoline has just launched its unique Data Centre Operational Management Course, DCOM, to help users avoid the common pitfalls concerning data centre management.