Determining Your System’s High Availability Percentage

13 11 2009

Most of the time we see customers statement of requirement for their system availability to be 99.9% or 99.99% or higher. What does it really mean? What determines these figures?

When customers say they want the system to be 99% availability, it means they want the system to be available 99% of the time, with the allowance of 1% downtime. Let’s put value to the time, and for the sake of simple calculation, let’s take 100 days. The customer wants their system to be available for user access for 99 days with the allowance of 1 day downtime. Which means in a year, the allowance of downtime is approximately 3.5 days.

The following table shows the translation of percentage availability to downtime in a year, month, and week.

Availability % Downtime % Downtime
per year
(365 days)
Downtime
per month
(30 day calculation)
Downtime
per week
90% 10% 36.5 days 72 hours 16.8 hours
95% 5% 18.25 days 36 hours 8.4 hours
98% 2% 7.3 days 14.4 hours 3.36 hours
99% 1% 3.65 days 7.2 hours 1.68 hours
99.5% 0.5% 1.83 days 3.6 hours 50.4 minutes
99.8% 0.2% 17.52 hours 86.23 minutes 20.16 minutes
99.9%
(three nines)
0.1% 8.76 hours 43.2 minutes 10.1 minutes
99.95% 0.05% 4.38 hours 21.56 minutes 5.04 minutes
99.99%
(four nines)
0.01% 52.6 minutes 4.32 minutes 1.01 minutes
99.999%
(five nines)
0.001% 5.26 seconds 25.9 seconds 6.05 seconds

We need to know what is meant by downtime. Downtime means the period of time that the server/service is unavailable. It could be the result of:

a. planned downtime – occurs due to maintenance such as server reboot after patches update, etc

b. unplanned downtime – occurs due to service/hardware failure such as power outage, network connection problem, etc

How do we determine which figure is suited for our system? The more critical the system is, the higher percentage availability goes.

How do determine the system’s critical level? We have to look at the system, understand what the system does, and what are the consequences when the system is down, whether it slows down operations, or it can cause life and death situations. When the system can cause life and death situations, it is understood that the system has high critical level, which requires high availability percentage.

# System Description Consequences of downtime Critical level Assign % availability Why?
1. Email System a corporation has their own email system, and the corporate office hours is 9am – 5pm slows down operations low 90% 72 hours downtime a month, which could happen during out of office hours, is acceptable
2. Human Resource Management System an international company has HQs in Brunei and USA, where the time zone has 12 hours difference human resource section will not be able to do their tasks moderate 95% the system is in use round the clock, a 36 hours downtime a month, is quite acceptable
3. Hospital Patient Record System the patient record is used by doctors, nurses, parmacists for medical checkups, follow ups, surgical, prescription etc life and death situations high 99.99% when system is down, it could cause doctors to diagnose the patient wrongly, surgeon to operate the patient wrongly, or pharmacists to prescribe drugs wrongly. All these can cause life and death situation, hence it requires high percentage to availability

How does the system engineers design the system to have high percentage availability? Mostly, engineers will eliminate every single point of failures by introducing fault tolerance, redundancy equipment, clustering, and load balancing.

The cost of system with high percentage availability. We also have to keep in mind that high percentage availability means engineers will propose for lots of redundancy which shoots up the price sky high.








Follow

Get every new post delivered to your Inbox.