Most of the time we see customers statement of requirement for their system availability to be 99.9% or 99.99% or higher. What does it really mean? What determines these figures?
When customers say they want the system to be 99% availability, it means they want the system to be available 99% of the time, with the allowance of 1% downtime. Let’s put value to the time, and for the sake of simple calculation, let’s take 100 days. The customer wants their system to be available for user access for 99 days with the allowance of 1 day downtime. Which means in a year, the allowance of downtime is approximately 3.5 days.
The following table shows the translation of percentage availability to downtime in a year, month, and week.
| Availability % | Downtime % | Downtime per year (365 days) |
Downtime per month (30 day calculation) |
Downtime per week |
| 90% | 10% | 36.5 days | 72 hours | 16.8 hours |
| 95% | 5% | 18.25 days | 36 hours | 8.4 hours |
| 98% | 2% | 7.3 days | 14.4 hours | 3.36 hours |
| 99% | 1% | 3.65 days | 7.2 hours | 1.68 hours |
| 99.5% | 0.5% | 1.83 days | 3.6 hours | 50.4 minutes |
| 99.8% | 0.2% | 17.52 hours | 86.23 minutes | 20.16 minutes |
| 99.9% (three nines) |
0.1% | 8.76 hours | 43.2 minutes | 10.1 minutes |
| 99.95% | 0.05% | 4.38 hours | 21.56 minutes | 5.04 minutes |
| 99.99% (four nines) |
0.01% | 52.6 minutes | 4.32 minutes | 1.01 minutes |
| 99.999% (five nines) |
0.001% | 5.26 seconds | 25.9 seconds | 6.05 seconds |
We need to know what is meant by downtime. Downtime means the period of time that the server/service is unavailable. It could be the result of:
a. planned downtime – occurs due to maintenance such as server reboot after patches update, etc
b. unplanned downtime – occurs due to service/hardware failure such as power outage, network connection problem, etc
How do we determine which figure is suited for our system? The more critical the system is, the higher percentage availability goes.
How do determine the system’s critical level? We have to look at the system, understand what the system does, and what are the consequences when the system is down, whether it slows down operations, or it can cause life and death situations. When the system can cause life and death situations, it is understood that the system has high critical level, which requires high availability percentage.
| # | System | Description | Consequences of downtime | Critical level | Assign % availability | Why? |
| 1. | Email System | a corporation has their own email system, and the corporate office hours is 9am – 5pm | slows down operations | low | 90% | 72 hours downtime a month, which could happen during out of office hours, is acceptable |
| 2. | Human Resource Management System | an international company has HQs in Brunei and USA, where the time zone has 12 hours difference | human resource section will not be able to do their tasks | moderate | 95% | the system is in use round the clock, a 36 hours downtime a month, is quite acceptable |
| 3. | Hospital Patient Record System | the patient record is used by doctors, nurses, parmacists for medical checkups, follow ups, surgical, prescription etc | life and death situations | high | 99.99% | when system is down, it could cause doctors to diagnose the patient wrongly, surgeon to operate the patient wrongly, or pharmacists to prescribe drugs wrongly. All these can cause life and death situation, hence it requires high percentage to availability |
How does the system engineers design the system to have high percentage availability? Mostly, engineers will eliminate every single point of failures by introducing fault tolerance, redundancy equipment, clustering, and load balancing.
The cost of system with high percentage availability. We also have to keep in mind that high percentage availability means engineers will propose for lots of redundancy which shoots up the price sky high.