Translate

Wednesday, October 21, 2020

DR in Real World Incidents

When it comes to computer systems, Disaster Recovery is an important concept but at most of the times, we neglect the most important conceptual points. Those will lead to a lot of damages. The idea of this blog post is to learn from mistakes of real-world incidents that can be used as learnings to the Computer Systems.

Titanic (1912), what a romantic story. However, it is not that romantic when you consider Disaster Recovery (DR) concepts. When the Titanic was designed and built, they boasted that it is a ship that is practically unsinkable.  In that mindset, they did not have enough lifeboats this causes many deaths in the incident. They had only 20 lifeboats which were sufficient for 1300 people out of 3500 that were travelling. This is a basic mistake when it comes to disaster recovery. When you are designing a DR system, consideration should be the scale of Impact NOT the Probability of the Event. Though the probability of the ship sinking is small, you have to plan DR considering the impact if the event occurs. In the case of IT systems, we need to look at from the same line of thought. 


Next, let us look at somewhat very recent incident. The incident is the 
Fukushima Daiichi nuclear disaster that happened in 2011, March. The Fukushima Daiichi Nuclear Power Plant comprised six separate boiling water reactors out of which three were operating on that day where the earthquake occurred. As soon as the earthquake, those Reactors automatically shut down. As the reactors were now unable to generate power to run their own coolant pumpsas a DR mechanism, emergency diesel generators came online,  to power electronics and coolant systems. However, the tsunami came in after the earthquake which resulted shut down of diesel generators.  In the designed, there was another DR option DC power./ DC power was lost on Units 1 and 2 due to flooding, while some DC power from batteries remained available on Unit 3. After all the DR options failed, then the inevitable disaster occurred. Now they have built a tsunami wall in the area in order to control tsunami, another DR option. Learning from this strategy is that, there is no limit for disaster implementation, there are spaces for improvement all the time. 

The third incident is coming from Bhopal, India. In 1969 United Carbide, USA started their factory in Bhopal. At the time of opening, this factory was equipped with the latest technologies and included lot of DR mechanisms such as extra storage for Methyl Iso-cyanide (MIC),  Vent Gas scrubber to detoxified gas, Flare Tower to burn the gas etc. However, over time due to the cost-cutting, these DR techniques were not maintained and ultimately many of the monitoring systems were not working so that operators were not caring about the meter reading. All these things have resulted with the lives of more than 25,000 people in December 1984. So we have a few learning outcomes with the Bhopal tragedy. We should not compromise DR in the name of cost-cutting, we need to maintain DR systems and importantly we need to rehearsal for DR systems. 

In 2000, November, in Kapran, Austria 150 people died due to fire in a train. During the incident, passengers did not have the options to inform the necessary people outside. Furthermore, there were no smoke detectors etc to stop the fire. After the incident, when authorities were questioned they had an interesting reason. According to them, they haven't taken any of the DR precautions since there were no previous incidents before. For DR history does not matter. A new incident can always occur. 

Another incident that results in Disastor is collapsing of New World Hotel in Singapore. In this incident, after the building was built there were many changes had happened. They have introduced new AC plants, new machinery. With these changes, they had not looked into design again. In the case of the Computer system, we all know that systems are evolving. According to the system functionality, you DR systems should be modified accordingly. 

These real-world incidents have resulted with many loss of lives. Though Computer systems may not result in looking of that many lives, it is essential to focus more on the DR system in order to maintain better systems. 

No comments:

Post a Comment