During the mid- and late 1990s, I managed the main IT infrastructure for a major motion picture studio in Beverly Hills, California. An event just prior to my hiring drastically changed the corporation's thinking about disaster recovery, which led the company to ask me to develop a disaster-recovery program of major proportions.
Two of this studio's most critical applications were just coming online and were being run on IBM AS/400 midrange processors. One of the applications involved the scheduling of broadcast times for programs and commercials for the company's new premier cable television channel. The other application managed the production, distribution, and accounting of domestic entertainment videos, laser discs, and interactive games. The company had recently migrated the development and production versions of these applications onto two more advanced models of the IBM AS/4009406-level machines utilizing reduced instruction set computing (RISC) technology.
During the development of these applications, initial discussions began about developing a disaster-recovery plan for these AS/400s and their critical applications. In February 1995, the effort was given a major jumpstart from an unlikely source. A distribution transformer that powered the AS/400 computer room from outside the building short-circuited and exploded. The damage was so extensive that repairs were estimated to take up to five days. With no formal recovery plan yet in place, IT personnel, suppliers, and customers all scurried to minimize the impact of the outage.
A makeshift disaster-recovery site located 40 miles away was quickly identified and activated with the help of one of the company's key vendors. Within 24 hours, the studio's AS/400 operating systems, application software, and databases were all restored and operational. Most of the critical needs of the AS/400 customers were met during the six days that it eventually took to replace the failed transformer.
Three Important Lessons Learned
This incident accelerated the development of a formal disaster-recovery plan, and underscored three important points about recovering from a disaster:
There are noteworthy differences between the concept of disaster recovery and that of business resumption. Business resumption is defined here to mean that critical department processes can be performed as soon as possible after the initial outage. The full recovery from the disaster usually occurs many days after the business resumption process has been activated.
In this case, the majority of company operations affected by the outage were restored in less than a day after the transformer exploded. It took nearly four days to replace all the damaged electrical equipment and another two days to restore operations to their normal state. Distinguishing between these two concepts helped during the planning process for the formal disaster-recovery programit enabled a focus on business resumption in meetings with key customers, while the focus with key suppliers could be on disaster recovery.
The majority of disasters most likely to cause lengthy outages to computer centers are relatively small, localized incidents such as broken water mains, fires, smoke damage, or electrical equipment failures. They typically are the not the flash floods, powerful hurricanes, or devastating earthquakes frequently highlighted in the media.
This is not to say that we should not be prepared for such a major disaster. Infrastructures that plan and test recovery strategies for smaller incidents are usually well on their way to having a program to handle any size of calamity. While major calamities do occur, they are far less likely and are often overshadowed by the more widespread effects of the disaster on the community. What usually makes a localized computer center disaster so challenging is that the rest of the company is normally operational and desperately in need of the computer center services that have been disrupted.
A firm commitment is needed from executive management to proceed with a formal disaster-recovery plan. In many ways, disaster recovery is like an insurance policy: You don't really need it until you really need it. This commitment became the first important step toward developing an effective disaster-recovery process. A comprehensive program requires hardware, software, budget, and the time and effort of knowledgeable personnel. The support of executive management is necessary to make these resources available.