Table of Contents
- Microsoft SQL Server Defined
- Microsoft SQL Server Features
- Microsoft SQL Server Administration
- Microsoft SQL Server Programming
- Performance Tuning
- Practical Applications
- Becoming a DBA
- DBA Levels
- Becoming a Data Professional
- SQL Server Professional Development Plan, Part 1
- SQL Server Professional Development Plan, Part 2
- SQL Server Professional Development Plan, Part 3
- Evaluating Technical Options
- System Sizing
- Creating a Disaster Recovery Plan
- Anatomy of a Disaster (Response Plan)
- Database Troubleshooting
- Conducting an Effective Code Review
- Developing an Exit Strategy
- Data Retention Strategy
- Keeping Your DBA/Developer Job in Troubled Times
- The SQL Server Runbook
- Creating and Maintaining a SQL Server Configuration History, Part 1
- Creating and Maintaining a SQL Server Configuration History, Part 2
- Creating an Application Profile, Part 1
- Creating an Application Profile, Part 2
- How to Attend a Technical Conference
- Tips for Maximizing Your IT Budget This Year
- The Importance of Blue-Sky Planning
- Application Architecture Assessments
- Business Intelligence
- Tips and Troubleshooting
- Additional Resources
Anatomy of a Disaster (Response Plan)
Last updated Mar 28, 2003.
- “No plan survives contact with the enemy.” Helmuth von Moltke the Elder
I’ve written here on InformIT in multiple locations about having a plan to help you recover from a disaster. Like many data professionals, I’m a bit of a stickler about planning everything out I think that without a good plan, you’re inviting even more trouble during a catastrophe. I was also in the military, and had the value of plans drilled into me constantly.
Having those plans, and probably more importantly going through a planning process, has saved my backside more times than I can remember. Plans make sure that I thought everything through, that I had spare parts on hand, that I knew who to call about a particular part of the recovery process, and much more. They simply remove most of the uncertainty during the panic that follows a crisis, when it tends to build the most.
When the plan works, of course.
Because they don’t always work. Sometimes the plan doesn’t include one little variable which might be the most important variable. The plan might not be valid for the situation, or more likely, the situation might change and the plan doesn’t. At that point, the plan doesn’t make sense any more.
So does that mean that you shouldn’t plan? No. It doesn’t. It means that if you’re going to put your trust in the plan, it has to work. And the only way you can know that it will work is to try it out. And that’s where the difficulty comes in. You might have a great plan it might cover everything you can think of, but if you don’t take the time to exercise it through a few times, you’re begging for trouble.
Don’t believe me? Think you’ve covered the bases? Allow me to tell you a true story about a team that learned about plans the hard way.
I was the IT manager of a small group of about eight professionals. We had a project to deploy several hundred very small networked systems across a large territory to remote offices, and those offices would not have an IT presence. We had to create a single server with mail, office and SQL Server on it. Up to five workstations connected to the network, and a nightly job loaded all of the transactions during the day to a miniframe computer from each office.
The creation and distribution for this network was a thing of beauty. The network engineer on the team figured out how to get the buildings wired with a specific color of network cables for each use. The local hubs and routers had those same colors on the ports. The System Administrator on the team had the small server’s ports coded with the same color of sticker so that plug-in was a snap.
We all created our parts of the server I had the mail administrator set up mail, the system administrator create file and printer shares, and we even imaged the server drives and had the hardware vendor create an automated process that would see the systems configure themselves on site by location name. It created a domain, handled the networking, and even had an image of the workstations to automatically build each one with everything the user needed.
The system would then contact the miniframe, send us an e-mail that everything went well, and set up all of its own maintenance. If a workstation ever had a problem, the user simply pressed a certain key, and the whole thing was rebuilt. Their data was saved on the server, and they continued working less than an hour later.
The team thought of everything. We had tapes and tape drives in each server. The receptionists at each site were taught to change the tapes each morning, and the servers backed up everything onto tape every night. The “recovery disk” was taped on the side, and everything was all set.
One morning I announced that we needed to run a test. How confident did the team feel about a “real” disaster? They looked at each other and said “bring it on.”
I had the team “ship” a network (server, router and workstation) to a cubicle in the building. I made them wait a week for the “shipping” to work. I asked accounting if I could borrow someone for a few days for the test, and got the cooperation from that team. I asked for someone who would represent the field, and they got me a lady that had actually worked in a remote office at one point. She was very smart, but had no computer experience. It was perfect.
I asked her to follow the instructions to set up the network, the server and then the workstations. The process the team developed worked like a champ. She had the whole thing up and running in record time, without making a single call to the helpdesk, which was standing by with instructions should they need them.
Once the system was up and running, I asked the “remote office” if she would key in some information for me. I then simulated the end of the day by asking her to put in the tape in the tape drive, and let her get back to her job.
We debriefed, and agreed that this part of the test was a success. I told them that at some point I would simulate a disaster. They agreed, and they went home.
After the team went home, I walked over to the server. Keep in mind, this was quite some time back, and Windows Servers were kind of a new thing. In fact, it was Windows NT 4.0, and the latest service packs hadn’t even come out yet. So you could get away with a lot of things that you can’t today, and the recovery tools weren’t nearly as robust.
I erased a critical operating system file, logged off of the server and went home.
The Plan in Meltdown
The next day the team came to work, and set about checking their e-mail and chatting. I sat in my office waiting for the call. The “remote site” came on line and she started working on the terminal.
After an hour or so, the terminal caused the server to take a hit on that file I erased, and bluescreened. She came to my office and said “there’s something weird happening on the server.” I nodded. “Should I call someone?” I said, “I don’t know. Did they tell you to call someone?” She shook her head. I said, “what do you think you should do?” She said “I’ll wait and see if it fixes itself.” I said “OK”. It was about 10:00 by this time.
After about a half hour, she came back to my office. “It’s still not working. I think I should call someone.” I said “OK”. She asked who she should call, and I asked her who they told her to call. She said she wasn’t sure the helpdesk maybe? I told her that I thought that might work. It’s now lunchtime.
The helpdesk person didn’t know we were running a test. They dispatched a local IT troop to the cubicle, and when he got there he asked what was up. She stated she was working on a test of some kind, so the tech left.
When the team came back from lunch, they wondered aloud how the performance on the system might be going. I stayed silent. They decided to call and ask how things were going, so I took a seat nearby to hear the conversation. It was now afternoon. Chaos ensued.
After yelling for the team to join them, they glared at me when she told them she had asked me about this at 10:00. I asked them, “Did you tell her to call me? Did you tell her to call anyone?” They looked down at their feet. I said, “well, we’ve been down a day. That’s about 5,000 dollars worth of billing information lost so far.”
I said “You better get started fixing this thing.”
After trying to guide the remote worker through understanding what was going on (they couldn’t see the screen, she couldn’t describe it well) and asking dozens of questions, they realized that the system was down hard. By then it was quitting time, and I told her to go home. They protested but I said, “Do you think they will really stay in the office to help you with this problem in the field?” So they had to wait another day. So far: almost 24 hours of data time lost and they still didn’t realize how bad it was.
The next day she came to work, and they were ready. They asked her to put in the tape, and put in the recovery disk. They tried to get the system up to where they could run the tape restore. But then they ran into a problem the rescue disk (which at the time only tried to restore a few system files) was Windows NT RTM, and the system had SP2 applied as part of the build process. So now the hard drive (with the data she keyed in the first day) was totally trashed.
That’s when I broke the really bad news to them. I told them I trashed the drive before the first backup had been taken. Even if they managed to get the recovery disk in place, the restore would have had nothing on it.
Finally, after three full calendar days, they “shipped” a new server to her from the factory. She had to do the entire installation again. I even made them wait a day to let the “shipping” happen. Then I called a meeting.
Believe me, the confidence level was not high. People realized they had put their faith in a plan that had not been tested. But I reminded them it was a test environment and that this was the time to make the needed changes.
What did they do based on what we learned. They had a second server ready and standing by on a shelf at HQ. The system had the same information, process, tape drive and everything else ready to go.
At the first sign of trouble (we practiced again) they shipped the server after they configured the name of the domain. They sent a recovery disk (with the right files) along as well. If the problem turned out not to need the server, they simply shipped it back.
Each site got a training video about the tape backups and how to check them. The server was configured to sent out a “distress call” based on a periodic ping from HQ. Health monitoring was put into place. Each server got a Windows NT slipstreamed installation set of CD’s (no DVD’s back then!) taped to the side, and stickers were placed on the front, back and sides with the helpdesk number and the code to use to indicate the type of problem this was. We also invested in a bare-metal recovery software package when one came out for NT.
We had to use the new process more than once and we learned more each time. That’s the key learning from mistakes, and realizing that you don’t have a disaster plan you have a disaster recovery plan.
InformIT Articles and Sample Chapters
Rate Your "Relationshp" with Your Disaster Recovery Plan is a great article on putting your eggs in one basket...
Books and eBooks
Disaster Recovery Planning: Preparing for the Unthinkable, 3rd Edition should be required reading for any administrator.
A more recent option is Learning from Catastrophes: Strategies for Reaction and Response by Howard Kunreuther and Michael Useem.