This chapter is from the book
1.11 Surviving a Major Outage or Work Stoppage
- Consider modeling your outage response on the Incident Command System (ICS). This ad hoc emergency response system has been
refined over many years by public safety departments to create a flexible response to adverse situations. Defining escalation
procedures before an issue arises is the best strategy.
- Notify customers that you are aware of the problem on the communication channels they would use to contact you: intranet help
desk “outages” section, outgoing message for SA phone, and so on.
- Form a “tiger team” of SAs, management, and key stakeholders; have a brief 15- to 30-minute meeting to establish the specific
goals of a solution, such as “get developers working again,” “restore customer access to support site” and so on. Make sure
that you are working toward a goal, not simply replicating functionality whose value is non-specific.
- Establish the costs of a workaround or fallback position versus downtime owing to the problem, and let the businesspeople
and stakeholders determine how much time is worth spending on attempting a fix. If information is insufficient to estimate
this, do not end the meeting without setting the time for the next attempt.
- Spend no more than an hour gathering information. Then hold a team meeting to present management and key stakeholders with
options. The team should do hourly updates of the passive notification message with status.
- If the team chooses fix or workaround attempts, specify an order in which fixes are to be applied, and get assistance from
stakeholders on verifying that the each procedure did or did not work. Document this, even in brief, to prevent duplication
of effort if you are still working on the issue hours or days from now.
- Implement fix or workaround attempts in small blocks of two or three, taking no more than an hour to implement total. Collect
error message or log data that may be relevant, and report on it in the next meeting.
- Don’t allow a team member, even a highly skilled one, to go off to try to pull a rabbit out of his or her hat. Since you can’t
predict the length of the outage, you must apply a strict process in order to keep everyone in the loop.
- Appoint a team member who will ensure that meals are brought in, notes taken, and people gently but firmly disengaged from
the problem if they become too tired or upset to work.