Home > Articles > Networking > Storage

📄 Contents

  1. 1.1 Building a Site from Scratch
  2. 1.2 Growing a Small Site
  3. 1.3 Going Global
  4. 1.4 Replacing Services
  5. 1.5 Moving a Data Center
  6. 1.6 Moving to/Opening a New Building
  7. 1.7 Handling a High Rate of Office Moves
  8. 1.8 Assessing a Site (Due Diligence)
  9. 1.9 Dealing with Mergers and Acquisitions
  10. 1.10 Coping with Frequent Machine Crashes
  11. 1.11 Surviving a Major Outage or Work Stoppage
  12. 1.12 What Tools Should Every SA Team Member Have?
  13. 1.13 Ensuring the Return of Tools
  14. 1.14 Why Document Systems and Procedures?
  15. 1.15 Why Document Policies?
  16. 1.16 Identifying the Fundamental Problems in the Environment
  17. 1.17 Getting More Money for Projects
  18. 1.18 Getting Projects Done
  19. 1.19 Keeping Customers Happy
  20. 1.20 Keeping Management Happy
  21. 1.21 Keeping SAs Happy
  22. 1.22 Keeping Systems from Being Too Slow
  23. 1.23 Coping with a Big Influx of Computers
  24. 1.24 Coping with a Big Influx of New Users
  25. 1.25 Coping with a Big Influx of New SAs
  26. 1.26 Handling a High SA Team Attrition Rate
  27. 1.27 Handling a High User-Base Attrition Rate
  28. 1.28 Being New to a Group
  29. 1.29 Being the New Manager of a Group
  30. 1.30 Looking for a New Job
  31. 1.31 Hiring Many New SAs Quickly
  32. 1.32 Increasing Total System Reliability
  33. 1.33 Decreasing Costs
  34. 1.34 Adding Features
  35. 1.35 Stopping the Hurt When Doing This
  36. 1.36 Building Customer Confidence
  37. 1.37 Building the Teams Self-Confidence
  38. 1.38 Improving the Teams Follow-Through
  39. 1.39 Handling an Unethical or Worrisome Request
  40. 1.40 My Dishwasher Leaves Spots on My Glasses
  41. 1.41 Protecting Your Job
  42. 1.42 Getting More Training
  43. 1.43 Setting Your Priorities
  44. 1.44 Getting All the Work Done
  45. 1.45 Avoiding Stress
  46. 1.46 What Should SAs Expect from Their Managers?
  47. 1.47 What Should SA Managers Expect from Their SAs?
  48. 1.48 What Should SA Managers Provide to Their Boss?
  • Print
  • + Share This
This chapter is from the book

1.11 Surviving a Major Outage or Work Stoppage

  • Consider modeling your outage response on the Incident Command System (ICS). This ad hoc emergency response system has been refined over many years by public safety departments to create a flexible response to adverse situations. Defining escalation procedures before an issue arises is the best strategy.
  • Notify customers that you are aware of the problem on the communication channels they would use to contact you: intranet help desk “outages” section, outgoing message for SA phone, and so on.
  • Form a “tiger team” of SAs, management, and key stakeholders; have a brief 15- to 30-minute meeting to establish the specific goals of a solution, such as “get developers working again,” “restore customer access to support site” and so on. Make sure that you are working toward a goal, not simply replicating functionality whose value is non-specific.
  • Establish the costs of a workaround or fallback position versus downtime owing to the problem, and let the businesspeople and stakeholders determine how much time is worth spending on attempting a fix. If information is insufficient to estimate this, do not end the meeting without setting the time for the next attempt.
  • Spend no more than an hour gathering information. Then hold a team meeting to present management and key stakeholders with options. The team should do hourly updates of the passive notification message with status.
  • If the team chooses fix or workaround attempts, specify an order in which fixes are to be applied, and get assistance from stakeholders on verifying that the each procedure did or did not work. Document this, even in brief, to prevent duplication of effort if you are still working on the issue hours or days from now.
  • Implement fix or workaround attempts in small blocks of two or three, taking no more than an hour to implement total. Collect error message or log data that may be relevant, and report on it in the next meeting.
  • Don’t allow a team member, even a highly skilled one, to go off to try to pull a rabbit out of his or her hat. Since you can’t predict the length of the outage, you must apply a strict process in order to keep everyone in the loop.
  • Appoint a team member who will ensure that meals are brought in, notes taken, and people gently but firmly disengaged from the problem if they become too tired or upset to work.
  • + Share This
  • 🔖 Save To Your Account