Home > Articles > Web Services > Cloud Computing

The Practice of Cloud System Administration: Operations in a Distributed World

By Thomas A. Limoncelli, Strata R. Chalup, Christina J. Hogan
Sep 22, 2014

📄 Contents

␡

7.1 Distributed Systems Operations
7.2 Service Life Cycle
7.3 Organizing Strategy for Operational Teams
7.4 Virtual Office
7.5 Summary
Exercises

⎙ Print

< Back Page 2 of 6 Next >

This chapter is from the book 

Practice of Cloud System Administration, The: DevOps and SRE Practices for Web Services, Volume 2

Learn More Buy

7.2 Service Life Cycle

Operations is responsible for the entire service life cycle: launch, maintenance (both regular and emergency), upgrades, and decommissioning. Each phase has unique requirements, so you’ll need a strategy for managing each phase differently.

The stages of the life cycle are:

Service Launch: Launching a service the first time. The service is brought to life, initial customers use it, and problems that were not discovered prior to the launch are discovered and remedied. (Section 7.2.1)
Emergency Tasks: Handling exceptional or unexpected events. This includes handling outages and, more importantly, detecting and fixing conditions that precipitate outages. (Chapter 14)
Nonemergency Tasks: Performing all manual work required as part of the normally functioning system. This may include periodic (weekly or monthly) maintenance tasks (for example, preparation for monthly billing events) as well as processing requests from users (for example, requests to enable the service for use by another internal service or team). (Section 7.3)
Upgrades: Deploying new software releases and hardware platforms. The better we can do this, the more aggressively the company can try new things and innovate. Each new software release is built and tested before deployment. Tests include system tests, done by developers, as well as user acceptance tests (UAT), done by operations. UAT might include tests to verify there are no performance regressions (unexpected declines in performance). Vulnerability assessments are done to detect security issues. New hardware must go through a hardware qualification to test for compatibility, performance regressions, and any changes in operational processes. (Section 10.2)
Decommissioning: Turning off a service. It is the opposite of a service launch: removing the remaining users, turning off the service, removing references to the service from any related service configurations, giving back any resources, archiving old data, and erasing or scrubbing data from any hardware before it is repurposed, sold, or disposed. (Section 7.2.2)
Project Work: Performing tasks large enough to require the allocation of dedicated resources and planning. While not directly part of the service life cycle, along the way tasks will arise that are larger than others. Examples include fixing a repeating but intermittent failure, working with stakeholders on roadmaps and plans for the product’s future, moving the service to a new datacenter, and scaling the service in new ways. (Section 7.3)

Most of the life-cycle stages listed here are covered in detail elsewhere in this book. Service launches and decommissioning are covered in detail next.

7.2.1 Service Launches

Nothing is more embarrassing than the failed public launch of a new service. Often we see a new service launch that is so successful that it receives too much traffic, becomes overloaded, and goes down. This is ironic but not funny.

Each time we launch a new service, we learn something new. If we launch new services rarely, then remembering those lessons until the next launch is difficult. Therefore, if launches are rare, we should maintain a checklist of things to do and record the things you should remember to do next time. As the checklist grows with each launch, we become better at launching services.

If we launch new services frequently, then there are probably many people doing the launches. Some will be less experienced than others. In this case we should maintain a checklist to share our experience. Every addition increases our organizational memory, the collection of knowledge within our organization, thereby making the organization smarter.

A common problem is that other teams may not realize that planning a launch requires effort. They may not allocate time for this effort and surprise operations teams at or near the launch date. These teams are unaware of all the potential pitfalls and problems that the checklist is intended to prevent. For this reason the launch checklist should be something mentioned frequently in documentation, socialized among product managers, and made easy to access. The best-case scenario occurs when a service team comes to operations wishing to launch something and has been using the checklist as a guide throughout development. Such a team has “done their homework”; they have been working on the items in the checklist in parallel as the product was being developed. This does not happen by accident; the checklist must be available, be advertised, and become part of the company culture.

A simple strategy is to create a checklist of actions that need to be completed prior to launch. A more sophisticated strategy is for the checklist to be a series of questions that are audited by a Launch Readiness Engineer (LRE) or a Launch Committee.

Here is a sample launch readiness review checklist:

Sample Launch Readiness Review Survey

The purpose of this document is to gather information to be evaluated by a Launch Readiness Engineer (LRE) when approving the launch of a new service. Please complete the survey prior to meeting with your LRE.

General Launch Information:

– What is the service name?

– When is the launch date/time?

– Is this a soft or hard launch?
Architecture:

– Describe the system architecture. Link to architecture documents if possible.

– How does the failover work in the event of single-machine, rack, and datacenter failure?

– How is the system designed to scale under normal conditions?
Capacity:

– What is the expected initial volume of users and QPS?

– How was this number arrived at? (Link to load tests and reports.)

– What is expected to happen if the initial volume is 2× expected? 5×? (Link to emergency capacity documents.)

– What is the expected external (internet) bandwidth usage?

– What are the requirements for network and storage after 1, 3, and 12 months? (Link to confirmation documents from the network and storage teams capacity planner.)
Dependencies:

– Which systems does this depend on? (Link to dependency/data flow diagram.)

– Which RPC limits are in place with these dependencies? (Link to limits and confirmation from external groups they can handle the traffic.)

– What will happen if these RPC limits are exceeded ?

– For each dependency, list the ticket number where this new service’s use of the dependency (and QPS rate) was requested and positively acknowledged.
Monitoring:

– Are all subsystems monitored? Describe the monitoring strategy and document what is monitored.

– Does a dashboard exist for all major subsystems?

– Do metrics dashboards exist? Are they in business, not technical, terms?

– Was the number of “false alarm” alerts in the last month less than x?

– Is the number of alerts received in a typical week less than x?
Documentation:

– Does a playbook exist and include entries for all operational tasks and alerts?

– Have an LRE review each entry for accuracy and completeness.

– Is the number of open documentation-related bugs less than x?
Oncall:

– Is the oncall schedule complete for the next n months?

– Is the oncall schedule arranged such that each shift is likely to get fewer than x alerts?
Disaster Preparedness:

– What is the plan in case first-day usage is 10 times greater than expected?

– Do backups work and have restores been tested?
Operational Hygiene:

– Are “spammy alerts” adjusted or corrected in a timely manner?

– Are bugs filed to raise visibility of issues—even minor annoyances or issues with commonly known workarounds?

– Do stability-related bugs take priority over new features?

– Is a system in place to assure that the number of open bugs is kept low?
Approvals:

– Has marketing approved all logos, verbiage, and URL formats?

– Has the security team audited and approved the service?

– Has a privacy audit been completed and all issues remediated?

Because a launch is complex, with many moving parts, we recommend that a single person (the launch lead) take a leadership or coordinator role. If the developer and operations teams are very separate, one person from each might be selected to represent each team.

The launch lead then works through the checklist, delegating work, filing bugs for any omissions, and tracking all issues until launch is approved and executed. The launch lead may also be responsible for coordinating post-launch problem resolution.

Case Study: Self-Service Launches at Google

Google launches so many services that it needed a way to make the launch process streamlined and able to be initiated independently by a team. In addition to providing APIs and portals for the technical parts, the Launch Readiness Review (LRR) made the launch process itself self-service.

The LRR included a checklist and instructions on how to achieve each item. An SRE engineer was assigned to shepherd the team through the process and hold them to some very high standards.

Some checklist items were technical—for example, making sure that the Google load balancing system was used properly. Other items were caution-ary, to prevent a launch team from repeating other teams’ past mistakes. For example, one team had a failed launch because it received 10 times more users than expected. There was no plan for how to handle this situation. The LRR checklist required teams to create a plan to handle this situation and demonstrate that it had been tested ahead of time.

Other checklist items were business related. Marketing, legal, and other departments were required to sign off on the launch. Each department had its own checklist. The SRE team made the service visible externally only after verifying that all of those sign-offs were complete.

7.2.2 Service Decommissioning

Decommissioning (or just “decomm”), or turning off a service, involves three major phases: removal of users, deallocation of resources, and disposal of resources.

Removing users is often a product management task. Usually it involves making the users aware that they must move. Sometimes it is a technical issue of moving them to another service. User data may need to be moved or archived.

Resource deallocation can cover many aspects. There may be DNS entries to be removed, machines to power off, database connections to be disabled, and so on. Usually there are complex dependencies involved. Often nothing can begin until the last user is off the service; certain resources cannot be deallocated before others, and so on. For example, typically a DNS entry is not removed until the machine is no longer in use. Network connections must remain in place if deallocating other services depends on network connectivity.

Resource disposal includes securely erasing disks and other media and disposing of all hardware. The hardware may be repurposed, sold, or scrapped.

If decommissioning is done incorrectly or items are missed, resources will remain allocated. A checklist, that is added to over time, will help assure decommissioning is done completely and the tasks are done in the right order.

< Back Page 2 of 6 Next >

🔖 Save To Your Account

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Email Address