Home > Articles > Web Services > Cloud Computing

The Practice of Cloud System Administration: Operations in a Distributed World

This chapter from The Practice of Cloud System Administration: Designing and Operating Large Distributed Systems, Volume 2 starts with some operations management background, then discusses the operations service life cycle, and ends with a discussion of typical operations work strategies.
This chapter is from the book
  • The rate at which organizations learn may soon become the only sustainable source of competitive advantage.
  • —Peter Senge

Part I of this book discussed how to build distributed systems. Now we discuss how to run such systems.

The work done to keep a system running is called operations. More specifically, operations is the work done to keep a system running in a way that meets or exceeds operating parameters specified by a service level agreement (SLA). Operations includes all aspects of a service’s life cycle: from initial launch to the final decommissioning and everything in between.

Operational work tends to focus on availability, speed and performance, security, capacity planning, and software/hardware upgrades. The failure to do any of these well results in a system that is unreliable. If a service is slow, users will assume it is broken. If a system is insecure, outsiders can take it down. Without proper capacity planning, it will become overloaded and fail. Upgrades, done badly, result in downtime. If upgrades aren’t done at all, bugs will go unfixed. Because all of these activities ultimately affect the reliability of the system, Google calls its operations team Site Reliability Engineering (SRE). Many companies have followed suit.

Operations is a team sport. Operations is not done by a single person but rather by a team of people working together. For that reason much of what we describe will be processes and policies that help you work as a team, not as a group of individuals. In some companies, processes seem to be bureaucratic mazes that slow things down. As we describe here—and more important, in our professional experience—good processes are exactly what makes it possible to run very large computing systems. In other words, process is what makes it possible for teams to do the right thing, again and again.

This chapter starts with some operations management background, then discusses the operations service life cycle, and ends with a discussion of typical operations work strategies. All of these topics will be expanded upon in the chapters that follow.

7.1 Distributed Systems Operations

To understand distributed systems operations, one must first understand how it is different from typical enterprise IT. One must also understand the source of tension between operations and developers, and basic techniques for scaling operations.

7.1.1 SRE versus Traditional Enterprise IT

System administration is a continuum. On one end is a typical IT department, responsible for traditional desktop and client–server computing infrastructure, often called enterprise IT. On the other end is an SRE or similar team responsible for a distributed computing environment, often associated with web sites and other services. While this may be a broad generalization, it serves to illustrate some important differences.

SRE is different from an enterprise IT department because SREs tend to be focused on providing a single service or a well-defined set of services. A traditional enterprise IT department tends to have broad responsibility for desktop services, back-office services, and everything in between (“everything with a power plug”). SRE’s customers tend to be the product management of the service while IT customers are the end users themselves. This means SRE efforts are focused on a few select business metrics rather than being pulled in many directions by users, each of whom has his or her own priorities.

Another difference is in the attitude toward uptime. SREs maintain services that have demanding, 24 × 7 uptime requirements. This creates a focus on preventing problems rather than reacting to outages, and on performing complex but non-intrusive maintenance procedures. IT tends to be granted flexibility with respect to scheduling downtime and has SLAs that focus on how quickly service can be restored in the event of an outage. In the SRE view, downtime is something to be avoided and service should not stop while services are undergoing maintenance.

SREs tend to manage services that are constantly changing due to new software releases and additions to capacity. IT tends to run services that are upgraded rarely. Often IT services are built by external contractors who go away once the system is stable.

SREs maintain systems that are constantly being scaled to handle more traffic and larger workloads. Latency, or how fast a particular request takes to process, is managed as well as overall throughput. Efficiency becomes a concern because a little waste per machine becomes a big waste when there are hundreds or thousands of machines. In IT, systems are often built for environments that expect a modest increase in workload per year. In this case a workable strategy is to build the system large enough to handle the projected workload for the next few years, when the system is expected to be replaced.

As a result of these requirements, systems in SRE tend to be bespoke systems, built on platforms that are home-grown or integrated from open source or other third-party components. They are not “off the shelf” or turn key systems. They are actively managed, while IT systems may be unchanged from their initial delivery state. Because of these differences, distributed computing services are best managed by a separate team, with separate management, with bespoke operational and management practices.

While there are many such differences, recently IT departments have begun to see a demand for uptime and scalability similar to that seen in SRE environments. Therefore the management techniques from distributed computing are rapidly being adopted in the enterprise.

7.1.2 Change versus Stability

There is a tension between the desire for stability and the desire for change. Operations teams tend to favor stability; developers desire change. Consider how each group is evaluated during end-of-the-year performance reviews. A developer is praised for writing code that makes it into production. Changes that result in a tangible difference to the service are rewarded above any other accomplishment. Therefore, developers want new releases pushed into production often. Operations, in contrast, is rewarded for achieving compliance with SLAs, most of which relate to uptime. Therefore stability is the priority.

A system starts at a baseline of stability. A change is then made. All changes have some kind of a destabilizing effect. Eventually the system becomes stable again, usually through some kind of intervention. This is called the change-instability cycle.

All software roll-outs affect stability. A change may introduce bugs, which are fixed through workarounds and new software releases. A release that introduces no new bugs still creates a destabilizing effect due to the process of shifting workloads away from machines about to be upgraded. Non-software changes also have a destabilizing effect. A network change may make the local network less stable while the change propagates throughout the network.

Because of the tension between the operational desire for stability and the developer desire for change, there must be mechanisms to reach a balance.

One strategy is to prioritize work that improves stability over work that adds new features. For example, bug fixes would have a higher priority than feature requests. With this approach, a major release introduces many new features, the next few releases focus on fixing bugs, and then a new major release starts the cycle over again. If engineering management is pressured to focus on new features and neglect bug fixes, the result is a system that slowly destabilizes until it spins out of control.

Another strategy is to align the goals of developers and operational staff. Both parties become responsible for SLA compliance as well as the velocity (rate of change) of the system. Both have a component of their annual review that is tied to SLA compliance and both have a portion tied to the on-time delivery of new features.

Organizations that have been the most successful at aligning goals like this have restructured themselves so that developers and operations work as one team. This is the premise of the DevOps movement, which will be described in Chapter 8.

Another strategy is to budget time for stability improvements and time for new features. Software engineering organizations usually have a way to estimate the size of a software request or the amount of time it is expected to take to complete. Each new release has a certain size or time budget; within that budget a certain amount of stability-improvement work is allocated. The case study at the end of Section 2.2.2 is an example of this approach. Similarly, this allocation can be achieved by assigning dedicated people to stability-related code changes.

The budget can also be based on an SLA. A certain amount of instability is expected each month, which is considered a budget. Each roll-out uses some of the budget, as do instability-related bugs. Developers can maximize the number of roll-outs that can be done each month by dedicating effort to improve the code that causes this instability. This creates a positive feedback loop. An example of this is Google’s Error Budgets, which are more fully explained in Section 19.4.

7.1.3 Defining SRE

The core practices of SRE were refined for more than 10 years at Google before being enumerated in public. In his keynote address at the first USENIX SREcon, Benjamin Treynor Sloss (2014), Vice President of Site Reliability Engineering at Google, listed them as follows:

Site Reliability Practices

  1. Hire only coders.
  2. Have an SLA for your service.
  3. Measure and report performance against the SLA.
  4. Use Error Budgets and gate launches on them.
  5. Have a common staffing pool for SRE and Developers.
  6. Have excess Ops work overflow to the Dev team.
  7. Cap SRE operational load at 50 percent.
  8. Share 5 percent of Ops work with the Dev team.
  9. Oncall teams should have at least eight people at one location, or six people at each of multiple locations.
  10. Aim for a maximum of two events per oncall shift.
  11. Do a postmortem for every event.
  12. Postmortems are blameless and focus on process and technology, not people.

The first principle for site reliability engineering is that SREs must be able to code. An SRE might not be a full-time software developer, but he or she should be able to solve nontrivial problems by writing code. When asked to do 30 iterations of a task, an SRE should do the first two, get bored, and automate the rest. An SRE must have enough software development experience to be able to communicate with developers on their level and have an appreciation for what developers do, and for what computers can and can’t do.

When SREs and developers come from a common staffing pool, that means that projects are allocated a certain number of engineers; these engineers may be developers or SREs. The end result is that each SRE needed means one fewer developer in the team. Contrast this to the case at most companies where system administrators and developers are allocated from teams with separate budgets. Rationally a project wants to maximize the number of developers, since they write new features. The common staffing pool encourages the developers to create systems that can be operated efficiently so as to minimize the number of SREs needed.

Another way to encourage developers to write code that minimizes operational load is to require that excess operational work overflows to the developers. This practice discourages developers from taking shortcuts that create undue operational load. The developers would share any such burden. Likewise, by requiring developers to perform 5 percent of operational work, developers stay in tune with operational realities.

Within the SRE team, capping the operational load at 50 percent limits the amount of manual labor done. Manual labor has a lower return on investment than, for example, writing code to replace the need for such labor. This is discussed in Section 12.4.2, “Reducing Toil.”

Many SRE practices relate to finding balance between the desire for change and the need for stability. The most important of these is the Google SRE practice called Error Budgets, explained in detail in Section 19.4.

Central to the Error Budget is the SLA. All services must have an SLA, which specifies how reliable the system is going to be. The SLA becomes the standard by which all work is ultimately measured. SLAs are discussed in Chapter 16.

Any outage or other major SLA-related event should be followed by the creation of a written postmortem that includes details of what happened, along with analysis and suggestions for how to prevent such a situation in the future. This report is shared within the company so that the entire organization can learn from the experience. Postmortems focus on the process and the technology, not finding who to blame. Postmortems are the topic of Section 14.3.2. The person who is oncall is responsible for responding to any SLA-related events and producing the postmortem report.

Oncall is not just a way to react to problems, but rather a way to reduce future problems. It must be done in a way that is not unsustainably stressful for those oncall, and it drives behaviors that encourage long-term fixes and problem prevention. Oncall teams are made up of at least eight members at one location, or six members at two locations. Teams of this size will be oncall often enough that their skills do not get stale, and their shifts can be short enough that each catches no more than two outage events. As a result, each member has enough time to follow through on each event, performing the required long-term solution. Managing oncall this way is the topic of Chapter 14.

Other companies have adopted the SRE job title for their system administrators who maintain live production services. Each company applies a different set of practices to the role. These are the practices that define SRE at Google and are core to its success.

7.1.4 Operations at Scale

Operations in distributed computing is operations at a large scale. Distributed computing involves hundreds and often thousands of computers working together. As a result, operations is different than traditional computing administration.

Manual processes do not scale. When tasks are manual, if there are twice as many tasks, there is twice as much human effort required. A system that is scaling to thousands of machines, servers, or processes, therefore, becomes untenable if a process involves manually manipulating things. In contrast, automation does scale. Code written once can be used thousands of times. Processes that involve many machines, processes, servers, or services should be automated. This idea applies to allocating machines, configuring operating systems, installing software, and watching for trouble. Automation is not a “nice to have” but a “must have.” (Automation is the subject of Chapter 12.)

When operations is automated, system administration is more like an assembly line than a craft. The job of the system administrator changes from being the person who does the work to the person who maintains the robotics of an assembly line. Mass production techniques become viable and we can borrow operational practices from manufacturing. For example, by collecting measurements from every stage of production, we can apply statistical analysis that helps us improve system throughput. Manufacturing techniques such as continuous improvement are the basis for the Three Ways of DevOps. (See Section 8.2.)

Three categories of things are not automated: things that should be automated but have not been yet, things that are not worth automating, and human processes that can’t be automated.

Tasks That Are Not Yet Automated

It takes time to create, test, and deploy automation, so there will always be things that are waiting to be automated. There is never enough time to automate everything, so we must prioritize and choose our methods wisely. (See Section 2.2.2 and Section 12.1.1.)

For processes that are not, or have not yet been, automated, creating procedural documentation, called a playbook, helps make the process repeatable and consistent. A good playbook makes it easier to automate the process in the future. Often the most difficult part of automating something is simply describing the process accurately. If a playbook does that, the actual coding is relatively easy.

Tasks That Are Not Worth Automating

Some things are not worth automating because they happen infrequently, they are too difficult to automate, or the process changes so often that automation is not possible. Automation is an investment in time and effort and the return on investment (ROI) does not always make automation viable.

Nevertheless, there are some common cases that are worth automating. Often when those are automated, the more rare cases (edge cases) can be consolidated or eliminated. In many situations, the newly automated common case provides such superior service that the edge-case customers will suddenly lose their need to be so unique.

Tasks That Cannot Be Automated

Some tasks cannot be automated because they are human processes: maintaining your relationship with a stakeholder, managing the bidding process to make a large purchase, evaluating new technology, or negotiating within a team to assemble an oncall schedule. While they cannot be eliminated through automation, they can be streamlined:

  • Many interactions with stakeholders can be eliminated through better documentation. Stakeholders can be more self-sufficient if provided with introductory documentation, user documentation, best practices recommendations, a style guide, and so on. If your service will be used by many other services or service teams, it becomes more important to have good documentation. Video instruction is also useful and does not require much effort if you simply make a video recording of presentations you already give.
  • Some interactions with stakeholders can be eliminated by making common requests self-service. Rather than meeting individually with customers to understand future capacity requirements, their forecasts can be collected via a web user interface or an API. For example, if you provide a service to hundreds of other teams, forecasting can be become a full-time job for a project manager; alternatively, it can be very little work with proper automation that integrates with the company’s supply-chain management system.
  • Evaluating new technology can be labor intensive, but if a common case is identified, the end-to-end process can be turned into an assembly-line process and optimized. For example, if hard drives are purchased by the thousand, it is wise to add a new model to the mix only periodically and only after a thorough evaluation. The evaluation process should be standardized and automated, and results stored automatically for analysis.
  • Automation can replace or accelerate team processes. Creating the oncall schedule can evolve into a chaotic mess of negotiations between team members battling to take time off during an important holiday. Automation turns this into a self-service system that permits people to list their availability and that churns out an optimal schedule for the next few months. Thus, it solves the problem better and reduces stress.
  • Meta-processes such as communication, status, and process tracking can be facilitated through online systems. As teams grow, just tracking the interaction and communication among all parties can become a burden. Automating that can eliminate hours of manual work for each person. For example, a web-based system that lets people see the status of their order as it works its way through approval processes eliminates the need for status reports, leaving people to deal with just exceptions and problems. If a process has many complex handoffs between teams, a system that provides a status dashboard and automatically notifies teams when hand-offs happen can reduce the need for legions of project managers.
  • The best process optimization is elimination. A task that is eliminated does not need to be performed or maintained, nor will it have bugs or security flaws. For example, if production machines run three different operating systems, narrowing that number down to two eliminates a lot of work. If you provide a service to other service teams and require a lengthy approval process for each new team, it may be better to streamline the approval process by automatically approving certain kinds of users.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Overview


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information


To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information


Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children


This site is not directed to children under the age of 13.

Marketing


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information


If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information


Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents


California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure


Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact


Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice


We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020