Operations is different from typical enterprise IT because it is focused on a particular service or group of services and because it has more demanding uptime requirements.
There is a tension between the operations team’s desire for stability and the developers’ desire to get new code into production. There are many ways to reach a balance. Most ways involve aligning goals by sharing responsibility for both uptime and velocity of new features.
Operations in distributed computing is done at a large scale. Processes that have to be done manually do not scale. Constant process improvement and automation are essential.
Operations is responsible for the life cycle of a service: launch, maintenance, upgrades, and decommissioning. Maintenance tasks include emergency and non-emergency response. In addition, related projects maintain and evolve the service.
Launches, decommissioning of services, and other tasks that are done infrequently require an attention to detail that is best assured by use of checklists. Checklists ensure that lessons learned in the past are carried forward.
The most productive use of time for operational staff is time spent automating and optimizing processes. This should be their primary responsibility. In addition, two other kinds of work require attention. Emergency tasks need fast response. Nonemergency requests need to be managed such that they are prioritized and worked in a timely manner. To make sure all these things happen, at any given time one person on the operations team should be focused on responding to emergencies; another should be assigned to prioritizing and working on nonemergency requests. When team members take turns addressing these responsibilities, they receive the dedicated resources required to assure they happen correctly by sharing the responsibility across the team. People also avoid burning out.
Operations teams generally work far from the actual machines that run their services. Since they operate the service remotely, they can work from anywhere there is a network connection. Therefore teams often work from different places, collaborating and communicating in a chat room or other virtual office. Many tools are available to enable this type of organizational structure. In such an environment, it becomes important to change the communication medium based on the type of communication required. Chat rooms are sufficient for general communication but voice and video are more appropriate for more intense discussions. Email is more appropriate when a record of the communication is required, or if it is important to reach people who are not currently online.