Home > Articles

An Interview with the authors of “The Practice of Cloud System Administration” on DevOps and Data Security

  • Print
  • + Share This
Win Treese interviews Thomas A. Limoncelli, Strata R. Chalup, and Christina J. Hogan, the authors of The Practice of Cloud System Administration: Designing and Operating Large Distributed Systems, Volume 2, about DevOps, tearing down silos, the challenge of data security, and why we’re all doomed.
From the author of

Win Treese: A focal point for your new book is “DevOps”. For those not familiar with the term, how would you describe DevOps and its philosophy?

Thomas Limoncelli: DevOps is breaking down the silos of Dev(eloper)s and Op(eration)s so that they work together. In the old days, Devs wrote code and threw it over the wall (or into a shrink-wrap machine) and Ops picked it up and were stuck trying to figure out how to make it operational in an efficient way. Requests for code changes that would improve operability went largely ignored.

Christina Hogan: The result was that Ops resisted change, introducing change management processes and sign-off tasks that made it harder to push new releases into production. This was frustrating for Devs, as it delayed new features being made available to customers and them seeing the benefit of the Devs’ hard work. Debugging was hard because of the long time between developing the code and getting bug reports. Releases were larger and less frequent, and therefore riskier and more error prone, making Ops even more change averse. It was a vicious circle that benefited no one.

Limoncelli: In a web environment, Devs and Ops work at the same company and can work together. If management makes both Devs and Ops responsible for uptime and performance, it becomes natural to tear down the silos and work together. Automation becomes the rule, not the exception. Devs and Ops develop empathy for each other’s roles and can focus on global optimizations, rather than local optimizations that are globally ineffective.

Hogan: Devs benefit from getting new features more quickly into production, and having less overhead and resistance for deploying new releases. Ops benefit from more reliable releases and better operational features.

Strata Chalup: These lessons are not just for web environments. While the techniques and cultural changes came from the web, they are now being applied to all environments because every organization has silos and can benefit from more cooperation.

Treese: Your book is quite comprehensive, possibly with an intimidating amount of detail. How would you advise existing groups about starting to work as a D evO ps team?

Limoncelli: It usually happens by a grassroots initiative. Two people from different teams sit down, usually outside of the office, and have a heart-to-heart conversation about ways they can improve the entire system by working closer together. When I worked at Google I would often have team leads meet and discuss their processes and the handoffs between the teams. Soon people would be listing “pain points” and both teams would commit to changes that would make things better.

Alternatively it can start from the top. However, if an executive puts out a mandate that everyone should tear down silos, that’s not going to change anything. They should, instead, set goals for end-to-end improvements in latency (how long it takes a feature to get to production), and quality (how many failed deployments) and then make sure the middle management understands that these improvements can best be made by making it possible for their teams to cooperate directly across silos. In other words, management needs to get out of the way.

Here’s a quick test an executive can use: If one team needs to talk to another to make an improvement, are they permitted to talk directly to the person that they need to talk with to fix the problem, or do they have to go up one management chain, to the common manager, then down another chain, getting approval at each manager until a couple weeks later they’re finally talking to the right person? Or, worse yet, is the message conveyed on the engineer's behalf by management who act as gatekeepers so that critical technical details are lost? If the norm is anything other than direct person-to-person contact, the executives have built a company with bad culture and they need to change that.

Hogan: Not only should employees be encouraged to talk directly to the person they need to talk to, in a highly productive culture people are scolded for trying to go up the management chain rather than talking directly. At most companies it is the other way around.

Treese: The Practice of Cloud System Administration is described as Volume 2, following your previous book, The Practice of System and Network Administration. How have things changed since volume 1?

Hogan: Volume 1 was written in 2000, when outsourcing often meant laying off an IT department and replacing it with a company that did IT services, often using the people that had just been laid off. In retrospect, these deals rarely achieved the desired improvements in cost or quality. Since then using IaaS, PaaS and SaaS offerings has become an alternative way to outsource.

Volume 2 covers designing and operating these services. It looks at the DevOps cultural and organizational approach that is an essential part of successful cloud services. It describes the benefits of continuous integration, continuous delivery and continuous deployment. And it introduces an assessment framework for driving and measuring continuous improvement.

Limoncelli: Volume 1 has a lot of information about how to run a helpdesk and how to maintain a fleet of desktops and laptop. That used to be a big part of system administration. Now helpdesks generally don’t exist, and when they do they’re not staffed by system administrators because the skill set required is not as technical. When Volume 1 was written we were radicals for saying that OS installation should be automated so that each machine started out the same; and that even machines that come from a vendor preloaded with an OS should be wiped and re-installed with the standard configuration. Now that kind of thing is conventional wisdom.

Hogan: Volume 1 is also getting a make-over in the coming year to bring it up-to-date with the current state of the art, but it will still be quite distinct from Volume 2. We refer to Volume 1 as “the enterprise book” and Volume 2 as “the cloud book”. For better or worse, the enterprise is inherently more concerned with helping the end-users to be able to perform their jobs effectively and “ship product” that is not an Internet-based service. Enterprise IT teams need to provide all the services that their end-users need—not just provide a select few really well at massive scale. On the other hand, scaling of most services in the enterprise is much less of an issue than it is “in the cloud”. Techniques in Volume 2 can be applied in the enterprise, of course, particularly in large enterprises that are building private clouds. Also, most enterprises are moving in the direction of becoming customers for the various cloud offerings, rather than building those services themselves. Enterprise SA teams also need to prepare for moving more services into the cloud, and to know when to recommend and how to manage IaaS or PaaS solutions.

Treese: How does the emergence of cloud computing affect enterprise IT?  How is it different from outsourcing?

Hogan: Cloud providers offer a clearly defined service, akin to something you buy from a hardware or software vendor. A cloud service does not try to replicate the existing complex mess at the enterprise and support it for less money. The cloud provider leaves the enterprise to decide if its solution meets the enterprise’s needs or not, and to adapt its processes to the cloud solution. The cloud provider takes requests for features and customization, but only implements the ones that make sound business and technical sense, again akin to a hardware or software vendor. When an enterprise moves something into the cloud, the service and the costs are clearly defined up-front and there are fewer surprises. I say “ fewer” rather than “no” because enterprises that do not understand what their usage patterns will really be may still end up surprised at the costs. Because the service is clearly defined, economies of scale can be (and are) realized. The provider’s operational costs are amortized over many customers. The customer trades off the ability to customize the solution for a cheaper and better service. The benefits of the standardized cloud service usually outweigh the benefits of the customization.

For most small businesses, using cloud services should be a clear win. They typically just need some standard infrastructure services that should be available as cloud services. That approach should reduce the IT expertise that the company needs, obviate the need for expensive computing infrastructure and datacenter space, and provide them with better reliability than they could otherwise afford.

Larger enterprises tend to have more customized environments. However, there is a lot of commonality between the needs of different companies. We expect that over time companies will move towards using more SaaS solutions, and that SaaS solutions for more services that enterprises need will be developed. We also expect that the challenges posed by data privacy and integrity needs, local regulatory requirements and other blockers will be addressed by cloud providers, and solutions for particular industry sectors such as banking and healthcare will emerge.

Treese: A startup company doesn’t begin with a large-scale operation. How is this book relevant to them in the early days?

Chalup: Some people feel that DevOps is only needed by companies like Google, Facebook and Yahoo!... the unicorns of this industry: amazingly unique and special entities. The truth is that there are no unicorns. The problems that big companies face are the same problems that small companies face. If anything, most big companies are just conglomerates of many, many small teams with all the same problems that small companies have.

Limoncelli: Every startup wants to eventually have millions, or billions, of users. The architectural decisions they make at the start will enable or inhibit that kind of growth. The book has a lot of advice about platform selection, tools, and methodologies that work at small, medium, and large scale.

Treese: Sometimes it can seem impossible to keep up with changing tools, best practices, new products, and new open source projects that all promise to make life better for system administrators and developers. How do you stay on top of those things for your job?

Limoncelli: This is a book of fundamental principles and practices that are timeless. Therefore we don't make recommendations about which specific products or technologies to use. We could provide a comparison of the top five most popular web servers or NoSQL databases or continuous build systems. If we did, then the book would be out of date the moment it was published. Instead, we discuss the qualities one should look for when selecting such things. We provide a model to work from. This approach is intended to prepare the reader for a long career where technology changes over time but they are always prepared. We do, of course, illustrate our points with specific technologies and products, but not as an endorsement of those products and services.

Hogan: The best way to stay on top of all the developments in our field is to go to conferences and talk to people. Find out what they are doing and share your experiences with others. For people outside of the US, it can be a lot harder to get companies to fund attendance of the mostly US-based conferences. However, if you are willing to bear some of the costs yourself (e.g. flights), usually your company will agree to pay the rest (e.g. accommodation, meals and conference fees). If you care about your career and staying current, it’s worth it.

Treese: Many components of the systems you describe in the book are available as commercial or open source products, but they often require substantial work to integrate them. How do you think about making the build-or-buy decision?

Chalup: In a perfect world the exact modules and subsystems we need already exist and our role is simply buy-and-integrate. Sadly the world isn’t like that. Instead we need to build systems when we can do a better job, or there is value in a more tightly integrated solution.

Limoncelli: Google has achieved an unparalleled level of efficiency because they created their own stack; not just software but hardware too. Their software stack is years ahead of anyone else. Their hardware stack is too. They have rethought how the concrete floors of the datacenter should be engineered, to how the roof should be made, and everything in between. As a result they can’t include off-the-shelf products (software or hardware) without a lot of pain. However, that occasional sacrifice is minor compared to the huge advantage from their tight integration.

Treese: Recent security attacks are forcing everyone to up their game when it comes to cybersecurity. What are the best short-term steps for a D evO ps team to take?

Hogan: Recent events like Heartbleed, Shellshock and Poodle provided some golden opportunities for Dev Ops teams to identify bottlenecks. How quickly were their systems patched, and how long did each step in the process take?  Which steps need the most improvement in order for the team to be able to upgrade more quickly the next time?  These steps are the ones that should be improved in the near term.

Limoncelli: There’s a new layer in “multi-layer security” called: change-ability. If it is uncomfortable to upgrade software, you’re doomed to be stuck with insecure software. When Heartbleed struck, nearly every device and software package in every server and system needed to be upgraded. This made organizations painfully aware of which products could be upgraded easily and which were virtually impossible (and everything in between). Organizations that had one or two servers that never get upgraded because “the last software upgrade went badly, let’s never do it again” were in for a world of hurt. Every box and software system needs to be change-able: you need to be able to rapidly, frequently, and confidently roll out changes. If the system is change-averse for technical, political, or operational reasons, you need to re-engineer it so that it is change-able. Continuous Integration (CI) isn't about speed, it is about confidence that new releases will be quality. Speed is a happy side effect.

If a system's upgrade process is risky, we avoid upgrades thinking we are reducing risk. However, we are actually creating a bigger risk by becoming “out of practice”. If you upgrade it frequently, you’ll force yourself to get better at it, which reduces risk of failure and improves your ability to react faster to software vulnerabilities. The first few upgrades will be painful, but over time the process will get smoother. Risk of negative outcomes is reduced. The scientific practice of improvement through repetition is “practice makes perfect” and if every grade school child knows this, then IT should too.

Being change-able doesn’t only affect security. An organization that has grown accustomed to never changing internal processes can’t improve itself. When we accept software that is difficult or unable to be upgraded, we become accustomed to its deficiencies. Soon our time is consumed with working around that calcified organization or buggy software. This becomes hugely inefficient and ultimately a competitive disadvantage.

Treese: What are the best long-term security practices to put in place?

Limoncelli: Stop focusing on compliance and focus on being better at protecting your data. Much of the security world is focused on compliance which, basically, prepares you for last year’s problems. Don’t be shocked that a minimum password length and desktops that lock their screens after 2 minutes don’t protect you from Heartbleed. Those were problems in the 1990s. Focus on new architectures that are secure from the start: develop a good certificate management system and then use it consistently for everything; think hard about every type of data you have and whether it should be in the cloud, on premises, or on computers that never connect to networks at all; stop buying software from vendors that have never taken security seriously. If a company has a history of making vulnerable products with no remorse, we must stop giving them money so that they’ll stop hurting society.

I think we need new business models. The current ones aren't working. If you buy a product from someone else, their goal is to make money. Any security fix or patch they do is just a painful subtraction from their highly profitable maintenance contract. There is a perverse incentive that leads to insecure products. Building software securely from the start is too expensive (see A Security Market for Lemons ) or delays your product from reaching market; either puts you out of business, leaving snake-oil and sloppy, insecure vendors remaining. It is cheaper to hide problems than to fix them. The alternative is to write your own software stack so that the entire value chain is controlled by you and therefore aligned with your best interests. That is impossibly expensive and requires every company to have technical prowess that is unrealistic to achieve. So basically, we’re doomed.

Hogan: On that cheerful note... I agree with Tom that the current financial incentives for information security are flawed. Finding a solution to this conundrum is non-trivial, and not something that technologists with little real understanding of economics can do alone. However, there is an annual workshop on the economics of information security (WEIS), which brings together economists and technocrats who are interested in this problem. Some of the things that they have analyzed include the effectiveness of using civil liability as a financial incentive for improving security (PDF), the effect of cloud-based SaaS offerings on security (PDF), and the effectiveness and incentives of bi-directional ISP-based filtering (PDF). A recent ACM article by WEIS attendee Cormac Herley indicates that increasing the cost of an attack can significantly reduce the likelihood of successful break-ins by financially motivated attackers . In other words, encryption and strong authentication do help. Also, the IETF is actively looking into standards and protocol revisions to enable, and in some cases enforce, better security practices.

So, in a nutshell, our recommendations are: integrate security from the ground up; know your data; understand your data security requirements; use encryption and strong authentication; manage your encryption keys well; stay current; be change-able; and avoid vendors who don’t take security seriously.

  • + Share This
  • 🔖 Save To Your Account