The head of R&D for the Siemens Healthineers teamplay digital health platform and reliability lead for all Siemens Healthineers Digital Health products outlines what companies are looking for when hiring SRE engineers.
An SRE engineer, or site reliability engineer in full, is someone who engineers reliability for a website. Engineering reliability spans a wide range of activities throughout the lifetime of a service.
According to the The Site Reliability Workbook by Google, the activities are in the architecture, design, development and operations of the service. However, currently there is a fairly common misconception that the SRE engineer would only concentrate on operations. In fact, many job adverts for SRE engineers focus on operational skills.
In order to engineer reliability properly and timely, SRE engineers need to equally focus their attention on requirement engineering, software architecture, design and development.
In requirement engineering, the SRE engineer can discuss the most critical user workflows, the most critical reliability aspects of the workflows, frequency of workflow execution and general reliability assumptions. Following this, concrete requirements can be defined that need to be fulfilled by the architecture, design and development of the service. In fact, the definition of SLOs and SLAs can be started best before these technical activities begin.
In software architecture and design, the SRE engineer can introduce resilience to single points of failure, suggest appropriate use of stability patterns for distributed systems, foresee enough adaptive capacity between services, weigh design choices from the resilience point of view and validate some assumptions by implementing spikes.
In development, the SRE engineer can actively take part in service implementation. Previously discussed architecture and design choices can be implemented, and thereby validated. Additionally, automation and tools necessary to operate the service with reasonable effort can be implemented as well.
In operations, the SRE engineer can react to the SLO and SLA breaches. In case of severe breaches, they can take part in the incident response process involving many teams.
With the activities above, the SRE engineer should be in a position to establish a service sufficiently reliable for the load and usage profile at hand. It is an exciting multi-disciplinary job and a rewarding career that enjoys a growing popularity in the industry.
Job ads growth for SRE engineers is impressive with thousands of companies around the world trying to attract the best talent available.
Refer to Chapter 12.6.2 in the book Establishing SRE Foundations to find out more!