Home > Articles > Process Improvement

Service Level Agreement in the Solaris OE Data Center

  • Print
  • + Share This
Building on the definitions, processes, and best practices supporting the Service Level Management (SLM) process presented in the first article in this two-part series, this article explores Service Level Agreements (SLAs). Best practices for keeping SLAs simple, measurable, and realistic—thus avoiding the most common pitfalls of overpromising and underdelivering on agreements—are detailed, and templates are provided that illustrate the translation of SLA principles to real-world examples.
Like this article? We recommend

As we discussed in the Service Level Management in the Data Center article, the ability to deliver according to pre-defined agreements increasingly becomes a competitive requirement. Aside from being able to deliver highly available, reliably performing systems, just being able to deliver your promise is key to success. This is why an effective and efficient Service Level Management (SLM) system is important. Key to the success of such a system is a sound Service Level Agreement (SLA).

This article describes what role an SLA plays in the Internet Data Center (IDC) and how it helps assure that one's reputation stays intact. It also includes sample agreements that can be used as templates.

Due to time sensitivity, the necessary localization and customization legal aspects are not included in this article.

Service Level Agreements

A good SLA helps the IDC promise what is possible to deliver and deliver what is promised.

In this article, we will establish what an SLA is and provide two sample agreements and one example of how an agreement can map to key performance indicators. A section on why we believe SLAs are so important and what we believe are the essential benefits of managing against one follows these sections.

What a Service Level Agreement Is

An SLA sets the expectations between the consumer and provider. It helps define the relationship between the two parties. It is the cornerstone of how the service provider sets and maintains commitments to the service consumer.

A good SLA addresses five key aspects:

  • What the provider is promising.
  • How the provider will deliver on those promises.
  • Who will measure delivery, and how.
  • What happens if the provider fails to deliver as promised.
  • How the SLA will change over time.

In the definition of an SLA, realistic and measurable commitments are important. Performing as promised is important, but swift and well communicated resolution of issues is even more important.

The challenge for a new service and its associated SLA is that there is a direct relationship between the architecture and what the maximum levels of availability are. Thus, an SLA cannot be created in a vacuum. An SLA must be defined with the infrastructure in mind.

An exponential relationship exists between the levels of availability and the related cost. Some customers need higher levels of availability and are willing to pay more. Therefore, having different SLAs with different associated costs is a common approach.

The following section contains an example template of an SLA to show all of the important components one wants to address in such a document.

The SLA example in TABLE 1 is based on the short form template that is available from nextslm.org (http://www.nextslm.org/). An online learning community, nextslm.org is dedicated to providing clear, concise answers about SLA. They are sponsored by BMC, PriceWaterhouseCoopers and Sun Microsystems, Inc.

This template is customized and augmented to reflect a service provider model. By replacing the italicized text with specific service aspects the template can be customized to reflect a specific service offer.

TABLE 1 SLA Template

The insert service name is used by insert customer name to insert description of the service capability. The Internet Service Provider (ISP) guarantees that:

The service name will be available insert percentage of the time from insert normal hours of operation including hours and days of the week. Any individual outage in excess of insert time period or sum of outages exceeding insert time period per month will constitute a violation.

Insert percentage of service name transactions will exhibit insert value seconds or less response time, defined as the interval from the time the user sends a transaction to the time a visual confirmation of transaction completion is received. Missing the metric for business transactions measured over any business week will constitute a violation.

The IDC Customer Care team will respond to service incidents that affect multiple users (typically more than 10) within insert time period, resolve the problem within insert time period, and update status every insert time period. Missing any of these metrics on an incident will constitute a violation.

The IDC Customer Care team will respond to service incidents that affect individual users within insert time period, resolve the problem within insert time period, and update status every insert time period. Missing any of these metrics on an incident will constitute a violation.

The IDC Customer Care team will respond to non-critical inquiries within insert time period, deliver an answer within insert time period, and update status every insert time period. Missing any of these metrics on an incident will constitute a violation. A non-critical inquiry is defined as a request for information that has no impact on the service quality if not answered or acted upon promptly.

The external availability measurements are done by insert test company name and reported on a monthly basis to insert customer name. The internal processes are measured and reported by the ISP to the insert customer name on a monthly basis. This service includes incident reporting.


TABLE 2 shows the number of violations and associated penalty on a monthly basis.

TABLE 2 Monthly Violations and Associated Penalties

Number of violations

Penalty

1>5

Insert penalty. Typically a reduction in fees.

5>10

Insert penalty. Typically a reduction in fees plus some additional compensation and a corrective action plan.

10>

Insert penalty. Typically a reduction in fees plus some additional compensation and a corrective action plan.


As services and technologies change, the SLA may change to reflect the improvements and/or changes. This SLA will be reviewed every six months and updated as necessary. When updates are deemed necessary, the customer will be asked to review and approve the changes.

Other areas that must be defined in an SLA are details on how the measurements are done, what usage limitation the service has with regard to number of concurrent users and so forth, and details on how and who receives reports and how conflicts are arbitrated. Because these topics are unique in each contract, they are not included in the preceding example.

This SLA is a "short form" SLA to illustrate essential aspects between a consumer and provider in an ASP context. Internal SLAs, between operations support groups in the IDC, for example, are different and often contain more details and specifications.

The main reason for this difference is that internal SLAs are driven by budget constraints and the business management's view of IT, while external SLAs are driven by revenue, cost and earnings.

The following template provides a general description of an internal SLA as well as the owners approval and review process, and a definition of the terms used in the document. It is another example of an internal SLA from nextslm.org.

TABLE 3 Internal SLA Template

1.0 Statement of Intent

This section states the objectives of the document.

1.1 Approvals

All parties must agree on the SLA. This section contains a list of who approved the SLA.

1.2 Review Dates

This section contains the track record of the SLA reviews.

1.3 Time and Percent Conventions

This section contains the descriptions of what time conventions and metrics are being used.

2.0 About the Service

This section introduces the service addressed by this SLA.

2.1 Description

This section describes the service in detail.

2.2 User Environment

This section describes the architecture and technologies that are used by the consumers of the service.

3.0 About Service Availability

This section introduces the availability concepts used in this SLA.

3.1 Normal Service Availability Schedule

This section describes what is considered normal service availability.

3.2 Scheduled Events That Impact Service Availability

This section describes what scheduled outages are to be expected,

3.3 Non-emergency Enhancements

This section describes the process that inserts enhancements into the infrastructure.

3.4 Change Process

This section describes the complete process of how changes are introduced in the service., including the associated availability impact.

3.5 Requests for New Users

This section describes the provisioning process of new users/customers.

4.0 About Service Measures

This section contains a detailed description of how the service availability is measured and reported.


How an SLA Maps to Key Performance Indicators

TABLE 4 uses the first example SLA and shows, in the right column, what key performance indicators result from the stated commitment. These indicators. in turn, drive what performance data and metrics are collected by the SLM process. A separate article regarding SLM describes how this information interacts with the SLM system. The performance indicators in TABLE 4 drive the internal SLAs and their associated metrics.

FIGURE 1 shows how different internal groups (networking, systems and applications) must commit to certain transaction response times to achieve the promised levels of service to the consumer.

Figure 1FIGURE 1 Transaction Response Times for Promised Service Levels

TABLE 4 SLA Key Performance Indicators

Commitment

Key Performance Indicator

The service name will be available insert percentage of the time from insert normal hours of operation including hours and days of the week. Any individual outage in excess of insert time period or sum of outages exceeding insert time period per month constitutes a violation.

Service Availability as a Percentage of Normal Business hours.

Note–We must measure overall service availability. The maximum threshold is the maximum outage per incident and/or total sum of outage per month.

Insert percentage of service name transactions will exhibit insert value seconds or less response time, defined as the interval from the time the user sends a transaction to the time a visual confirmation of transaction completion is received. Missing the metric for business transactions measured over any business week constitutes a violation.

Percentage of transaction response times more than x seconds.

Note–We must measure transaction times against a threshold of x seconds and measure the number of slower transactions as a percentage of the total.

The IDC Customer Care team will respond to service incidents that affect multiple users within insert time period, resolve the problem within insert time period, and update status every insert time period. Missing any of these metrics on an incident constitutes a violation.

Service Incident (affecting multiple users) Response times, resolution times and status updates.

Note–We must be able to create incident reports that track actions with timestamps. These are measured against the time thresholds.

The IDC Customer Care team will respond to service incidents that affect individual users within insert time period, resolve the problem within insert time period, and update status every insert time period. Missing any of these metrics on an incident constitutes a violation.

Service Incident (affecting single users) Response times, resolution times and status updates.

Note–We must be able to create incident reports that track actions with timestamps. These are measured against the time thresholds.

The IDC Customer Care team will respond to non-critical inquiries within insert time period, deliver an answer within insert time period, and update status every insert time period. Missing any of these metrics on an incident constitutes a violation.

Inquiry response times, answer times and status updates.

Note–We must be able to create incident reports that track actions with timestamps. These are measured against the time thresholds.

The external availability measurements are done by insert test company name and reported on a monthly basis to customer name. The internal processes are measured and reported by the ISP to the customer name on a monthly basis.

This is an example of an external management service that is managed by its own SLA between the IDC provider and the external ISP that supports this commitment to the service consumer.


  • + Share This
  • 🔖 Save To Your Account