Aside from providing the proper information to compare performance against the SLOs, the SLM solution must also include information and/or tools that facilitate troubleshooting, capacity planning and modeling. Modeling is very important in the frequently changing Internet age because time to market is paramount; the implementation of new technologies and services does not have time for trending solutions. Troubleshooting and trending solutions are commonly available in the more traditional system and network management tools. A good modeler is often hard to find.
To understand some of the technology challenges better, the following paragraphs provide a more technical and detailed description of managed components in a data center infrastructure.
The technology components in a service provider's data center architecture consist of five platform layers as defined by the SunPSSM 3DM program and somewhat expanded to reflect a data center environment more than a development environment:
Business application or service
Application infrastructure (application servers, Lightweight Directory Access Protocol (LDAP), relational data base management system (RDBMS) and so forth.)
Computing and storage platform (Sun Fire_ 15K server, Solaris_ operating environment, disks, storage area networks (SANs) and so forth).
Facilities infrastructure (air conditioner, power, access and so forth)
To instrument this infrastructure, small programs that monitor specific components to collect information and even sometimes act on certain events must be inserted. These agents or element managers run at each platform layer. On network devices we often see remote monitoring (RMON) probes and Simple Network Management Protocol (SNMP) Mibs. On systems, we see vendor specific agents like the Sun_ Management Center software that not only monitor, but often also help manage the system. On the application infrastructure layer, we see application agents like BMC's Knowledge Modules and sometimes custom agents (for custom applications) and, finally, at the services layer, we find aggregator tools that can often test the health of the service and correlate events from the other layers. An example of such a tool is MicroMuse's Netcool, but we often find custom agents too. All of the information is typically aggregated at different stages, often first locally then network wide, and ultimately service wide. A network operations center (NOC) monitors all of the agents and facilitates Help desk and customer care activities in addition to its normal operational tasks.
FIGURE 3 is a graphical representation of such an SLM infrastructure.
FIGURE 3 SLM Infrastructure
The SLA definitions play an integral part in the process. Every monitor agent (network, system and application layer) measures itself against metrics derived from the SLOs to determine whether they are working within the correct parameters.
An example is an SLA requirement that states that a certain transaction can take no longer than 10 seconds before its result is presented to the consumer. Consequently, it has been established that the complete path cannot take longer than 10 seconds. This requirement drives the service level tester to test for this behavior, the application monitor to ensure it gets a return from the system within the time limits that allow the SLO to be met, and the system monitor to guarantee that all related I/O responses from the database happen in time, and that the network can transport all data from the back end system to the consumer at sufficient speed.
FIGURE 4 shows an example of how these elements determine how much time can be spent per area that drives what thresholds are set and what metrics are collected.
FIGURE 4 Time Scale of Thresholds and Metrics of a Service Level Objective
With this picture of what is involved in managing service levels we can consider the benefits of making the investment to set up all of this infrastructure.