Implementation Requirements
This section describes the additional requirements that went into the actual management and operations architecture. These requirements are in addition to the inherent requirements described in the previous sections.
Management At All Layers
TABLE 3 describes the aspects the M&O architecture manages per layer. Note that the layers follow the execution architecture cube as described in "IT Management Framework" on page 6. The execution architecture also implies that this requirement must be considered at all tiers (client through resource). The developers did not make this distinction because the complete managed architecture required the same visibility at all implemented tiers.
Facilities management is beyond the scope of this project. It is however an essential component of IT management.
TABLE 3 Managed Aspects By Layer
Layer |
iForce Implementation |
Fault |
Configuration |
Accounting |
Performance |
Security |
Business application |
IPlanet Message Server 5.10 |
Yes |
Next phase |
Next phase |
Yes |
Next phase |
Mail MultiPlexer (MMP) |
Yes |
Next phase |
Yes |
Next phase |
Next phase |
|
Mail Transfer Agent (MTA) |
Yes |
Next phase |
Yes |
Next phase |
Next phase |
|
Application infrastructure |
iPlanet Directory Server 4.13 |
Yes |
Next phase |
Yes |
Next phase |
Next phase |
DNS |
Yes |
Next phase |
Yes |
Next phase |
Next phase |
|
Firewall |
Yes |
Next phase |
Yes |
Next phase |
Next phase |
|
NTP |
Yes |
Next phase |
Next phase |
Next phase |
Next phase |
|
Computing and storage platform |
Netra_ T1 server |
Yes |
Yes |
Yes |
Next phase |
Next phase |
Netra 1405 server |
Yes |
Yes |
Yes |
Next phase |
Next phase |
|
Sun Fire_ 6800 server |
Yes |
Yes |
Yes |
Next phase |
Next phase |
|
Netra X1 server |
Yes |
Yes |
Yes |
Next phase |
Next phase |
|
Sun StorEdge_ T3 array |
Yes |
Yes |
Yes |
Next phase |
Next phase |
|
Sun Enteprise_ A1000 server |
Yes |
Yes |
Yes |
Next phase |
Next phase |
|
Ancor Switches |
Yes |
Yes |
Yes |
Next phase |
Next phase |
|
Sun_ Cluster 3.0 software |
Yes |
Yes |
Yes |
Next phase |
Next phase |
|
Solaris_ OE |
Yes |
Yes |
Yes |
Next phase |
Next phase |
|
Facilities infrastructure |
iForce lab |
N/A |
N/A |
N/A |
N/A |
N/A |
Yes: Management capability is included in current architecture.
Next phase: Management capability is to be included in a subsequent architecture.
N/A: Management capability is considered beyond the scope of this project.
Security management is planned for a subsequent phase. At this time the IDC Mail and Messaging Architecture facilitates security with a firewall complex at the entrance. A Security assessment will be scheduled to determine gaps and next steps. However, the tools currently deployed can facilitate security event management.
Accounting is currently under consideration because of its importance in an ISP/ASP environment. However the scope of this effort is well beyond the current efforts and is still being defined.
The capacity planning aspect is an extension of performance management and affects multiple layers. It requires a complete process and the inclusion of variables that help anticipate future needs. The iFRC has done extensive sizing tests using the tools deployed in the M&O architecture. The results are published in the Server Sizing Guide related to the IDC Mail and Messaging Architecture project.
Performance Data Collection, Metrics and Thresholds
This section contains more details on the selected measurements and thresholds for performance tuning and capacity planning that are the main objectives of the IDC Mail and Messaging Reference Architecture project.
This section describes the baseline performance monitoring metrics as defined in the SunPS performance tuning and capacity planning methodology. Based on this information, the developers can identify and locate potential problems. In addition, provides the basic data to start the capacity planning process.
Every system's behavior depends on what application it is supporting. Therefore, you should do a detailed requirement analysis on a system and application pair basis. The information that follows, however, is a good baseline set of requirements.
The actually implementation of the concepts described in the preceding section was influenced by pragmatic constraints. The following are the major constraints considered.
- Cost
- Ease of deployment and availability
- Installation and configuration time
At all times, the important consideration was the ability of the tools to perform the required tasks. The following section lists those requirements.
Performance information is summarized in the following categories:
- CPU
- Disk
- Memory
- Network
- NFS (if applicable)
- Workloads (if possible)
The format and organization of information is identical for all six categories listed and presented in subsequent sections. Information for each category is tabulated under the three columns:
Parameter
Lists the parameters considered in the monitoring requirements for each of the six categories listed above.
Description
Describes the parameter.
Expected Value
Lists the acceptable value for each of the above parameters. If there is no threshold of acceptance for the parameter, this column will indicate "Relative or Informational".
NOTE
Some of the expected values listed may need adjustment based on individual characteristics. (For example, CPU utilization must be normalized for the number of CPUs or percent of disk full relative to the total size of available space.)
The following sections list the various requirements for performance monitoring
CPU Metrics
To determine system performance health, you should monitor the CPU parameters listed in TABLE 4. TABLE 4. lists the CPU-related metrics the monitoring tool must be able to collect.
TABLE 4 CPU Metrics
Parameter |
Description |
Expected Value |
Percent CPU utilization |
Total for all CPUs. Any utilization imbalance among CPUs identified. |
<80 percent per CPU |
User CPU |
Percent CPU power spent on running user programs, libraries, and so forth should account for most of the CPU usage. |
<90 percent |
System CPU |
Percent CPU power spent on executing system, kernel and administrative code (for example, device drivers, I/O handling and so forth). |
<USR CPU |
Run queue |
Number of processes waiting to run on the CPU. UNIX uses the run queue to determine which process gets to use the CPU next. If the run queue exceeds two processes per processor, it may indicate a bottleneck. |
< 2x CPU |
Wait for I/O |
Percent time CPU has to wait for disk to respond. High values could indicate a disk bottleneck (if disk busy and service times are high), otherwise could indicate a controller bottleneck. |
<30 percent |
References:
I/O Metrics
To determine system performance health, you should monitor the I/O related parameters listed in TABLE 5. TABLE 5 lists the I/O-related metrics the monitoring tool must be able to collect.
TABLE 5 I/O Metrics
Parameter |
Description |
Expected Value |
Low activity disks |
It is important to balance the load on disks. This list indicates disks with low or no activity that may be available for load balancing. |
|
Disk space used by files system |
Indicates the file systems are running short of disk space. |
<85 percent |
Inode usage |
Shows when space allocated for i-node entries is running short. |
<20 percent |
Percent busy (top 10) |
Indicates the percent of time the disk is actually doing work. High values may indicate a disk or controller bottleneck. |
<35 percent |
Average service time |
Indicates the time it takes the disk to complete a request. High values may indicate a disk or controller bottleneck. NoteSome lightly used disks may exhibit long service times. This is a well-known anomaly and should be taken into consideration during performance analysis. |
<30ms |
Queue length |
Number of jobs waiting to be processed by the I/O system. |
<1 |
References:
Memory Metrics
The memory metrics are divided into four subcategories:
- Paging
- Buffers
- Swap
- Kernel
Paging
Paging moves data or individual pages of a process between disk and memory. A high page-out rate (move to disk) could be due to heavy writing of data to disk and does not necessarily indicate a memory shortage. However, it is an important metric to collect. TABLE 6 lists the memory-paging metrics the monitoring tool must be able to collect.
TABLE 6 Memory Paging Metrics
Parameter |
Description |
Expected Value |
Scan Rate |
This parameter is used as a clear indication of memory shortage. A value higher than 320 per second may mean that the processes do not have enough memory in which to run. |
<200 pages per second |
References:
Buffers
In Solaris OE version 2 and above, cache buffers are used to cache inode, indirect block, and cylinder group information. A default value of percent of physical memory for buffers is generally considered too high for systems with large memory and can be reduced if the hit rates warrant it. TABLE 7 lists the buffer-related metrics te monitoring tool must be able to collect.
TABLE 7 Buffer Metrics
Parameter |
Description |
Expected Value |
Percent Write Cache |
System write percentage that is cached in buffers (instead of to disk) |
>50 percent |
Percent Read Cache |
System read percentage that comes from cache buffers (instead of from disk). |
100 percent |
References:
Swap Areas
Swap areas should be distributed across many fast disks. Avoid placing them on disks used for OLTP databases. TABLE 8 lists the swap-related metrics the monitoring tool must be able to collect.
TABLE 8 Swap Area Metrics
Parameter |
Parameter Description |
Expected Value |
Swap rate |
Lack of memory can result in a whole process moving from memory to disk, called swap-out. This process should be very infrequent. Swap-ins indicate recalling of a swapped-out process, that is disk thrashing. |
1 per day |
Available swap |
Low numbers indicate memory shortage and can cause processes to thrash to disk rather than perform their task. |
32 Mbytes |
References:
Kernel
This section lists the essential metrics to collect regarding the performance of processes in the kernel as they may indicate memory related issues. TABLE 9 lists the metrics the monitoring tool must be able to collect.
TABLE 9 Kernel Metrics
Parameter |
Description |
Expected Value |
Memory failures |
Memory failures indicate that permanent and huge kernel memory allocation failed. This metric is highly critical |
0 |
File access |
This category reflects the amount of activity spent in locating files through directory block reads, i-node searches and file system path searches. It is good for establishing baselines. |
Relative |
References:
Network Metrics
While these requirements do not focus on the network, Sun servers do provide some general statistics, derived from the network cards, which can indicate performance issues.
TABLE 10 lists the network metrics the monitoring tool must be able to collect.
TABLE 10
Parameter |
Parameter Description |
Expected Value |
Collisions |
Used as a measure of network congestion. Not relevant in switched segments. High values usually indicate mis-configured interfaces. |
<15 percent |
Errors in/out |
This statistic usually indicates hardware/driver problems. |
<0.02 percent |
Bytes in/out |
Used as a baseline. |
Relative |
No. of connections |
(For example, ftp, rlogin, telnet). Used as a baseline. |
Relative |
References:
NFS Metrics
NFS is often the cause of performance issues. On systems that run NFS, the following parameters and values for this category should be monitored as they indicate potential performance issues. TABLE 11 lists the NFS metrics the monitoring tool must be able to collect.
TABLE 11 NFS Metrics
Parameter |
Parameter Description |
Expected Value |
Calls |
Used as a baseline. |
Relative |
Bad Calls |
Used as a baseline. |
Relative |
References:
Workloads
This category of metrics characterizes workloads by the extent they use general resources of the server. Mainly, this category is used for capacity planning purposes but it can also be used to see where, from a business perspective, issues might come from. Defining a workload is a way of grouping resource usage to create a logical unit of work. For example, in one company the number of users in the sales department may be increasing by threefold, while marketing and finance, which also access the same server, may be expected to grow only twofold. In this case, you can define three workloads, each including users from a different department. In this way, resource usage by each department can be tracked and the appropriate factor for growth can be applied.
Or, it could be that a company that is running two applications on a single server is planning to increase the total number of users accessing application A by a factor of two and the total number of users accessing application B by a factor of three. To assess resource usage by each application, the processes associated with each application are included in a separate workload, and the appropriate growth factor can then be applied to each.
TABLE 12 lists the workload-related metrics the monitoring tool must be able to define and collect.
TABLE 12 Workload Metrics
Parameter |
Description |
Expected Value |
Percent CPU per workload |
Used as a baseline |
Relative |
Physical I/O per workload |
Used as a baseline |
Relative |
No. of processes per workload |
Used as a baseline. |
Relative |
Resident set size per workload |
Memory occupied by each workload. Used as a baseline. |
Relative |
Tool Selection
This document described the SLM management concepts, implementation requirements and management metrics. TABLE 13 shows the tools that were selected as a result. The implementation of these will be discussed in the next article.
TABLE 13 Tools Distribution by Layer
Layer |
iForce Implementation |
Fault |
Config |
Accounting |
Performance |
Security |
Business application |
iPlanet Message Server 5.1 |
SunMC/Micromuse |
TBD |
TBD |
TeamQuestMicromuse |
TBD |
MMP |
SunMC/Micromuse |
TBD |
TBD |
TeamQuestMicromuse |
TBD |
|
MTA |
SunMC/Micro-Muse |
TBD |
TBD |
TeamQuestMicromuse |
TBD |
|
Application infrastructure |
iPlanet Directory Server 4.13 |
Micromuse |
TBD |
TBD |
TeamQuestMicromuse |
TBD |
DNS |
Micromuse |
TBD |
TBD |
TeamQuestMicromuse |
TBD |
|
Firewall |
Micromuse |
TBD |
TBD |
TeamQuest |
TBD |
|
NTP |
Micromuse |
TBD |
TBD |
TeamQuestMicromuse |
TBD |
|
Computing and storage platform |
Netra T1 server |
SunMC |
SunMC |
TBD |
TeamQuest |
TBD |
Netra 1405 server |
SunMC |
SunMC |
TBD |
TeamQuest |
TBD |
|
Sun Fire 6800 server |
SunMC |
SunMC |
TBD |
TeamQuest |
TBD |
|
Netra X1 server |
SunMC |
SunMC |
TBD |
TeamQuest |
TBD |
|
Sun StorEdge T3 Array |
SunMC |
SunMC |
TBD |
TeamQuest |
TBD |
|
Sun Enterprise A1000 |
SunMC |
N/A |
TBD |
TeamQuest |
TBD |
|
Sun SAN Switches |
SunMC |
TBD |
TBD |
TeamQuest |
TBD |
|
Sun Cluster 3.0 software |
SunMC |
SunMC |
TBD |
TeamQuest |
TBD |
|
Solaris OE |
SunMC |
SunMC |
TBD |
TeamQuest |
TBD |
|
Network Infrastructure |
Foundry NetIron |
Foundry |
Foundry |
TBD |
TeamQuest |
TBD |
Foundry ServerIron |
Foundry |
Foundry |
TBD |
TeamQuest |
TBD |
|
Foundry BigIron |
Foundry |
Foundry |
TBD |
TeamQuest |
TBD |
|
Facilities Infrastructure |
iForce Lab |
N/A |
N/A |
N/A |
N/A |
N/A |