Enterprise Server Tools
Enterprise server tools available on the Sun platform and supported by the Solaris™ Operating Environment (Solaris OE) provide easier system administration and better use of system resources. In this section, we introduce the Sun™ Management Center (Sun MC) software and the Solaris™ Resource Manager (Solaris RM) software tools, focusing on their use in HPC environments. This section contains the following:
"Sun™ Management Center (Sun MC)"
"Solaris™ Resource Manager (Solaris RM)"
Sun™ Management Center (Sun MC)
The Sun Management Center (Sun MC) software, formerly Sun Enterprise SyMON™, is an enterprise system management tool that supports the entire line of Sun servers and desktops. This tool provides a single point of management for the entire line of networked Sun systems.
Sun MC Features
The following list summarizes the most relevant features that apply to HPC environments:
Manages thousands of Sun systems
Provides a common GUI look and feel with the Java™ console
Integrates with tools from leading third-party vendors to address heterogeneous environments
Provides alarm management and predictive failure analysis
Identifies faults before a system is affected, via comprehensive online hardware diagnostics testing
Provides a powerful, easy-to-use interface for developing custom modules using the GUI module builder
New filtering capabilities help pinpoint problems quickly, even in systems with thousands of objects or nodes
Allows dynamic reconfiguration and domain management through secure management controls
Supports new UltraSPARC III processor based systems
Sun MC Benefits and Recommendations
Typically, HPC systems require clustering of several nodes to achieve levels of performance required by scientific and engineering communities. The need for system monitoring and management increases with the number of nodes that form a compute cluster. This complexity makes a typical HPC site an excellent candidate for deploying tools such as Sun MC software.
As the system administrator, you would have a physical and logical view of the entire cluster or clusters on a web-based interface, accessible from a console desktop that is geographically located anywhere on the network.
We recommend that you integrate Sun MC software with a job management system (JMS). This combination allows you to monitor and control queues and jobs from within a Sun MC console. For example, a site system administrator can easily:
check the status of jobs and resources
receive alarms in case of overload situations or hard limit excesses
suspend and resume queues as well as jobs
Another product that is useful to integrate with Sun MC software is the Load Sharing Facility (LSF) from Platform Computing Inc. This product supports the SNMP protocol and is easily integrated with Sun MC software.
The Sun Grid Engine (SGE) job management system does not currently support SNMP. It does provide a partial integration with Sun MC software, using the Tcl scripting language interface at the agent level. This capability can make the administration of compute intensive tasks easier to perform, particularly if an Sun MC software is already deployed at the site.
The following figure shows a high-level diagram of how job management systems (JMS) integrate with Sun MC software.
FIGURE 1 Integrating Job Management Systems With Sun MC Software
Solaris™ Resource Manager (Solaris RM)
The Solaris Resource Manager (Solaris RM) is a software product that is an extension to the Solaris OE. Sun MC software enhances resource availability for users, groups, and applications. It provides the ability to reserve and control major system resources such as CPU, virtual memory, and number of processes. Solaris RM software controls resource usage based solely on user ID.
Capabilities provided by Solaris RM software are regulated by a resource policy that is established according to a site's requirements. Users and applications receive a more consistent level of service on a single server, resulting in significant cost savings and greater administrative flexibility.
The Solaris RM software does not notify system administrators about usage limits; it provides resource usage reports and guarantees resources to key applications and users. It makes the performance of an application more predictable, and it ensures that system response times are not adversely affected by other tasks on a system.
Solaris RM Features
The following is a list of features that apply to HPC environments:
Reserves and controls major system resources such as CPU, virtual memory, and number of processes.
Provides users and applications with a more consistent level of service on a single compute server.
Guarantees resources to key applications and users.
Provides more predictability of application performance and ensures that system response times are not adversely affected by other tasks on a system.
Provides resource usage reports.
Solaris RM Benefits and Recommendations
Most HPC sites use a job management system (JMS) product to monitor and schedule jobs submitted by their user community, according to resource usage and site configuration and policies.
Most of the functionality provided by the Solaris Resource Manager tool is already included in a job management system. For example, Solaris RM software supports hard limits on resources where a process fails if it exceeds resource limits. In comparison, most popular JMS products such as LSF and SGE support hard limits.
In some cases, combining Solaris RM software with other JMS products provides additional functionality and efficiency. An example where Solaris RM software can be used with a JMS for more efficient use of resources is assigning shares to workloads and users. Distributing shares allows for more fair share scheduling, and it prevents jobs from overusing more than their allotted shares. The following figure illustrates a simple example that includes a hierarchy with two layers. The first layer assigns CPU shares to three workloads as follows:
compute intensive workload 60 shares
application development workload 30 shares
administration tasks workload 10 shares
FIGURE 2 Solaris RM Software With a Single Node
The second hierarchical layer assigns CPU shares to users within a workload. The example in the figure illustrates Solaris RM software in an HPC environment with a single node. The same concept applies to a cluster of nodes when Solaris RM software is deployed at every node to ensure that applications or jobs running with a specific user ID only get allowed resources, according to a predetermined policy.