- Recommendations for Applying Preferred Practices
- Principals of Mission-Critical Implementations
- Physical Environment
- Internal Network Planning
- External Network Planning
- System Controller Configuration
- Platform and Domain Administration
- Error Analysis and Diagnosis
- Platform and Domain Configuration
- Dynamic Reconfiguration
- Related Resources
Platform and Domain Configuration
By design, the Sun Fire 15K/12K servers are well suited for running many applications that scale horizontally, as well as vertically. However, due to the very high capacity and efficient footprint of these servers, they are best suited for vertically scalable applications with dense consolidation. This type of scalability allows the platform and the application to expand by adding additional resources in same domain, or across multiple domains. Additionally, the application can take advantage of load balancing using the internal multitasking and threading attributes of the operating system and application. Either a single instance or a single domain can be made larger, or multiple instances of the applications can be installed on a single domain to accommodate additional users and functionality. Server resource management tools (such as Solaris_ Resource Manager) can be added to control and allocate system resources such as CPUs, processes, and memory.
Configuring Domains for Redundancy
Internally, most of the Sun Fire 15K/12K server components are designed with built-in redundancy and online recoverability. The main reasons for configuring separate domains are fault containment, security isolation, and workload separation. The Sun Fire 15K/12K servers are designed so that the physical location or proximity of domain hardware components is not relevant. This means that CPU/Memory boards (slot 0 boards), or I/O assemblies (slot 1 boards) can be located anywhere on the frame in their respective slots and will still be part of the same domain.
The minimum configuration for any Sun Fire 15K/12K domain is one CPU/Memory board, one gigabyte of memory, an hsPCI assembly, access to the backplane through an expander board, one network-capable PCI card, and a local boot disk subsystem. However, in most production environments, these minimum requirements would produce an unacceptable configuration due to the lack of redundant components. Therefore, we recommend that you configure all mission-critical production domains with redundant expander boards, CPU/Memory boards, and I/O boards.
All mission-critical domains should have a minimum of two CPU/Memory boards, two expander boards, and two I/O boards. Additionally, each mission-critical domain should have a minimum of two network cards, two boot I/O paths, and two data storage HBAs, each installed in separate expanders. Depending on the chosen cluster product, you should also implement dynamic multipathing and failover technologies on both the network and data storage. For example, you might use the Sun StorEdge_ Traffic Manager software, VERITAS DMP, or the Hitachi DLM for storage path failover. You might use Sun IP Multipathing (IPMP) or VERITAS Cluster for the network NIC failover. If the domain is configured with multipathed technologies, such as DMP or STMS, all primary paths should be located on one I/O assembly tray, and the secondary paths should be located on another I/O assembly tray. Having multiple paths for all I/O devices on separate I/O assembles enhances the ability for performing dynamic reconfiguration operations.
Applying Naming Standards
Naming platforms and domains is a very important task, though often overlooked. Domain and platform names should be designed to minimize complexity so that system administrators are not easily confused about which domain they are working on. Also, a naming scheme should be developed so that possible intruders cannot easily identify mission-critical production domains and platforms. Because making name changes after the domains are placed into production can be a major headache, complete this task and establish a naming standard early in the planning phase of the project.
Unlike with system controllers, where keeping up-to-date on the latest recommended patches is required, domains are generally managed differently with regard to patches. A well thought out patch management strategy is important for all domains, and the strategy you choose might depend on factors such as application requirements, outage windows, and testing. When patching domains, remember to validate that the proposed patches will actually fix the problem you are having, ensure that the proposed patch will not cause other problems even though it will fix something else, and ensure that the patch will not affect the system's performance. There are many patch management tools and utilities from Sun Microsystems that make this task easier, including patchdiag, Solaris_ Patch Manager, PatchCheck, PatchPro Expert, signed patches, Live Upgrade, and SunMC Change Manager. Live Upgrade can be used to test domain's new patches, and if necessary, provide the ability to quickly fall back to the operating system's state as it was before the patch was applied.
Splitting Expander Boards
Internal to the Sun Fire 15K/12K servers, expander boards are used for connecting the CPU/Memory boards and I/O boards to the Sun Fire interconnect ports on the backplane. The Sun Fire 15K/12K server configurations have the option to "split" expander boards. This means that two different boards connected to the same expander can be assigned to separate domains. This allows an I/O board to be assigned to another domain separate from the CPU/Memory board installed on that same expander. Because some domains require more I/O than CPU, the CPU/memory board and its resources could be assigned for Domain A, and the I/O board and its resources could be assigned for Domain B. In effect, this gives the Sun Fire 15K/12K servers enhanced flexibility. In turn, this added flexibility brings some availability and performance drawbacks. Because a "split expander" board is a shared physical resource among two domains, a failure of this board set affects both domains. Additionally, there is a performance penalty for configuring a split expander configuration. It is best to fully evaluate the advantages and disadvantages of the added flexibility before using this feature.
Domain Boot Devices
Although it is possible to boot domains over the network or from the DVD drive, we recommend that you do not do so on live production domains. Booting over the network is acceptable for initial installations and troubleshooting, but for mission-critical production domains, you should configure separate mirrored boot devices (RAID1). Mirrored boot devices should be configured on either the Sun StorEdge S1 or the Sun StorEdge T3 arrays, which are the only currently supported Sun Fire 15K/12K server boot devices. The arrays should be connected with redundant paths on separate I/O boards and I/O assemblies, and should be configured with either Solaris Volume Manager or VERITAS Volume Manager software. The boot arrays should also be installed across separate data center racks with separate power sources for added availability. When mirroring the boot devices, new devalias names should clearly identify the primary and secondary boot drive at the OBP prompt. For example, "bootdisk" for the primary and "mirrordisk" for the secondary. Be sure to document all procedures for boot disk recovery in an online runbook.
We recommend that you install the S1 boot array in the first PCI card slot that is accessed by the OpenBoot PROM (OBP) probe list, which is generally the lowest numbered I/O board slot in the domain. This will guarantee that the device paths and the configuration to the /etc/path_to_inst will not change if a subsequent boot with the reconfigure option (boot-r or reboot -- -r) is used after additional components are added to the domain.
The configuration of domain operating system disks varies from site to site. The key thing to remember is to build the operating system configuration so that it can be easily upgraded with a minimum number of file systems and is consistent across domains. Also utilize technologies such as Solaris Volume Manager or VERITAS Volume Manager, as well as Flash Archive, JumpStart, and Live Upgrade technologies. The simpler the configuration is normally, the better when configuring OS disks.
Monitoring the Server
We recommend that you monitor the Sun Fire 15K/12K servers with Sun Management Center software agents. Although SunMC is the recommended tool for first-level monitoring of the Sun Fire 15K/12K server platforms, it should not be considered as the only tool that will fulfill the requirements for a large scale enterprise monitoring effort. You could also consider using the Sun SRS Net Connect tool. This tool is a collection of system management services designed to enable you to securely monitor hardware, alarming, system performance, and trend reporting through internet access. You can easily integrate SunMC with other supporting enterprise-wide management and monitoring platforms, such as Tivoli and BMC, through supplied management information base (MIB) files for SNMP. Although it is not a monitoring tool, SunMC Change Manager can also be integrated into the system to track system changes and patches, and to provide resource provisioning capabilities.
The SunMC server component should be installed on a separate Sun Enterprise_ or Sun Fire class-two processor server, with a minimum of two network interfaces. Additionally, this server should be security hardened. For each agent, there is a set of statuses and rules that define the alarm conditions that can be generated and tuned by that agent. This information is documented in the Sun Management Center Supplement for the Sun Fire 15K Systems (Part # 816-2701-10), which is available at http://docs.sun.com.
The following list summarizes the Sun Fire 15K/12K server categories that should be monitored by SunMC:
Platform configuration reader (environmental platform agent)
Domain configuration reader (domain configuration agent)
System controller configuration reader (system controller environment and configuration agent)
Platform/domain state management module (manage and manually monitor the platform and domain states)
Dynamic reconfiguration module (manage and manually monitor the device and dynamic attach points and availability)
System controller monitoring module
Performing Online Maintenance Testing
Regularly test the following technologies and procedures to make sure they are functioning properly. Some of these tests might not be needed, depending on the particular environment. Additionally, some tests might be required only after changes are made to specific system components or operating system configuration changes. Each site should develop and execute a test plan specific to its needs.
Replacing CPUs online
Adding CPUs online
Adding memory online
Replacing memory online
Replacing I/O cards online
Adding I/O cards online (reference supported configurations)
Performing a live upgrade of the OS (Using Live Upgrade)
Performing hot patching
Performing system controller failover
Testing cluster failover
Deleting and adding boards to domains (if these functions are used on a regular basis by the environment)
Booting domains over the network from the system controller or from a JumpStart server
Booting domains and system controllers from secondary mirrored boot devices
Testing IPMP, DMP, DLM, or STMS failover (whichever applies)