SMS 1.4 new features improve the availability, serviceability, diagnosability, and recovery characteristics of Sun Fire 15K/12K systems.
The new features reduce the mean time to:
Automatically diagnose causes of domain faults
Enhance system restoration capability by removing faulty resources from the system configuration
Provide actionable repair information
Enable more efficient remote support capabilities
The following paragraphs describe the new features in the SMS 1.4 software and how they relate to improving availability for these systems.
Error Event Reporting
Enhancements were made to the Solaris OE to improve the availability of the domain. The SMS 1.4 software error event feature reports events in compliance with the changes to Solaris OE. For more information about Solaris OE availability features, refer to the Sun BluePrints OnLine article "Solaris Operating System Availability Features."
The SMS error handling is processed by the error and fault handling daemon (EFHD). This daemon collects all relevant error information and creates the fault and list events in addition to the error events. A fault event represents a diagnosed fault that caused one or more error events. All fault events are encapsulated into one list event.
If a single fault event is present, the diagnosis is unambiguous. However, if more that one fault event exists, any of the faults could be the cause of the errors.
The SMS error reporting daemon (ERD) is responsible for sending the events to the message log and other possible reporting channels, such as emails, Sun Management Center software, and System Resources Services.
The SMS event log access daemon (ELAD) records the events and provides an interface that is used by the SMS showlogs command to view the event log.
When certain hardware errors occur in a domain, the system controller (SC) performs the diagnosis and domain recovery steps. The automatic diagnosis (AD) consists of three different diagnostic engines (DE).
The SMS DE diagnoses hardware errors associated with the domain stop (DStop).
The Solaris DE identifies non-fatal domain hardware errors, and reports them to the system controller.
The POST DE identifies any hardware test failures that occur when the power-on self-test (POST) is run to bring up the domain.
By default, AD process is enabled. The SMS environment flag DISABLE_AUTO_DIAGNOSIS can be used to disable AD process.
The following sections describe the diagnosis and recovery steps that occur for the hardware errors identified by each diagnostic engines.
SMS Diagnostic Engine
FIGURE 1 describes the flow of the automatic diagnosis and automatic recovery process.
FIGURE 1 Automatic Diagnosis and Recovery Process for Hardware Errors With Domain DStop
Hardware errors involving CPU boards, processors, I/O controllers, and memory banks are detected, and the domain is stopped (DStop) by the SC. A dump file is generated whenever DStop occurs.
SMS DE determines the failure based on the errors captured in the DStop dump file. The DE identifies one or more components that are responsible for the errors.
Auto-diagnosis list events are reported by the ERD to the configured reporting channels, such as the message log and email. They are also recorded in the event log by the ELAD.
The SMS DE records the diagnosed fault in each of the components by updating the component health system (CHS) on that component.
As a part of domain restoration, the POST reviews the updated CHS information to determine which component to remove from the domain configuration. The appropriate components are then deconfigured and the domain is restarted.
Solaris Diagnostic Engine
FIGURE 2 shows the automatic diagnosis process for nonfatal domain hardware errors.
FIGURE 2 Automatic Diagnosis for Nonfatal Domain Hardware Errors
The Solaris OE determines when a nonfatal hardware error has occurred and reports it to the SC. The domain is not stopped. The Solaris OE identifies the failure and the components that caused the failure. If appropriate, the Solaris OE might also deconfigure the component. For example, a CPU might be taken offline because of non-fatal errors that occur within the module, or a virtual memory page might be retired due to errors contained in the page.
The diagnostic information is then handled through the same channel as the SMS DE, and event messages are generated. These list events are then reported by ERD and recorded by ELAD.
The SMS DE records the diagnosis error in each of the components by updating the CHS on that component.
In this case, the domain is not stopped, and resources are removed by POST from the domain configuration at the next domain reboot.
POST Diagnostic Engine
FIGURE 3 shows the POST diagnosis process.
FIGURE 3 POST Diagnosis Process
Whenever POST is run to test and configure the domain, any components that fail during the self-test are reported to SMS.
SMS records the diagnosed fault in each of the components by updating the CHS on that component. The appropriate components are then removed from the domain configuration and the domain is booted.
If AD determines that a single component is at fault, the CHS for that component is marked as faulty. If it indicates that more that one component could be at fault, all possible components are marked as suspect.
It is possible that not all the components listed are faulty. The hardware error could be caused by a smaller subset of the identified components. Further analysis might be required to determine which field replaceable units (FRUs) are faulty.
Component Health System (CHS)
This feature records the CHS of each component in the system.
The enable and disable component command in SMS blacklists the component. The blacklisted component is location based, that is, if a system board in expander 1 is moved to expander 2, the system board slot of expander 1 is still blacklisted and the system board now in expander 2 can be integrated into a domain.
The new functionality using CHS is to mark the component as faulty. In the previous example, the status of system board 1 is stored with the component, and it is not integrated into a domain by POST.
CHS is stored in the FRUs SEEPROM. The FRUs with a faulty CHS can be removed from the resource pool without the use of blacklisting.
Automatic restoration occurs on the domain after the fault is isolated.
The SMS software has automatic system recovery (ASR) features. If the reboot_on_error flag is set, the domain is restarted with a minimum level of POST and might not reconfigure the faulty component.
The new functionality allows POST, during the domain initialization, to query if a resource should be excluded from a domain configuration, due to CHS. If the component is faulty, POST does not configure it in the domain configuration.
Also, as mentioned earlier, if POST can determine that a single component is at fault, the CHS for that component is marked as faulty.
Event reporting uses four different channels to report events:
Sun Management Center software
Remote services using System Remote Services (SRS) Net Connect
Events are logged into the platform messages log and appropriate domain message log. These text messages are in a single-line standard format, with enough information to help service personnel troubleshoot the problem.
The following example shows the text message template.
<initiator> Event: <> CSN: <> DomainID: <> ADInfo: <> Time: <> Recommended Action: Service action required
The following shows a text message example for DStop.
[AD] Event: SF15000-8001-0W CSN: 053A2003 DomainID: A ADInfo: 1.SMS-DE.1.4 Time: Fri Jul 11 14:26:36 PDT 2003 Recommended-Action: Service action required
The following shows a text message example for POST test failure.
[AD] Event: SF15000-8001-DE CSN: 053A2003 DomainID: A ADInfo: 1.POST-DE.1.4 Time: Fri Jul 11 14:30:36 PDT 2003 Recommended-Action: Service action required
The following shows a text message example for domain Solaris.
[DOM] Event: SF15000-8000-FF CSN: 053A2003 DomainID: B ADInfo: 1.SF-SOLARIS-DE.1 Time: Thu Jul 31 08:37:54 PDT 2003 Recommended-Action: Service action required
Sun Management Center Software
The event reporting daemon in the SMS software generates SMS events. These SMS events are handled by Sun Management Center software Event Front-End (EFE) daemon.
These SMS events contain event class, event code, and the Sun Fire chassis serial number (CSN). The Sun Management Center platform agent then issues a Sun Management Center text message for display on the Sun Management Center console.
By default, SMS does not generate email messages. You need to configure the email list by fault classes, domains, and recipients. The sample template of the email message form is included with SMS software in $SMSETC/config/templates/sample_email.
Customize the sample template by substituting tags with fault information. A standard shell script is included to send email. You can replace this script with a customized shell script.
You might need to customize scripts for the correct recipients and for the desired faults and domains. The email control file, event_email.cf, contains the email notification parameters. These parameters identify the email recipient based on the event class and domain in which the event occurred and whether the event message structure is sent as an attachment with the event email.
Use the testemail command to verify that the email event notification works properly. This command is at /opt/SUNWSMS/SMS/lib/smsadmin/testemail.
The following is an example of email received.
Date: Tue, 19 Aug 2003 10:45:28 -0600 (MDT) Subject: FAULT: SF15000, serial# 352A0007, code SF15000-8000-GK From: FMA@xyz.com To: undisclosed-recipients:; FAULT: SF15000, serial# 352A0007, code SF15000-8000-GK Fault event in domain(s) A at Tue Aug 19 10:45:18 MDT 2003. Fault severity = SMIEVENT_SEV_FATAL <7> Indictment Count: 2 Indictment list: sb11 ex11
For complete details about event tags described in the email template file, refer to the SMS 1.4 Administrator Guide.
Support utilities provide the commands: showlogs and testemail.
The SMS showlogs command is updated to view the error event reports.
The parameter -E in the showlogs command formats and condenses the event log information displayed.
The option -p e displays the event log according to the arguments passed to the option.
The showlogs event output supplements the diagnosis information presented in the platform and domain message logs or event emails. The showlogs event output can be used for additional troubleshooting purposes.
Use the testemail command to test the email setup and verify email generated reports. This command ensures that the reports contain the proper domain information, faults, and recipients.