Solaris OE Enhancements
Kernel updates for Solaris 8 OE and Solaris 9 OE on UltraSPARC III systems enhance the correctable error (CE) L2_SRAM module handling. Multiple CEs on accessing an L2_SRAM module indicate a higher probability of experiencing an uncorrectable error (UE). To prevent a fatal UE, the Solaris OE attempts to off-line CPUs. The availability of domains increases because the Solaris OE does not access L2_SRAM modules that have an increased failure probability.
The enhanced Solaris OE kernels have the ability to communicate hardware failures to the SC. If the system is using the appropriate kernel update for the Solaris 8 OE (KU-108528-24) or the Solaris 9 OE (KU-112233-09) with patch 116009-01, a message is sent to the SC when the Solaris OE identifies and isolates a faulty L2_SRAM module. The failed L2_SRAM module is not reconfigured into a domain on future domain reboots or setkeyswitch off and on operations because the system controller has recorded the component as faulty in its CHS.
Similar to memory page retirement, the Solaris OE keeps track of the number of ECCs over time on an L2_SRAM module (FIGURE 7). Two types of ECCs are considered herenonfatal multibit errors (UCU, CPU, WDU, EDU) and nonfatal single-bit correctable errors (UCC, CPC, WDC, EDC). If an L2_SRAM module experiences one nonfatal multibit error or three single-bit correctable errors in a 24-hour window, the L2_SRAM module is diagnosed with an increased probability of suffering a fatal failure in future. In this scenario, the Solaris OE has been enhanced to automatically attempt to off-line the affected CPU module. It is possible that the CPU off-line may not succeed because there might be processes bound to that CPU.
FIGURE 7 Solaris OE L2_SRAM Error Handling
TABLE 6, "Example 6," shows the messages on successfully off-lining a CPU that experienced more than two CE events in a 24-hour window.
TABLE 6 Example 6
|Feb 3 06:38:40 doma SUNW,UltraSPARC-III: NOTICE: [AFT1] CPU6 offlined due to more than 2 xxC Events in 24:00:00 (hh:mm:ss)|
Once a CPU is off-lined the Solaris OE sends a message to the system controller. The system controller updates the CHS of the affected FRU so that the faulty CPU is not configured into a domain on future reboots or setkeyswitch off and on events.
Off-lining the CPU associated with L2_SRAM modules with a higher probability of experiencing a fatal error increases the availability of the Solaris OE. Communication between the Solaris OE and the SC to persistently store the CHS increases availability and provides easier diagnosis and serviceability of the system. Dynamically reconfigured CPU/Memory boards can be replaced with minimal impact to the Solaris OE and user applications.