Home > Articles > Networking > Storage

  • Print
  • + Share This
Like this article? We recommend

Error Analysis and Diagnosis

This section addresses error analysis and diagnosis practices.

ECC Replacement Guidelines

Solaris OS can report three categories of ECC memory errors, which are identified as intermittent, persistent, and sticky. These errors are reported to the console as well as to the system-messages file. In addition, the Sun Fire 15K server detects ECC errors in the hardware and takes a snapshot of the hardware state. The hardware state dumps are known as record stops and are in the SC directory structure for the affected domain. ECC DIMM errors are related to main-memory modules and not level-two cache (L2CACHE) errors. For more information, refer to the Sun Blueprints OnLine article "Solaris Operating System Availability Features."

The ECC DIMM memory errors are categorized as follows:

  • Intermittent—The error was not detected on a reread of the affected memory location.

  • Persistent—The error was detected again on a reread of the affected memory location, but the scrub operation corrected it.

  • Sticky—The error still exists in memory, even after the scrub operation.

TIP

If the ECC is intermittent, check reports.

If the ECC is persistent, replace the DIMM if three or more errors occur in a 24-hour period on the same DIMM.

If the ECC is sticky, replace the DIMM on first occurrence.

Maintain Current Explorer Data

Sun Fire server systems are designed with significant diagnostic capabilities. In the event of a system fault, the system should provide information for both hardware and software failures, which can be used to help determine the source of the fault. Errors can be reported and logged to several places depending on the type of error. Explorer software is the utility of choice for gathering the state of SCs and domains at the time a failure occurs. Be sure to use the most current release of Explorer software to capture all of the appropriate data.

For more information about Explorer, refer to:

http://sunsolve.Sun.COM/pub-cgi/show.pl?target=explorer/explorer

TIP

Use the latest release of Explorer software.

Fault Isolation

One of the most interesting features of the Sun Fire 15K server is the ability to reconfigure the platform and the domain via the software command line interface (CLI). In fact, the centerplane can be placed into a degraded mode of operation without shutting down the running domains. In most cases, this provides an opportunity to isolate a fault in a field replacable unit (FRU) before handling any hardware and risking damage to certain single point of failures (SPOFs) such as the centerplane. Given any type of hardware detected error, an attempt to isolate the failing components is indicated using the information provided by the software (hpost, redx, dsmd dumps, logfiles). Before attempting hardware replacement, use the SMS command disablecomponent to deconfigure and isolate failing hardware.

TIP

Before hardware replacement, use the SMS command disablecomponent to deconfigure and isolate the faulty hardware.

SMS 1.4 software has new functionality for the Auto Diagnosis and Recovery capabilities, to help detect system failures when they occur, deconfigure faulty components out of a system, and automatically restore a system. These new functions help customers minimize both planned and unplanned downtime. These new features are described as follows:

  • Automatically deconfigure the faulty components out of a system, reducing unplanned downtime (component health status).

  • Provide detailed error messages for faster problem resolution and faster time to service (auto diagnosis).

  • Detect potential CPU cache failures and offline affected CPUs, keeping system up and application available (CPU taken offline).

  • Automatically restore domains, reducing the impact of faulty components of system availability (auto recovery).

  • Automatically generate email notice informing designated recipients of domain events when they occur (email event notification).

For more information, refer to the Sun BluePrints OnLine article titled "Sun Fire 15K/12K Auto Diagnosis and Recovery."

TIP

Also check event notification to deconfigure and isolate the faulty hardware.

  • + Share This
  • 🔖 Save To Your Account