Home > Articles

  • Print
  • + Share This
Like this article? We recommend

Page Retirement

Solaris maintains a list of pages not being used. This list of pages is known as the freelist. When a new page is needed, one is taken off the freelist. To remove a page from use permanently, it is sufficient to ensure that when it is no longer being used by the system, it is not returned to the freelist. Essentially, it will remain unused until the system is rebooted and the OS starts over with a new freelist. When a page is no longer being used, the OS calls the function page_free() on that page.

This section discusses some of the reasons for page retirement, and how faulty pages are handled.

Page Retirement for Correctable DIMM Errors

Page Retirement is implemented via bug IDs 4484338, 4504686, and 4880360. A memory DIMM which is experiencing repeated correctable (single-bit) errors has an increased probability of experiencing an uncorrectable (multi-bit) error. Likewise, the probability of a memory error condition which could result in system downtime also increases.

To help address this, new features have been implemented for both UltraSPARC II-based and UltraSPARC III-based systems. These features attempt to proactively predict which system memory components (DIMMs) have an increased probability of experiencing an uncorrectable error, and subsequently remove this memory from future use when it is no longer used by the kernel or any processes. Limits are placed on the number of memory pages which can be retired from use.

A CPU receives notification that a correctable memory error has occurred via the trap mechanism. This initiates the following sequence of events:

  1. The correctable error (CE) is scrubbed from the system using a sequence of address space identifier (ASI) memory accesses, cache line flushes, and so on.

  2. NOTE

    The operations to clear an error are CPU type specific.

    The CE can fall into one of three categories:

    • A CE is considered intermittent if the error is not detected upon a reread of the affected memory word. An intermittent CE is often referred to as a transient soft error.

    • A CE is considered persistent if the error is detected upon reread, but the scrubbing operation corrected it. A persistent CE is often referred to as a temporary soft error.

    • A CE is considered sticky if after scrubbing, the error is still present. A sticky CE is often referred to as a stuck-at hard error.

  3. The physical address that caused the CE is read from the asynchronous fault address register (AFAR). From this, the DIMM that contains the affected memory cell is determined.

  4. The "Leaky Bucket" SERD algorithm is invoked on this DIMM. This monitors the frequency and interval of CE occurrences on a system component. If these exceed a certain threshold, the system decides this component is "deteriorating," and pages mapped to it should be removed from use. The algorithm ignores intermittent CE occurrences but counts persistent CE occurrences. Sticky CE occurrences cause the DIMM on which the CE occurs to be immediately marked as deteriorating. Until the threshold of acceptable CE occurrences on a DIMM has been exceeded, no pages on that DIMM are retired. After the threshold is exceeded, every subsequent CE on that DIMM causes a page retire operation.

  5. When the DIMM is marked as deteriorating, Solaris also marks the physical page containing the physical address as deteriorating.

  6. NOTE

    At this point both the physical DIMM (identified by its UNUM string) and the physical page frame are marked deteriorating.

  7. The physical page is now retired from use, if possible, and the system continues normal operations.

Because of memory interleaving, every DIMM in the system has multiple pages associated with it. Conversely, a single physical page consists of portions of physical memory from multiple DIMMs.

Toxic vs. Failing Pages

The OS distinguishes between pages that have correctable or uncorrectable errors. A page with an uncorrectable error which might be able to be cleared is marked as toxic. Pages that are mapped to a DIMM that has experienced multiple correctable errors are marked as failing. Recall earlier that DIMMs could be marked as deteriorating. A deteriorating DIMM contains pages that are either toxic or failing.

If a page is marked toxic, the OS attempts to clean any errors from the page using a scrubbing algorithm when page_free() is invoked on that page. If it can verify that there are no errors on the page after it does its scrubbing, it allows that page to be returned to the freelist. This ensures that a single error does not cause a page to be removed from the system. If the scrubbing is unsuccessful, the page is marked as failing. Further, if it is no longer in use by other threads, the page is immediately retired.

If a page is marked failing, no attempt is made to clean the page via scrubbing. It is immediately retired if it is no longer in use by other threads.

The sequence of operations is as follows:

  1. When the page is deemed to be failing, a flag is set on that page, PAGE_IS_FAILING.

  2. When the page is no longer in use, page_free() is invoked on that page.

  3. If the PAGE_IS_FAILING flag is set, page_free() moves the page to a special retired pages vnode, and the amount of available free memory in the system is decremented. The page is not returned to the freelist, and so will not be used again until reboot.

Page Retirement and DR

Page Retirement does not utilize DR. Automatic DR of an entire system board or its subcomponents is not attempted due to one or more pages being retired on a board. If a system administrator manually performs DR on a system board containing retired pages out of a domain and into the same or another domain, the pages will be active again if the POST process is successful.

There is currently no mechanism to inform POST/OBP of deteriorating DIMMs. For this reason, there exists no method to remove a page from use permanently across reboots. The end result is that pages retired during one OS run become available again at the next boot.

NOTE

Solaris 8 Kernel Update patches prior to 108528-24 and Solaris 9 Kernel Update patches 112233-08 and earlier provide a rudimentary implementation of page retirement, but it is disabled by default. You will find it possible to enable the feature via the /etc/system file. However, due to bug IDs 4401262 and 4854496, which are present in those releases, it is not recommended to do so. The following two paragraphs describe these bug IDs in more detail.

Bug ID 4401262 describes a hang condition during dynamic reconfiguration. If a system board with DIMMs containing retired pages is subsequently dynamically reconfigured from the system, the DR operation hangs during the DR unconfigure stage. Note that you might also encounter a memory error during DR which invokes the page retirement algorithms during the DR operation. This is expected to result in the same hang condition; however, test cases have not proven this. The message to understand regarding Solaris 8 Kernel Update patches prior to 108528-24 and Solaris 9 Kernel Update patches 112233-08 and earlier is that DR should not be used on a system board containing known retired pages. Service to the DIMM should be postponed until such time as the OS can be taken down.

Bug ID 4854496 describes a panic condition that results from dereferencing a pointer which resides on a page already zeroed by page retirement. While no customers have experienced this panic condition outside of artificially created conditions, it is theoretically possible for it to occur.

While it is not recommended, armed with this knowledge you can understand the DR interaction and potential panic conditions, and enable page retirement on those systems which need to utilize this feature but which cannot run Solaris 8 Kernel Update patch 108528-24.

Tunables

The following /etc/system variables and their possible values are listed here for reference only. Changing the values to other than their defaults should only be done under the guidance of an authorized Sun Microsystems service provider.

TABLE 2 Page Retirement Variables and Values

Variables

Values

set ce_verbose_memory=[0/1/2]

set ce_verbose_other=[0/1/2]

A value of 0 indicates no logging. A value of 1 indicates that the messages are sent to the log file, but not the console. A value of 2 indicates that the messages are sent to the console and the log file. The default value is 1.

set automatic_page_removal=[0/1]

A value of 0 disables the page retirement feature. A value of 1 enables the page retirement feature. The default value depends upon the kernel patch release and is discussed in a prior section.

set ecc_softerr_interval=1440

set ecc_softerr_limit=2

The interval measured in minutes and number of acceptable CEs within this interval. Used by the Leaky Bucket algorithm to determine when to begin page retirement. It is acceptable to have ecc_softerr_limit CEs within ecc_softerr_interval minutes. Beyond this threshold, begin page retirement. These values are the defaults.

set max_pages_retired_bps=10

Limits the number of physical memory pages which can be retired. This number is a percentage of physical memory stored as basis points, where 100 basis points is 1%. The default is 10, or.1% of physical memory.


Example Messaging

The following is a sequential extract of messages from the system log of a system which experienced multiple errors on two DIMMs.

  1. CE on Memory Module Board 4 J3401 (First DIMM, First Error)

  2. Jan 7 04:13:29 pyre SUNW,UltraSPARC-II: [ID 194692 kern.notice] [AFT0] Corrected Memory Error detected by CPU12, errID 0x0000003b.24cd6aea
    Jan 7 04:13:29 pyre  AFSR 0x00000000.00100000<CE> AFAR 0x00000001.2db18000
    Jan 7 04:13:29 pyre  AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x78330400
    Jan 7 04:13:29 pyre  UDBH Syndrome 0x64 Memory Module Board 4 J3401
    Jan 7 04:13:29 pyre SUNW,UltraSPARC-II: [ID 898376 kern.notice] [AFT0] errID 0x0000003b.24cd6aea Corrected Memory Error on Board 4 J3401 is Persistent
    Jan 7 04:13:29 pyre SUNW,UltraSPARC-II: [ID 906141
    kern.notice] [AFT0] errID 0x0000003b.24cd6aea ECC Data Bit 7 was in error and corrected
  3. CE on Memory Module Board 4 J3801 (Second DIMM, First Error)

  4. Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 338670 kern.notice] [AFT0] Corrected Memory Error detected by CPU5, errID 0x00000043.f7fb26ef
    Jan 7 04:14:07 pyre  AFSR 0x00000000.00100000<CE> AFAR 0x00000001.2bf6c000
    Jan 7 04:14:07 pyre  AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x78330be0
    Jan 7 04:14:07 pyre  UDBH Syndrome 0xf2 Memory Module Board 4 J3801
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 282000 kern.notice] [AFT0] errID 0x00000043.f7fb26ef Corrected Memory Error on Board 4 J3801 is Persistent
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 339990 kern.notice] [AFT0] errID 0x00000043.f7fb26ef ECC Data Bit 9 was in error and corrected
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c000, Data 0x0eccf00d.ff250000, ECC 0xd5
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c008, Data 0x0eccf00d.ff250000, ECC 0xd5
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c010, Data 0x0eccf00d.ff250000, ECC 0xd5
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c018, Data 0x0eccf00d.ff250000, ECC 0xd5
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c020, Data 0x0eccf00d.ff250000, ECC 0xd5
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c028, Data 0x0eccf00d.ff250000, ECC 0xd5
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c030, Data 0x0eccf00d.ff250000, ECC 0xd5
  5. CE on Memory Module Board 4 J3801 (Second DIMM, Second Error)

  6. Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 492729 kern.notice] [AFT0] Corrected Memory Error detected by CPU5, errID 0x00000043.f9761088
    Jan 7 04:14:07 pyre  AFSR 0x00000000.00100000<CE> AFAR 0x00000001.2bf6c040
    Jan 7 04:14:07 pyre  AFSR.PSYND 0x0000(Score 05)
    AFSR.ETS 0x00 Fault_PC 0x78330be0
    Jan 7 04:14:07 pyre  UDBH Syndrome 0xf2 Memory Module Board 4 J3801
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 203173 kern.notice] [AFT0] errID 0x00000043.f9761088 Corrected Memory Error on Board 4 J3801 is Persistent
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 429558 kern.notice] [AFT0] errID 0x00000043.f9761088 ECC Data Bit 9 was in error and corrected
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c040, Data 0x0eccf00d.ff250040, ECC 0x5c
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c048, Data 0x0eccf00d.ff250040, ECC 0x5c
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c050, Data 0x0eccf00d.ff250040, ECC 0x5c
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c058, Data 0x0eccf00d.ff250040, ECC 0x5c
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c060, Data 0x0eccf00d.ff250040, ECC 0x5c
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c068, Data 0x0eccf00d.ff250040, ECC 0x5c
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c070, Data 0x0eccf00d.ff250040, ECC 0x5c
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c078, Data 0x0eccf00d.ff250040, ECC 0x5c
  7. CE on Memory Module Board 4 J3801 (Second DIMM, Third Error)

  8. Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 313390 kern.notice] [AFT0] Corrected Memory Error detected by CPU5, errID 0x00000043.faed89ef
    Jan 7 04:14:07 pyre  AFSR 0x00000000.00100000<CE> AFAR 0x00000001.2bf6c080
    Jan 7 04:14:07 pyre  AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x78330be0
    Jan 7 04:14:07 pyre  UDBH Syndrome 0xf2 Memory Module Board 4 J3801
    Jan 7 04:14:07 pyre unix: [ID 596940 kern.warning] WARNING: [AFT0] 3 soft errors in less than 24:00 (hh:mm) detected from Memory Module Board 4 J3801
  9. CE count on J3801 has exceeded maximum acceptable; page removal is attempted

  10. Jan 7 04:14:07 pyre unix: [ID 618185 kern.notice] NOTICE: Scheduling removal of page 0x00000001.2bf6c000
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 873457 kern.notice] [AFT0] errID 0x00000043.faed89ef Corrected Memory Error on Board 4 J3801 is Persistent
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 185728 kern.notice] [AFT0] errID 0x00000043.faed89ef ECC Data Bit 9 was in error and corrected
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c080, Data 0x0eccf00d.ff250080, ECC 0xb1
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c088, Data 0x0eccf00d.ff250080, ECC 0xb1
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c090, Data 0x0eccf00d.ff250080, ECC 0xb1
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c098, Data 0x0eccf00d.ff250080, ECC 0xb1
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c0a0, Data 0x0eccf00d.ff250080, ECC 0xb1
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c0a8, Data 0x0eccf00d.ff250080, ECC 0xb1
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c0b0, Data 0x0eccf00d.ff250080, ECC 0xb1
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c0b8, Data 0x0eccf00d.ff250080, ECC 0xb1
  11. CE on Memory Module Board 4 J3801 (Second DIMM, Fourth Error)

  12. Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 958426 kern.notice] [AFT0] Corrected Memory Error detected by CPU5, errID 0x00000043.fc6c8d68
    Jan 7 04:14:07 pyre  AFSR 0x00000000.00100000<CE> AFAR 0x00000001.2bf6c0c0
    Jan 7 04:14:07 pyre  AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x78330be0
    Jan 7 04:14:07 pyre  UDBH Syndrome 0xf2 Memory Module Board 4 J3801
    Jan 7 04:14:07 pyre unix: [ID 596940 kern.warning] WARNING: [AFT0] 4 soft errors in less than 24:00 (hh:mm) detected from Memory Module Board 4 J3801
  13. CE count on J3801 still in excess of maximum acceptable; page removal is attempted (Note: this is the same page as before, which could not be removed because it was still in use. This is the second attempt)

  14. Jan 7 04:14:07 pyre unix: [ID 618185 kern.notice] NOTICE: Scheduling removal of page 0x00000001.2bf6c000
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 716816 kern.notice] [AFT0] errID 0x00000043.fc6c8d68 Corrected Memory Error on Board 4 J3801 is Persistent
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 236389 kern.notice] [AFT0] errID 0x00000043.fc6c8d68 ECC Data Bit 9 was in error and corrected
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c0c0, Data 0x0eccf00d.ff2500c0, ECC 0x38
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c0c8, Data 0x0eccf00d.ff2500c0, ECC 0x38
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c0d0, Data 0x0eccf00d.ff2500c0, ECC 0x38
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c0d8, Data 0x0eccf00d.ff2500c0, ECC 0x38
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c0e0, Data 0x0eccf00d.ff2500c0, ECC 0x38
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c0e8, Data 0x0eccf00d.ff2500c0, ECC 0x38
    Jan 7 04:14:07 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x12bf6c0f0, Data 0x0eccf00d.ff2500c0, ECC 0x38
  15. Page on Memory Module Board 4 J3801 is removed

  16. Jan 7 04:14:12 pyre unix: [ID 693633 kern.notice] NOTICE: Page 0x00000001.2bf6c000 removed from service
  17. CE on Memory Module Board 6 J3801 (Third DIMM, First Error)

  18. Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 298492 kern.notice] [AFT0] Corrected Memory Error detected by CPU15, errID 0x00000046.fd74b4e8
    Jan 7 04:14:20 pyre  AFSR 0x00000000.00100000<CE> AFAR 0x00000001.35bf4000
    Jan 7 04:14:20 pyre  AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x78330be0
    Jan 7 04:14:20 pyre  UDBH Syndrome 0xf2 Memory Module Board 6 J3801
    Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 686754 kern.notice] [AFT0] errID 0x00000046.fd74b4e8 Corrected Memory Error on Board 6 J3801 is Persistent
    Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 542367 kern.notice] [AFT0] errID 0x00000046.fd74b4e8 ECC Data Bit 9 was in error and corrected
    Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x135bf4000, Data 0x0eccf00d.ff250000, ECC 0xd5
    Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x135bf4008, Data 0x0eccf00d.ff250000, ECC 0xd5
    Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x135bf4010, Data 0x0eccf00d.ff250000, ECC 0xd5
    Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x135bf4018, Data 0x0eccf00d.ff250000, ECC 0xd5
    Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x135bf4020, Data 0x0eccf00d.ff250000, ECC 0xd5
    Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x135bf4028, Data 0x0eccf00d.ff250000, ECC 0xd5
  19. CE on Memory Module Board 6 J3801 (Third DIMM, Second Error)

  20. Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 565072 kern.notice] [AFT0] Corrected Memory Error detected by CPU15, errID 0x00000046.fef0581a
    Jan 7 04:14:20 pyre  AFSR 0x00000000.00100000<CE> AFAR 0x00000001.35bf4040
    Jan 7 04:14:20 pyre  AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x78330be0
    Jan 7 04:14:20 pyre  UDBH Syndrome 0xf2 Memory Module Board 6 J3801
    Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 190951 kern.notice] [AFT0] errID 0x00000046.fef0581a Corrected Memory Error on Board 6 J3801 is Persistent
    Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 744456 kern.notice] [AFT0] errID 0x00000046.fef0581a ECC Data Bit 9 was in error and corrected
    Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x135bf4040, Data 0x0eccf00d.ff250040, ECC 0x5c
    Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x135bf4048, Data 0x0eccf00d.ff250040, ECC 0x5c
    Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x135bf4050, Data 0x0eccf00d.ff250040, ECC 0x5c
    Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x135bf4058, Data 0x0eccf00d.ff250040, ECC 0x5c
    Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x135bf4060, Data 0x0eccf00d.ff250040, ECC 0x5c
    Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x135bf4068, Data 0x0eccf00d.ff250040, ECC 0x5c
    Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x135bf4070, Data 0x0eccf00d.ff250040, ECC 0x5c
    Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x135bf4078, Data 0x0eccf00d.ff250040, ECC 0x5c
  21. CE on Memory Module Board 6 J3801 (Third DIMM, Third Error)

  22. Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 372229 kern.notice] [AFT0] Corrected Memory Error detected by CPU15, errID 0x00000047.005d5687
    Jan 7 04:14:20 pyre  AFSR 0x00000000.00100000<CE> AFAR 0x00000001.35bf4080
    Jan 7 04:14:20 pyre  AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x78330be0
    Jan 7 04:14:20 pyre  UDBH Syndrome 0xf2 Memory Module Board 6 J3801
    Jan 7 04:14:20 pyre unix: [ID 596940 kern.warning] WARNING: [AFT0] 3 soft errors in less than 24:00 (hh:mm) detected from Memory Module Board 6 J3801
  23. CE count on J3801 has exceeded maximum acceptable; page removal is attempted

  24. Jan 7 04:14:20 pyre unix: [ID 618185 kern.notice] NOTICE: Scheduling removal of page 0x00000001.35bf4000
    Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 834227 kern.notice] [AFT0] errID 0x00000047.005d5687 Corrected
    Memory Error on Board 6 J3801 is Persistent
    Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 745085 kern.notice] [AFT0] errID 0x00000047.005d5687 ECC Data Bit 9 was in error and corrected
    Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x135bf4080, Data 0x0eccf00d.ff250080, ECC 0xb1
    Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x135bf4088, Data 0x0eccf00d.ff250080, ECC 0xb1
    Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x135bf4090, Data 0x0eccf00d.ff250080, ECC 0xb1
    Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x135bf4098, Data 0x0eccf00d.ff250080, ECC 0xb1
    Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x135bf40a0, Data 0x0eccf00d.ff250080, ECC 0xb1
    Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x135bf40a8, Data 0x0eccf00d.ff250080, ECC 0xb1
    Jan 7 04:14:20 pyre SUNW,UltraSPARC-II: [ID 832828 kern.notice] [AFT0] Paddr 0x135bf40b0, Data 0x0eccf00d.ff250080, ECC 0xb1
  25. Page for Memory Module Board 6 J3801 is removed

  26. Jan 7 04:14:25 pyre unix: [ID 693633 kern.notice] NOTICE: Page 0x00000001.35bf4000 removed from service
  • + Share This
  • 🔖 Save To Your Account