Home > Articles > Operating Systems, Server > Solaris

  • Print
  • + Share This
Like this article? We recommend

Recovery and Performance Measurements

To better understand the recovery capabilities on SPARC processor-based Solaris systems running ORACLE, we designed and conducted several lab experiments that involved building an ORACLE database running an online transaction processing (OLTP) workload. We then injected a fatal fault that caused the system to crash. From this experiment, we measured recovery time. We ran two types of tests:

  • Fault injection tests. We introduced a fatal fault while the system was under load. We measured the recovery times when the system and database were restarted.

  • Performance tests. We completed OLTP runs to assess the performance impact of recovery tuning.

We used the following two systems in the experiment:

  • System A: Sun EnterpriseUltra 4000 with 8 CPUs, 4 gigabytes of RAM, and 112 disks; running Solaris 8 OE and ORACLE 8.1.7. The Solaris Volume Manager software was used to configure RAID devices.

  • System B: Sun Fire 6800 with 24 CPUs, 196 gigabytes of RAM, and 540 disks; running Solaris 9 OE and ORACLE 9.2. The Veritas Volume Manager software was used to configure RAID devices.

To cause a fatal fault, we executed a simple script as root to force a kernel panic. The system A results are summarized in TABLE 5.

TABLE 5 System A Test Results

Throughput

Chkpnt

% perf

Recovery Time

Recovery Speedup

7965.90

None

 

1624

 

7681.70

1800

3.57

1824

0.89

6154.70

900

22.74

1011

1.61

6054.87

600

23.99

981

1.66

5974.20

300

25.00

340

4.78

5690.70

150

28.56

249

6.52


The table heading definitions are as follows:

  • Throughput is the OLTP workload performance, in transactions per minute.

  • Chkpnt is the checkpoint interval in seconds.

  • %perf is the performance regression from the base run (no checkpointing).

  • Recovery Time is the amount of time, in seconds, ORACLE recovery required.

  • Recovery Speedup is how much faster the result was versus the no-checkpoint case.

NOTE

Note that this set of runs was done using hard checkpointing only. Checkpoints were issued at desired intervals through a script.

Our data shows an extreme case; extreme because we used hard checkpointing, and the system was I/O constrained. However, it does give a clear indication that recovery time can be substantially improved (by a factor of 6), albeit in this case with a substantial trade-off in performance (28% regression). The primary lesson from this run is that it's about IOPS. Tuning an I/O-constrained system to meet aggressive recovery goals potentially degrades performance when hard checkpointing is used.

We then moved the test to a much larger system (System B), where a larger database was built, and we used ORACLE 9.2.

The primary testing involved measuring improvements using recovery_parallelism and increasing the message buffer size via parallel_execution_message_size. TABLE 6 shows the results.

TABLE 6 System B Test Results

No RP/2k

24/2k

24/4k

24/8k

24/16k

24/32k

24/64k

43m59s

33m04s

26m16s

23m18s

22m40s

22m25s

22m31s


The column headings indicate the parallelism level and the message size. The times are the duration of the ORACLE recovery phase, from the beginning of crash recovery through the database open completion. The first column has no recovery parallelism and a 2 kilobytes message size. Subsequent runs used 24 recovery processes, and only the message size was varied.

Our data shows significant improvements in recovery through parallelism, a 25% reduction in recovery time. Further improvements were seen with message size increases. Increasing to an 8 kilobytes message size (with parallelism) reduced recovery time by half. Note that these tests were done without turning on the fast start mechanism, thus there was not an application performance tradeoff in making these changes.

NOTE

Message size adjustments affect parallel query slaves. A workload running parallel queries was not tested, so performance impact is not known.

In our case, the optimal message size for tuning was 8 kilobytes. Beyond that, improvements in recovery time were negligible.

We ran another set of test after tuning fast_start_mttr_target, to measure the impact on performance. TABLE 7 summarizes the results.

TABLE 7 System B Alternate Tuning Test Results

MQTh

fast_start_mttr

% of base

125927.13

0

 

115580.67

300

91.78

116199.47

600

92.28

117947.8

900

93.66

117574.8

1200

93.37


Our data shows a very reasonable performance tradeoff for aggressive recovery settings—just about 8% performance regression in the worst case.

On the recovery side, ORACLE 9.2 consistently meets or exceeds MTTR target settings, with the exception of the 300 setting, but that was only off by about 6%. At 300, ORACLE startup took 318 seconds. We suspect that the 300 target was not met because this was the first test run, and Oracle MTTR statistics were estimates, not actual values. Subsequent values (600 and up) resulted in ORACLE startup times consistently under the target value. Note that all these measurements were done with 24 recovery processes and an 8 kilobytes message size.

  • + Share This
  • 🔖 Save To Your Account