The RAScad model introduced in the previous section identifies the main parameters that affect the availability of a scalable service two-node Sun Cluster stack when node failures occur. These parameters are:
Mean Time between Failures (MTBF)
Probability of successful reconfiguration (p)
Recovery_Timetime taken for reconfiguration to complete
Mean Time To Repair a node (MTTR_1).
Node_Rejoin_Timetime for a node to join cluster.
Percent increase in failure rate due to increased load (a)
Mean Time To Repair two nodes (MTTR_2)
Modeling the availability of this stack is tantamount to modeling these parameters and any others that come into play in other outage scenarios. Some of these parameters, such as Recovery_Time, directly relate to outages and can be measured by fault injection experiments in a laboratory setting. Some others, such as MTBF, do not lend themselves to relatively straight forward laboratory measurement efforts. Yet other parameters have components in them that are not within Sun Cluster system control.
This section describes the methodology proposed for modeling Sun Cluster availability with a reasonable degree of accuracy, using a hybrid measurement/analysis approach. A few points must be noted first. The Sun Cluster system supports a large number of data services and hardware configurations, yielding an exponential number of supported stacks. Each stack has its unique intercomponent characteristics, and thus should be considered as a separate target for availability evaluation. The modeling effort discussion in this article covers a subset of these stacks, although the methodology is general enough to be applicable to any Sun Cluster stack.
The proposed methodology addresses the inherent difficulty of evaluating the availability of software, by adopting a hybrid modeling approach that combines the techniques of black-box measurement, white-box analysis, and availability budgeting, as illustrated in FIGURE 2.
FIGURE 2 Hybrid Modeling Approach
Black-box measurement refers to the technique of treating the system as a black-box and examining its outputs given a set of inputs. This technique can be further subdivided into two approaches, depending on the ease of measuring in a laboratory setting. For parameters that can be measured in a laboratory setting, fault injection can be used to simulate real-world faults. The inputs would thus be the faults injected and the outputs would be the occurrence and duration of a service outage. For parameters that cannot be measured using fault injection, one needs to rely upon data collection from the field and customers to provide data about the parameters being measured. While black-box measurement is technically easier than other modeling techniques, it has the disadvantage of not encapsulating any explanation of the system's behavior. This prevents extrapolation of the data obtained to any stack that is even slightly different from the stack at hand. Owing to limited understanding of the system's behavior, it could also lead to an incorrect model due to blind assumptions made about various factors. Hence, the results on a given stack and a given environment cannot, in general, be applied to others, thereby limiting the usefulness of this technique.
White-box analysis, also called behavioral analysis, refers to the technique of analyzing the behavior of each individual component as well as its interactions. The advantage of this approach is obvious. It gives an accurate availability model. However, a complete white-box analysis requires a complete examination of all the components. Each of these components can be complicated, and there can be a large number of such components. Furthermore, since software changes relatively rapidly, often by the time the white-box analysis is complete, the stack has changed enough to require a re-analysis. Thus, this is an expensive approach.
Clearly, an approach that combines aspects of both black-box measurement and white-box analysis is needed to model the availability of a complex software system in a reasonable time, with good accuracy, and with an ability to perform extrapolations.
An additional relevant aspect of a Sun Cluster software stack is the presence of third-party vendor layers in the stack. Typically very little is known about the behavior of these layers or their interaction characteristics with other layers, and so these layers essentially assume the role of black-boxes. One pragmatic way to approach this issue is to first measure the availability characteristics of the third-party component under representative scenarios, determine the typical outage contributed by that layer, and then assign an availability budget to that layer. An availability budget is essentially an upper bound on outage time contributed by that layer, and on the number of times the corresponding component can fail. These upper bounds can be expressed in a variety of ways, starting from a simple number to a probability density function. The budget needs to be set for the current as well as future revisions of that component, and serves as a proxy for any white-box analysis of the same.
The proposed methodology thus adopts a hybrid approach of using black-box measurement with white-box analysis and availability budgeting to model the availability of a specified Sun Cluster software stack. The methodology dictates the following basic steps towards building an availability model.
Define the stack, the application used to simulate load on the data service in question, and the method/client used to measure the outage.
Select a set of representative faults to inject into the system, and measure the outage observed from the client.
Analyze the outage in terms of the components contributing to each outage.
Investigate how the component times can affect the outage.
This information can be used to extrapolate the change in outage with respect to changing the stack in some well-defined ways.
Determine the components needing to be assigned budgets.
Collect data from deployed clusters representing that stack and use this data to measure parameters that cannot be measured through fault injection.
Derive models for each of the parameters that contribute to system availability, based on the data collection and analysis efforts.
Plug these values into the RAScad models.
There will be one value per primary failure type.
Derive an overall availability model from these RAScad models.
Once the stack is finalized, the scenarios for which measurement and analysis needs to be conducted must be determined. This entails finding a representative subset of the total set of faults that can occur for the stack, since the total number is too large to be considered. In the last few decades, there has been a significant amount of research on the representativeness of fault injection experiments to assess various system metrics6. While there is a great deal of controversy regarding this issue, given the lack of alternatives, fault injection is still regarded as the most viable approach, both for measurement and modeling activities. The methodology proposed here chooses fault injection scenarios based on customer and field input, as well as faults seen and simulated in the development laboratories as part of Sun Cluster software testing. SunSM Sigma tools such as QFD analysis7 are then used to prioritize the tests.