Failures in Clustered Systems
A number of faults are specific to clustered systems. These faults affect synchronization between nodes, arbitration of conflicts regarding shared components, or the presentation of a single view of data or services to external users and systems. Cluster Failures describes how Sun Cluster 3.0 avoids and handles these faults. The sections that follow describe the split brain, multiple instance, and amnesia faults.
Split-brain failure occurs when a single cluster has a failure that results in reconfiguration into multiple partitions, with each partition forming without knowledge of the existence of any other. If each new cluster does not know of the existence of the others, there could be a collision in shared resources. Network addresses can be duplicated because each of the new clusters thinks it should own the shared network address. Shared storage could also be affected, because each cluster believes that it owns the shared storage. The result is severe data corruption.
Amnesia and Temporally Split Configurations explains how Sun Cluster 3.0 avoids the split-brain condition.
A multiple instance failure occurs when an application is designed to operate on data assuming it has exclusive access to the data and more than one instance of the application is started. For a single system there are many ways to prevent this failure, such as semaphores, checking the process list, creating lock files, and so forth. In a clustered environment, this prevention becomes more difficult. Using semaphores and checking the process list only applies to the local node. Creating lock files works, but only if the lock files are on shared storage. Even then, a simple lock file will not suffice, because the application must have a mechanism to correctly query each node in the cluster to ensure that another instance of the application is actually running, since the application may have failed without releasing the lock.
Amnesia is a failure mode in which a node starts with stale cluster configuration information. This is a synchronization error because the cluster configuration information is not propagated to all of the nodes. For example, if a node fails and the cluster is then reconfigured, cluster configuration information of the node is now stale. If the node tries to rejoin the cluster, the node must resynchronize its cluster configuration information.
A more difficult situation occurs when the node fails, the cluster is reconfigured, the cluster is brought down, and then the failed node is brought back up. In this case, the stale cluster configuration information is presumed to be correct, and a new cluster is built with the stale configuration information.