Home > Articles

  • Print
  • + Share This
This chapter is from the book

25.4 The "Split-Brain" Syndrome

The "split-brain" syndrome can easily be described if we consider a simple two-node cluster like the one in Figure 25-1. If these nodes were to lose all LAN communications, how would they decide who was the "best" to reform the cluster? Even if they had a serial heartbeat, we could still be in a situation where both nodes had individual communications but for some reason could not communicate with each other. Serviceguard requires a cluster quorum of more than 50 percent of the previously running nodes. In the two-node situation described above, we could be in a situation where two equal-sized clusters would both try to reform, and if allowed to do so, we would have two instances of our applications running simultaneously—and that's not a good idea. In this situation, we need a "tiebreaker." For Serviceguard, the tiebreaker is known as a cluster lock. Serviceguard now offers two forms of tiebreaker:

  • Cluster Lock Disk: This is a shared disk that both nodes can see and that is controlled by LVM. In a cluster of more than four nodes, a cluster lock disk is not supported or allowed. A quorum server is.

  • Quorum Server: This is a separate node, not part of the cluster but contactable over a network interface (preferably on the same subnet to avoid delays over routers, and so on). This machine could be something as simple as a workstation running HP-UX 11.0 or 11i (either PA-RISC or IPF), or it could even be a machine running Linux. The Quorum Server listens to connection requests from the Serviceguard nodes on port #1238. The server maintains a special area in memory for each cluster, and when a node obtains the cluster lock, this area is marked so that other nodes will recognize the lock as "taken." It may provide quorum services for more than one cluster.

The idea of a cluster lock can be extended to a cluster of any size; we do not want two groups of nodes each containing 50 percent of the nodes previously in the cluster trying to form two clusters of their own. Again, we would have two sets of applications trying to start up simultaneously—not a good idea. We need a cluster lock in a two-node cluster; it's a must because of the "split-brain" syndrome. In a three-node cluster, it is advisable because one node may be down for maintenance and we are back to being a two-node cluster. For more than three nodes, a cluster lock is optional because the chance of having two groups of nodes of exactly equal size is unlikely. Whichever group of nodes wins the "tiebreaker," those nodes will form the cluster. The other group of nodes will shut down by instigating a TOC (Transfer of Control). We look at a crashdump later, which tells us that Serviceguard caused the system to initiate a Transfer Of Control (TOC).

Before we get started on actually configuring a cluster, I want to use just one last paragraph to remind you of some other hardware considerations.

  • + Share This
  • 🔖 Save To Your Account