Home > Articles > Operating Systems, Server > Linux/UNIX/Open Source

  • Print
  • + Share This
This chapter is from the book

This chapter is from the book

Clusters

Even with RAID and Ethernet bonding on a host there are plenty of other components that can fail, from the CPU to the software on the host. If you need a service to stay up even when a host fails, then you need a cluster. There are a number of different ways to set up Linux clusters, and there are many different kinds of clusters as well. In this section I will discuss one of the most common tools used in basic Linux clusters, Heartbeat, and how to use it to create a basic fault-tolerant service across two servers. Afterward I will discuss how to use a tool called DRBD to replicate data between two servers over the network. These two tools provide a solid foundation you can use for any number of fault-tolerant services.

As you work with clusters, you will find that most clustering technologies use some of the same concepts for cluster management. Below are some of the basic rules and terminologies people use when they develop a cluster:

  • Floating IPs
  • In a standard active/passive Heartbeat cluster, each node (server) has its main IP and there is an additional floating IP that is shared between the nodes. Only the node that is considered active will use and answer to the floating IP address. Services are hosted off of the floating IP address so that when a particular host goes down and the fail-over node assumes the floating IP, it can take over the service.

  • Active/active versus active/passive
  • In an active/active cluster all nodes are running and accepting load at all times. In an active/passive cluster one node is considered the master and accepts all of the load while any other nodes take load only when the master goes down. My examples will be based on an active/passive cluster.

  • Split-brain syndrome
  • Split-brain syndrome occurs in an active/passive cluster when both nodes believe they are the master and try to assume the load. This can be disastrous for a cluster, especially in the case of shared storage and floating IPs, as both nodes will try to write to storage (that may not accept writes from multiple sources) as well as try to grab the floating IP for themselves. As you will see, one of the big challenges in clustering is identifying when a host is truly down and avoiding split-brain syndrome.

  • Quorum
  • Clusters often use the concept of a quorum to determine when a host is down. The idea behind a quorum is to have a consensus among the members of your cluster about who can talk to whom. This typically works only when you have at least three hosts; in a two-node cluster if node A can’t talk to node B, it can be difficult for node A to know whether its network is down or node B’s network is down. With a third node you can set a quorum of two, so that at least two nodes must be unable to reach another node before that node is considered down.

  • Fencing
  • Fencing is one of the methods used to avoid split-brain syndrome. The name is derived from the idea of building a fence around a downed node so that it can’t become the active node again until its problems have been resolved. There are a number of ways to fence a machine, from turning off its network ports to rebooting it or triggering a kernel panic (aka shooting the other node in the head).

  • Shooting the other node in the head
  • This term is used to describe a particularly direct response to fence a server. When a cluster determines a host is unavailable, it will often forcibly kill the server by either a reboot, a kernel panic, or even remotely power cycling the machine. The idea is that once the system reboots, it should be back to some sort of consistent state and should be able to rejoin the cluster safely.

  • Separate connection for node monitoring
  • A common practice for clusters is to have a separate connection that the nodes use to monitor each other. The idea here is to prevent normal traffic from slowing down or interfering with communications between nodes. Some administrators solve this by monitoring over each node’s serial port or connecting a crossover network cable between a second set of Ethernet ports.

Heartbeat

Heartbeat is one of the core Linux clustering tools. The idea behind Heartbeat is to provide a system to describe a cluster and then monitor each node to see if it is available. When a node is unavailable, Heartbeat can then take some sort of action in response. Responses might include moving a floating IP to a fail-over host, starting or stopping particular services, mounting particular file systems, and fencing the unavailable host.

There are two main methods of Heartbeat configuration these days. The classic Heartbeat configuration is based on a few basic text configuration files that are relatively easy to configure by hand. The traditional Heartbeat cluster works with only two nodes. The newer Heartbeat 2 configuration model relies on XML files and can support clusters of more than two nodes. These new features and file formats can introduce some complexity, especially when you are just getting started with clustering, so for the purposes of this chapter, since my example features only a two-node cluster, I am going to stick with the traditional Heartbeat configuration. The traditional Heartbeat configuration relies on three main configuration files: ha.cf, haresources, and authkeys.

  • ha.cf
  • This file defines all of the settings for a particular clustering, including the nodes in the cluster, what methods Heartbeat uses to communicate with each node, and also time-outs to use for any fail-over.

  • haresources
  • Here you will configure the responses a node will take as a result of a fail-over. This might be assuming a floating IP, mounting or unmounting file systems, or starting or stopping services.

  • authkeys
  • This file contains a secret key that all of the nodes have in common. This key is used as a method of authentication so that the nodes know they are speaking to valid members of the cluster.

Example Cluster

In my example I will set up a two-node active/passive Apache cluster. I will assume that Apache hosts static files (i.e., I don’t need replicated storage at this point). Here is the information about node1 and node2, the two different servers in my cluster:

node1
eth0: 172.16.245.220/255.255.255.0
eth1: 192.168.10.1/255.255.255.0

node2
eth0: 172.16.245.221/255.255.255.0
eth1: 192.168.10.2/255.255.255.0

I will use eth0 to host the service. I use eth1 as a private network between each host. You could set this up either with a crossover cable between the eth1 ports on both servers or through a separate switch. In addition, I will set up a floating IP at 172.16.245.222 that any host that wants to access this Web site will use. I have already installed and configured Apache on each of these hosts.

Install and Configure Heartbeat

The Heartbeat software is packaged and available for Ubuntu Server as the package heartbeat, so you can use your preferred package manager to install it:

$ sudo apt-get install heartbeat

The package will automatically create the /etc/ha.d directory and the init scripts you need for the service, but it won’t set up any of the main three configuration files, ha.cf, haresources, or authkeys, so I will go into how to configure those here.

ha.cf

The Heartbeat package provides an annotated sample ha.cf file under /usr/share/doc/heartbeat, so be sure to use that as a resource if you want examples or further information. Here is the /etc/ha.d/ha.cf file I used in my cluster:

autojoin none 
bcast eth1
warntime 5
deadtime 10 
initdead 30 
keepalive 2 
logfacility local0
node node1.example.org
node node2.example.org
respawn hacluster /usr/lib/heartbeat/ipfail
ping 172.16.245.5 172.16.245.1
auto_failback off

A copy of this file will go on both node1 and node2. Each of these options is important, so I will describe them below:

  • autojoin
  • You can choose to have nodes automatically join a cluster using the shared secret in the authkey file as authentication. For large clusters that constantly add or delete nodes this might be a useful option to enable so that you aren’t constantly rolling out and updating your ha.cf file to list new nodes. In my case, since I have only two nodes, I have disabled this option.

  • bcast
  • There are a number of different ways each node can communicate to the others for Heartbeat and other communications. Each of these options is documented in /usr/share/doc/heartbeat/ha.cf.gz, but here is a summary. If you use the serial option, you can check Heartbeat over a null modem cable connected to each node’s serial port. If you use mcast, you can define a multicast interface and IP address to use. In my case I used bcast, which broadcasts over my eth1 interface (the private network I set up). You can also specify ucast, which allows you to simply communicate via a unicast IP. That could be useful if you want to limit broadcast traffic or if you have only one interface (or a pair of bonded ports) and want to use your standard IP addresses to communicate. By default Heartbeat uses UDP port 694 for communication, so if you have a firewall enabled, be sure to add a rule to allow that access.

  • warntime
  • This is the time in seconds during which a communication can time out before Heartbeat will add a note in the logs that a node has a delayed heartbeat. No action will be taken yet.

  • deadtime
  • This is the time in seconds during which a communication can time out before Heartbeat will consider it dead and start the fail-over process.

  • initdead
  • On some machines it might take some time after Heartbeat starts for the rest of the services on the host to load. This option allows you to configure a special time-out setting that takes effect when the system boots before the node is considered dead.

  • keepalive
  • This is the number of seconds between each heartbeat.

  • logfacility
  • Here I can configure what syslog log facility to use. The local0 value is a safe one to pick and causes Heartbeat to log in /var/log/syslog.

  • node
  • The node lines are where you manually define each node that is in your cluster. The syntax is node nodename, where nodename is the hostname a particular node gives when you run uname -n on that host. Add node lines for each host in the cluster.

  • respawn
  • Since I have a separate interface for Heartbeat communication and for the regular service, I want to enable the ipfail script. This script will perform various network checks on each host so it can determine whether a host has been isolated from the network but can still communicate with other nodes. This respawn line tells Heartbeat to start the ipfail script as the hacluster user, and if it exits out to respawn it.

  • ping
  • This option goes along with the ipfail script. Here I can define extra hosts that a node can use to gauge its network connectivity. You want to choose stable hosts that aren’t nodes in the cluster, so the network gateway or other network infrastructure hosts are good to add here.

  • auto_failback
  • This is an optional setting. Heartbeat defines what it considers a master node. This node is the default node from which the service runs. By default, when the master node goes down and then recovers, the cluster will fail back to that host. Just to avoid IPs and services flapping back and forth you may choose to disable the automatic failback.

Once the ha.cf file is saved and deployed to both node1 and node2, I can move on to haresources.

haresources

The /etc/ha.d/haresources file defines what resources the cluster is managing so it can determine what to do when it needs to fail-over. In this file you can define the floating IP to use and can also list services to start or stop, file systems to mount and unmount (which I will discuss in the DRBD section), e-mails to send, and a number of other scripts that are located under /etc/ha.d/resource.d. In my case the haresources file is pretty simple:

node1 172.16.245.222 apache2

The first column defines which node is considered the default, in my case node1. Next I define the floating IP to use for this cluster, 172.16.245.222. Finally, I can define a list of services to start or stop when this node is active. Since Apache is started by /etc/init.d/apache2, I choose apache2 here. The example above used some configuration shorthand since I had some pretty basic needs. The longer form of the same line is

node1 IPaddr::172.16.245.222 apache2

The IPaddr section tells Heartbeat to use the /etc/ha.d/resource.d/IPaddr script and pass the 172.16.245.222 argument to it. With the default settings of the IPaddr script, the floating IP address will take on the same subnet and broadcast settings as the other IP address on the same interface. If I wanted the subnet mask to be /26, for instance, I could say

node1 IPaddr::172.16.245.222/26 apache2

The apache2 section at the end is also shorthand. By default Heartbeat will run the service with the start argument when a node becomes active and with the stop argument when a node is disabled. If you created a special script in /etc/ha.d/resource.d/ or /etc/init.d/ and wanted to pass special arguments, you would just list the service name, two colons, then the argument. For instance, if I created a special script called pageme that sent an SMS to a phone number, my haresources line might read

node1 172.16.245.222 apache2 pageme::650-555-1212

Once you have created your haresources file, copy it to /etc/ha.d/ on both nodes, and make sure that it stays identical.

authkeys

The final step in the process is the creation of the /etc/ha.d/ authkeys file. This file contains some method Heartbeat can use to authenticate a node with the rest of the cluster. The configuration file contains one line starting with auth, then a number that defines which line below it to use. The next line begins with a number and then a type of authentication method. If you use a secure private network like a crossover cable, your authkeys might just look like this:

auth 1
1 crc

This option doesn’t require heavy CPU resources since the communications aren’t signed with any particular key. If you are going to communicate over an open network, you will likely want to use either MD5 or SHA1 keys. In either case the syntax is similar:

auth 2
1 crc
2 sha1 thisisasecretsha1key
3 md5 thisisasecretmd5key

Here you can see I have defined all three potential options. The secret key you pass after sha1 or md5 is basically any secret you want to make up. Notice in the example above I set auth 2 at the top line so it will choose to authenticate with SHA1. If I had wanted to use MD5 in this example, I would set auth to 3 since the MD5 configuration is on the line that begins with 3. Once you create this file and deploy it on all nodes, be sure to set it so that only root can read it, since anyone who can read this file can potentially pretend to be a member of the cluster:

$ sudo chmod 600 /etc/ha.d/authkeys

Once these files are in place, you are ready to start the cluster. Start with the default node you chose in haresources (in my case node1) and type

$ sudo /etc/init.d/heartbeat start

Once it starts, move to the other node in the cluster and run the same command. You should be able to see Heartbeat start to ping nodes and confirm the health of the cluster in the /var/log/syslog file. At this point you are ready to test fail-over. Open a Web browser on a third host and try to access the Web server on the floating IP (in my case 172.16.245.222) and make sure it works. Then disconnect the main network interface on the active host. Depending on the time-outs you configured in /etc/ha.d/ha.cf, it will take a few seconds, but your fail-over host should start talking about the failure in the logs and will assume the floating IP and start any services it needs. Here’s some sample output from a syslog file during a fail-over from node1 to node2:

Feb 16 17:37:56 node2 ipfail: [4340]: debug: Got asked for num_ping.
Feb 16 17:37:57 node2 ipfail: [4340]: debug: Found ping node
      172.16.245.1!
Feb 16 17:37:57 node2 ipfail: [4340]: debug: Found ping node
    172.16.245.5!
Feb 16 17:37:58 node2 ipfail: [4340]: info: Telling other node that we
    have more visible ping nodes.
Feb 16 17:37:58 node2 ipfail: [4340]: debug: Sending you_are_dead.
Feb 16 17:37:58 node2 ipfail: [4340]: debug: Message [you_are_dead]
    sent.
Feb 16 17:37:58 node2 ipfail: [4340]: debug: Got asked for num_ping.
Feb 16 17:37:58 node2 ipfail: [4340]: debug: Found ping node
    172.16.245.1!
Feb 16 17:37:59 node2 ipfail: [4340]: debug: Found ping node
    172.16.245.5!
Feb 16 17:37:59 node2 ipfail: [4340]: info: Telling other node that we
    have more visible ping nodes.
Feb 16 17:37:59 node2 ipfail: [4340]: debug: Sending you_are_dead.
Feb 16 17:37:59 node2 ipfail: [4340]: debug: Message [you_are_dead]
    sent.
Feb 16 17:38:05 node2 heartbeat: [4255]: info: node1 wants to go
    standby [all]
Feb 16 17:38:06 node2 ipfail: [4340]: debug: Other side is unstable.
Feb 16 17:38:06 node2 heartbeat: [4255]: info: standby: acquire [all]
    resources from node1
Feb 16 17:38:06 node2 heartbeat: [4443]: info: acquire all HA
    resources (standby).
Feb 16 17:38:06 node2 ResourceManager[4457]: info: Acquiring resource
    group: node1 172.16.245.222 apache2
Feb 16 17:38:06 node2 IPaddr[4483]: INFO:  Resource is stopped
Feb 16 17:38:06 node2 ResourceManager[4457]: info: Running
    /etc/ha.d/resource.d/IPaddr 172.16.245.222 start
Feb 16 17:38:06 node2 ResourceManager[4457]: debug: Starting
    /etc/ha.d/resource.d/IPaddr 172.16.245.222 start
Feb 16 17:38:07 node2 IPaddr[4554]: INFO: Using calculated nic for
     172.16.245.222: eth0
Feb 16 17:38:07 node2 IPaddr[4554]: INFO: Using calculated netmask
     for 172.16.245.222: 255.255.255.0
Feb 16 17:38:07 node2 IPaddr[4554]: DEBUG: Using calculated broadcast
     for 172.16.245.222: 172.16.245.255
Feb 16 17:38:07 node2 IPaddr[4554]: INFO: eval ifconfig eth0:0
     172.16.245.222 netmask 255.255.255.0 broadcast 172.16.245.255
Feb 16 17:38:07 node2 IPaddr[4554]: DEBUG: Sending Gratuitous Arp for
     172.16.245.222 on eth0:0 [eth0]
Feb 16 17:38:07 node2 kernel: [ 7391.316832] NET: Registered protocol
     family 17
Feb 16 17:38:07 node2 IPaddr[4539]: INFO:  Success
Feb 16 17:38:07 node2 ResourceManager[4457]: debug:
     /etc/ha.d/resource.d/IPaddr 172.16.245.222 start done. RC=0
Feb 16 17:38:07 node2 ResourceManager[4457]: info: Running 
    /etc/init.d/apache2  start
Feb 16 17:38:07 node2 ResourceManager[4457]: debug: Starting
     /etc/init.d/apache2  start
Feb 16 17:38:07 node2 ResourceManager[4457]: debug:
     /etc/init.d/apache2  start done. RC=0
Feb 16 17:38:07 node2 heartbeat: [4443]: info: all HA resource
     acquisition completed (standby).
Feb 16 17:38:07 node2 heartbeat: [4255]: info: Standby resource
     acquisition done [all].
Feb 16 17:38:08 node2 heartbeat: [4255]: info: remote resource
     transition completed.
Feb 16 17:38:08 node2 ipfail: [4340]: debug: Other side is now stable.
Feb 16 17:38:08 node2 ipfail: [4340]: debug: Other side is now stable.

Now you can plug the interface back in. If you disabled automatic fail-over, the other node will still hold the floating IP. Otherwise the cluster will fail back. At this point your cluster should be ready for any remaining tests to tune the time-outs appropriately and then, finally, active use.

The previous example is a good starting place for your own clustered service but certainly doesn’t cover everything that Heartbeat can do. For more information about Heartbeat along with more details on configuration options and additional guides, check out http://www.linux-ha.org.

DRBD

A common need in a cluster is replicated storage. When a host goes down, the fail-over host needs access to the same data. On a static Web server, or a Web server with a separate database server, this requirement is easily met since the data can be deployed to both members of the cluster. In many cases, though, such as more complex Web sites that allow file uploads, or with clustered NFS or Samba servers, you need a more sophisticated method to keep files synchronized across the cluster.

When faced with the need for synchronized storage, many administrators start with some basic replication method like an rsync command that runs periodically via cron. When you have a cluster, however, you want something more sophisticated. With DRBD you can set up a file system so that every write is replicated over the network to another host. Here I will describe how to add DRBD to our Heartbeat cluster example from above. I have added a second drive to each node at /dev/sdb and created a partition that fills up the drive at /dev/sdb1. The goal is to have a replicated disk available at /mnt/shared on the active node.

The first step is to install the DRBD utilities. These are available in the drbd8-utils package, so install it with your preferred package manager:

$ sudo apt-get install drbd8-utils

The next step is to create a configuration file for DRBD to use. The package will automatically install a sample /etc/drbd.conf file that documents all of the major options. Definitely use this sample as a reference, but I recommend you move it out of the way and create a clean /etc/drbd.conf file for your cluster since, as you will see, the drbd.conf is relatively simple. Here’s the /etc/drbd.conf I will use for my cluster:

global {
    usage-count no;
}

common {
    protocol C;
}

resource r0 {
    on node1 {
     device   /dev/drbd1;
     disk     /dev/sdb1;
     address  192.168.10.1:7789;
     meta-disk internal;
    }
    on node2 {
     device   /dev/drbd1;
     disk     /dev/sdb1;
     address  192.168.10.2:7789;
     meta-disk internal;
    }
    net {
     after-sb-0pri   discard-younger-primary;
     after-sb-1pri   consensus;
     after-sb-2pri   disconnect;
    }
}

To simplify things I will break up this configuration file into sections and describe the options:

global {
    usage-count no;
}

common {
    protocol C;
}

The global section allows you to define certain options that apply outside of any individual resource. The usage-count option defines whether your cluster will participate in DRBD’s usage counter. If you want to participate, set this to Yes. Set it to No if you want your DRBD usage to be more private.

The common section allows you to define options that apply to every resource definition. For instance, the protocol option lets you define which transfer protocol to use. The different transfer protocols are defined in the sample drbd.conf included with the package. For protocol, choose C unless you have a specific reason not to. Since I have a number of options in my resource section that are the same for each node (like device, disk, and meta-disk), I could actually put all of these options in the common section. You just need to be aware that anything you place in the common section will apply to all resources you define.

Each replicated file system you set up is known as a resource and has its own resource definition. The resource definition is where you define which nodes will be in your cluster, what DRBD disk to create, what actual partition to use on each host, and what network IP and port to use for the replication. Here is the resource section of my config for a resource called r0:

resource r0 {
    on node1 {
     device   /dev/drbd1;
     disk     /dev/sdb1;
     address  192.168.10.1:7789;
     meta-disk internal;
    }
    on node2 {
     device   /dev/drbd1;
     disk     /dev/sdb1;
     address  192.168.10.2:7789;
     meta-disk internal;
    }
    net {
     after-sb-0pri   discard-younger-primary;
     after-sb-1pri   consensus;
     after-sb-2pri   disconnect;
    }
}

As you can see, I have defined two nodes here, node1 and node2, and within the node definitions are specific options for that host. I have decided to use /dev/drbd1 as the DRBD virtual device each host will actually mount and access and to use /dev/sdb1 as the physical partition DRBD will use on each host. DRBD standardizes on port 7788 on up for each resource, so I have chosen port 7789 here. If you have enabled a firewall on your hosts, you will need to make sure that this port is unblocked. Note also that I have specified the IP addresses for the private network I was using before for Heartbeat and not the public IP addresses. Since I know this network is pretty stable (it’s over a crossover cable), I want to use it to replicate the data. Otherwise you do want to make sure that any network you use for DRBD is fault-tolerant.

The final option for each node is meta-disk set to internal. DRBD needs some area to store its metadata. The ideal, simplest way to set this up is to use an internal metadisk. With an internal metadisk, DRBD will set aside the last portion of the partition (in this case /dev/sdb1) for its metadata. If you are setting up DRBD with a new, empty partition, I recommend you use an internal metadisk as it is much easier to maintain and you are guaranteed that the metadisk and the rest of the data are consistent when a disk fails. If you want to replicate a partition that you are already using, you will have to use an external metadisk on a separate partition and define it here in drbd.conf, or you risk having DRBD overwrite some of your data at the end of the partition. If you do need an external metadisk, visit http://www.drbd.org/docs/about/ and check out their formulas and examples of how to properly set up external metadisks.

The final part of my r0 resource is the following:

    net {
     after-sb-0pri   discard-younger-primary;
     after-sb-1pri   consensus;
     after-sb-2pri   disconnect;
    }

These are actually the default settings for DRBD, so I didn’t need to explicitly list them here. I did so just so I could show how you can change DRBD’s default split-brain policy. Remember that when a split brain occurs, neither node can communicate with the other and can’t necessarily determine which node should be active. With DRBD, by default only one node is listed as the primary and the other is the secondary. In this section you can define behavior after different split-brain scenarios. The after-sb-0pri section defines what to do when both nodes are listed as secondary after a split brain. The default is to use the data from the node that was the primary before the split brain occurred. The next option sets what to do if one of the nodes was the primary after the split brain. The default is consensus. With consensus, the secondary’s data will be discarded if the after-sb-0pri setting would also destroy it. Otherwise the nodes will disconnect from each other so you can decide which node will overwrite the other. The final after-sb-2pri option defines what to do if both nodes think they are the primary after a split brain. Here DRBD will disconnect the two nodes from each other so you can decide how to proceed. Check out the sample drbd.conf file for a full list of other options you can use for this section.

Initialize the DRBD Resource

Now that the /etc/drbd.conf file is set up, I make sure it exists on both nodes and then run the same set of commands on both nodes to initialize it. First I load the kernel DRBD module, then I create the metadata on my resource (r0), and then I bring the device up for the first time:

$ sudo modprobe drbd
$ sudo drbdadm create-md r0
$ sudo drbdadm up r0

Notice that I reference the r0 resource I have defined in drbd.conf. If you set up more than one resource in the file, you would need to perform the drbdadm commands for each of the resources the first time you set them up. After you run these commands on each node, you can check the /proc/drbd file on each node for the current status of the disk:

$ cat /proc/drbd
version: 8.0.11 (api:86/proto:86)
GIT-hash: b3fe2bdfd3b9f7c2f923186883eb9e2a0d3a5b1b build by
    phil@mescal, 2008-02-12 11:56:43

 1: cs:WFConnection st:Secondary/Unknown ds:Inconsistent/DUnknown
      C r---
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
    resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
    act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0
	        changed:0

In this output you can see that its current status (cs) is WFConnection, which means it is Waiting for a Connection. Currently no node has been assigned as the primary, so each will think that its state (st) is Secondary/Unknown. Finally, the disk state (ds) will be Inconsistent since both DRBD resources have not been synced yet. It will also show Inconsistent here if you suffer a split brain on your cluster and DRBD can’t recover from it automatically.

Next you are ready to perform the initial synchronization from one node to the other. This means you have to choose one node to act as the primary, and both nodes need to be able to communicate with each other over the network. Now if your disk already had data on it, it is crucial that you choose it as the primary. Otherwise the blank disk on the second node will overwrite all of your data. If both disks are currently empty, it doesn’t matter as much. In my case I will choose node1 as the primary and run the following command on it:

$ sudo drbdadm -- --overwrite-data-of-peer primary r0

At this point data will start to synchronize from node1 to node2. If I check the output of /proc/drbd, I can see its progress much as with /proc/mdstat and software RAID:

$ cat /proc/drbd
version: 8.0.11 (api:86/proto:86)
GIT-hash: b3fe2bdfd3b9f7c2f923186883eb9e2a0d3a5b1b build by
    phil@mescal, 2008-02-12 11:56:43

 1: cs:SyncSource st:Primary/Secondary ds:UpToDate/Inconsistent
      C r---
    ns:9568 nr:0 dw:0 dr:9568 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
     [>....................] sync'ed:  0.2% (8171/8181)M
     finish: 7:15:50 speed: 316 (316) K/sec
     resync: used:0/31 hits:597 misses:1 starving:0 dirty:0
	          changed:1
     act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0
	          changed:0

Now you can see that the current status, state, and disk state have all changed. Once the synchronization starts, you can go ahead and start using /dev/drbd1 like a regular partition and put a file system on it and mount it. In my case the disk was empty, so I needed to do both on node1:

$ sudo mkfs -t ext3 /dev/drbd1
$ sudo mkdir /mnt/shared
$ sudo mount /dev/drbd1 /mnt/shared

Now go to node2 and make sure that the /mnt/shared directory exists there as well, but don’t mount /dev/drbd1! Since I am using ext3 and not a clustering file system, I can mount /dev/drbd1 on only one node at a time. Once the file finishes syncing, the status will change to Connected and the disk state will change to UpToDate. You are then ready to set up Heartbeat so that it can fail-over DRBD properly.

Configure Heartbeat

Heartbeat includes DRBD support specifically in the form of a script under /etc/ha.d/resource.d/ called drbddisk. You can use this script to tell Heartbeat which resources to start or stop. Unless you use a clustering file system, only one node can mount and write to a DRBD device at a time, so you need to set up Heartbeat so that it will mount or unmount the file system based on which node is active. Previously the line in my /etc/ha.d/haresources file was

node1 172.16.245.222 apache2

Now I will change it to

node1 172.16.245.222 drbddisk::r0
    Filesystem::/dev/drbd1::/mnt/shared::ext3 apache2

Make similar changes to the haresources file on node2 as well. Once you change the /etc/ha.d/haresources file on both hosts, run

$ sudo /etc/init.d/heartbeat reload

Now your cluster is ready to go. You can simulate a failure by, for instance, rebooting the primary host. If you go to the second node, you should notice Heartbeat kick in almost immediately and mount /dev/drbd1, start Apache, and take over the floating IP. The /proc/drbd file will list the status as WFConnection since it is waiting for the other host to come back up and should show that the node is now the primary. Because we set up the Heartbeat cluster previously to not fail back, even when node1 comes back, node2 will be the active member of the cluster. To test failback just reboot node2 and watch the disk shift over to node1.

drbdadm Disk Management

Once you have a functioning DRBD disk, drbdadm is the primary tool you will use to manage your disk resources. The DRBD init script should take care of initializing your resources, but you can use

$ sudo drbdadm up r0

To bring up a resource, replace r0 with the name of the resource you want to start. Likewise, you can take down an inactive resource with

$ sudo drbdadm down r0

You can also manually change whether a node is in primary or secondary mode, although in a Heartbeat cluster I recommend you let Heartbeat take care of this. If you do decide to change over from primary to secondary mode, be sure to unmount the disk first. Also, if any node is currently primary, DRBD won’t let you change it to secondary while the nodes are connected (the cs: value in /proc/drbd), so you will have to disconnect them from each other first. To set the primary or secondary mode manually for a particular resource, run

$ sudo drbdadm primary r0
$ sudo drbdadm secondary r0
Change drbd.conf

At some point you might want to make changes in your /etc/drbd.conf file such as changing split-brain recovery modes. Whenever you make changes, make sure that the same change is added to /etc/drbd.conf on all of your nodes, and then run

$ sudo drbdadm adjust r0

on both nodes. Replace r0 with the name of the resource you want to change.

Replace a Failed Disk

Ideally you will have any disks you use with DRBD set up in some sort of RAID so that a disk can fail without taking out the node. If you do have a DRBD disk set up on a single drive as I do in this example and the drive fails, you will need to run a few commands to add the fresh drive. In this example I will assume that /dev/sdb failed on my node1 server. First, by default DRBD should automatically detach and remove a disk when it has a failure, so unless you knowingly tweaked that default, DRBD should do that work for you. Once you add and partition the replacement drive (in my case it will be /dev/sdb1), then first you need to re-create the internal metadata on /dev/sdb1 and then attach the resource:

$ sudo drbdadm create-md r0
$ sudo drbdadm attach r0
Manually Solve Split Brain

DRBD will attempt to resolve split-brain problems automatically, but sometimes it is unable to determine which node should overwrite the other. In this case you might have two nodes that have both mounted their DRBD disk and are writing to it. If this happens you will have to make a decision as to which node has the version of data you want to preserve. Let’s say in my case that node1 and node2 have a split brain and have disconnected from each other. I decide that node2 has the most up-to-date data and should become the primary and overwrite node1. In this case I have to tell node1 to discard its data, so I go to node1, make sure that any DRBD disk is unmounted, and type

$ sudo drbdadm secondary r0
$ sudo drbdadm -- --discard-my-data connect r0

If node2 is already in WFConnection state it will automatically reconnect to node1 at this point. Otherwise I need to go to node2, the node that has the good data, and type

$ sudo drbdadm connect r0

Now node2 will synchronize its data over to node1.

These steps should get you up and running with a solid replicated disk for your cluster. For more detailed information about DRBD, including more advanced clustering options than I list here, visit the official site at http://www.drbd.org.

  • + Share This
  • 🔖 Save To Your Account