Home > Articles > Operating Systems, Server > Solaris

  • Print
  • + Share This
Like this article? We recommend

Management Aspects of Campus Clusters

Processes

Well-defined processes play a vital role in ensuring timely disaster recovery with minimal data loss. Disasters by nature cannot be predicted, making it absolutely critical to establish tested procedures for recovery. Staff training and expertise, with clear lines of communication and decision-making, are essential. The procedures must be reviewed and audited regularly and updated or refined as necessary. Changes in technology, organizational structure, and other fundamentals must be accommodated as soon as possible. Finally, when a disaster occurs, a post-recovery analysis helps determine what went well, what went wrong, and what needs to be improved.

Administrator Skills

Because clusters are expected to provide very high service levels, most enterprises assign dedicated, specialized staff to administer cluster configurations. Intensive training in Sun Cluster software and other technologies, with detailed procedural run books, are required to help specialized administrators maintain the cluster environment at the highest possible service level. In addition to training and defined processes, the administrative staff needs to exercise some measure of creativity and flexibility to cope with the unexpected and unprecedented complexities of a disaster.

Administrators should have excellent skills in understanding the basic concepts of clustering, especially of the algorithms the cluster uses to decide which nodes are part of a new cluster and which are not. It is equally important to understand how mirroring in a remote environment works, and how it is possible to reconfigure complex storage and volume manager configurations. Most importantly, administrators must be able to apply investigative inquisitiveness in complex error scenarios to rapidly determine the impact of a disaster and take appropriate actions.

Monitoring and Stabilizing the Campus Cluster

Management infrastructure tools such as Sun™ Management Center 3.0 software can be used to help monitor the health of the campus cluster. Used either as a standalone solution or linked into the enterprise management framework, management tools enable administrators to quickly detect potential problems with individual nodes or interconnects.

In the event of a failure, administrators need to act quickly to determine the scenario. Any of the following failures may interrupt service availability:

  • One site is totally unavailable

  • Network connections, including the cluster interconnect between sites, are broken, but storage connections are still available

  • Storage connections are broken, but network connections are up

  • Network and storage connections are both unavailable

If the cluster or a new cluster is still operational, stabilizing the cluster mainly requires reconfiguring nodes and storage. Refer to the Sun Cluster 3.0 product documentation for specific reconfiguration procedures.

If a new cluster cannot be formed, (for example, due to loss of quorum), administrators must intervene. Care must be taken to avoid jeopardizing data integrity during manual procedures where cluster mechanisms might temporarily be disabled. Manually stabilizing the cluster prevents the formation of more than one cluster when nodes return to operation. Disabling power and removing all network and storage connections from a failed node are possible mechanisms for stabilizing a cluster. If the failed node is accessible, its cluster configuration should be changed so that the node does not try to rejoin the cluster automatically upon reboot.

Changing the Quorum Device

If a cluster cannot form a new cluster, manual intervention is necessary. Manual intervention may temporarily remove the quorum and failure fencing mechanisms of Sun Cluster 3.0. software. Therefore, it is essential to prevent more than one cluster from running at the same time. Otherwise, more than one node could access the shared data and cause data corruption.

In the case of a slowly approaching disaster, proactive measures should be applied, if possible. Administrators should be thoroughly trained in procedures for evacuating high-availability services and configuration information from the production site. If the quorum device is in the affected site, the administrator's first priority is to change the quorum device to one in the unaffected site. This task can be done using the SunPlex Manager tool, the scconf command at the command-line interface, or the scsetup menu interface.

Because the quorum reconfiguration must occur quickly, it is advisable to prepare a run book and special scripts for this situation. When deploying the reconfiguration procedure, administrators must choose a device that is positively in the unaffected cluster site or data center. Note that in a two-node cluster, the last quorum cannot be deleted. Administrators need first to add a second quorum, then delete the old one.

Furthermore, this procedure works only if the cluster has quorum. If the quorum and the other node are lost, then certified personnel must change the internal cluster configuration database and define a new quorum device.

Reconfiguring the Volume Manager

If access to storage in all sites (for example, all mirrors) is still available, no special procedures are necessary to protect data integrity. However, if an administrator determines that storage at the remote site is lost and must be replaced as part of the recovery, it is advisable to detach the mirrors located on these storage devices and remove the disks from the volume manager configuration. It is important to remove failed nodes from the cluster configuration.

Back to Normal Operations

Once the cluster is stabilized and data services are again available, the real recovery process can start. If there is no redundancy in the surviving data center, it is essential to decide how to establish this redundancy, especially on the data level, as soon as possible. This task can be achieved either by re-establishing a site or by adding storage and cluster nodes to the remaining site. Ideally, the steps required to re-establish redundancy are included in the preparatory process and documented in a run book.

  • + Share This
  • 🔖 Save To Your Account