Gathering Application Information
Once you have determined that an application suits the cluster environment, and have determined exactly what the scope of the project will be, the next step is to gather information about the application so that you can start to build the associated resource type. At the most fundamental level, this means you will need to determine exactly how to start, stop, and monitor the application automatically.
Determining how to start the application is the most important part of developing an agent, since it is the only part of the process that the cluster framework cannot handle by itself. You should keep in mind that starting an application doesn't just mean knowing the path to the program, but often includes a number of other components, for example:
- Command-line arguments
- Configuration files
- Environment variables
If there is a chance of multiple instances of an application running on the same cluster, it is particularly important to pay attention to the command-line arguments and configuration files that must be used to differentiate between these instances. These variable items might then be turned into extension properties of the final resource type (see Chapter 6), so that an administrator can control these aspects of the application when configuring the cluster.
In some cases it may be necessary to write a wrapper script to start an application with the correct set of configuration variables, and this is especially true if an application must be started as a particular user other than "root." For Solaris applications that normally start at boot time, there may already be an appropriate script in the /etc/init.d directory that you can use directly or modify slightly to have the required effect. Remember that if you add an application to the cluster that is normally started at boot time, you should remove the original start and stop script(s) from the /etc/rc?.d directories.
The amount of time taken for your application to start is also important. By default, the cluster framework will wait 300 seconds (five minutes) for the application to start up before sending the first probe to check that everything is okay. If your application takes longer than this to start, then you will have to change this delay when creating the agent.
Finally, you should make a note of anything that your application depends upon, including the presence of particular data or filesystems, networks, or other applications. It may be that in order to make one application highly available, others will need to be made highly available as well. If your application depends on another, you can add to the startup code for your agent check for the availability of that service. You should also document this dependency so that administrators can configure the cluster framework appropriately so that applications start in the correct order.
Most applications start after the network has started, but in some cases it may be important to start (or partially start) applications before the network has been configured. As we will see later in the book, the cluster framework allows us to run actions before and after the network has been started, so it is important to make a note of what your application expects to happen and when.
Stopping your application correctly is important to ensure data integrity when the service is failed over to another node. Not every failover occurs as a result of a system crash, so knowing how to stop gracefully is a vital part of the agent software.
In many cases, it is possible to safely stop an application simply by sending a TERM signal to the running process. In fact if you do not specify a particular program to stop your application, the cluster framework will do this by default. However, if there is a program supplied with your application that can gracefully stop it, then it should be used.
As with starting an application, there are other factors to consider than just the command name, including:
- Arguments, environment, and user
- Effects of kill(1)
- When to stop
The arguments, environment, and user factors are the same as for starting an application (see "Start" earlier in this chapter), and the timeout is similar. As with starting the application, stopping the application is by default given 300 seconds (five minutes) to be successful.
The effects of kill(1) on the application are important because if, when creating an agent using the SunPlex Agent Builder or Generic Data Service (GDS) tools  you don't supply a specific command to stop your application, the cluster framework will use kill(1) by default. About 80 percent of the timeout will be used to send the TERM signal, and about 15 percent to send the KILL signal. Even if you do provide a stop command (using one of the tools or not), the timeout value you supply will be used to decide how long the cluster framework will wait for an application to stop before deciding that something has gone wrong. As with the start timeout value, the default stop timeout is 300 seconds.
You should also consider whether the application should be stopped before or after the public network connection is removed. In most cases the application will be stopped before the network connection, but there may be times when you want to be sure that the network has been removed before stopping the application. However, this would normally be the case only when your application doesn't use the network directly.
Understanding how an application works is key to integrating it into the cluster environment. Once you have worked out how an application will behave under various conditions, you can create a monitor program that will check for these conditions and return appropriate values to the cluster framework, which in turn are used to decide what actions if any should be taken by the cluster.
By default, no monitoring is done by the cluster framework. However, if you use one of the provided tools  to create a network-aware resource type, a simple service probe is automatically provided. This probe just attempts to connect to the service's IP address and port, and if successful assumes that the application is operating correctly. Obviously, most applications require more complex monitoring than this, and Sun Cluster allows you to create very sophisticated monitoring applications to return different fault and failure modes to the cluster framework.
When you create a monitor program, you are in effect creating a type of expert system, in which a piece of software performs the tasks that might otherwise need to be done by a human operator. For this reason, detailed analysis of the application flow and possible failure modes will aid in the development of your resource type.
One way of tracking what your monitor program must do to check that your application is running correctly is to use a flowchart like that shown in Figure 4.3. In this example, we start by having the monitor program request a known piece of information from the application. If the request times out without any response, then the monitor program will check to see if the cluster framework is controlling the application properly, and take appropriate actions. If the request for data is successful, then the monitor program will compare the response from the application to what it expects the response to be: If they are the same, then the monitor will assume that everything is okay, and will go to sleep for a while before starting the whole process again. If the data received from the application is not what the monitor program expected, then it will check to see if any data was received and take appropriate action, which may be to send a warning to human operators or to restart or even fail over the application.
Figure 4.3 Application monitoring flowchart
As you can see, it is very important to understand how your application behaves before you can write a successful monitoring program. In some cases, the application you are trying to make highly available may already have a program you can use to check the status. In such situations, you simply need to ensure that the program will return zero if the application is okay, and 1 (or nonzero) if there is a problem. In other cases, you may require quite complex programming to assess the state of the application and decide what action to take, particularly if you want to take very specific actions when certain events (such as network failures, for example) occur. The Sun Cluster APIs provide the tools to retrieve a lot of information about the cluster environment itself, but it is your understanding of the application that will determine how successful your monitor program will be.
As with starting and stopping an application, running the monitor program may require specific command-line arguments and environment variables, need a particular user ID, and require a certain amount of time to run. These factors need to be documented so that you can integrate them into your resource type.