Cisco PIX: Failover Demystified
Of all of features of the PIX, I think I get more student questions about the Failover feature than any others. The purpose of this article is to give you an understanding of the mechanics of Failover. My aim here is not to teach you how to configure the PIX for failover, but to understand the failover process. Specifically, this article covers the following Failover topics:
- Failover Operation
- Configuration Replication
- Failover Monitoring
- Fail Back Rules
- Interface Testing
The failover function for the Cisco PIX Firewall provides a safeguard in case a PIX Firewall fails. Specifically, when one PIX Firewall fails, another immediately takes its place.
In the failover process, there are two PIX Firewalls: the primary PIX Firewall and the secondary PIX Firewall. Under normal operation, the primary PIX Firewall functions as the active PIX Firewall, performing normal network functions. The secondary PIX Firewall functions as the standby PIX Firewall, ready to take control should the active PIX Firewall fail to perform. When the primary PIX Firewall fails, the secondary PIX Firewall becomes active while the primary PIX Firewall goes on standby. This entire process is called failover.
To use failover at all, it's important to note that you need two PIX Firewalls identical in every way. They must have the following common characteristics:
The same version of the PIX OS
The same number/type of interfaces in the same slots
The primary must be running the unrestricted license of the PIX OS.
The secondary PIX must run either the unrestricted license or the failover license.
If the primary has a DES/3DES license, the secondary must have one.
The primary PIX Firewall is connected to the secondary PIX Firewall through a failover connection: the failover cable. The failover cable has one end labeled primary, which plugs into the primary PIX Firewall, and the other end labeled secondary, which plugs into the secondary PIX Firewall. The role of Primary or Secondary PIX is established by the Failover cable. Even though a PIX may switch between Active or Standby, once Primary and Secondary roles are established by the placement of the Failover cable, they never change.
A failover occurs when one of the following situations takes place:
The standby active command is issued on the Primary PIX.
The failover active command is issued on the Secondary PIX.
Block memory exhaustion occurs for 15 consecutive seconds or more on the active PIX Firewall
Network Interface Card (NIC) status. If the Link Status of a NIC is down, the unit will fail. "Down" means that the NIC is not plugged into an operation port. If a NIC has been configured as "down," it does not fail this test.
Failover Network communications. The two units send "hello" packets to each other over all network interfaces. If no "hello" messages are received for two failover poll intervals, the non-responding interface is put in testing mode to determine who is at fault.
Failover cable communication. The two units send "hello" messages to each other over the failover cable. If the standby doesn't hear from the active within two failover poll intervals, and the cable status is OK, the standby takes over as active.
Cable errors. The failover cable is wired so that each unit can distinguish between:
- A power failure other unit.
- A cable unplugged this unit.
- A cable unplugged other unit.
If the standby detects that the active is powered off (or reload/reset), it takes active control. If the failover cable is unplugged, a syslog is generated but no switching will occur.
At boot up, if both units are powered up without the failover cable installed, they both become active, creating a duplicate IP address with different MAC addresses, causing conflict on your network. The failover cable must be installed for failover to work correctly.
When actively functioning, the primary PIX Firewall uses system IP addresses and MAC addresses. The secondary PIX Firewall, when on standby, uses failover IP addresses and MAC addresses.
When the primary PIX Firewall fails and the secondary PIX Firewall becomes active, the secondary PIX Firewall assumes the system IP addresses and MAC addresses of the primary PIX Firewall. Then the primary PIX Firewall, functioning in standby, assumes the failover IP addresses and MAC addresses of the secondary PIX Firewall. This works very much like Hot Standby Routing Protocol (HSPR) in Cisco IOS. The main difference is the PIX does not require configuration of a virtual IP address for each interface.
In this section, you will learn the functional components of Failover and the internal processes that govern its operation.
The failover cable is the only additional hardware required to support the failover. The failover cable is a modified RS-232 serial link cable opperating at a speed of 9600 baud. A failover cable is shipped with every PIX Firewall.
In PIX Software Release 5.2, the speed was increased to 115.2K baud.
Basic failover communication is performed though the failover cable. Communication through failover cable is message-based and reliable. Every message sent requires acknowledgement (an ACK). If a message is not ACK'd by the other PIX within 3 seconds, the message is retransmitted. After 5 retransmissions without an ACK (for a total of 15 seconds), a failover condition is triggered and the standby PIX fails the Primary and becomes the Active PIX.
The orientation of the failover cable is crucial to correct failover operation. The end of the failover cable labeled Primary must be connected to the failover port of the Primary-Active PIX.
Failover communicates the following messages through the failover cable:
- MAC addresses exchange
- Hello (keep-alive)
- State (Active/Standby)
- Network Link Status
- Configuration Replication
Configuration replication is the function of synchronizing the configuration of the primary PIX Firewall to the secondary PIX Firewall. For configuration replication to succeed, both the primary and secondary PIX Firewalls must be exact matches of each other in both hardware and software (as previously stated). Configuration replication occurs over the failover cable from the active PIX Firewall to the standby PIX Firewall when any of these three events occurs:
When the standby PIX Firewall completes its initial bootup, the active PIX Firewall replicates its entire configuration to the standby PIX Firewall.
As commands are entered on the active PIX Firewall, they are sent across the failover cable to the standby PIX Firewall.
By entering the write standby command on the active PIX Firewall, which forces the entire configuration in memory to be sent to the standby PIX Firewall.
Configuration replication only occurs from the running config of the Primary to the running config of the Secondary. Because this is not a permanent place to store configurations, you must use the write memory command to write the configuration into NVRAM on both units. If failover occurs during replication, the new active PIX Firewall will have only a partial configuration. To recover from a configuration synchronization failure, you will need to force the Primary back to active and use the write standby command to update the Secondary.
When replication starts, the PIX Firewall console displays the message Sync Started, and when complete, displays the message Sync Completed. During replication, information cannot be entered on the PIX Firewall console. Replication can take a long time to complete for a large configuration because the failover cable is used. This is especially true on PIX's running PIX OS 5.1 or earlier when the baud rate of the cable was only 9600.
There is a failover poll interval of 15 seconds to monitor network activity, failover communications, and the power status. A failure of any of these parameters on the active unit will cause the standby unit to take active control. Whenever a unit is determined to have failed, it shuts down its network interfaces.
The two units send special failover hello packets to each other over the failover cable and all interfaces every 15 seconds (excluding those that are administratively shutdown). If either unit does not hear the hello on an interface for 3 consecutive poll checks, the PIX puts that LAN interface into testing mode to determine where the fault lies. If a standby PIX does not receive a hello from the failover cable for 3 consecutive poll checks, the standby PIX initiates a switchover and declares the other PIX failed. If the active PIX does not hear the hello messages, it stays active and sets the other PIX as failed.A network interface is placed in testing mode if a hello packet is not received. Testing of a network interface is non-intrusive, meaning that, while it is in testing mode, it still attempts to pass normal traffic. The testing process consists of 4 individual tests geared toward stimulating network traffic:
NIC status testThe PIX performs link up/link down tests for up to 5 seconds.
Network activity testIf all interfaces on both PIX's pass the link test, the PIX will listen for up to 5 seconds to listen for network activity on all interfaces. If no activity is received on an interface, the offending PIX is failed.
Address Resolution Protocol (ARP) testIf the preceding two tests pass, the PIX reads the 10 most recent ARP entries and attempts to ping each of them.
PING testAs a final arbiter should the previous three tests all pass, the PIX will send directed broadcasts out on each interface and listen for responses.
If an interface that is in testing mode is capable of receiving traffic, it is considered operational. If it can hear other network traffic, it is assumed the error must be with the other unit not being able to send the hello packet. This results in failing the other unit. If it is determined that the testing unit cannot receive network traffic while the other can, the testing unit fails itself.
In addition to monitoring all network interfaces, failover also monitors the power status of the other unit, as well as the status of the failover cable itself. The failover cable provides the ability to detect if the other unit is plugged in and powered on. If the cable is unplugged from either unit, switching is disabled. If an active unit loses power, the standby unit takes over within 15 seconds. A unit in the failed state waits 15 seconds, and then tries to transition to the standby state. If the transition triggers a failure, the unit fails again. You can issue the failover reset command to manually reset the PIX from the failed to standby state. If the transition triggers a failure, the unit will fail again. A PIX in the failed state cannot switch into active state.
If the failure is due to a link down condition on an interface, a link up condition clears the failed state (for example, if an interface is unplugged and then later plugged in).
Failover Monitoring Using the show failover Command
The following examples assume the failover cable is installed and operational. They also assume that the units have been configured with a System IP address of 192.168.10.1 and a Failover IP address of 192.168.10.2 for the Outside interface and a System IP address of 10.10.10.1 and Failover IP address of 10.10.10.2 for the Inside interface.
Configuring a firewall for failover and not setting the "failover ip address" can lead to the two PIX's flip-flopping between active and standby.
Example 1 shows the normal output of the show failover command. Note that the IP address of each unit is displayed. If no failover IP address has been entered, it displays 0.0.0.0 and monitoring of the interfaces remains in the waiting state. See Example 2 for an explanation of the waiting state.
Example 1 Normal Failover
pixfirewall# (config) show failover Failover On Cable status: Normal Reconnect timeout 0:00:00 This host: Primary - Active Active time: 6885 (sec) Interface Outside (192.168.10.1): Normal Interface Inside (10.10.10.1): Normal Other host: Secondary - Standby Active time: 0 (sec) Interface Outside (192.168.10.2): Normal Interface Inside (10.10.10.2): Normal
Failover does not start monitoring the network interfaces until it has heard the second hello packet from the other unit on that interface. Using the default failover poll 15 setting, this should take 30 seconds. If the PIX's are attached to a Layer 2 Switch running Spanning Tree Protocol (STP), this takes twice the forward delay time configured in the switch (typically configured as 15 seconds), plus this 30 second delay or one minute. At PIX bootup and immediately following a failover event, the Layer 2 switch detects a temporary bridge loop. Upon detection of the loop, it stops forwarding packets on these interfaces for the forward delay time. It then enters the listen mode for an additional forward delay time, during which time the switch is listening for bridge loops but not forwarding traffic (and thus not forwarding failover hello packets). After twice the forward delay time (30 seconds), traffic should resume flowing. Each PIX remains in waiting mode until it hears 30 seconds worth of hello packets from the other unit. During the time the PIX is passing traffic, it does not fail the other unit based on not hearing the hello packets. All other failover monitoring is still occurring (that is, Power, Interface Loss of Link, and Failover Cable hello). Example 2 shows the failover interfaces in the waiting state, indicating two failover hello's have yet to be exchanged.
Example 2 Failover in the Waiting State (Uninitialized)
pixfirewall# (config) show failover Failover On Cable status: Normal Reconnect timeout 0:00:00 This host: Primary - Active Active time: 6930 (sec) Interface Outside (192.168.10.1): Normal (Waiting) Interface Inside (10.10.10.1): Normal (Waiting) Other host: Secondary - Standby Active time: 15 (sec) Interface Outside (192.168.10.2): Normal (Waiting) Interface Inside (10.10.10.2): Normal (Waiting)
In Example 3, the failover process has detected an interface failure. Note that Interface Inside on the primary unit is the source of the failure. The units are back in waiting mode because of the failure. During this process, the primary PIX Firewall swaps its system IP addresses with the secondary PIX Firewall's failover IP addresses.
The failed unit has removed itself from the network (interfaces are down) and is no longer sending hello packets on the network. The active unit remains in a waiting state until the failed unit is replaced and failover communications starts again.
Example 3 The Failover Process Detects an Interface Failure
pixfirewall# (config) show failover Failover On Cable status: Normal Reconnect timeout 0:00:00 This host: Primary - Standby (Failed) Active time: 7140 (sec) Interface Outside (192.168.10.2): Normal (Waiting) Interface Inside (10.10.10.2): Failed (Waiting) Other host: Secondary - Active Active time: 30 (sec) Interface Outside (192.168.10.1): Normal (Waiting) Interface Inside (10.10.10.1): Normal (Waiting)
Fail back is the term used to describe the action of restoring PIX operation from the Secondary-Active back to the Primary-Failed PIX. Fail back to the primary unit is not automatically forced, as there is no reason to switch active and standby roles. When a failed primary unit is repaired and brought back on line, it does not automatically resume as the active unit. To force a unit to be the active unit, use the failover active command on the Primary-Standby unit or the no failover active command on the Secondary-Active unit.
The results of issuing the failover active vary depending on whether Failover or Stateful Failover are configured.
If Stateful Failover is used, connection state information is passed from the active unit to the standby unit.
In Failover mode, state information is not tracked and sessions must be reestablished by applications. This means all active connections are dropped after a switchover.
This section discusses the differences between failover and stateful failover modes.
As stated earlier, failover enables the standby PIX Firewall to take over the duties of the active PIX Firewall when the active PIX Firewall fails. There are two types of failover:
FailoverWhen the active PIX Firewall fails and the standby PIX Firewall becomes active, all connections are lost and client applications must initiate a new connection to restart communication through the PIX Firewall. The disconnection occurs because the standby PIX Firewall has no facility to receive connection information from the active PIX Firewall. The channel provided by the failover cable lacks the bandwidth necessary to maintain state synchronization between the tw PIX's.
Stateful failoverWhen the active PIX Firewall fails and the standby PIX Firewall becomes active, the same connection information is available at the new active PIX Firewall, and end-user applications are not required to do a reconnect to keep the same communication session. The connections remain because the stateful failover feature passes per-connection stateful information to the standby PIX Firewall. The TCP connection table (except http) is synchronized with the Secondary PIX over the interface chosen for Statefull Failover.
Stateful failover requires a 100 Mbps Ethernet interface on each PIX to be used exclusively for passing state information between the two PIX Firewalls. These interfaces can be connected by any of the following:
Category 5 crossover cable directly connecting the primary PIX Firewall to the secondary PIX Firewall (100Mb half or full duplex)
100BaseTX half-duplex hub using straight Category 5 cables
100BaseTX full duplex on a dedicated switch or dedicated virtual LAN (VLAN) of a switch using straight Category 5 cables.
I hope that this article has improved your understanding of the failover mechanism of the Cisco PIX Firewall. For information on configuration as well as many helpful tips, please refer to the Failover chapter of Cisco® Secure PIX® Firewalls. The book also provides basic and advanced configuration aspects of the Cisco PIX.