- Simplicity versus Flexibility versus Optimality
- Knowing the Problem You're Trying to Solve
- Overhead and Scaling
- Operation Above Capacity
- Compact IDs versus Object Identifiers
- Optimizing for the Most Common or Important Case
- Forward Compatibility
- Migration: Routing Algorithms and Addressing
- Making Multiprotocol Operation Possible
- Running over Layer 3 versus Layer 2
- Determinism versus Stability
- Performance for Correctness
- In Closing
One type of robustness is simple robustness, in which the protocol adapts to node and link fail-stop failures.
Another type is self-stabilization. Operation may become disrupted because of extraordinary events such as a malfunctioning node that injected incorrect messages, but after the malfunctioning node is disconnected from the network, the network should return to normal operation. The ARPANET link state distribution protocol was not self-stabilizing. If a sick router injected a few bad LSPs, the network would have been down forever without hours of difficult manual intervention, even though the sick router had failed completely hours before and only "correctly functioning" routers were participating in the protocol.
"We'll let you know if there's a problem" protocol. This drives me crazy because it's so popular. It doesn't work because the problem notification message is invariably a datagram and can get lost. For example, at one conference, the person organizing all the hotel rooms for the speakers said, "Don't worry if you haven't gotten a confirmation. We'd let you know if there was a problem getting you a room." Sure enough, at least one of us showed up and had no hotel room.
Another type is Byzantine robustness. The network can continue to work properly even in the face of malfunctioning nodes, whether the malfunctions are caused by hardware problems or by malice.
As society becomes increasingly dependent on networks, it is desirable to attempt to achieve Byzantine robustness in any distributed algorithm such as clock synchronization, directory system synchronization, or routing. This is difficult, but it is important if the protocol is to be used in a hostile environment (such as when the nodes cooperating in the protocol are remotely manageable from across the Internet or when a disgruntled employee might be able to physically access one of the nodes).
Following are some interesting points to consider when your goal is to make a system robust.
Every line of code should be exercised frequently. If there is code that gets invoked only when the nuclear power plant is about to explode, it is possible that the code will no longer work when it is needed. Modifications may have been made to the system since the special case code was last checked, or seemingly unrelated events such as increasing link bandwidth may cause code to stop working properly.
Sometimes it is better to crash rather than gradually degrade in the presence of problems so that the problems can be fixed or at least diagnosed. For example, it might be preferable to bring down a link that has a high error rate.
It is sometimes possible to partition the network with containment points so that a problem on one side does not spread to the other. An example is attaching two LANs with a router versus a bridge. A broadcast storm (using data link multicast) will spread to both sides, whereas it will not spread through a router.
Connectivity can be weird. For example, a link might be one-way, either because that is the way the technology works or because the hardware is broken (for example, one side has a broken transmitter, or the other has a broken receiver). Or a link might work but be sensitive to certain bit patterns. Or it might appear to your protocol that a node is a neighbor when in fact there are bridges in between, and somewhere on the bridged path is a link with a smaller MTU size. Therefore, it could look as if you are neighbors, but indeed packets beyond a certain size will not succeed. It is a good idea to have your protocol check that the link is indeed functioning properly (perhaps by padding hellos to maximum length to determine whether large packets actually get through, by testing that connectivity is two-way, and so on).
Certain checksums detect certain error conditions better than others. For example, if bytes are getting swapped, Fletcher's checksum catches the problem whereas the IPv4 checksum does not catch it.
Ideally, every packet should be able to be processed at wire speeds. An unauthorized person might not be able to generate correct signatures on a packet, but if it requires more time for a node to do the calculation necessary to realize that the packet is invalid and should be discarded than it takes to receive the packet, a denial of service attack can be mounted merely by sending a lot of invalid packets.
The "unclear on the concept" protocol (contributed by Joshua Simons).
Joshua connected to a Web site with his secure browser and made a purchase (so the credit card information was sent across the network encrypted). The Web site then sent him a confirmation email (in the clear) with all the details of the transaction, including his credit card information.