Which Routing Protocol Should My Network Use?
- Is One Protocol "Better" Than the Others?
- Which Designs Play to the Strength of Each Protocol?
- What Are the Tradeoffs?
Among all the thorny questions that network engineers are asked on a regular basis, probably among the hardest is this one:
- My network currently runs Enhanced Interior Gateway Routing Protocol (EIGRP). Would I be better off if I switched to Open Shortest Path First (OSPF)?
You can replace the two protocols mentioned in this sentence with any pair of protocols among the advanced interior gateway protocols (OSPF, Intermediate System-to-Intermediate System [IS-IS] and EIGRP), and you have described a question that routing protocol engineers are asked probably thousands of times a year. Of course, convergence is always faster on the other side of the autonomous system boundary, so to speak, so it is always tempting to jump to another protocol as soon as a problem crops up with the one you are running.
How do you answer this question in real life? You could try the standard, "It depends," but does this really answer the question? The tactic in the Routing Protocols Escalation Team was to ask them questions until they went away, but none of these answers really helps the network operator or designer really answer the question, "How do you decide which protocol is the best?"
Three questions are embedded within this question, really, and it is easier to think about them independently:
- Is one protocol, in absolute terms, "better" than all the other protocols, in all situations?
- If the answer to this first question is "No," does each routing protocol exhibit some set of characteristics that indicate it would fit some situations (specifically, network topologies) better than others?
- After you have laid out the basics, what is the tradeoff in living with what you currently have versus switching to another routing protocol? What factors do you need to consider when doing the cost/benefit analysis involved in switching from one routing protocol to another?
This appendix takes you through each of these three questions. This might be the first and last time that you hear a network engineer actually answer the question, "Which routing protocol should I use?" so get ready for a whirlwind tour through the world of routing.
Is One Protocol "Better" Than the Others?
The first thing you need to do with this sort of question is to qualify it: "What do you mean by better?" Some protocols are easier to configure and manage, others are easier to troubleshoot, some are more flexible, and so on. Which one are you going to look at?
This appendix examines ease of troubleshooting and convergence time. You could choose any number of other measures, including these:
- Ease of management—What do the Management Information Bases (MIBs) of the protocol cover? What sorts of show commands are available for taking a network baseline?
- Ease of configuration—How many commands will the average configuration require in your network configuration? Is it possible to configure several routers in your network with the same configuration?
- On-the-wire efficiency—How much bandwidth does the routing protocol take up while in steady state, and how much could it take up, at most, when converging in response to a major network event?
Ease of Troubleshooting
The average uptime (or reliability) of a network is affected by two elements:
- How often does the network fail?
- How long does it take to recover from a failure?
The network design and your choice of equipment (not just the vendor and operating system, but also putting the right piece of equipment into each role and making certain that each device has enough memory, and so on) play heavily into the first element. The design of the network also plays into the second element. The piece often forgotten about when considering the reliability of a network is how long it takes to find and fix, or troubleshoot, the network when it fails.
Ease of management plays a role in the ease of troubleshooting, of course; if it is hard to take a baseline of what the network is supposed to look like, you will not do so on a regular basis, and you will have a dated picture to troubleshoot from. The tools available for troubleshooting are also important. Of course, this is going to vary between the implementations of the protocols; here, implementations in Cisco IOS Software illustrate the concepts. Table G-1 outlines some of the troubleshooting tools that are available in EIGRP, OSPF, and IS-IS, in Cisco IOS Software.
Table G-1 Cisco IOS Software Troubleshooting Tools for EIGRP, OSPF, and IS-IS
|
EIGRP |
OSPF |
IS-IS |
Debug Neighbors |
Neighbor formation state; hello packets. |
Neighbor formation state; hello packets. |
Packets exchanged during neighbor formation. |
Log Neighbor State |
Yes. |
Yes. |
No. |
Debug Database Exchange and Packets |
Packets exchanged (updates, replies, and so on), with filters per neighbor or for a specific route. |
Packets flooded, with filters for specific routing information. Packets retransmitted. |
Packets flooded. |
Debug Interactions with the Routing Table |
Yes. |
No. |
No. |
Debug Route Selection Process |
Yes (DUAL1 FSM2 events). |
Yes (SPF3 events). |
Yes (SPF events). |
Show Database |
Yes, by specific route and route state. |
Yes, by LSA4 type and advertising router. |
Yes, by LSP5 ID or type of route. |
Event Log |
Yes; understandable if you comprehend DUAL and its associated terminology. |
Yes; only understandable if you have access to the source code. |
No. |
1 DUAL = Diffusing Update Algorithm
2 FSM = finite state machine
3 SPF = shortest path first
4 LSA = link-state advertisement
5 LSP = link-state packet
From this chart, you can see that EIGRP generally provides the most tools for finding a problem in the network quickly, with OSPF running a close second.
Which Protocol Converges Faster?
I was once challenged with the statement, "There is no way that a distance vector protocol can ever converge faster than a link-state protocol!" An hour and a half later, I think the conversation tapered off into, "Well, in some situations, I suppose a distance vector protocol could converge as fast as a link-state protocol," said without a lot of conviction.
In fact, just about every network engineer can point to reasons why he thinks a specific routing protocol will always converge faster than some other protocol, but the reality is that all routing protocols can converge quickly or slowly, depending on a lot of factors strictly related to network design, without even considering the hardware, types of links, and other random factors that play into convergence speed in different ways with each protocol. As a specific example, look at the small network illustrated in Figure G-1 and consider the various options and factors that might play into convergence speed in this network.
Figure G-1 Simple Network
This figure purposefully has no labels showing anything concerning routing protocols configuration or design; instead, this section covers several possible routing configurations and examines how the same protocol could converge more or less quickly even on a network this small through just minor configuration changes.
Start with EIGRP as an example:
- The Router A to C link has a bandwidth of 64 kbps.
- The Router A to B link has a bandwidth of 10 Mbps.
- The Router B to D and Router C to D links have equal bandwidths.
With this information in hand, you can determine that Router D is going to mark the path to 10.1.1.0/24 through Router B as the best path (the successor in EIGRP terms). The path through Router C will not be marked as a feasible successor, because the differential in the metrics is too great between the two paths. To the EIGRP process running on Router D, the path through Router C cannot be proven based on the metrics advertised by Routers B and C, so the path through Router C will not be installed as a possible backup route.
This means that if the Router B to D link fails, Router D is forced to mark 10.1.1.0/24 as active and send a query to Router C. The convergence time is bounded by the amount of time it takes for the following tasks:
- Router D to examine its local topology table and determine that no other known loop-free paths exist.
- Router D to build and transmit a query toward Router C.
- Router C to receive and process the query, including examining its local EIGRP topology table, and find it still has an alternate path.
- Router C to build a reply to the query and transmit it.
- Router D to receive the reply and process it, including route installation time and the time required to change the information in the forwarding tables on the router.
Many factors are contained in these steps; any one of them could take a long time. In the real world, the total time to complete the steps in this network is less than two or three seconds.
Now change the assumptions just slightly and see what the impact is:
- The Router A to C link and A to B links have equal bandwidth.
- The Router B to D link has a bandwidth of 64 kbps.
- The Router B to C link has a bandwidth of 10 Mbps.
As you can tell, the network conditions have been changed only slightly, but the results are altered dramatically. In this case, the path to 10.1.1.0/24 through Router C is chosen as the best path. EIGRP then examines the path through Router B and finds that it is a loop-free path, based on the information embedded in EIGRP metrics. What happens if the Router B to C link fails?
The process has exactly one step: Router D examines its local EIGRP topology table and finds that an alternate loop-free path is available. Router D installs this alternate route in the local routing table and alters the forwarding information as needed. This processing takes on the order of 150 milliseconds or less.
Using the same network, examine the various reactions of OSPF to link failures. Begin with these:
- The Router B to D link has a cost of 20.
- All other links in the network have a cost of 10.
- All routes are internal OSPF routes.
What happens if the Router B to C link fails?
- Router B and C detect the link failure and wait some period of time, called the link-state advertisement (LSA) generation time. Then they flood modified router LSAs with this information.
- The remaining routers in the network receive this new LSA and place it in their local link-state databases. The routers wait some period of time, called the shortest path first (SPF) wait time, and then run SPF.
- In the process of running SPF, or after SPF has finished running (depending on the implementation), OSPF will install new routing information in the routing table.
With the default timers, it could take up to one second (or longer, in some situations) to detect the link failure and then about three and a half seconds to flood the new information. Finally, it could take up to two and a half seconds before the receiving routers will run SPF and install the new routing information. With faster times and various sorts of tuning, you can decrease these numbers to about one second or even in the 300-millisecond range in some specific deployments.
Making Router D an area border router (ABR) dramatically impacts the convergence time from the Router E perspective because Router D has to perform all the preceding steps to start convergence. After Router D has calculated the new correct routing information, it must generate and flood a new summary LSA to Router E, and Router E has to recalculate SPF and install new routes.
Redistributing 10.1.1.0/24 into the network and making the area that contains Routers A, B, C, and D into a not-so-stubby area (NSSA) throws another set of timers into the problem. Router D now has to translate the Type 7 external LSA into an external Type 5 LSA before it can flood the new routing information to Router E.
These conditions do not even include the impact of multiple routes on the convergence process. EIGRP, for instance, can switch from its best path to a known loop-free path for 10,000 routes just about as fast as it can switch 1 route under similar conditions. OSPF performance is adversely impacted by the addition of 10,000 routes into the network, possibly doubling convergence time.
You can see, then, that it is not so simple to say, "EIGRP will always converge faster than OSPF," "IS-IS will always converge faster than EIGRP," or any other combination you can find. Some people say that OSPF always converges faster than EIGRP, for instance, but they are generally considering only intrarea convergence and not the impact of interarea operations, the impact of various timers, the complexity of the SPF tree, and other factors. Some people say that EIGRP always converges faster than any link-state protocol, but that depends on the number of routers involved in the convergence event. The shorter the query path, the faster the network converges.
If you align all the protocol convergence times based on the preceding examination, you generally find the convergence times in this order, from shortest to longest:
- EIGRP with feasible successors.
- Intrarea OSPF or IS-IS with fast or tuned timers.
-
- EIGRP without feasible successors.
- Intrarea OSPF or IS-IS with standard timers.
- Interarea OSPF or IS-IS.
The last three are highly variable, in reality. In any particular network, OSPF, IS-IS, and EIGRP without feasible successors might swap positions on the list. The network design, configuration, and a multitude of other factors impact the convergence time more than the routing protocol does. You get the best convergence time out of a routing protocol if you play the network design to the strengths of the protocol.