Fiber Cable Failure Impacts, Survivability Principles, and Measures of Survivability
- Transport Network Failures and Their Impacts
- Survivability Principles from the Ground Up
- Physical Layer Survivability Measures
- Survivability at the Transmission System Layer
- Logical Layer Survivability Schemes
- Service Layer Survivability Schemes
- Comparative Advantages of Different Layers for Survivability
- Measures of Outage and Survivability Performance
- Measures of Network Survivability
- Restorability
- Reliability
- Availability
- Network Reliability
- Expected Loss of Traffic and of Connectivity
In this chapter we will look at causes of fiber cable failures, identify the impacts of outage, and relate these to the goals for restoration speed. We then provide an overview of the different basic principles and techniques for network survivability. This provides a first overview appreciation of the basic approaches of span, path and p-cycle based survivability which we treat in depth in later chapters. The survey of basic mesh-oriented schemes in this chapter also lets the reader see these schemes in contrast to ring-based schemes that are 100% or more redundant, and which we do not consider further in the book. The chapter concludes with a look at the quantitative measures of network survivability, and the relationships between availability, reliability and survivability.
3.1 Transport Network Failures and Their Impacts
3.1.1 Causes of Failure
It is reasonable to ask why fiber optic cables get cut at all, given the widespread appreciation of how important it is to physically protect such cables. Isn't it enough to just bury the cables suitably deep or put them in conduits and stress that everyone should be careful when digging? In practice what seems so simple is actually not. Despite best-efforts at physical protection, it seems to be one of those large-scale statistical certainties that a fairly high rate of cable cuts is inevitable. This is not unique to our industry. Philosophically, the problem of fiber cable cuts is similar to other problems of operating many large-scale systems. To a lay person it may seem baffling when planes crash, or nuclear reactors fail, or water sources are contaminated, and so on, while experts in the respective technical communities are sometimes amazed it doesn't happen more often! The insider knows of so many things that can go wrong [Vau96]. Indeed some have gone as far as to say that the most fundamental engineering activity is the study of why things fail [Ada91] [Petr85].
And so it is with today's widespread fiber networks: it doesn't matter how advanced the optical technology is, it is in a cable. When you deploy 100,000 miles of any kind of cable, even with the best physical protection measures, it will be damaged. And with surprising frequency. One estimate is that any given mile of cable will operate about 228 years before it is damaged (4.39 cuts/year/1000 sheath-miles) [ToNe94]. At first that sounds reassuring, but on 100,000 installed route miles it implies more than one cut per day on average. To the extent that construction activities correlate with the working week, such failures may also tend to cluster, producing some single days over the course of a year in which perhaps two or three cuts occur. In 2002 the FCC also published findings that metro networks annually experience 13 cuts for every 1000 miles of fiber, and long haul networks experience 3 cuts for 1000 miles of fiber [VePo02]. Even the lower rate for long haul implies a cable cut every four days on average in a not atypical network with 30,000 route-miles of fiber. These frequencies of cable cut events are hundreds to thousands of times higher than corresponding reports of transport layer node failures, which helps explain why network survivability design is primarily focused on recovery from span or link failures arising from cable cuts.
3.1.2 Crawford's Study
After several serious cable-related network outages in the 1990s, a comprehensive survey on the frequency and causes of fiber optic cable failures was commissioned by regulatory bodies in the United States [Craw93]. Figure 3-1 presents data from that report on the causes of fiber failure. As the euphemism of a "backhoe fade" suggests, almost 60% of all cuts were caused by cable dig-ups. Two-thirds of those occurred even though the contractor had notified the facility owner before digging. Vehicle damage was most often suffered by aerial cables from collision with poles, but also from tall vehicles snagging the cables directly or colliding with highway overpasses where cable ducts are present. Human error is typified by a craftsperson cutting the wrong cables during maintenance or during copper cable salvage activities ("copper mining") in a manhole. Power line damage refers to metallic contact of the strain-bearing "messenger cable" in aerial installations with power lines. The resulting i2R (heat dissipation) melts the fiber cable. Rodents (mice, rats, gophers, beavers) seem to be fond of the taste and texture of the cable jackets and gnaw on them in both aerial and underground installations. The resulting cable failures are usually partial (not all fibers are severed). It seems reasonable that by partial gnawing at cable sheaths, rodents must also compromise a number of cables which then ultimately fail at a later time. Sabotage failures were typically the result of deliberate actions by disgruntled employees, or vandalism when facility huts or enclosures are broken into. Today, terrorist attacks on fiber optic cables must also be considered.
Figure 3-1. Immediate cause breakdown for 160 fiber optic cable cuts ([Craw93]).
Floods caused failures by taking out bridge crossings or by water permeation of cables resulting in optical loss increases in the fiber from hydrogen infiltration. Excavation damage reports are distinct from dig-ups in that these were cases of failure due to rockfalls and heavy vehicle bearing loads associated with excavation activities. Treefalls were not a large contributor in this U.S. survey but in some areas where ice storms are more seasonal, tree falls and ice loads can be a major hazard to aerial cables. Conduits are expensive to install, and in much of the country cable burial is also a major capital expense. In parts of Canada (notably the Canadian shield), trenching can be almost infeasible as bedrock lies right at the surface. Consequently, much fiber cable mileage remains on aerial pole-lines and is subject to weather-related hazards such as ice, tree falls, and lightning strikes.
Figure 3-2 shows the statistics of the related service outage and physical cable repair times. Physical repair took a mean time of 14 hours but had a high variance, with some individual repair times reaching to 100 hours. The average service outage time over the 160 reported cable cuts was 5.2 hours. As far as can be determined from the report, all 160 of the cable failures reported were single-failure events. This is quite relevant to the applicability and economic feasibility of later methods in the book for optimal spare capacity design.
Figure 3-2. Histogram of service restoration and cable repair times (data from [Craw93]).
In 1997 another interesting report came out on the causes of failure in the overall public switched network (PSTN) [Kuhn97]. Its data on cable-related outages due to component flaws, acts of nature, cable cutting, cable maintenance errors and power supply failures affecting transmission again add up to form the single largest source of outages. Interestingly Kuhn concludes that human intervention and automatic rerouting in the call-handling switches were the key factors in the systems's overall reliability. This is quite relevant as we aim in this book to reduce the dependence on human intervention wherever possible in real-time and effectively to achieve the adaptive routing benefits of the PSTN down in the transport layer itself. Also of interest to readers is [Zorp89] which includes details of the famous Hinsdale central-office fire from which many lessons were learned and subsequently applied to physical node protection.
3.1.3 Effects of Outage Duration
There are a variety of user impacts from fiber optic cable failures. Revenue loss and business disruption is often first in mind. As mentioned in the introduction, the Gartner research group attributes up to $500 million in business losses to network failures by the year 2004. Direct voice-calling revenue loss from failure of major trunk groups is frequently quoted at $100,000/minute or more. But other revenue losses may arise from default on service level agreements (SLAs) for private line or virtual network services, or even bankruptcies of business that are critically dependent on 1-800 or web-pages services. Many businesses are completely dependent on web-based transaction systems or 1-800 service for their order intakes and there are reports of bankruptcies from an hour or more of outage. (Such businesses run with a very finely balanced cash-flow.) Growing web-based e-commerce transactions only increase this exposure. Protection of 1-800 services was one of the first economically warranted applications for centralized automated mesh restoration with AT&T's FASTAR system [ChDo91]. It was the first time 1-800 services could be assured of five minute restoration times. More recently one can easily imagine the direct revenue loss and impact on the reputation of "dot-com" businesses if there is any outage of more than a few minutes.
When the outage times are in the region of a few seconds or below, it is not revenue and business disruptions that are of primary concern, but harmful complications from a number of network dynamic effects that have to be considered. A study by Sosnosky provides the most often cited summary of effects, based on a detailed technical analysis of various services and signal types [Sosn94]. Table 3-1 is a summary of these effects, based on Sosnosky, with some updating to include effects on Internet protocols.
The first and most desirable goal is to keep any interruption of carrier signal flows to 50 ms or less. 50 ms is the characteristic specification for dedicated 1+1 automatic protection switching (APS) systems. An interruption of 50 ms or less in a transmission signal causes only a "hit" that is perceived by higher layers as a transmission error. At most one or two error-seconds are logged on performance monitoring equipment and data packet units for most over-riding TCP/IP sessions will not be affected at all. No alarms are activated in higher layers. The effect is a "click" on voice, a streak on a fax machine, possibly several lost frames in video, and on data services it may cause a packet retransmission but is well within the capabilities of data protocols including TCP/IP to handle. An important debate exists in the industry surrounding 50 ms as a requirement for automated restoration schemes. One view holds that the target for any restoration scheme must be 50 ms. Section 3.1.4 is devoted to a further discussion of this particular issue.
As one moves up from 50 ms outage time the chance that a given TCP/IP session loses a packet increases but remains well within the capability for ACK/NACK retransmission to recover without a backoff in the transmission rate and window size. Between 150-200 ms when a DS-1 level reframe time is added, there is a possibility (<5% at 200 ms) of exceeding the "carrier group alarm" (CGA) times of some older channel bank1 equipment, at which time the associated switching machine will busy out the affected trunks, disconnecting any calls in progress.
Table 3-1. Classification of Outage Time Impacts
Target Range |
Duration |
Main Effects / Characteristics |
---|---|---|
Protection Switching |
< 50 ms |
No outage logged: system reframes, service "hit", 1 or 2 error-seconds (traditional performance spec for APS systems), TCP recovers after one errored frame, no TCP fallback. Most TCP sessions see no impact at all. |
1 |
50 ms - 200 ms |
< 5% voiceband disconnects, signaling system (SS7) switch-overs, SMDS (frame-relay) and ATM cell-rerouting may start. |
2 |
200 ms - 2 s |
Switched connections on older channel banks dropped (CGA alarms) (traditional max time for distributed mesh restoration), TCP/IP protocol backoff. |
3 |
2s - 10 s |
All switched circuit services disconnected. Private line disconnects, potential data session / X.25 disconnects, TCP session time-outs start, web page not available errors. Hello protocol between routers begins to be affected. |
4 |
10s - 5 min |
All calls and data sessions terminated. TCP/IP application layer programs time out. Users begin attempting mass redials / reconnects. Routers issuing LSAs on all failed links, topology update and resynchronization beginning network-wide. |
"Undesirable" |
5 min - 30 min |
Digital switches under heavy reattempts load, "minor" societal / business effects, noticeable Internet "brownout." |
"Unacceptable" |
> 30 min |
Regulatory reporting may be required. Major societal impacts. Headline news. Service Level Agreement clauses triggered, lawsuits, societal risks: 911, travel booking, educational services, financial services, stock market all impacted. |
With DS1 interfaces on modern digital switches, however, this does not occur until 2.5 +/- 0.5 seconds.2 Some other minor network dynamics begin in the range from 150-200 ms. In Switched Multi-megabit Digital Service (SMDS) cell rerouting processes would usually be beginning by 200 milliseconds. The recovery of any lost data is, however, still handled through higher layer data protocols. The SS7 common channel signaling (CCS) network (which control circuit-switched connection establishment) may also react to an outage of 100 ms at the SONET level (~150 ms after reframing at the DS-1 level). The CCS network uses DS-0 circuits for its signaling links and will initiate a switchover to its designated backup links if no DS-0 level synch flags are seen for 146 ms. Calls in the process of being set up at the time may be abandoned. Some video codecs using high compression techniques can also require a reframing process in response to a 100 ms outage that can be quite noticeable to users.
In the time frame from 200 ms to two seconds no new effects on switched voiceband services emerge other than those due to the extension of the actual signal lapse period itself. By two seconds the roughly 12% of DS0 circuits that are carried on older analog channel banks (at the time of Sosnosky's study) will definitely be disconnected. In the range from two to 10 seconds the effects become far more serious and visible to users. A quantum change arises in terms of the service-level impact in that virtually all voice connections and data sessions are disconnected. This is the first abrupt perception by users and service level applications of outage as opposed to a momentary hit or retransmission-related throughput drop. At 2.5 +/- 0.5 seconds, digital switches react to the failure states on their transmission interfaces and begin "trunk conditioning"; DS-0, (n)xDS-0 (i.e., "fractional T1"), DS-1 and private line disconnects ("call-dropping") occur. Voiceband data modems typically also time out two to three seconds after detecting a loss of carrier. Session dependent applications such as file transfer using IBM SNA or TCP/IP may begin timing out in this region, although time-outs are user programmable up to higher values (up to 255 seconds for SNA). X.25 packet network time-outs are typically from one to 30 seconds with a suggested time of 5 seconds. When these timers expire, disconnection of all virtual calls on those links occurs. B-ISDN ATM connections typically have alarm thresholds of about five seconds.
In contrast to the 50 ms view for restoration requirements, this region of 1 to 2 second restoration is the main objective that is accepted by many as the most reasonable target, based largely on the cost associated with 1+1 capacity duplication to meet 50 ms, and in recognition that up until about 1 or 2 seconds, there really is very little effect on services. However, two seconds is really the "last chance" to stop serious network and service implications from arising. It is interesting that some simple experiments can dramatically illustrate the network dynamics involved in comparing restoration above and below a 2 second target (whereas there really are no such abrupt or quantum changes in effects at anywhere from zero up to the 2 second call-dropping threshold).
Figure 3-3 shows results from a simple teletraffic simulation of a group of 50 servers. The servers can be considered circuits in a trunk group or processors serving web pages. The result shown is based on telephony traffic with a 3 minute holding time. The 50 servers are initially in statistical equilibrium with their offered load at 1% connection blocking. If a call request is blocked, the offering source reattempts according to a uniform random distribution of delay over the 30 seconds following the blocked attempt. Figure 3-3(a) shows the instantaneous connection attempts rate, if the 50 trunk group is severed and all calls are dropped, then followed by an 80% restoration level. Figure 3-3(b) shows the corresponding dynamics of the same total failure, also followed by only 80% restoral, but before the onset of call dropping. Figure 3-3(c) shows how the overall transient effect is yet further mitigated by adaptive routing in the circuit-switched service layer to further reduce ongoing congestion. This dramatically illustrates how beneficial it is in general to achieve a restoration response before connection or session dropping, even if the final restoral level is not 100%.
Figure 3-3. Traffic dynamic effects (semi-synchronized mass re-attempts) of restoration beyond the call-dropping limit of ~2 seconds (collaboration with M. MacGregor).
The seriousness of an outage that extends beyond several seconds, into the tens of seconds, grows progressively worse: IP networks begin discovering "Hello" protocol failures and attempt to reconverge their routing tables via LSA flooding. In circuit-switched service layers, massive connection and session dropping starts occurring and goes on for the next several minutes. Even if restoration occurred at, say, 10 seconds, there would by then be millions of users and applications that begin a semi-synchronized process of attempting to re-establish their connections. There are numerous reports of large digital switching systems suffering software crashes and cold reboots in the time frame of 10 seconds to a few minutes following a cable cut, due to such effects. The cut itself might not have affected the basic switch stability, but the mass re-attempt overwhelms and crashes the switch. Similar dynamics apply for IP large routers forwarding packets for millions of TCP/IP sessions that similarly undergo an unwittingly synchronized TCP/IP backoff and restart. (TCP/IP involves a rate backoff algorithm called "slow start" for response to congestion. Once it senses rising throughput the transmit rate and window size is multiplied in a run up to the maximum throughput. Self-synchronized dynamics among disparate groups of TCP/IP sessions can therefore occur following the failure or during the time routing tables are being updated). The same kind of dynamic hazards can be expected in MPLS-based networks as label edge routers (LERs) get busy (following OSPF-TE resynchronization) with CR-LDP signaling for re-establishment of possibly thousands of LSPs simultaneously through the core network of LSRs. Protocols such as CR-LDP for MPLS (or GMPLS) path establishment were not intended for, nor have they ever been tested in an environment of mass simultaneous signaling attempts for new path establishment. The overall result is highly unpredictable transient signaling congestion and capacity seizure and contention dynamics. If failure effects are allowed to even enter this domain we are ripe for "no dial tone" and Internet "brown outs" as switch or router O/S software succumbs to overwhelming real-time processing loads. Such congestion effects are also known to propagate widely in both the telephone network and Internet. Neighboring switches cannot complete calls to the affected destination, blocking calls coming into themselves, and so on. If anything, however, the Internet is even more vulnerable than the circuit switched layer to virtual collapse in these circumstances. 3
Beyond 30 minutes the outage effects are generally considered so severe that it is reportable to regulatory agencies and the general societal and business impacts are considered to be of major significance. If communications to or between police, ambulance, medical, flight traffic control, industrial process control or many other such crucial services break down for this long it becomes a matter of health and safety, not just business impact. In the United States any outage affecting 30,000 or more users for over 30 minutes is reportable to the FCC.
3.1.4 Is 50 ms Restoration Necessary?
Any newcomer to the field of network survivability will inevitably encounter the "50 ms debate." It is well to be aware that this is a topic that has been already argued without resolution for over a decade and will probably continue. The debate persists because it is not entirely based on technical considerations which could resolve it, but has roots in historical practices and past capabilities and has been a tool of certain marketing strategies.
History of the 50 ms Figure
The 50 ms figure historically originated from the specifications of APS subsystems in early digital transmission systems and was not actually based on any particular service requirement. Early digital transmission systems embodied 1:N APS that required typically about 20 ms for fault detection, 10 ms for signaling, and 10 ms for operation of the tail-end transfer relay, so the specification for APS switching times was reasonably set at 50 ms, allowing a 10 ms margin. Early generations of DS1 channel banks (1970s era) also had a Carrier Group Alarm (CGA) threshold of about 230 ms. The CGA is a time threshold for persistence of any alarm state on the transmission line side (such as loss of signal or frame synch loss) after which all trunk channels would be busied out. The 230 ms CGA threshold reinforced the need for 50 ms APS switches at the DS3 transmission level to allow for worst-case reframe times all the way down the DS3, DS2, DS1 hierarchy with suitable margin against the 230 ms CGA deadline. It was long since realized that a 230 ms CGA time was far too short, however. Many minor line interruptions would trigger an associated switching machine into mass call-dropping because of spurious CGA activations. The persistence time before call dropping was raised to 2.5 +/- 0.5 s by ITU recommendations in the 1980s as a result. But the requirement for 50 ms APS switching stayed in place, mainly because this was still technically quite feasible at no extra cost in the design of APS subsystems. The apparent sanctity of 50 ms was further entrenched in the 1990s by vendors who promoted only ring-based transport solutions and found it advantageous to insist on 50 ms as the requirement, effectively precluding distributed mesh restoration alternatives which were under equal consideration at the start of the SONET era. As a marketing strategy the 50 ms issue thus served as the "mesh killer" for the 1990s as more and more traditional telcos bought into this as dogma.
On the other hand, there was also real urgency in the early 1990s to deploy some kind of fast automated restoration method relatively immediately. This lead to the quick adoption of ring-based solutions which had only incremental development requirements over 1+1 APS transmission systems. However, once rings were deployed, the effect was to only further reinforce the cultural assumption of 50 ms as the standard. Thus, as sometimes happens in engineering, what was initially a performance capability in one specific context (APS switching time) evolved into a perceived requirement in all other contexts.
But the "50 ms requirement" is undergoing serious challenges to its validity as a ubiquitous requirement, even being referred to as the "50 ms myth" by data-centric entrants to the field who see little actual need for such fast restoration from an IP services standpoint. Faster restoration is by itself always desirable as a goal, but restoration goals must be carefully set in light of corresponding costs that may be paid in terms of limiting the available choices of network architecture. In practice, insistence on "50 ms" means 1+1 dedicated APS or UPSR rings (to follow) are almost the only choices left for the operator to consider. But if something more like 200 ms is allowed, the entire scope of efficient shared-mesh architectures become available. So it is an issue of real importance as to whether there are any services that truly require 50 ms.
Sosnosky's original study found no applications that require 50 ms restoration. However, the 50 ms requirement was still being debated in 2001 when Schallenburg [Schal01], understanding the potential costs involved to his company, undertook a series of experimental trials with varying interruption times and measured various service degradations on voice circuits, SNA, ATM, X.25, SS7, DS1, 56 kb/s data, NTC digital video, SONET OC-12 access services, and OC-48. He tested with controlled-duration outages and found that 200 ms outages would not jeopardize any of these services and that, except for SS7 signaling links, all other services would in fact withstand outages of two to five seconds.
Thus, the supposed requirement for 50 ms restoration seems to be more of a techno-cultural myth than a real requirement—there are quite practical reasons to consider 2 seconds as an alternate goal for network restoration. This avoids the regime of connection and session time-outs and IP/MPLS layer reactions, but gives a green light to the full consideration of far more efficient mesh-based survivable architectures.