- Transport Network Failures and Their Impacts
- Survivability Principles from the Ground Up
- Physical Layer Survivability Measures
- Survivability at the Transmission System Layer
- Logical Layer Survivability Schemes
- Service Layer Survivability Schemes
- Comparative Advantages of Different Layers for Survivability
- Measures of Outage and Survivability Performance
- Measures of Network Survivability
- Restorability
- Reliability
- Availability
- Network Reliability
- Expected Loss of Traffic and of Connectivity

## 3.13 Network Reliability

The field of "network reliability" is concerned with questions of retaining graph connectivity in networks where edges have
a non-zero probability of being in a failed state. The central issue is simple-sounding but in fact it is quite difficult
to exactly compute the probability that a graph remains connected as a whole, or if a path remains between specific nodes
or sets of nodes, in a graph with unreliable edges. Specific measures that are studied in this field are questions of *"{s,t}"* or *"two-terminal" reliability, k-terminal reliability*, and *all-terminal reliability*. These are all various measures of the purely topology-dependent probability of graph disconnection between pairs of nodes
points. Rai and Agrawal [RaAg90] provide a complete survey of this field. Here we try only to extract those basic ideas of network reliability that form
part of a grounding for work on transport network survivability and feed into the problem of availability block diagram reduction.

Figure 3-21 illustrates the basic orientation for the network reliability problem. Four equally likely states are drawn for an assumed
*p _{link}=0.32* (i.e., out of 28 links present we expect 9 of them down at any one time on average). A solid line is a working link, dashed
is a failed link. If we pick nodes 0-11 we see that in (a)-(c) despite the failures there is always still a route between
them. Inspection shows in fact that none of the randomly generated states (a)-(c) contributes any two-terminal unreliability:
there is still at least one topologically feasible route between all node pairs. Equivalently, we can say that none of these
failure combinations has disconnected the graph. Case (d), however, is an equally likely state but has a dramatically different
effect. Four of the nine failure links form a cut of the graph across edges (14-19), (14-9), (6-13) and (13-5). The two-terminal
reliability of all node pairs separated by the cut are thus affected by this failure state. This not only illustrates how
abrupt and discontinuous network behavior is in general but it also conveys why numerical enumeration of all link state combinations,
followed by tests for graph connectivity, is not feasible for this type of problem on a large network.

^{8}

**Figure 3-21. Network reliability: How likely is it that at least one route exists between nodes? In this example there are 228 link-state
combinations to consider.**

Of course in a real network, there may also be outage due to finite capacity effects in Figure 3-21 (a) though (c) but this is not in the scope of the basic "network reliability" problem. Basic network reliability (in the sense of [Colb87],[HaKr95],[Shei91]) presumes that there are no routing or capacity constraints on the network graph. If at least one route exists topologically
between *{s,t}*, then it is assumed the signal (or packet train) will discover and use it. With this limitation understood, however, its
methods and concepts can provide tools for use in other means for more encompassing considerations availability analysis.
The problem of most relevance to the availability of a service path through a network is that of *two-terminal* reliability.

#### 3.13.1 Two-Terminal Network Reliability

Two-terminal reliability^{9} is the complement to the likelihood that every distinct path between *{s,t}* contains at least one failed (or blocked) link. Exact computation of the two-terminal reliability problem is NP-complete
for general graphs even when the link failure probabilities are known. The computational complexity of trying to enumerate
all networks states and inspect them for connectivity between nodes *{s,t}* has led to the approach of more computationally efficient bounds. A widely known general form is called the *reliability polynomial:*

where *G = (V, E)* is the network graph, *m= |E|* is the number of edges in the graph, *{s,t}* is a specific terminal pair and *p* is the link *operating* probability.

This form prescribes either exact or approximate (bounding) estimates of *R(-)*, depending on how *N _{i}(-)* is obtained. In its exact form

*N*is the number of subgraphs of

_{i}(-)*G*in which there are exactly

*(m-i)*failed links but the remaining graph contains a route between nodes

*{s,t}.*Of course this just defers the problem of calculation

*R(-)*to that of counting or estimating

*N*Two simple bounds are conceptually evident at this stage. One is to enumerate (for each

_{i}(-).*i*∊ 1...

*m)*only those

*m-i*failure link combinations that constitute cuts of the graph between

*{s,t}.*A cut-finding program can thus enumerate a large number of cuts and their associated weights (in terms of number of edges) for insertion into Equation 3.31. Obviously for

*p*≈ 1 the smallest cuts are the most likely and hence numerically dominant contributors to

*R(-).*Assuming not all of the highest order cuts are enumerated

^{10}the result will be an upper (i.e., optimistic) bound on the exact

*R(-)*. i.e.,

where *c* is the minimum cut of the graph between *{s,t}* and *C _{i}(-)* is the number of

*{s,t}*cutsets found comprising exactly

*i*edges. The exact reliability will be lower than this because network states involving

*i*failures but containing a cutset of fewer than

*i*edges are connectivity-failure states that are not counted.

A converse viewpoint for assessing *N _{i}(-)* is from the standpoint of network states that contain at least one working route among the set of all distinct routes between

*{s,t}*. (The two are conceptually the same as the notion of "cuts and ties" in more advanced analysis of system availability block diagrams.) Here, all of the

*k-*successively longer distinct (non-looping) routes on the graph between

*{s,t}*are generated and each recorded with its associated length (number of edges in series en route). Then a simple upper (i.e., optimistic) bound on

*{s,t}*reliability is:

where *L _{i}* is the length of the

*k*distinct route between

^{th}*{s,t}*. Figure 3-22(a) portrays the basic notion of

*{s,t}*reliability being viewed in Equation 3.33 as the probability that

*not every*possible route is blocked and implicitly treats routes as independent entities. In contrast, Figure 3-22(b) shows how several distinct routes may actually share single link failures in common, illustrating why Equation 3.33 is an optimistic bound.

**Figure 3-22. Orientations to the network reliability calculation: (a) failures that together create an (s,t) cut set, (b) failures that
defeat all routes between (s,t).**

More precisely the route-based formulation is dependent on union of the probabilities that all edges in route *i* are operating, i.e.,

which calls for application of the inclusion-exclusion principle for the union of non-disjoint sets [GaTa92] (p.90). Denoting as the probability that all links in the *i ^{th}* route are operating,

In [Shei91] the application of the inclusion-exclusion principle for probability union is treated further, showing that there are always certain cancellation effects between terms of the inclusion-exclusion series that give further insights (the concept of irrelevant edges) and that can be exploited to simplify the expansion process.

#### 3.13.2 Factoring the Graph and Conditional Decomposition

Let us now return to the problem of calculating system availability in cases where basic series and parallel relationships
do not completely reduce the model. This is where the link to network reliability arises. If a network is completely reducible
between nodes *{s,t}* by repeated application of simple reductions into a single equivalent link, the network is said to be *two-terminal series-parallel*. In such a case the resultant single reduced edge probability is *R(G, {s,t},p).* But many realistic cases are not two-terminal series-parallel in nature because of some edge that cross-couples the remaining
relationships in a way that halts further application of the series-parallel reductions. In the approach that follows, which
is also based in network reliability, such an edge is used as a kind of pivot point on which the problem is split into two
conditional probability sub-versions that apply when the particular edge is in one case assumed available and in the other
case where it is assumed to be down.

Figure 3-23 summarizes the basic series-parallel reduction rules in a canonical form on the edge probabilities (probabilities of the
link being up, equivalent to the elemental availability). Cases (a) and (b) are the previous basic parallel and series relationships,
to which case (c), called a "two-neighbor reduction," is added. When applied to either a network graph or an availability
block diagram, these transformations are exact or, in the language of network reliability, they are "reliability preserving."
To use these reductions, element failures must be statistically independent, and in cases (b) and (c) in Figure 3-23 node *b* must have no other arcs incident upon it. Node *b* also cannot be either the source or target. While single arcs are shown the rules apply to any block that is similarly reducible
to a single probability expression, so that, for instance, *p _{1}* in Figure 3-23(a) may already be the result of a prior set of series-parallel reductions.

**Figure 3-23. Reliability-preserving graph reduction rules: (a) parallel, (b) series, (c) two-neighbor reduction.**

In general the application of series-parallel reduction rules will be exhausted before the original network is completely
reduced. This will usually manifest itself through some edge that cross-couples between remaining subgraphs, i.e., one or
more nodes will be like node *b* in Figure 3-23(b) but with the presence of more than just two arcs, so that another application of a series reduction is not possible. At this
stage the graph can be "factored" to continue the reductions. Graph factoring is based on Moscowitz's *pivital decomposition formula* [Shei91] (p.10). The key idea is that:

where *p* is the probability the edge is available, G|e means graph "*G given e*", and *{G-e}* is graph *G* without edge *e*. Thus the whole is considered as the conditional probability decomposition of the two states that the confounding edge *e* may be in, with probability *p* and *(1-p)* respectively. G|e is represented by graph *G* where edge *e* is contracted or "short circuited." The probability-weighted sum of the two conditional probability decomposition terms is
the two-terminal graph reliability. In practice the idea is to recognize a key edge *e* that will decouple the two resulting conditional subgraphs in a way that allows another round of series-parallel reductions.
A complete graph may thus be decomposed through a series of series-parallel reductions, splitting to two conditional subgraphs,
series parallel reductions on each, splitting again in those as needed, and so on. The real computational advantage of the
decomposition steps is to overcome the situations where no further series-parallel reductions are possible. Were it not for
this use of decomposition to link between subproblems that are further series-parallel reducible, it would be of little practical
value because by itself it is equivalent to state-space enumeration by building a binary tree of all two-state edge combinations.
More detailed treatments can be found in [Colb87] (p.77) and [HaKr95].

To illustrate the application to availability problems, however, consider the availability block diagram in Figure 3-24(a). Because of the "diagonal" element, it is not amenable to series-parallel reduction. We do, however, obtain two subgraphs that are each easily analyzed if we presuppose the state of the diagonal element. In (b) we presume it is failed. In (c) we presume it to be in a working state. Thus the resulting subgraphs are conditional probability estimates of the system availability. To get the overall availability we weight the result for each subgraph by the probability of the decomposed link state that lead to that subgraph. Therefore, for this example:

**Figure 3-24. Example illustrating conditional decomposition of an availability block diagram.**