Home > Articles > Networking > Storage

  • Print
  • + Share This
Like this article? We recommend

Case Study

We will determine a reliability figure on three very basic SAN architectures. The starting point of our study is the network storage requirements.

Network Storage Requirements

We want networked storage that has access to one server. Later, this storage will be accessible to other servers. The server is already in place, and has been designed to to sustain single component hardware failures (with dual host bus adapters (HBAs), for example). Data on this storage must be mirrored, and the storage access must also stand up to hardware failures. The cost of the storage system must be reasonable, while still providing good performance.

Our first temptation might be to decide which components to use; switches, hubs, Sun StorEdge_T3 arrays, Sun StorEdge_ A5x00 arrays, and so on. However, a more prudent approach would be to determine the appropriate architecture in terms of its resistance to hardware failures, cost, and performance, leaving the selection of specific components for a later stage.

NOTE

For this case study, the focus is on storage architecture redundancy and reliability, and does not address cost and performance issues.

Architecture 1

Figure 2FIGURE 2 Architecture 1 Block Diagram

Architecture 1 provides the basic storage necessities we are looking for with the following advantages and disadvantages:

Advantages:

  • Storage is accessible if one of the links is down.

  • Storage A is mirrored onto B.

  • Other servers can be connected to the concentrator to access the storage.

Disadvantages:

If the concentrator fails, we have no more access to the storage. This concentrator is a single point of failure (SPOF).

Architecture 2

Figure 3FIGURE 3 Architecture 2 Block Diagram

Architecture 2 has been improved to take into account the previous SPOF. A concentrator has been added, and now the storage configuration is redundant and the requirements are satisfied with the following advantages:

  • If any links or components go down, storage is still accessible (resilient to hardware failures).

  • Data is mirrored (Disk A <-> Disk B).

  • Other servers can be connected to both concentrators to access the storage space.

Architecture 3

Figure 4FIGURE 4 Architecture 3 Block Diagram

Architecture 3 seems very close to architecture 2. The main difference resides in the fact that Disk A and Disk B have only one data path. Disk A is still mirrored to Disk B, as required.

This architecture has all the advantages of the previous architectures with the following differences:

  • Disk A can only be accessed through Link C, and Disk B only through Link D.

  • There is no data multipathing software layer, which results in easier administration and easier troubleshooting.

In some sense it seems we are loosing a level of redundancy in architecture 3. To appreciate the differences between architecture 2 and 3, we will use block diagram analysis to determine and compare their reliability values.

Determining Redundancy

We first list an inventory of components involved in the three architectures as shown in the first column of the following table. Next, we analyze the three architectures for redundancy.

Failing Component (first failure)

Architecture 1: Is the System OK?

Architecture 2 and 3: Is the System OK?

HBA 1

Yes

Yes

HBA 2

Yes

Yes

Link A

Yes

Yes

Link B

Yes

Yes

Concentrator 1

No

Yes

Concentrator 21

n/a

Yes

Link C

Yes

Yes

Link D

Yes

Yes

Disk A

Yes

Yes

Disk B

Yes

Yes

Total number of redundant components

8

10


Consequently, we see that Architecture 2 and 3 satisfy our objectives for redundancy, while Architecture 1 does not.

It is possible to obtain an objective difference between architecture 2 and 3 by studying their respective reliability. We will find that, although both architecture 2 and 3 are fully redundant, one is more reliable than the other.

Determining Reliability

Using the reliability formulas discussed earlier, we can determine which architecture has the highest reliability value. For the purpose of this article, we will use sample MTBF values (as obtained by the manufacturer) and AFR values shown in the table below:

TABLE 1 Component Inventory

Component

AFR Variable

Sample MTBF Values (hours)

AFR2

HBA 1

H

800,000

0.011

HBA 2

H

 

 

Link A

L

400,000

0.022

Link B

L

 

 

Concentrator 1

C

580,000

0.0151

Concentrator 23

C

 

 

Link C

L

400,000

0.022

Link D

L

 

 

Disk A

D

1,000,000

0.0088

Disk B

D

 

 


NOTE

The example MTBF values were taken from real network storage component statistics. However, such values vary greatly, and these numbers are given here purely for illustration.

Architecture 1

Figure 5FIGURE 5 Architecture 1 Reliability Block Diagram

Having the rate of failure of each individual component, we can obtain the system's annual failure rate AFR1 and consequently the system reliability and system MTBF values. Using the block diagram (FIGURE 5), it is easy to identify which components are configured redundantly, and which are not. The following formula is derived using the block diagram analysis discussed earlier. The AFR values of redundant components are multiplied to the power equal to the number of redundant components. The AFR values of non-redundant components are multiplied by the number of those components in series. In this case, the concentrator (C) is the only non-redundant component (C * 1= C). And finally, the AFR values are summed.

The formula for this architecture:

AFR1 = (H + L)2 + C + L2 +D2

Sample values applied:

AFR1 = (0.011 + 0.022)2 + 0.0151 + 0.0222 + 0.00882 = 0.0167

Using the AFR value, we determine the annual reliability R1 of the system:

R1 = 1 – AFR1

R1 = 1 – 0.0167 = 0.9833, or 98.33%

Using the AFR value, the following system MTBF value is derived:

System MTBF = 8760/AFR1

System MTBF = 8760 / 0.0167 = 524,551 hours

Architecture 2

Figure 6FIGURE 6 Architecture 2 Reliability Block Diagram

This architecture has a different configuration, and the resulting formula is derived using the block diagram analysis.

The formula for this architecture:

AFR2 = (H + L + C + L)2 +D2

Sample values applied:

AFR2 = (0.011 + 0.022 + 0.0151 + 0.022)2 + 0.00882 = 0.005

Using the AFR, determine the annual reliability R2 of the system:

R2 = 1 – AFR2

R2 = 1 – 0.005 = 0.995, or 99.5%

Using the AFR value, the following system MTBF value is derived:

System MTBF = 8760 / AFR2

System MTBF = 8760 / 0.005 = 1,752,000 hours

Architecture 3

Figure 7FIGURE 7 Architecture 3 Reliability Block Diagram

Architecture 3 results in yet another block diagram calculation.

The formula for this architecture:

AFR3 = (H + L + C + L +D)2

Sample values applied:

AFR3 = (0.011 + 0.022 + 0.0151 + 0.022 + 0.0088)2 = 0.0062

Using the AFR, determine the annual reliability R3 of the system.

The formula:

R3 = 1 – AFR3

Numbers applied:

R3 = 1 – 0.0062 = 0.9938, or 99.38%

Using the AFR value, the following system MTBF value is derived:

System MTBF = 8760 / AFR3

System MTBF = 8760 / 0.0062= 1,412,903 hours

Conclusion

When the calculations are complete, we compare the data:

Architecture 1 = 98.33%, or a System's MTBF = 524,551 hours

Architecture 2 = 99.50%, or a System's MTBF = 1,752,000 hours

Architecture 3 = 99.38%, or a System's MTBF = 1,412,903 hours

The MTBF figures are the most revealing, and indicate that architecture 2 is statistically the most reliable of all.

In conclusion, the case study calculations provide the following points:

  • Only architecture 2 and 3 are fully redundant, hence they satisfy the requirement of a redundant configuration that can sustain a single hardware failure.

  • The reliability value for Architecture 1 doesn't show the non-redundant aspect of this architecture. It is therefore important to consider both characteristics; redundancy and reliability.

  • Architecture 2 is nearly three times more reliable than Architecture 1, and has an estimated higher MTBF of 339,097 hours when compared to architecture 3.

Finally, weighing the advantages of one solution over the another, we must also take other parameters into account, such as:

  • Storage capacity requirements

  • Performance

  • Cost

  • Maintainability (indexed by the MTTR: mean time to repair)

  • Availability (which depends on the MTBF and MTTR)

  • Serviceability

  • Ease of deployment

  • Support

The last point, support, is a critical consideration, because it is through support that a second failure will be avoided by quick troubleshooting and prompt part replacement. One factor not obvious in the calculations is that although we might think Architecture 2 brings more in terms of redundancy, due to the dual path from server to disks, it has the drawback of requiring additional software that can add another layer of complexity that might be less desirable (possibly lowering the ease of deployment and serviceability, while increasing costs).

Finally, it is worth noting that any storage area networking (SAN) implementation must be carefully planned and analyzed before deployment. Added to which, simple SAN design often will be preferable, because of easier support (troubleshooting and problem resolution). But one must not favor one parameter over the others without knowing the consequences, and therefore every aspect of the architecture decision must be considered. This is the only way to increase the reliability of storage architecture.

  • + Share This
  • 🔖 Save To Your Account