This chapter is from the book Network Protocols A network protocol is a set of agreements and specifications for sending data over a network. Many network protocols are in use today. Let's dive into a quick overview of protocols used to help you become familiar with this domain to allow you to make informed infrastructure decisions. TCP/IP Transmission Control Protocol/Internet Protocol (TCP/IP) is the network protocol that has the widest use in industry: TCP/IP protocol stacks exist for all operating systems currently in use. It is an extremely robust and reliable protocol. It is routable, which means that it can be sent between disparate networks. It is available for free with all operating systems. Even Netware, the home of the once very popular network protocol SPX, offers TCP/IP in addition to its proprietary protocol. An extremely robust suite of security functionality has been developed for communication via TCP/IP. Internet communication by and large uses TCP/IP. Several other network protocols can be and are used over the Internet. They will be discussed later in this section. However, the Internet communication protocols HTTP, HTTPS, SMTP, and FTP all use IP. Web services use TCP/IP. Messaging protocols use TCP/IP. All remote object communication, such as DCOM, Java RMI, and CORBA, can use TCP/IP. Subnets Subnets are segments of a network that can communicate directly with each other. A person on a subnet can communicate directly with all the other computers on that subnet. Networks are split into subnets by the subnet mask part of the TCP/IP properties. Subnets simplify network administration and can be used to provide security. Computers on different subnets require routers to communicate with computers on other subnets. TCP/IP is a suite of protocols built on a four-layer model: The first layer is the network interface. These are the LAN technologies, such as Ethernet, Token Ring, and FDDI, or the WAN technologies, such as Frame Relay, Serial Lines, or ATM. This layer puts frames onto and gets them off the wire. The second layer is the IP protocol. This layer is tasked with encapsulating data into Internet datagrams. It also runs all the algorithms for routing packets between switches. Sub-protocols to IP are functions for Internet address mapping, communication between hosts, and multicasting support. Above the IP layer is the transport layer. This layer provides the actual communication. Two transport protocols are in the TCP/IP suite: TCP and User Datagram Protocol (UDP). These will be explained in greater detail in the next section. The topmost layer is the application layer. This is where any application that interacts with a network accesses the TCP/IP stack. Protocols such as HTTP and FTP reside in the application layer. Under the covers, all applications that use TCP/IP use the sockets protocol. Sockets can be thought of as the endpoints of the data pipe that connects applications. Sockets have been built into all flavors of UNIX and have been a part of Windows operating systems since Windows 3.1, see Figure 1-2. Figure 1-2. TCP stack. The transmission protocol is the actual mechanism of data delivery between applications. TCP is the most widely used of the transmission protocols in the TCP/IP suite. It can be described as follows: TCP is a packet-oriented protocol. That means that the data are split into packets, a header is attached to the packet, and it is sent off. This process is repeated until all the information has been put on the wire. TCP is a connection-oriented protocol in that a server requires the client to connect to it before it can transmit information. TCP attempts to offer guaranteed delivery. When a packet is sent, the sender keeps a copy. The sender then waits for an acknowledgement of receipt by the recipient machine. If that acknowledgement isn't received in a reasonable time, the packet is resent. After a certain number of attempts to transmit a packet, the sender will give up and report an error. TCP provides a means to properly sequence the packets once they have all arrived. TCP provides a simple checksum feature to give a basic level of validation to the packet header and the data. TCP guarantees that packets will be received in the order in which they were sent. UDP is used much less than TCP/IP, yet it has an important place in IP communication because of the following: UDP broadcasts information. Servers do not require a connection to send data over the network via UDP. A UDP server can sit and broadcast information such as the date and time without regard to whether or not anyone is listening. UDP does not have the built-in facilities to recover from failure that TCP has. If a problem, such as a network failure, prevents a particular application from receiving the datagram, the sending application will never know that. If reliable delivery is necessary, TCP should be used or the sending application will have to provide mechanisms to overcome the inherently unreliable nature of UDP. UDP is faster and requires less overhead than TCP. Typically, UDP is used for small data packets and TCP for large data streams. Other Protocols ATM is a high-speed network technology like Ethernet. All data streamed over ATM are broken down into 53-byte cells. The constant size of the cells simplifies switching issues and can provide higher transmission capabilities than 10 or even 100 Mbs Ethernet. TCP/IP establishes a connection between sender and recipient machines, but it doesn't establish a circuit, a fixed set of machines through which all the traffic will go for the duration of the session. TCP/IP allows network conditions to modify the actual physical route that packets will take during a session. This makes TCP/IP robust and self-healing in the face of network problems. In the case of video or voice transmission, it is best to have a connection between the communicating parties, and ATM can provide that. TCP/IP can be sent over ATM. Another protocol that has emerged is Multiprotocol Label Switching (MPLS). MPLS competes with ATM in that it allows the establishment of labeled paths between the sender and the receiver. This ability to establish paths has made MPLS of interest to the creators of virtual private networks. Unlike ATM, it is relatively easy with MPLS to establish paths across multiple layer 2 transports like Ethernet and FDDI. It also outperforms ATM and offers some very useful path control mechanisms. Like ATM, MPLS uses IP to send data over the Internet. For further information, see the MPLS FAQ at www.mplsrc.com/mplsfaq.shtml. The emergence of Gigabit Ethernet implementations has provided enough raw bandwidth to allow TCP/IP over Ethernet to compete with ATM for video and other bandwidth-intensive applications. The big question to consider about TCP/IP is whether to utilize IPv4, which was released way back in 1980, or to move toward IPv6. IPv6 has many features of importance to the infrastructure of an enterprise: Virtually unlimited address space. A different addressing scheme that allows individual addresses for every device in an enterprise everywhere in the world. A fixed header length and improved header format that improves the efficiency of routing. Flow labeling of packets (Labeling packets allows routers to sort packets into streams, making TCP/IPv6 much more capable than TCP/IPv4 of handling stream-oriented traffic such as VOIP or video streams.) Improved security and privacy (Secure communications are compulsory with v6, meaning that all communications exist in a secure tunnel. This can be compelling in some circumstances. The increased security provided by IPv6 will go a long way toward providing a Secure Cyberspace.) Secure Cyberspace The Internet was designed with a highly distributed architecture to enable it to withstand such catastrophic events as a nuclear war. The infrastructure, the net part, is secure from attack. However, the structure of the Internet provides no security for a single node, router, or computer connected to the Internet. Unprotected computers are the key to one of the greatest current threats to the Internet: denial-of-service attacks. The others are worms and email viruses that eat up bandwidth. The U.S. government has outlined a national strategy to protect the Internet. For details, see the draft paper entitled "National Strategy to Secure Cyberspace" at www.whitehouse.gov/pcipb. From a hardware and operating system point of view, converting to v6 can be very close to cost-free. This is based on the facts that all the major operating systems have v6 IP stacks built into them and that adding v6 support to devices such as routers only requires a software update. The exhaustion of the IP namespace is another issue of concern. It is a fact that the current IP address space is facing exhaustion. Innovations such as network address translation (NAT) have postponed the day when the last address is assigned, buying time for an orderly transition from IPv4 to v6. Network Address Translation Network address translation is a technology that allows all computers inside a business to use one of the sets of IP addresses that are private and cannot be routed to the Internet. Since these addresses cannot be seen on the Internet, an address conflict is not possible. The machine that is acting as the network address translator is connected to the Internet with a normal, routable IP address and acts as a router to allow computers using the private IP addresses to connect to and use the Internet. IPv6 contains numerous incremental improvements in connecting devices to the Internet. The enormous expansion of the naming space and the addressing changes that make it possible for every device on the planet to be connected to the Internet are truly revolutionary for most enterprises. For Canaxia, not only can every machine in every one of its plants be connected to a central location via the Internet, but every sub-component of that machine can have its own connection. All sections of all warehouses can be managed via the Internet. With a secure Internet connection, dealers selling Canaxia cars can connect their sales floors, their parts departments, and their service departments to a central Canaxia management facility. The conversion of an existing enterprise to IPv6 is not going to be cost-free. Older versions of operating systems may not have an IPv6 protocol stack available, perhaps necessitating their replacement. In a large corporation, ascertaining that all existing applications will work seamlessly with v6 will probably be a substantial undertaking. In theory, IPv4 can coexist with v6. But we say that if you have to mix IPv4 with v6, you will have to prove that there are no coexistence problems. We recommend the following be used as a template when contemplating conversion to IPv6: If you are a small or medium-sized firm without any international presence, wait until it is absolutely mandatory. In any case, you will be able to get by with just implementing v6 on the edge of the network, where you interface with the Internet. If you are a multinational corporation or a firm that has a large number of devices that need Internet addresses, you should formulate a v6 conversion plan. Start at the edge and always look for application incompatibilities. Of course, you should have exhausted options such as using nonroutable IP addresses along with NAT. If you do decide to convert, start at the edge and try to practice just-in-time conversion techniques. Because of its size and international reach, Canaxia has formulated a strategy to convert to IPv6. The effort will be spaced over a decade. It will start at the edge, where Canaxia interfaces with the Internet and with its WAN. New sites will be v6 from the very start. Existing sites will be slowly migrated, with this effort not due to begin for another five years. As applications are upgraded, replaced, or purchased, compatibility with IPv6 will be a requirement, as much as is economically feasible. Canaxia's architecture team is confident no crisis will occur anytime in the next 10 years due to lack of available Internet addresses. Systems Architecture and Business Intelligence If your enterprise runs on top of a distributed and heterogeneous infrastructure, the health of the infrastructure can be vital to the success of your business. In that case, the architect will need to provide visibility into the status of the enterprise systems. The ability to provide near real-time data on system performance to business customers is critical. Following is an example: The Web server was up 99 percent of the time, which means it was down 7.3 hours last month. What happened to sales or customer satisfaction during those hours? If the customers didn't seem to care, should we spend the money to go to 99.9 percent uptime? The Web server was up, but what was the average time to deliver a page? What did customer satisfaction look like when the page delivery time was the slowest? How about the network? Part of the system is still on 10 MBits/second wiring. How often is it afflicted with packet storms? With what other variables do those storm times correlate? Part of the service staff is experimenting with wireless devices to gather data on problem systems. What was their productivity last week? In terms of quality attributes, the following figures for uptime relate to availability: 99 percent uptime per year equates to 3.65 days of downtime. 99.9 percent uptime equates to .365 days or 8.76 hours down per year. 99.99 percent uptime means that the system is down no more than 52 minutes in any given year. It should be possible for a manager in marketing, after receiving an angry complaint from an important client, to drill into the data stream from systems underpinning the enterprise and see if there is a correlation between the time it took to display the Web pages the customer needed to work with and the substance of his or her complaints. Perhaps several of the times that the Web server was down coincide with times when this customer was attempting to do business. Without data from all the possible trouble spots, it will be impossible to pin down the real source of the problem. In most organizations, the tools to monitor the health of networks, databases, and the Web infrastructure are not plugged into the enterprise's overall BI system. They are usually stand-alone applications that either provide snapshots into the health of a particular system or dump everything into a log that is looked at on an irregular basis. Often they are little homegrown scripts or applications thrown together to solve a particular problem, or they are inherited from the past when the systems were much simpler and much less interdependent. Start with the most important performance indicators for your system. Perhaps they are page hits per second or percent bandwidth utilization. Integrate this data into the BI system first. Perhaps the segment of your enterprise infrastructure architecture aligns with one of the company's business units. Fit the performance metrics to the unit's output and present it to the decision makers for your unit. The planning stage for a new system is an excellent time to build performance metrics measurement into its architecture. New systems often are conceived using an optimistic set of assumptions regarding such metrics as performance and system uptime. When that new system comes online, you will be grateful for the ability to generate the data to quantify how accurate the initial assumptions were. If problems crop up when the system becomes operational, you will have the data necessary to identify when the problem lies at your fingertips. Service Level Agreements Service level agreements (SLAs) are a formalization of the quality attributes of availability and performance that have been externalized in a document. As infrastructure and applications become more vital to businesses, they are demanding that the providers of those assets guarantee, in writing, the levels of performance and stability that the business requires. SLAs are a manifestation of how important certain IT functionality has become to modern business processes. The crucial part of dealing with an SLA is to think carefully about the metrics required to support it. If you are required to provide a page response time under 3 seconds, you will have to measure page response times, of course. But what happens when response times deteriorate and you can no longer meet your SLA? At that point you had better have gathered the data necessary to figure out why the problem occurred. Has the overall usage of the site increased to the point where the existing Web server architecture is overloaded? What does the memory usage on the server look like? For example, Java garbage collection is a very expensive operation. To reduce garbage collection on a site running Java programs, examine the Java code in the applications. If garbage collection is bogging down your system, the number of temporary objects created should be reduced. The bottom line is this: When planning the systems architecture, you will have to think beyond the metric in the agreement to measuring all the variables that impact that metric. We suggest that the introduction of SLAs in your organization be looked upon as an opportunity rather than a threat and as an excellent tool to force the organization to update, rationalize, and above all integrate its infrastructure measurements. If a new project does not have an SLA, you might want to write up a private SLA of your own. The discipline will pay off in the future. Systems Architecture and Storage The vast majority of businesses data storage costs are increasing exponentially. The costs are both in the physical devices and the personnel costs associated with the individuals needed to manage, upgrade, and enhance the storage devices. One problem with storage costs is that the real cost to the enterprise is hidden in hundreds or thousands of places. Most businesses upgrade storage a machine and a hard drive at a time. This results in scores or hundreds of small, low profile purchases that never show up as line items on the IT department's budget, see Figure 1-3. Figure 1-3. The three storage architectures. The important point is not that you adopt a particular architecture but that you understand the benefits and trade-offs associated with the different architectures. The conventional approach is to attach the storage devices, usually hard drives, directly to the machine. This is direct attached storage (DAS) and is an excellent choice for situations in which tight security must be maintained over the data in the storage device. Its downside is that it is expensive to upgrade and maintain. Upgrade costs arise more from the costs related to having personnel move applications and data from older, smaller storage devices to the newer, bigger device than from the costs related to the storage device itself. In addition, this storage mechanism makes it difficult to establish exactly what the storage costs of the organization are and to manage those costs. Storage area networks (SANs) offer basically unlimited storage that is centrally located and maintained. Using extremely high-speed data transfer methods such as FDDI, it is possible to remove the storage media from direct connection to the computer's backplane without affecting performance. SANs offer economical enterprise-level storage that is easy to manage and to grow, but implementing a SAN is a large operation that requires thorough architectural analysis and substantial up-front costs. However, the ROI for most SAN enterprises is very compelling. SANs should not be considered unless it is possible to establish the very high-speed data connection that is required. If the machines to be connected are geographically remote, different storage architecture is required. Network attached storage (NAS) devices are low-cost, very-low-maintenance devices that provide storage and do nothing else. You plug a NAS into the network, turn it on, and you have storage. Setup and maintenance costs are very low; the only cost is to relocate the existing, attached storage to the NAS. These devices are perfect for increasing storage to geographically distributed machines. The major determinant for choosing between NAS and SAN should be the data storage architectural design. The following will help you choose between the two: Disaster recovery strongly favors SAN because of the ease with which it can be distributed over large distances. As of this writing, fiber can extend up to 150 kilometers (95 miles) without amplification. This means that data can be mirrored transparently between devices that are 300 kilometers (190 miles) apart. Distributed performance is better with SANs. The following functionality favors SANs: Very large database applications Application server duties File storage definitely favors NASs. Ease of administration is a tie. In their own ways, both storage solutions are easy to administer. With a NAS, you plug it into the wall and into the network and you have storage. The work lies in partitioning the storage, moving data onto the device, and pointing users to the device. SANs require quite a bit of up-front design and setup, but once online, adding storage is very simple and more transparent than with a NAS. High availability is pretty much a tie. The SAN can mirror data for recovery, and the NASs all have hot-swappable hard drives in a RAID 5 configuration. Cost favors NASs. Initial costs for an NAS storage installation are about one-fifth the cost of a similar-sized SAN storage installation. However, the ongoing costs of a SAN are usually less than for a similar NAS. This is due to the following factors: Staffing costs are less due to the centralization of the storage devices. This makes the task of adding storage much quicker and easier. Hardware costs are less. Less is spent on storage devices. Less disk space is needed with a SAN because they are more efficient in the utilization of storage. Normally, SAN storage will average 75 to 85 percent utilization, while utilization for some NASs storage devices will be 10 to 15 percent. LAN infrastructure spending is less. NASs are accessed over the network, and backups are usually conducted over the network. This increase in network traffic can force costly network upgrades. Canaxia has a SAN that hosts the company's mission-critical databases and data. Portions of that data are mirrored to data centers physically removed from the operation center itself. In addition, each company campus has a central file-storage facility that consists of large numbers of rack-mounted NAS arrays. Most application servers are connected to the SAN. Storage requirements continue to grow at a steady rate, but the cost of administrating this storage has actually dropped as the data have been centralized onto the SAN and various NASs. Most of you will use a mix of the three architectures. It is important that this mix be planned by your architects and be easy for managers to cost and to justify. While disaster recovery planning is not about providing backup of a company's data, the storage architecture chosen can make the task of protecting and recovering the enterprise data storage materially cheaper and easier. It will take significant effort to make direct attached storage (DAS) as robust in the face of disaster as distributed storage architecture, such as a SAN or a NAS with remote mirroring. Quick (a few hours) recovery is possible with a SAN architecture. Disaster recovery is discussed in greater detail later in this chapter. Systems Architecture Aspects of Security Ensuring enterprise security is a wide-ranging operation that touches on almost every area of a business. As such, it has to grow out of extensive interactions between the company's management and the architecture team. One of the issues that has to be on the table first is how to build a security architecture that increases a company's competitive advantage. Since security is an enterprise-wide endeavor, the input of the entire architecture team is required. It is important to adopt a graduated approach and apply the proper security levels at the proper spots in the enterprise. Effective enterprise security consists of the following: Effective, well thought out, clearly communicated security policies. Effective, consistent implementation of these policies by a company staff that is motivated to protect the business's security. A systems architecture that has the proper security considerations built in at every level. When discussing enterprise security, one of the first matters to discuss is what needs to be protected and what level of protection is appropriate for that particular item. A useful rule of thumb can be borrowed from inventory management. Proper inventory management divides articles into three levels: A, B, and C. The A level items are the most valuable, and extensive security provisions are appropriate for them. Trade secrets, credit card numbers, and update and delete access to financial systems are just a few of the items that would be on an A list. B items need to be protected, but security considerations definitely need to be balanced against operational needs. C items require little or no security protection. As a rule of thumb, 5 percent of the list should be A items, 10 to 15 percent should be B items, and the rest should be C items. A similar analysis is appropriate for the security aspects of an enterprise's systems architecture. The data assets are located on machines that are part of its systems architecture. In addition, elements of the systems architecture, such as the network, will most often be used in attacks upon the business's vital data resources. The following is a list of the major classes of security attacks, in rough order of frequency, that must be considered when building the security part of the systems architecture: Viruses and worms Attacks by dishonest or malicious employees Destruction or compromise of the important data resources due to employee negligence or ignorance Attacks from the outside by hackers Denial-of-service attacks Viruses and worms are the most common IT security problems. The vast majority of the viruses and worms that have appeared in the last few years do not actually damage data that are resident on the computer that has the virus. However, in the process of replicating themselves and sending a copy to everyone on a corporate mailing list, viruses and worms consume a large amount of network bandwidth and usually cause a noticeable impact on productivity. For the last couple of years, the most common viruses have been email viruses. They exploit extremely well-known psychological security vulnerabilities in a business's employees. In addition, some person or group of persons in corporate infrastructure support will be required to monitor computer news sites daily for the appearance of new email viruses and to immediately get and apply the proper antivirus updates to, hopefully, prevent the business from becoming infected by this new email virus. Unfortunately, the current state of antivirus technology is such that defenses for viruses can only be created after the virus appears. The majority of attacks against corporate data and resources are perpetrated by employees. To protect your organization against inside attack, utilize the following: User groups and effective group-level permissions Effective password policies to protect access to system resources Thorough, regular security audits Effective security logging Assigning users to the proper group and designing security policies that completely, totally restricts access to only the data and resources the group requires to perform its functions is the first line of defense against attacks by insiders. Don't forget to remove the accounts of users who have left the firm. For those enterprises that still rely on passwords, another area for effective security intervention at the systems architecture level is effective password policies. In addition to mandating the use of effective passwords, make sure that no resources, such as databases, are left with default or well-known passwords in place. All passwords and user IDs that are used by applications to access systems resources should be kept in encrypted form, not as plain text in the source code. Passwords are the cheapest method of authentication and can be effective if strong passwords are assigned to users, the passwords are changed on a regular basis, and users don't write down their passwords. Users can reasonably be expected to remember one strong password if they use that password on at least a daily basis. Since a multitude of applications are password-protected in most enterprise environments, we recommend that enterprises standardize on Windows operating system logon user name and password as the standard for authentication and require every other application use the Windows authentication as its authentication. This can be via the application making a call to the Windows security service or by the application accepting a Kerberos ticket from the operating system. Unfortunately, the vast majority of enterprises have a multitude of applications that require a user name and password. As a result, most users have five, ten, even fifteen passwords. Some passwords will be used several times a day, some will be used a couple of times a year. Often the corporation cannot even standardize on a single user name, so the user will have several of those, too. In a situation such as this, most users will maintain written username and password lists. The justification for this situation is always cost: The claim is made that it would cost too much to replace all the current password-protected programs with ones that could get authentication from the Windows operating system. When faced with this argument, it is useful to document the time and money spent on resetting passwords and the productivity lost by users who are locked out of an application that they need to do their job. We expect you will discover costs in the range of $10 to $20 per employee per year. Those figures, coupled with the knowledge of the damage that could occur if one of the password lists fell into the wrong hands, might get you the resources necessary to institute a single sign-on solution. Devices such as smart card readers and biometric devices can authenticate using fingerprints and retina scans. These devices allow extremely strong authentication methods, such as changing of the password daily or even changing the user's password right after they have logged on. The cost of equipping an entire enterprise with such devices is considerable, and they only get you into the operating system. If the user has ten password-protected programs to deal with once he or she is logged in, the device has not bought you much. Protecting A-level assets with one of these devices makes excellent sense. For the majority of employees and the majority of businesses, a single sign-on solution with the user expected to memorize one strong password every couple of months is adequate security and the most cost-effective solution. You need to know where you are vulnerable and what attacks can exploit these vulnerabilities. If you have not run a security audit, make plans to do so. Large enterprises may want to engage outside consultants to audit the entire firm. As an alternative, some software packages can effectively probe your system for vulnerabilities. You might consider building up a security cadre of individuals in your organization. They would be tasked with knowing and understanding all the current hacker attacks that can be mounted against systems of your type and with running any security auditing software that you might purchase. Finally, effective, tamperproof audit systems will allow you to detect that an attack has occurred and will provide you with the identity of the attacker. This is where scrupulously removing expired user accounts is important. If Bob Smith has left the company but his user account still exists, that account can be compromised and used in an attack with you having no clue as to who really perpetrated it. In any case, the cost of implementing the patch to the vulnerability has to be balanced against the seriousness of the threat and the probability that someone in your organization would have the sophistication necessary to carry out such an assault. Lock the most important doors first. Inadvertent damage to or compromise of business data by well-meaning or ignorant employees causes substantial business losses annually. While this type of damage has no malicious component, the results are the same as for a malicious attack. Inadvertent damage can be prevented in a variety of ways. Training is first and foremost. When new corporate applications are being brought online, it is crucial that people who will be using them are thoroughly trained. Effective training provides a positive ROI. After an application is in regular operation, it is important that experienced operators be given the time and resources to train and mentor new operators. However, even the best training programs are not 100 percent effective. It is also important to make sure that people are in the proper roles and that the security parameters of these roles are so crafted that people are given the least opportunity to do damage without restricting their ability to carry out their assigned functions. Audit trails are useful for establishing exactly what was damaged or compromised. After that has been ascertained, it is the role of the backups that have been applied to this data that will allow you to recover from the situation. Hacker attacks are high-profile events that invariably become news items when discovered. It is difficult to accurately judge the size of business losses caused by hacker attacks from the Internet. In any case, you have to take them seriously. All companies can expect to experience hundreds of low-level "doorknob rattling" attacks, such as port scans run against them, in the course of a year. On the positive side, the vast majority of technical hacker exploits are wellknown, and the measures necessary to defeat them are standard parts of all companies' Internet defense systems. The major response that hacker attacks will prompt is you keeping all your machines both inside the firewall and in the DMZ 100 percent up to date on security patches. Any stratagem that you can devise to streamline the application of security patches will give your company a strategic advantage over a company that does it all by hand or ignores the problem completely. Of the attacks that can occur from the outside—from the Internet—denial-of-service (DOS) attacks are the least common but potentially the most destructive. DOS attacks involve the recruitment of large numbers of outside computer systems and the synchronization of them to flood your system with such a large number of requests that it either crashes or becomes unable to respond to legitimate service requests by your customers. Fortunately, due to the large-scale effort involved on the part of the attackers, DOS attacks are rare and are normally directed against high-profile targets. DOS attacks are extremely difficult to combat, and it is beyond the scope of this book to discuss methods to deal with them. It is important, though, that if your company is large enough and important enough to be a possible target for a DOS attack, you begin now to research and to put into place countermeasures to fend off a DOS attack. All security mechanisms, such as encryption and applying security policies, will have an impact on system performance. Providing certificate services costs money. By applying agile architecture principles to the design of this security part of your systems architecture, you can substantially mitigate these impacts. To be agile in the area of systems architecture security design means applying just the right level of security at just the right time. It means being realistic about your company's security needs and limiting the intrusion of security provisions into the applications and machines that make up your systems architecture. Just enough security should be applied to give the maximum ROI and no more. As a final note, do not forget physical security for any machine that hosts sensitive business data. There are programs that allow anyone with access to a computer and its boot device to grant themselves administrator rights. Once more, we recommend the A-, B-, and C-level paradigm. C-level data and resources have no special physical security. B-level data and resources have a minimal level of security applied. For A-level data and resources, we suggest you set up security containment that also audits exactly who is in the room with the resource at any time. Systems Architecture and Disaster Recovery Disaster recovery planning is sometimes referred to as "business continuity planning." DRP for the enterprise system infrastructure is a major systems architect responsibility. The purpose of DRP is to produce a set of plans to deal with a severe disruption of a company's operations. DRP is independent of the modality of the disaster (flood, fire, earthquake, terrorist attack, and so on). It involves the following: Identification of the functions that are essential for the continuity of the enterprise's business. Identification of resources that are key to the operation of those functions, such as: Manufacturing facilities Data Voice and data communications People Suppliers Etc. Prioritization of those key resources Enumeration of the assumptions that have been made during the DRP process Creation of procedures to deal with loss of key resources Testing of those procedures Maintenance of the plan DRP is normally a large project that involves input from every unit in the company. The sheer size of the effort can make it seem prohibitively expensive. DRP is largely about money. Perfect disaster protection will cost an enormous amount of it. A disaster coupled with inadequate DRP can destroy a company. However, the DRP process will be like any other company process in that the size and cost of the DRP effort will be adjusted to fit available company resources. To provide the best DRP for the available resources, it is vital that the analysis phase correctly do the following: Identifies all the key resources Accurately prioritizes the key resources That it provide accurate costing for protection of those resources That it provide accurate estimates of the probability of the destruction of the key resources Given an accurate resource analysis, the IT/business team can work together to build an affordable DRP. When costing the DRP program, do not neglect the maintenance of the plan. DRP planning for most enterprises will involve the coordination of individual DRP plans from a variety of departments. We focus on Canaxia's DRP and the process of its creation. Initially, the disaster planning team (DPT) was created and tasked with creating a DRP process. Canaxia's architecture group was a core part of that team. It quickly became obvious that Canaxia required a permanent group to maintain and administer its DRP. As a result, the Department of Business Continuity (DBC) was created. Following are some of the critical Canaxia resources that the DRP identified: Engine manufacturing facilities. All engines for the Canaxia line of vehicles were manufactured in one facility. Brake assemblies. Canaxia used a unique brake system available from a single supplier. That supplier was located in a politically unstable region of the world. Data. Of the mountain of data that Canaxia maintained, 10 percent was absolutely critical to daily functioning. Furthermore, 45 percent of the data was deemed irreplaceable. Voice and data communications. Canaxia depended on the Internet for the communication necessary to knit together its far-flung departments. The enterprise resource planning (ERP) system that totally automated all of Canaxia's manufacturing and supply activities. A core group of IT employees that had done the installation of the ERP system. Loss of the manager and two of the developers would have a severe negative impact on the functioning of the ERP system. Once critical resources had been identified, they were divided into three categories: Survival critical. Survival critical resources are those absolutely required for the survival of the company. Normally, survival critical activities must be restored within 24 hours. Data, applications, infrastructure, and physical facilities required for the acceptance of orders and the movement of completed vehicles from manufacturing plants are examples of survival critical resources for Canaxia. Mission critical. Mission critical resources are resources that are absolutely required for the continued functioning of the enterprise. However, the company can live without these resources for days or weeks. Other. This category contained all the resources that could be destroyed in a disaster but were not deemed to be survival or mission critical. If they were replaced, it would be months or years after the disaster. Within each category, the team attached a probability that the resource would become unavailable and a cost to protect it. All the assumptions used in the analysis phase were detailed to allow the individuals tasked with funding the DRP to do a reality check on the numbers given them. Documenting existing procedures so that others could take on the functioning of key Canaxia employees was a substantial part of the DRP. Canaxia, like many other enterprises, had neglected documentation for a long time. That neglect would have to be rectified as part of the DRP. In many cases, producing documentation decades after a system was built would be an expensive task. Every DRP will have to enumerate a set of tests that will allow the company to have confidence in the DRP. Running some of the tests will be expensive. Once more, the test scenarios have to be rated as to cost, probability of resource loss, and criticality of the resource. Testing is a good place to examine the assumptions from the analysis phase for accuracy. A DRP will have to be maintained. As systems and business needs change, the DRP will have to be revised. In some cases, the DRP needs can drive business decisions. Canaxia made the decision to move to a generic brake assembly that could be obtained from a variety of sources. Canaxia's brake performance was one of its technical selling points, and management knew that moving to a generic brake would not help sales. Canaxia also established a company rule that strongly discouraged getting into single-source supplier relationships in the future. We recommend an iterative approach to DRP. First do the top of the survival critical list all the way through the testing phase. Use the experience gained in that process to make the next increment more focused and efficient. Trying to do the entire job in one big effort risks cost and time overruns that can cause the process to be abandoned before anything is protected.