Home > Store

UNIX Fault Management: A Guide for System Administrators

By Brad Stone, Julie Symons
Published Dec 3, 1999 by Pearson.

Book

Sorry, this book is no longer in print.

Not for Sale

Description

Sample Content

Updates

More Information

Description

Copyright 2000
Edition: 1st

Book
ISBN-10: 0-13-026525-X
ISBN-13: 978-0-13-026525-8

2652E-6

Maximize UNIX system integrity and availability in mission-critical environments!

If you're responsible for maintaining the integrity and availability of a mission-critical UNIX system, then you need UNIX Fault Management: A Guide for System Administrators, the first book that brings together all of the monitoring and fault management information. Expert UNIX system management engineers Brad Stone and Julie Symons show you exactly how to implement appropriate, cost-effective system monitoring on any UNIX server -- including systems configured as high availability clusters. You'll learn how to:

Plan for-and establish-cost-effective, reliable system monitoring procedures
Monitor systems, disks, networks, applications, and databases
Detect, investigate, and recover from server problems
Implement best practices for high availability in enterprise-class UNIX installations-including clusters
Take advantage of key fault management trends, new standards, and new technologies

This book contains detailed descriptions of fault monitoring tools and monitoring frameworks to help you make better purchasing decisions. You'll also find a handy quick reference of monitoring tasks and techniques for operators -- including specific, step-by-step recovery solutions. If you can't afford one nanosecond more downtime than necessary, you can't afford to be without UNIX Fault Management.



Sample Content

Downloadable Sample Chapter

Click here for a sample chapter for this book: 013026525X.pdf

1. Analyzing the Role of System Operators.

Trends in System Operations.

2. Enumerating Possible Events.

Defining Fault Management. Event Categories. Configuration Events. Faults. Resource and Performance Events. Security Intrusions. Environmental Changes.

3. Using Monitoring Frameworks.

Distinguishing Monitoring Frameworks. Monitored Components. Monitoring Features. Monitor Discovery and Configuration. Monitor Developer's Kits. Notification Methods. Diagnostic Capabilities. IT/Operations. Monitored Components. Monitoring Features. Monitor Discovery and Configuration. Monitor Developer's Kit. Notification Methods. Diagnostic Capabilities. Additional Information. Unicenter TNG. Monitored Components. Monitoring Features. Monitor Discovery and Configuration. Monitor Developer's Kit. Notification Methods. Diagnostic Capabilities. Additional Information. Event Monitoring Service. Monitored Components. Monitoring Features. Monitor Discovery and Configuration. Monitor Developer's Kit. Notification Methods. Diagnostic Capabilities. Additional Information. PLATINUM ProVision. Monitored Components. Monitoring Features. Monitor Discovery and Configuration. Monitor Developer's Kit. Notification Methods. Diagnostic Capabilities. Additional Information. BMC PATROL. Monitored Components. Monitoring Features. Monitor Discovery and Configuration. Monitor Developer's Kit. Notification Methods. Diagnostic Capabilities. Additional Information. MeasureWare. Monitored Components. Monitoring Features. Monitor Discovery and Configuration. Monitor Developer's Kit. Notification Methods. Diagnostic Capabilities. Additional Information.

4. Monitoring the System.

Identifying Important System Monitoring Categories. Monitoring System Configuration Changes. Monitoring System Faults. Monitoring System Resource Utilization. Monitoring System Security. Monitoring System Performance. Using Standard Commands and Tools. bdf and df. ioscan. iostat. ipcs. mailstats. ps. sar. swapinfo. sysdef. sysdef. timex. top. uname. uptime. vmstat. who. Using System Instrumentation. SNMP. DMI. Using Graphical Status Monitors. OpenView Network Node Manager. ClusterView. Unicenter TNG. Using Event Monitoring Tools. Event Monitoring Service. EMS High Availability Monitors. EMS Hardware Monitors. Enterprise SyMON. OpenView IT/Operations. GlancePlus Pak 2000. Security Monitoring. Security Overview. Security Monitoring Tools. Using Diagnostic Tools. Support Tool Manager. HP Predictive Support. HA Observatory. Monitoring System Peripherals. Disks. Tapes. Printers. Collecting System Performance Data. MeasureWare. GlancePlus. PerfView. BMC PATROL for UNIX. Candle. Using System Performance Data. Avoiding Performance Issues. Detecting CPU Contention. Checking System Resource Usage. Detecting Memory and Swap Contention. Detecting Disk and File System Bottlenecks. Avoiding System Problems. Recovering from System Problems. Comparing System Monitoring Tools. Case Study: Recovering from Memory Faults. Verifying Configuration. Setting Up Monitoring and Reconfiguration. Memory Board Failure Occurs. Fixing the Failure and Restoring Service.

5. Monitoring the Disks.

Identifying Important Disk Monitoring Categories. Using Standard Commands and Tools. bdf and df. diskinfo. fsck. ioscan. lvdisplay. pvdisplay. vgdisplay. Using System Instrumentation. Simple Network Management Protocol. Desktop Management Interface. Using Event Monitoring Tools. Event Monitoring Service Disk Volume Monitor. EMS Hardware Monitors. HARAYMON and ARRAYMOND. OpenView IT/Operations. Enterprise SyMON. Using Diagnostic Tools. Support Tool Manager. HP Predictive Support. Collecting Disk Performance Data. MeasureWare. GlancePlus. PerfView. BMC PATROL. Using Disk Performance Data. Avoiding Disk Problems. Recovering from Disk Problems. Comparing Disk Monitoring Products. Case Study: Configuring and Monitoring for Mirrored Disks. Verifying Configuration. Setting Up Monitoring. Mirror Fails. Restoring Mirrors. Verifying Configuration.

6. Monitoring the Network.

Identifying Important Network Components to Monitor. Using Graphical Network Status Monitors. Network Node Manager. IT/Operations. Unicenter TNG. Enterprise SyMON. Monitoring Network Interface Card and Cable Failures. Using SNMP Instrumentation. Using Standard Commands and Tools. Using Additional Products To Monitor Network Links. Using Link-Specific Commands. Monitoring Networking and Transport Protocols. Using SNMP Instrumentation. Using Standard Commands and Tools. Monitoring Network Services. Monitoring DHCP/BOOTP Servers. Monitoring DNS/NIS Name Servers. Monitoring FTP. Monitoring NFS. Monitoring Remote Connectivity. Monitoring Web Servers. Monitoring Network Hosts. Network Node Manager. netstat. Interconnect & Router Manager. Collecting Network Performance Data. Using RMON and RMON-II Instrumentation. NetMetrix Site Manager. MeasureWare. GlancePlus. PerfView. BMC PATROL for \. Network General Sniffer Pro. Using Network Performance Data. Avoiding Performance Issues. Detecting Overloaded Network Servers. Detecting Network Congestion. Avoiding Network Problems. Recovering from Network Problems. Isolating the Fault. Network and Lower Layers. Transport and Higher Layers.

7. Monitoring the Application.

Important Application Components to Monitor. Identifying Application Types. Using Standard Commands and Tools. ps. top. vmstat. Using System Instrumentation. SNMP. DMI. pstat. Fault Detection Tools. IT/Operations. MC/ServiceGuard. ClusterView. Event Monitoring Service. EcoSNAP. Monitoring Tools for ERP Applications. Envive. SMART Plug-Ins. BMC PATROL Knowledge Modules. EcoSYSTEMS. Resource and Performance Monitoring Tools. Application Resource Measurement. MeasureWare. GlancePlus. PerfView. Process Resource Manager. Controlling Application Performance. Recovering from Application Problems. Comparison of Application Monitoring Products.

8. Monitoring the Database.

Identifying Important Database Monitoring Categories. Configuring the Database. Watching for Database Faults. Managing Database Resources and Performance. Keeping the Database Server Secure. Ensuring Successful Database Backups. Using Standard Database Commands and Tools. UNIX Commands. SQL Commands. SNMP MIB Monitoring. Database Vendor Tools. Using Fault Detection and Recovery Tools. MC/ServiceGuard. ClusterView. EMS HA Monitors. Resource and Performance Monitoring Tools. Application Resource Measurement. Oracle Trace. Oracle V$ Tables. GlancePlus Pak 2000. Oracle Management Pak. PerfView. SMART Plug-Ins for Databases. BMC PATROL Knowledge Modules. PLATINUM DBVision. Using Database Performance Data. Avoiding Performance Issues. Checking for System Contention. Checking for Disk Bottlenecks. Checking Database Buffer and Pool Sizes. Avoiding Database Problems. Recovering from Database Problems. Comparison of Database Monitoring Products.

9. Enterprise Management.

Monitoring Across an Enterprise. Identifying Events. Using Event Correlation Tools. OpenView Event Correlation Services. Seagate NerveCenter. IT Masters MasterCell. Monitoring Multiple Systems. IT/O. ClusterView. Enterprise Management Frameworks. ClusterView. Monitoring Agents. Using Multiple Tools. IT/Operations and the Network Node Manager. IT/O and PerfView. BMC PATROL and IT/O. PLATINUM ProVision and IT/O. EMS and OpenView NNM or IT/O.

10. UNIX Futures.

Future Trends in Fault Management.

Appendix A:.

Glossary.

Index.

Preface

This book is intended for system administrators and operators who are responsible for maintaining the integrity and availability of mission-critical UNIX systems. The book provides a description of the fault monitoring tools and techniques available for UNIX servers, including systems that are configured as high availability clusters.

This book can therefore be a handy quick reference for an operator trying to troubleshoot a problem in the customer environment, by pointing out where to find key diagnostic messages and describing how to take recovery actions.

A system administrator responsible for the initial configuration and administration of UNIX systems will also find this book useful because it describes the procedures to follow to set up the appropriate levels of system monitoring. The product descriptions can also help in making purchasing decisions as the customer determines the appropriate amount of event monitoring needed in their environment.

An overview of the tasks performed by an operator is provided, with details on how events are received and processed. The remainder of the book focuses on the types of events that can be received, how they are detected, how operators receive event notifications, and how problems can be investigated and recovery performed. The goal is to introduce the necessary tools, but not to show how every possible problem can be solved.

This book provides numerous descriptions of how fault management tools and products can be used to solve a variety of problems. Many of the chapters are focused on specific computer components, such as disks or databases, to be helpful to operators with specific roles. Here is a description of the individual chapters:

Chapter 1, "Analyzing the Role of System Operators," describes the tasks performed by a system operator and the evolution of fault management.

Chapter 2, "Enumerating Possible Events," describes the various types of events that are interesting to monitor on a UNIX system.

Chapter 3, "Using Monitoring Frameworks," describes monitoring frameworks and the administrative tasks that must be done before they can be used.

Chapter 4, "Monitoring the System," describes the tools and products used to monitor the UNIX server.

Chapter 5, "Monitoring the Disks," describes the tools and products used to monitor external disk devices.

Chapter 6, "Monitoring the Network," provides an overview of the many tools available for detecting problems and events related to the use of the network.

Chapter 7, "Monitoring the Application," describes methods for monitoring the response times and availability of critical applications.

Chapter 8, "Monitoring the Database," focuses specifically on tools to detect problems and events related to database usage.

Chapter 9, "Enterprise Management," discusses the problems with trying to deal with fault management for the large-scale customer enterprise.

Chapter 10, "UNIX Futures," discusses the future plans of some of the major UNIX system vendors in the area of fault management.

Appendix A, "Standards," describes fault management standards that have emerged and how you can benefit from them.

The Glossary contains the important terms used in the book, and their definitions.

Although it is assumed that most customers concerned about fault management will implement high availability solutions, this book does not describe how to create highly available computing environments. Readers needing additional information on high availability may check Hewlett-PackardÕs external Web site on high availability (http://www.hp.com/go/ha) or read Clusters for High Availability by Peter Weygant.

In general, this book does not discuss the configuration and installation of the hardware and software components of your UNIX system. You should rely on your vendorsÕ product manuals for this.

Many of the examples used in this book were created on HP-UX servers. Other UNIX platforms behave similarly, and we note when tools are supported only on certain UNIX platforms.



Updates

Submit Errata



More Information



InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Email Address

UNIX Fault Management: A Guide for System Administrators

Book

Description

Sample Content

Downloadable Sample Chapter

Table of Contents

Preface

Preface

Updates

Submit Errata

More Information

InformIT Promotional Mailings & Special Offers

Other Things You Might Like