- Objectives
- 25.1 Problem Prevention Best Practices
- 25.2 Problem Determination Methodology
- 25.3 Working with IBM WebSphere Support
- 25.4 Summary
25.2 Problem Determination Methodology
Often, when an error occurs in WebSphere, the user knows a problem has occurred but doesn't know what to do about it or what it meansthe user just knows that something is broken. This section takes you through WebSphere problem determination methodology, a series of rules to follow when you encounter a problem, to help pinpoint its origin and take it to resolution.
Why is methodology important? The foundation of any problem determination process is good methodology. Knowing how to go about determining if you have a problem, where the problem exists, and the basics of how to solve that problem are important to any enterprise project.
You might be asking, "Why not tell us how to troubleshoot WebSphere?" Troubleshooting a product the size of WebSphere could fill a book on its own. WebSphere is a very large and complex piece of software and to try to address in detail how to debug each specific component is beyond the scope of an administration book. While this subsection does not cover detailed troubleshooting of the application server, it does lay the foundation for troubleshooting your environment if and/or when a problem occurs. This methodology section lays out a set of rules you can employ when doing in-depth problem determination. Following these rules when a problem occurs can help expedite locating the problem (the first step in any problem determination process) and then resolution.
25.2.1 Locating the Error in a Complex Environment
When a problem occurs, pinpointing its origin can take some detective work, especially in a complex environment where multiple tiers with multiple products are integrated with one another. Knowing where each product's log files are located is an important step in knowing your environment, since the log files often provide very useful information. We have detailed where to locate and how to read WebSphere's log files in chapter 24, WebSphere Problem Determination ToolsLogging and Tracing. Often, when troubleshooting in a complex environment, problem determination becomes a team effort, involving administrators and application developers from each component in the environment.
This is especially true in the initial stages of determining where the problem resides (e.g., which tier, which component/product). Rarely is one person an expert in each component that makes up the environment (e.g., the back-end, the Web server, the edge component, the authentication server, the application, etc.). It is often necessary to involve people from various roles to assist in determining where the problem is or is not located. For example, if you are experiencing the problem when testing a new build of an application, make sure the necessary application developers are available to help determine if the error could be originating from the application code. Or, if the error is occurring when accessing the back-end data store, involve the database administrator to assist in determining if the error is originating from the database.
Figure 25.1 Example request path through enterprise environment.
To help pinpoint the component where the problem exists, it is useful to create a diagram of the path a request would take, assuming it did not fail, beginning with the client all the way to the back-end, depicting each component the request touches. For example, assume that a request for a simple servlet is failing. If you were to map the flow of a request in a simple environment beginning from a browser, it might look something like the servlet request goes through the proxy firewall to the network sprayer; from the network sprayer through the domain firewall to the Web server; from the Web server through an additional firewall to WebSphere Application Server, which then responds back to the Web server that responds to the browser.
At any one of these points, the request might fail. However, the request can be tracked by looking at access and error logs of each of the components involved. Additionally, this might require enabling trace on the various components to track the requests. For example, the access log on WebSphere's embedded HTTP Server might be enabled so you can verify that the request from the Web server made it into WebSphere. Possibly, one of the components is throwing error messages into the log files, or there are communication issues between two or more of the components involved. Once the failing component(s) is located, a more thorough examination of the error can begin. Problem determination of products outside the WebSphere Application Server or products installed and running in WebSphere Application Server (like WebSphere Portal or a custom J2EE application) is out of the scope of this section. While this section is mainly geared toward problem determination with regard to the WebSphere Application Server run time, some of this methodology still applies to external products and applications.
25.2.2 Could the Error Be Valid?
Once the error is isolated to a particular component(s) within the environment, one of the first things to evaluate is the error code, message, and any associated stack trace that appears. Often these error codes and/or messages provide useful information as to what went wrong. Most products and protocols also have guides that provide additional information on particular error codes. WebSphere has a Message Reference guide that is a subsection of the InfoCenter documentation. This Message Reference section has a description of each error code that WebSphere can log in its trace or log files. For information on how to read or locate WebSphere error codes and messages, please refer to chapter 24. Also, in Table 25.1, we are providing information on where to locate additional commonly used IBM product documentation, as well as protocol error codes.
Table 25.1 IBM Product and Protocol Message Reference Guides
Product or Protocol |
Message Reference Guide |
Where to Locate |
WebSphere Application Server |
Message Reference Section of Info Center Documentation |
Online InfoCenter Documentation: |
DB2 Universal Database |
Message Reference, vol. 1 and 2 Guides |
Online DB2 Core Documentation: |
IBM Http Server |
Troubleshooting Section of InfoCenter Documentation |
Online Infocenter Documentation: |
Apache Server |
Online Documentation and FAQ |
Online Apache Documentation Project: |
Hypertext Transfer Protocol (HTTP) |
Status Code Section of RFC2616 |
Online RFC 2616: |
Sometimes, pertinent information can be located in different product log files, which can be aligned by timestamp. For example, an exception occurring in the database could provoke error messages to be logged in the application server log files, as well as the Web server logs. Once you locate one error message, you can use its associated timestamp to cross-reference the database, application server, Web server, log files, and so on. This is also a good technique to use when locating the root of the error.
TIP
It is important to have the system clocks synchronized on each machine such that time stamping cross-referencing can be easily used. If the system clocks are not synchronized, the logs can still be cross-referenced; however, the times must be skewed appropriately.
When tackling a problem, it is always good to first assume that the error is valid before declaring that there is a bug or a defect in the running product. The error code and associated message can often be enough for you to diagnose and fix the problem.
For example, let us take a look at a fix pack installation problem scenario. Upon installing a fix pack onto an existing WebSphere configuration, an error occurs preventing the installation. The initial reaction to such a failure could be to assume that there is a defect with the installation or the WebSphere run time. Listing 25.1 depicts a portion of a reproduced log file with the problem.
Listing 25.1 Portion of log file from failing installation of a WebSphere fix pack.
... Results: ================================================================= Time Stamp (End) : 2003-07-15T17:08:42-04:00 EFix Component Result : failed EFix Component Result Message: ================================================================= WUPD0239E: Fix removal failure: The processing of fix WAS_WSADIE_ND_01_16- 2003_5.0_cumulative, component prereq.wsadie failed. See the log file C:\\WebSphere\DMgr\properties\version\log\20030715_210842_WAS_WSADIE_ND_01 -16-2003_5.0_cumulative_prereq.wsadie_uninstall.log for processing details. ================================================================= EFix Component Installation ... Done Exception: WUPD0223E: Fix uninstall failure: The update for component {1}
for fix pre-req.wsadie could not be installed . ...
As you can see from the log file excerpt, an exception occurred that prevented the installation. This log file also referenced an additional file for more information (see the highlighted portion of the log file above). Upon investigation of the referenced log file, 20030715_210842_WAS_WSADIE_ND_01-16-2003_5.0_cumulative_prereq.wsadie_uninstall.log, additional information about the problem is uncovered. Listing 25.2 shows a reproduced portion of the referenced log file, 20030715_210842_WAS_WSADIE_ND_01-162003_5.0_cumulative_prereq.wsadie_uninstall.log.
Listing 25.2 Portion of log file from failing installation of a WebSphere fix pack.
... 2003-07-15T17:08:42-04:00 Applying entry 1 of 5 20% complete 2003-07-15T17:08:42-04:00 Preprocessing entry (restore): 2003-07-15T17:08:42-04:00 No EAR processing noted. 2003-07-15T17:08:42-04:00 Next entry name: lib/jdom.jar 2003-07-15T17:08:42-04:00 entry path: C:\WebSphere\DMgr\lib\jdom .jar 2003-07-15T17:08:42-04:00 Error 16--File could not be deleted: C:\WebSphere\DMgr\lib\jdom.jar 2003-07-15T17:08:42-04:00 Fetching entry ... 2003-07-15T17:08:42-04:00 Preprocessing entry (restore): 2003-07-15T17:08:42-04:00 No EAR processing noted. 2003-07-15T17:08:42-04:00 Next entry name: lib/marshall.jar 2003-07-15T17:08:42-04:00 entry path: C:\WebSphere\DMgr\lib\ marshall.jar 2003-07-15T17:08:42-04:00 Error 16--File could not be deleted: C:\WebSphere\DMgr\lib\marshall.jar 2003-07-15T17:08:42-04:00 Fetching entry ... 2003-07-15T17:08:42-04:00 Preprocessing entry (restore): ...
Again, we have highlighted the errors in the log file excerptyou can see that some of the jar files being replaced during the installation of the fix pack could not be removed.
TIP
A log file can have a tremendous amount of information in it. Sometimes searching for "rror" or "xception" can help pinpoint problems easily. Notice in both search string the "E" was left off such that capitalization does not limit the search.
Since these jar files could not be removed, the installation was failing. With this information in hand, we can begin to diagnose the problemfirst assuming that the error was valid. Why couldn't the files be removed? The following options could all be valid possibilities:
The jar files did not exist in the first place.
The person running the installation did not have the appropriate permissions to remove files on the operating system.
The files were locked by a running process.
After validating that the jars did exist and the installer had the appropriate permissions, the last option was investigated. A quick check of all running processes uncovered that WebSphere was still running while the fix pack was attempting to be installed. Since WebSphere was still running, it had locked the jar files to prevent run time corruption. So, in fact, the problem was not a defect or bug in WebSphere's installation of the fix pack; instead it was a valid response to an invalid operation (note that the fix pack installation directions require all WebSphere processes to be stopped before running the installation program). Once all WebSphere processes were stopped, the fix pack installation was successful.
25.2.3 What Has Changed?
When an error occurs, another technique to help pinpoint the root of the problem is to determine what might have changed to invoke the error. For example, did the error occur just after you ran a new test scenario, or did the error begin after you adjusted some TCP configurations on your operating system? Rarely does an error "just begin happening" if nothing was changed in the environment (network, operating system, application, server configuration, etc.). Therefore, it is an important part of problem determination to uncover what might have changed to invoke the problem at hand.
In an earlier section, Change Control Best Practices, a change control process was detailed as a best practice for problem prevention. If this system is in place and adhered to, determining if something in the environment has changed becomes much easier. Also, note that when we say "environment" we do not just mean WebSphere administration. The environment encompasses much more than thisit includes items such as the operating system settings, application configuration and code, test cases, configuration of supporting products either running on WebSphere or communicating with WebSphere, such as the Web server, back-end, authentication server, portal server, edge components, and so on. Determining if something has changed can be more than asking yourself if you have recently altered a configuration setting (unless you are the only one with administrator privileges to every server in the environment). Communication, therefore, is the key, especially in a complex environment. We are often surprised how, in some environments, communication between the administrators of various components (WebSphere, Database, network, etc.), as well as with application development, is minimal. Often, an administrator will call a product support line before calling a coworker in a different department to see if they might have altered a configuration.
Pinpointing the change does not mean that a bug does not exist, nor does it mean that the problem is now solved. Often there is a good reason for the change that has been effected. However, knowing what the change is and how it affects the system is important in finding a solution (whether it is a product bug fix, a configuration tweak, etc.).
25.2.4 Simplify, Simplify, Simplify
When you are running in a complex environment and an error occurs, finding the problem can sometimes be equated to finding a "needle in a haystack." With so many different products and configurations involved, solving the problem becomes like solving a multiple variable algebra problem: the greater the number of variables involved, the more complex it is to solve.
Also, there is not always just one item that causes a problem to occur. Rather, it can be a combination of settings, coupled with a particular path through running application code, that triggers the error. To help limit some of the variables in the problem, it is wise to strip the environment back to the simplest possible running environment in which the error still occurs.
The best problem determination environment is one where the error can be reproduced with the simplest test scenario, running the simplest application code, deployed in the simplest environment. In this environment, not only is it easier to describe the problem to support (if necessary), but also it limits the number of variables involved in the problem, making it easier to determine a solution.
25.2.4.1 he Simplest Test Scenario
Evaluate the test scenario that prompts the error to occur. If the test scenario is testing multiple conditions, can the test be limited to only the condition that fails? By narrowing what is tested until you have located the simplest test scenario that still causes the failure, you can save time when rerunning the test scenario, as well as narrow the number of variables when reproducing the test. You might discover that it is the sequence in which the tests are run that causes the problem, and/or eliminate the components that appear to not related to the failure.
Additionally, if the failure is occurring during load testing, try to find the minimum amount of load that still reproduces the problem. For example, if the test does not fail with a single user, but fails with two users, there might be a thread synchronization error. If the tests only fail under high load, it might be that your application or environment needs to be tuned for performance (please see chapters 21 through 23 of this book to learn about performance as relates to WebSphere).
25.2.4.2 The Simplest Application
When running a complex application, it is often difficult to determine whether the source of the error resides in the running application, in the WebSphere run time, or in some other area. If you can eliminate the running application in the simplified environment by reproducing the error with an alternate, much simpler application, that is very beneficial. You can eliminate the enterprise application as the source of the problem by attempting to reproduce the error with one of the IBM WebSphere sample applications that are installed with WebSphere or by creating a very simple application that forces the error to occur.
TIP
The technique of using an IBM WebSphere sample application or a simple sample application that can reproduce the problem is especially beneficial when working with IBM support. If using a created simple application, include it with any documentation that is provided to WebSphere support, with a description of what it does and the error it causes. This can help expedite support's interaction in determining the problem.
25.2.4.3 The Simplest Environment
To locate the origin of the problem, either a product or a WebSphere component level, it is best to reproduce the problem with the simplest configuration possible. For example, if the problem is occurring with a Web application, remove the Web server from the environment by accessing the application directly via WebSphere's embedded HTTP server. If the problem is with persistent Enterprise Java Beans (EJBs), try to manually invoke some of the update or select queries on the back-end to validate that they run correctly. Some other suggestions for simplifying the problem determination environment include
Disabling work load management (distributed only)
Disabling security
Disabling the JIT compiler
Moving the test clients onto a machine that has a direct route to WebSphere (rather than having to go through firewalls or edge components)
25.2.5 Do You Have Enough System Resources?
When failures begin to occur during load tests, sometimes it is not due to a run time failure, but rather a potential performance issue. Please refer to the Part 5, WebSphere Performance, of this book for additional information on performance monitoring and tuning. However, do remember that every machine will have its limits. In some situations, additional hardware will be needed to support particular load requests.
Performance monitoring can also uncover problems such as memory leaks that can severely impact application performance. It is highly recommended to tune your application before releasing it in a production environment.
25.2.6 What to Do If the Problem Is in Production
When a failure occurs in a production environment, it is often a critical situation. If you believe the problem to be a WebSphere run time defect, it will be important to contact IBM WebSphere support (1-800-IBM SERV) immediately so they can begin investigating the problem. It is also important to make sure that no information surrounding the failure is lost. It is a best practice to backup all log files, including database and Web server logs, if applicable, so they can be referred to later, if necessary. Until a solution is found, rollback or disable any change or update that might have invoked the problem. In parallel, it is pertinent to try to reproduce the problem in a test environment. A problem that can be reproduced in test will lend itself to easier problem determination since detailed traces and logging can be enabled without fear of affecting performance or up-time in the production environment. It also provides an experimental environment for being able to freely alter configuration parameters, as well as providing a simpler, less complex problem determination environment.
TIP
When the cause of the problem is determined, use it as a lesson learned. It is important that the failing scenario works its way back into the test suite that is run before any application is released in production. This way, the problem can be prevented in the future. Be sure to update procedures and test cases to avoid this problem in the future.
25.2.7 Where to Go for Help
IBM has an extensive WebSphere support Web site that contains self-help and problem submission information. This page should always be used before contacting IBM support. The WebSphere support Web site is accessible at http://www.ibm.com/software/webservers/appserv/was/support/.
The self-help section of the WebSphere support Web site contains links to several online resources meant to help you troubleshoot a WebSphere problem. Using this site, you can search on keywords to find Frequently Asked Questions (FAQs), Technotes, Hints and Tips, and other documents that address existing WebSphere problems. FAQs document common problems and solutions. Hints and Tips contain information about installing, configuring, and troubleshooting WebSphere. Technotes are documents containing customer-reported problems and solutions. You can also download WebSphere tools and utilities, as well as WebSphere fix packs and interim fixes. The support page also contains links to educational material such as IBM online publications, redbooks, and white papers.
The WebSphere InfoCenter is another resource for self-help. The InfoCenter is available online at http://www.ibm.com/software/webservers/appserv/infocenter.html or it can be downloaded as a PDF file. The local version of the InfoCenter is also available as an Eclipse documentation plug-in and can be downloaded from http://www.ibm.com/software/webservers/appserv/infocenter.html. To view the local documentation, you also need to install the IBM WebSphere Help System, which is a viewer for displaying product or application information developed as Eclipse documentation plug-ins. The IBM WebSphere Help System is built on open source software developed by the Eclipse Project. The InfoCenter contains a problem determination section, and you can also search the InfoCenter using keywords.The developerWorks Web site contains very good information for WebSphere developers in the section dedicated to WebSphere, which is available at http://www.ibm.com/developerworks/websphere/. The WebSphere developerWorks is a great source for articles and best practices related to WebSphere products. The site also provides other features like code downloads, technology previews, and forums.
Other helpful WebSphere resources are the WebSphere newsgroups and WebSphere user-group forums. There are several such newsgroups and forums, and they usually contain very useful information provided by WebSphere users. Some of these newsgroups are monitored by IBM personnel, helping to ensure the integrity of information included within those newsgroups.
WebSphere Studio Application Developer and Site Developer V5.1 have a new feature that allows you to search on keywords for several products, including WebSphere Application Server. This new feature is provided in the form of several product-specific plug-ins. The search is performed on resources like the WebSphere support Web site, the WebSphere InfoCenter, and Google newsgroups. Access to these resources is provided from one central product-specific page. Besides search capabilities, the plug-ins also provide a collection of local documents for self-support. These documents are copies of FAQs, Technotes, Hints and Tips, and other resources that are frequently used by WebSphere support personnel. One advantage of having these local documents is that they are searched when you perform a search through the WebSphere Studio Application Developer or Site Developer Help menu. To access the product plug-ins, select Help > Help Contents from the main menu, and then click on Support information of the left side of the page. Please see Figure 25.2.
Figure 25.2 WebSphere Studio Application Developer Support information.