Home > Articles > Operating Systems, Server > Solaris

  • Print
  • + Share This
Like this article? We recommend

Methodology for Rule Extraction

Rule extraction from Web applications that will be accessed through the gateway can require persistence. Depending on how the application is written, its integration with the gateway usually will fall in to one of three categories:

  • Integration out-of-box

  • Integration through profile configuration

  • Integration with special attention required

Integration Out-Of-Box (Category 1)

This category usually applies to web applications that are delivered to the browser in purely HTML or HTML and some other languages inlined that do not reference any URLs. Its integration is usually straightforward and requires little or no administrative intervention. This content tends to be static in nature, contain absolute URLs, and be well-formed. Well-formed here means that the entire content is syntactically correct and basic formatting practices are in place. These pages do not have Forms, Java Applets, or imported scripts that require special attention. The IMG SRC, A HREF, FORM ACTION, APPLET CODEBASE, and JAVASCRIPT SRC tag attributes are just a few that are handled by the gateway out-of-box.

Integration Through Profile Configuration (Category 2)

This category applies to web applications, including those in Category 1, that are increasingly complex. Category 2 content would contain URLs in:

  • FORM INPUT tags

  • Applet parameters

  • JavaScript event handlers

  • A multitude of content types such as CSS, JavaScript code, and XML

  • Dynamically created content

Some of this content is handled out-of-box, while other content requires special considerations.

Integration With Special Attention Required (Category 3)

This category content includes dynamically created URLs on the client-side, complex scripts that have URLs in function parameters, URLs built in several steps or in multiple locations in the code using string concatenation, URLs contained in fractured JavaScript, URLs hidden in nested function calls, integration with unknown third party applications, and URLs contained in code that has passed through an obfuscator.

Category 2 content makes up the bulk of what most people would expect to see pass through the gateway. For applications that are being created specifically for use with the gateway, there are often a multitude of content-based workarounds that can be put in place if it is difficult, or not possible, to create a rule that will match a specific URL. Look for best programming practices in later sections for information on possible workarounds for different corner cases or content types.

The important thing to keep in mind is that the only thing to worry about content-wise is where URLs are referenced. URLs that are not correctly rewritten can manifest themselves in a variety of ways. The applications may misbehave when certain buttons are selected, forms are submitted, or other actions occur, such as a mouseOver. The browser may return an error message saying that a particular server is either down or inaccessible when a link has been selected. Users may be mysteriously redirected back to the Portal Server login page, even though they did not log out or have their session terminated. Images may show up broken, Applets may not download completely or run correctly. Navigation bars may not work correctly. Any of these could be signs that the gateway requires additional configuration to work with the application.

For testing purposes, you should ensure that the browser cannot make a direct connection to any server through the gateway. Otherwise, when the Portal Server is moved into production. there may be a number of issues that arise because the browser is no longer able to talk directly to internal content. One way to determine if this is a problem is to snoop the connection between the client and the content server. There should be no direct communication between the two when accessing the content server through the gateway component. If there is, then it is likely that a URL has been overlooked and has not been rewritten.

What Rules Are Necessary?

There are a number of ways to go about investigating what rules may be necessary for the correct rewriting of URLs. Some understanding of the web application will help in figuring out how to rewrite parts of the content correctly.

You should answer the following questions before diving straight into the Web application source code:

Does the Document Have Frames?

If the answer is yes, then you must ensure that you start looking at the source code for the correct frame. You will also want to see if there is a SCRIPT tag in the parent document that initializes top-level JavaScript variables that could include URLs or hint at how URLs will be used throughout the application. Keeping this in mind, you would also want to look at the individual frames to see if they make references to parent.* or top.* that might reference URLs (like parent.location.href).

For example, when a specific frame-enhanced page is accessed through the gateway, the page does not render properly if it is resized, and none of the tabs can be selected at the top of the page.

The following is the parent document source code:

<HTML>
<FRAMESET ROWS="75,*">
<FRAME SRC="index_top.html" NAME="tabs">
<FRAMESET COLS="134,*"
onResize="options.location.href='index_left.html';
content.location.href='index_right.html';">
<FRAME SRC="index_left.html" NAME="options">
<FRAME SRC="index_right.html" NAME="content">
</FRAMESET>
</FRAMESET>
</HTML>

The JavaScript event handler onResize executes if the browser window proportions are changed. The options.location.href and content.location.href JavaScript document objects refer to the URLs of the frame names options and content. The FRAME SRC attributes will be rewritten automatically because all SRC attributes in HTML tags will be rewritten by the gateway out-of-box. Because part of the problem is that the page does not render properly when the window is resized, it could be that both content.location.href and options.location.href are not in the gateway profile under the Rewrite JavaScript in URLs section. onResize will also have to be listed in the Rewrite HTML Attributes Containing JavaScript list.

For details about either rewriter attribute see "Rewriting HTML Attributes" on page 28 and "Rewriting JavaScript Content" on page 39. When this page is parsed by the gateway, it will go tag by tag and compare the attribute names to those in the Rewrite HTML Attributes section of the gateway profile. If something is matched, it will attempt to translate the attribute value as a raw URL. If the HTML attribute name appears in the Rewrite HTML Attributes Containing JavaScript list, the gateway will attempt to translate the JavaScript contents so that it can be resolved into a rewritten URL. In this case, the onResize attribute value contains two JavaScript variable assignments that are raw URLs. After onResize, content.location.href and options.location.href have been added to the appropriate gateway profile sections, this entire page should be rewritten correctly.

The other half of the problem in this example has to do with the tab links not working. Because this is a frame document, the source for the top frame containing the source code definitions for the tabs will have to be consulted. In this case, that would be index_top.html:

<HTML>
<HEAD>
<SCRIPT>
 <!--
   function openTab(id, link) {
    parent.content.location = link;
   }
 //-->
</SCRIPT>
</HEAD>
<BODY>
<A HREF="javascript:location.reload();" onClick="openTab(0,
 'http://www.iplanet.com');">
 <IMG SRC="images/LeftTab.gif">
 iPlanet Home
<IMG SRC="images/RightTab.gif">
</A>
</BODY>
</HTML>

In this particular instance, having parent.content.location in the Rewrite JavaScript Variables in URLs section will not correctly rewrite the value of link in the openTab function body. The reason for this is that link is not a raw URL, so the gateway does not know how to rewrite it. The value of link will not be defined until the Anchor link is selected in the browser and the string value http://www.iplanet.com is passed to the openTab function.

There are two ways to handle this. One is to add openTab:,y to the Rewrite JavaScript Function Parameters section of the gateway profile. This would prepend the gateway URL to the second parameter of the openTab function call within the Anchor tag. The other option is to add parent.content.location to the Rewrite JavaScript Variables Function section of the gateway profile. This will insert a function called iplanet within the SCRIPT tag and change link on the right side of the variable assignment to iplanet(link).

Starting in SP4 Hot Patch 1, the iplanet function definition occurs in the document HEAD element in its own SCRIPT element, instead of being placed multiple times throughout the document.

This will result in the link URL being rewritten by the client at runtime using the browser JavaScript engine. Because parent.content.location would have already been added to the gateway profile in the Rewrite JavaScript Variables in URLs section to correctly rewrite the parent document, the better option might be to rewrite the openTab function parameter. Otherwise, parent.content.location could be moved from the Rewrite JavaScript Variables in URLs section to the Rewrite JavaScript Variables Function, which would change the variable assignment in the parent document to: content.location.href=iplanet('index_right.html');

If the openTab function call looked like:

openTab(0, top.location.href)

then openTab:,y would have to be added to the Rewrite JavaScript Function Parameters Function section of the gateway profile. This option is only applicable starting in the SP4 Hot Patch 1 release. This avoids the problem where the iplanet function definition would be placed within the HTML tag body.

If optimization is the goal where there may be limited compute power on the client, reducing or eliminating the number of times the client has to resolve URLs using the iplanet function is a good idea. If flexibility is the goal, specifically where the same variable name is used in a variety of contexts, using the iplanet function to dynamically resolve the URLs is a good idea.

Does the Web Application Create Content Dynamically?

Special considerations may have to be made if content that is accessed through the gateway is generated dynamically by CGIs, servlets, or JavaServer Pages™ technology. Rule extraction is fundamentally the same for dynamic content as it is for static content, except that care needs to be taken in any direct manipulation of the HTTP headers. Also, the original application source code may have to be referred to, or even modified, to more easily determine where URLs might reside and how to best have the gateway handle them.

One thing to make sure with applications that create content dynamically is that the appropriate Content-Type header is set. Otherwise, the gateway may incorrectly rewrite the content, or not rewrite it at all. In a Perl application for example, the Content-Type is usually the first thing added to the response, and it generally looks something like this:

print "Content-Type: text/html\n\n";

The content-type HTTP header tells the gateway which environment to use to rewrite the content that will follow. Currently, the only content-types that are rewritten by the gateway are text/html, text/htm, application/x-javascript, text/css, text/xml, and text/x-component. These entries can be seen by selecting the Show Advanced Options button at the bottom of the gateway component profile using the administration console.

The content-types can then be seen in the MIME Type Translator Class section of the gateway profile. Adding additional content-types here will work only if the content does not contain special tagging conventions or if it is plain text. URLs outside of the tags themselves will not be rewritten, with the one exception being the first string that begins with a protocol identifier after the SPAN tag. This was done for compatibility purposes with Microsoft Exchange's Web interface.

The following is an example:

<SPAN>http://www.iplanet.com</SPAN>

There are some other things to keep in mind for integrating applications with the gateway in general. One is to not have an explicit dependence on the content-length header. It is obvious that after the gateway rewrites source code, the content-length will also be different from what the web application originally set it as. This fact can be overlooked fairly easily. A problem with how the content-length header is being manipulated might manifest itself as a page truncation.

When integrating with a JavaServer Pages application, be sure that all of the tags are resolved by the JavaServer Pages engine prior to going through the gateway. Depending on what the tag looks like, the gateway may attempt to rewrite its attributes, making a broken tag, otherwise invisible to the end-user, become visible.

The JavaScript content that passes through the gateway should be syntactically correct. If it is not, the problem can manifest itself by misplacing the iplanet function, or incorrectly parsing the SCRIPT tag and corrupting the page output. This page corruption sometimes shows two SCRIPT closing tags with no opening tag and may even move the entire SCRIPT block to a different location in the page source. Another often overlooked issue with the JavaScript application integration is the closing comment to hide the JavaScript code from non-compliant browsers. Unlike its HTML equivalent, the closing comment should have two leading slashes in front of it.

The following is an example:

<SCRIPT>
  <!-- Hide from non-compliant browsers

  // -->
</SCRIPT>

One of the difficulties with dynamically created content is that rule extraction tends to be more difficult because URLs are also created dynamically. Also, some applications will attempt to prevent the end-user from being able to view the source of the web application. This is usually done through trickery by trapping the right mouse event or through code obfuscation. Code obfuscation may make the task of rule extraction difficult, if not impossible, and should be avoided for applications that pass content through the gateway.

If, for instance, the code was not only generated dynamically, but obfuscated dynamically as well, the variable names would never be the same, and thus, reliable string matching would not be possible. Even if the obfuscation was only run once, there is a risk of local variables being obfuscated to the same name, which might have unpredictable results. If for some reason the source code cannot be viewed using the browser, the rewritten content can also be viewed by setting the ips.debug option in the gateway /etc/opt/SUNWips/platform.conf file to Message and restarting the gateway.

The /var/opt/SUNWips/debug/iwtGateway file will contain the document source after page translation. For a busy gateway, you may have to use vi or your favorite editor to search for the browser GET request and for the translated response, or just search for log entries beginning with: HTMLTranslator:Begin:

As mentioned previously, it may be easier in some cases to extract rules from the web application's source code, if it is available. The reason for this is that applications generally contain functional blocks. For example, if the end-user is experiencing problems with a navigation button accessing a Perl application through the gateway, and the Perl program contains a subroutine called buildNavBar, that may be a better place to start than searching the document view source or debug logs. Sometimes the opposite may be true because the browser source may be a great deal less complex than the web application source. This might be the case if you have a for loop in your web application that is creating a JavaScript for loop block that is dynamically creating image URLs to be used for mouseOvers. The web application might also contain variables that only have meaning within the web application and might not ever be seen by the gateway.

Automated Extraction Techniques

Extracting rules by hand is not always a straightforward experience. Attempting to automate that process may prove to be difficult as well, depending on the level of complexity of content that will be accessed through the gateway. For a sidebar on differing levels of complexity, refer to "CASE Studies: How to Configure the Gateway to Rewrite a Web-Based JavaScript Navigation Bar" on page 69 about rewriting a JavaScript web navigation bar.

An ideal companion for the gateway administrator might be an automated recommendation engine that works as a Web crawler and would mine out possible URLs that might need special consideration and make judgements as to how they might best be handled in the gateway profile. Better yet, would be to have the recommendation engine automatically add the rule to the gateway profile when it is at least 90 percent sure that the rule is not only needed but that it will also not regress any other rules or cause other content to be incorrectly handled.

Unfortunately, such a tool is not available today, so gateway administrators must rely on their own skills (and scripting abilities) to find URLs in the content. Luckily, as the document object model has continued to catch on, more and more developers are starting to manipulate document objects directly. These object references can usually be matched by a regular expression and tend to be assignments with a raw URL on the on the right side of the variable assignment.

The following is an example:

document.images["IMG"+imgNum].src = "../../images/img"+imgNum+".gif";

This example uses a predefined JavaScript array and is useful for a couple of reasons. One is that any reference to an array with a SRC property will likely have to be rewritten. The same is true for HREF. Also, document.images["IMG"+imgNum].src cannot be added to the gateway profile because the brackets are not understood. In SP3 Hot Patch 3 and SP4 Hot Patch 1, functionality was added to be able to use wildcards with these kinds of rules so that not only could they work for array references, but also to reduce the total number of rules that need to be added to the gateway profile. This rule optimization is particularly beneficial when gateway logging is enabled and when there are many similar rules defined.

For the example above, a rule like document.images*.src could be added to the Rewrite JavaScript Variables in URLs section of the gateway profile. Because it is known that there are other arrays that contain SRC properties that are also document arrays, the rule can be revised to document*.src.

However, there are also window objects in the object hierarchy, such as window.frames, that also have SRC attributes. Both the document and window objects can have this as a placeholder for the actual object name. Some objects have an HREF property as well, so two rules can be added that would account for a great deal of content that uses the JavaScript object hierarchy directly and/or the JavaScript predefined arrays to access the object hierarchy. These rules would be: *.src and *.href

Because these are the most generalized rules, they should occur before other rules to improve the performance of the rewriter.

The other thing to notice from the example is that the right side of the assignment begins with ../../images/. This would be a relative URL to the images directory that contains the prepended path information.

Because there is a string literal as the first portion of right side that contains prepended path information, it is considered a raw URL—meaning that it is directly resolvable by the gateway. If the assignment instead looked like the following:

document.images["IMG"+imgNum].src = imgURL + imgNum + ".gif";

then the image SRC URL would not be rewritten because the gateway does not understand what imgURL is, and imgURL could also change at runtime because it is a variable. Also, as of SP3 Hot Patch 3, JavaScript wildcarding works only for rules that are added to the Rewrite JavaScript Variables in URLs section of the gateway profile. To rewrite the second case then, imgURL would have to be added to the gateway profile in either the Rewrite JavaScript Variables in URLs section or the Rewrite JavaScript Variables Function section, depending on its usage in other areas of the application.

The last thing this example demonstrates is that if there are no rules added to the gateway profile for this example and the page is still accessed by redirecting through the gateway, the image will still be handled correctly. This behavior was alluded to earlier when comparing the rewriter to the browser where the browser may actually resolve relative URLs using the location field as though it were actually a BASE tag. In fact, if you set the cache size to a nonzero value large enough to cache the page, then view the source using the Netscape Navigator browser, the BASE tag will be included as part of the source. This is one reason why it is also important to determine rule extraction with the browser cache set to a size of zero and to check for updated pages every time. Otherwise, you may not see any JavaScript content when you view the source because it has already been rendered once and cached by the browser.

This is especially true if the web application contains numerous JavaScript document.write calls. If this example were included as a URL scraped channel on the desktop or if the JavaScript portion of the example was imported using a SCRIPT SRC attribute, then the relative URLs would not be handled correctly. The scraped channel would not work because the BASE equivalent would always have a path of DesktopServlet, and the host would likely be incorrect as well. There is a fix available for the later in SP4 Hot Patch 1 for the Internet Explorer browser. The Netscape Navigator browser does not send a document referrer header, so it is not possible for the gateway to determine the parent document URL to use to resolve relative links in the imported JavaScript code.

The term raw URL has been referred to throughout this document without much explanation other than what can be derived from the examples given. Understanding what a raw URL is will help in determining what rule to use and in what section of the gateway profile it should reside. Raw URLs are any string that can be clearly identified as a URL. Raw URLs have relevance when rewriting HTML attributes, FORM INPUT tags, and APPLET and/or OBJECT parameters, but it is most useful to differentiate a raw URL in JavaScript content where a variable assignment occurs. As such, the following is a good rule of thumb for determining raw URLs in JavaScript content. A raw URL in JavaScript content must follow these conventions:

  1. Is a string literal enclosed by matching single or double quotes.

  2. Usually, but not always, contains prepended path information.

    • Prepended path information can be relative or absolute.

    • The prepended path information must all be in the first string literal after the variable assignment.

    • If no prepended path information is provided, the FQD + path to the parent document is used as the BASE equivalent.

  3. It is not built on separate lines by using a concatenation operator.

    Examples of JavaScript variable assignments that are raw URLs:

    var myURL = "http://www.sun.com/" + prodPath + "solaris"; -

    The above is a fully qualified prepended path without path remainder.

    img = "../../images/myimg.gif";

    The above is a relative prepended path with path remainder.

    newImg = "../../" + "images/newimg.gif";

    The above is a relative prepended path with no path remainder.

    URL = 'images/' + imgNum + '.gif';

    The above has no prepended path with no path remainder.

    The following are examples of JavaScript variable assignments that are not raw URLs:

    var offImg = "../" + "../" + "images/off.gif";

    The above is a prepended path that is split.

    var mouseOverImg = up2dir + "images/moseover.gif";

    In the above example, up2dir is a variable.

    surfToNewPage += '?param1=val&' + param2Name + '=val2';

    The above example contains multiple assignments using the += operator.

    Typically, JavaScript variable assignments that contain raw URLs are added to the Rewrite JavaScript Variables in URLs section of the gateway profile.

    One of the best ways to automate rule extraction is by doing string matching directly on the content that will be accessed through the gateway. If the content is stored locally or if the gateway logging has been set to Message and has been accessed already using the browser, you might be able to use the grep(1) command to find content in pages that contain URL references. This is a simple approach, but it may prove to be more powerful than you initially think.

    The following is an example of how to find URL variable assignments in imported JavaScript content:

    $ find ./htdocs -name '*.js' -exec grep '\= \"http' 
    {} >> /tmp/jsAssignmts.txt \;
    $ cat /tmp/jsAssignmts.txt
      var url = "http://www.iplanet.com/bugsplat/show_bug.cgi?id=" + bug_id;
      var url = "https://http://www.iplanet.com/cgi-bin/gx.cgi/AppLogic+WebCall.CaseDet
    ails?case_id=" + case_id;
    var theWebCallURL = "https://http://www.iplanet.com:443";
      var url = "http://www.iplanet.com/bugsplat/show_bug.cgi?id=" + bug_id;
      var url = "https://http://www.iplanet.com:443/WebCall/wait.html";

    Because url is in the gateway profile by default and all of the right-side values are raw URLs, the only rule to be added is theWebCallURL, which will go in the Rewrite JavaScript in URLs section. Note that only the assignment operator is being matched for files with a js extension and that have a protocol identifier wrapped in double quotes, one space after the assignment in this particular example.

    The following is an example of how to look at the onClick JavaScript event handlers to see if any content requires gateway profile entries:

    $ find ./htdocs \( -name '*.htm' -o -name '*.html' \) -exec grep -i
    "onClick\=" {} \;
    document.write('ONCLICK="parent.tabSet--" ONMOUSEOVER="status=\'Back\';
    return true;">');
    document.write('ONCLICK="parent.tabSet++" ONMOUSEOVER="status=\'More\';
    return true;">');
    document.write('<TD ROWSPAN="2"><A HREF="javascript:location.reload()"
    TARGET="_self" ONCLICK="openTab('+id+', \''+link+'\');" ');
    document.write('<TD ROWSPAN="2"><A HREF="javascript:location.reload()"
    TARGET="_self" ONCLICK="openTab('+id+', \''+link+'\')" ');
    document.write('ONCLICK="openTab('+id+', \''+link+'\');
    setDirAccess('+id+');" ONMOUSEOVER="status=\''+name+'\'; return true;"><FONT
    SIZE="2" ');
    document.write('ONCLICK="openTab('+id+', \''+link+'\');
    setDirAccess('+id+');" ONMOUSEOVER="status=\''+name+'\'; return true;"><FONT
    SIZE="2" ');
    <INPUT TYPE ="button" VALUE="open Window" onclick="openWin()">
    <INPUT TYPE=BUTTON VALUE="Build BugList"
    onClick="location.href='http://www.iplanet.com.com/bugsplat/buglist.cgi?bug_id=344836...'">

    Many of the onClick event handler values do not relate to URLs at all, so they can be ignored. One entry contains location.href, which would be handled by the *.href rule suggested earlier in this section. One other thing to be concerned about is the second parameter to the openTab function. Because link is a JavaScript variable instead of raw URL, it will have to be handled either where link is first initialized to a URL value or within the body of openTab itself. Because the value is now known, you can find out where it is initialized by typing:

    $ find ./htdocs \( -name '*.htm' -o -name '*.html' \) 
    -exec grep "link \=" {} \;

    In this case, nothing is returned, which indicates that link is probably used only in the context of a function parameter to the openTab function, or possibly other functions as well. So the only way to determine how to rewrite link is by looking at the source code for the openTab function definition. If openTab did nothing more than have a function call to open another window to the link URL, then the web application source code would have to be modified to allow for rule creation.

    The following is an example of the source code:

    function openTab (id, link) {
     window.open(link, 
    ,"displayWindow","menubar=yes,location=yes,status=yes");
    }

    Even though window.open is in the gateway profile, out-of-box (see "Out-Of-Box Rule Set" on page 26), link still cannot be resolved using syntax interpretation alone. This can be resolved by moving window.open:y from Rewrite JavaScript Function Parameters to the Rewrite JavaScript Function Parameters Function section of the gateway profile that will wrap the first window.open parameter in the iplanet function so that it is rewritten dynamically by the client at runtime. One thing to watch out for when doing this is that other pages are not regressed with this rule change.

    One example would be if the window.open method were called within a JavaScript event handler in some other content. The gateway would then attempt to insert the entire iplanet function definition within (inline) the HTML tag itself. This problem will manifest itself in SP3 Hot Patch 3 by outputing the JavaScript code to the visible portion of the web page in the browser. With this in mind it is good to keep any values for entries in the Rewrite HTML Attributes Containing JavaScript out of either the Rewrite JavaScript Variables Function or Rewrite JavaScript Function Parameters Function sections to avoid the iplanet function definition appearing in the HTML tag itself.

    The SP4 Hot Patch 1 release addresses this problem by moving the iplanet function definition to the document HEAD element.

  • + Share This
  • 🔖 Save To Your Account