Home > Articles

  • Print
  • + Share This
This chapter is from the book

This chapter is from the book

SAX and I/O

SAX provides a fair amount of flexibility with respect to I/O handling of serialized XML documents. Wherever SAX expects I/O to occur, the InputSource utility class is used as an extended wrapper around the native I/O stream model (which in the case of Java is java.io.InputStream for byte-oriented I/O and java.io.Reader for character-oriented I/O). The following is the Java definition of InputSource:

package org.xml.sax;
public class InputSource {
// fields and method implementations elided for clarity
    public InputSource();
    public InputSource(String systemId);
    public InputSource(InputStream byteStream);
    public InputSource(Reader characterStream);

    public void setPublicId(String publicId);
    public String getPublicId();

    public void setSystemId(String systemId);
    public String getSystemId();

    public void setByteStream(InputStream byteStream);
    public InputStream getByteStream();

    public void setEncoding(String encoding);
    public String getEncoding();

    public void setCharacterStream(Reader characterStream);
    public Reader getCharacterStream();
}

Note that the primary enhancement that InputSource provides over native Java I/O types is that InputSource allows the character encoding, public, and system identifiers to be associated with the stream. When presented with an InputSource, a SAX-based parser will first attempt to acquire a character stream using getCharacterStream. If that method returns null, the parser will then attempt to acquire a byte stream using getByteStream. If that method also returns null, then the parser will use the URI returned by getSystemId.

The getSystemId method is important even when character or byte streams are used, as it provides the [base URI] property that is used to normalize relative URIs contained in the serialized stream. For that reason, it is critical that applications set this property even when they are providing their own byte/character streams. Consider the following code:

org.xml.sax.InputSource getMyXML(String url) {
  java.net.URL u = new java.net.URL(url);
  java.net.UrlConnection conn = u.openConnection();
  java.io.InputStream in = conn.getInputStream();
  return new org.xml.sax.InputSource(in);
}

Because this code fragment does not set the system identifier property of the InputSource object, any relative URIs contained in the document cannot be correctly resolved. The correct version of this code fragment is as follows:

org.xml.sax.InputSource getMyXML(String url) {
  java.net.URL u = new java.net.URL(url);
  java.net.UrlConnection conn = u.openConnection();
  java.io.InputStream in = conn.getInputStream();
  org.xml.sax.InputSource source =
              new org.xml.sax.InputSource(in);
  source.setSystemId(url);
  return source;
}

This version provides the consumer with the [base URI] Infoset property, ensuring that any relative URLs in the document can be resolved.

There are two common locations where InputSource is used. The most common is when bootstrapping an XML parser. This usage is discussed in a subsequent section. The more interesting application of InputSource is the EntityResolver interface. The EntityResolver is an extensibility interface that implementations of ContentHandler et al. can implement to provide for custom resolution of external entities. By default, when an external entity needs to be resolved, the system identifier can be used as a URI that is easily dereferenced using well-known techniques. However, if an implementation of EntityResolver has been provided to complement the ContentHandler implementation, the EntityResolver's resolveEntity method will be called first, giving the implementation an opportunity to provide its own InputSource for a given public/system identifier pair. The Java definition of EntityResolver is extremely simple.

package org.xml.sax;
public interface EntityResolver {
// return null to indicate systemId should be used as URI
  InputSource resolveEntity(String publicId,
                            String systemId)
        throws SAXException, java.io.IOException;
}

If the implementation of resolveEntity returns a non-null InputSource reference, that object's character/byte stream (or systemId) must be used. If a null reference is returned, the default behavior of dereferencing the systemId as a URI will be used.

Consider the following implementation of EntityResolver that prevents all FTP-based access by throwing an exception:

import org.xml.sax.*;
class Resolver1 implements EntityResolver {
  public InputSource resolveEntity(String pub, String sys)
                          throws SAXException {
    if (sys.toUpperCase().startsWith("FTP"))
      throw new SAXException("FTP not allowed");
    return null; // default processing
  }
}

The following implementation of EntityResolver redirects all requests destined for one vendor to the boilerplate XML document from another:

import org.xml.sax.*;
class Resolver1 implements EntityResolver {
  public InputSource resolveEntity(String pub, String sys)
                          throws SAXException {
    InputSource result = null;
    if (sys.toLowerCase().startsWith("http://www.sun.com"))
      result = new InputSource("http://redhat.com/bp.xml");
    return result;
  }
}

It is also possible to provide alternative character or byte streams simply by returning an InputSource that contains the appropriate Reader or InputStream.

  • + Share This
  • 🔖 Save To Your Account