Home > Articles > Programming > Java

  • Print
  • + Share This
This chapter is from the book

This chapter is from the book

17.4 Example: A Network Client That Retrieves URLs

Retrieving a document through HTTP is remarkably simple. You open a connection to the HTTP port of the machine hosting the page, send the string GET followed by the address of the document, followed by the string HTTP/1.0, followed by a blank line (at least one blank space is required between GET and address, and between address and HTTP). You then read the result one line at a time. Reading a line at a time was not safe with the mail client of Section 17.3 because the server sent an indeterminate number of lines but kept the connection open. Here, however, a readLine is safe because the server closes the connection when done, yielding null as the return value of readLine.

Although quite simple, even this approach is slightly harder than necessary, because the Java programming language has built-in classes (URL and URLConnection) that simplify the process even further. These classes are demonstrated in Section 17.5, but connecting to a HTTP server "by hand" is a useful exercise to prepare yourself for dealing with protocols that don't have built-in helping methods as well as to gain familiarity with the HTTP protocol. Listing 17.8 shows a telnet connection to the http://www.corewebprogramming.com HTTP server running on port 80.

Listing 17.8 Retrieving an HTML document directly through telnet

Unix> telnet http://www.corewebprogramming.com 80
Trying 216.248.197.112...
Connected to http://www.corewebprogramming.com.
Escape character is '^]'.
GET / HTTP/1.0

HTTP/1.1 200 OK
Date: Sat, 10 Feb 2001 18:04:17 GMT
Server: Apache/1.3.3 (Unix) PHP/3.0.11 FrontPage/4.0.4.3
Connection: close
Content-Type: text/html 

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
...
</HTML>
Connection closed by foreign host.

In this telnet session, the document was retrieved through a GET request. In other cases, you may only want to receive the HTTP headers associated with the document. For instance, a link validator is an important class of network program that verifies that the links in a specified Web page point to "live" documents. Writing such a program in the Java programming language is relatively straightforward, but to limit load on your servers, you probably want the program to use HEAD instead of GET (see Section 19.7, "The Client Request: HTTP Request Headers"). Java has no helping class for simply sending a HEAD request, but only a trivial change in the following code is needed to perform this request.

A Class to Retrieve a Given URI from a Given Host

Listing 17.9 presents a class that retrieves a file given the host, port, and URI (the filename part of the URL) as separate arguments. The application uses the N_etworkClient shown earlier in Listing 17.1 to send a single GET line to the specified host and port, then reads the result a line at a time, printing each line to the standard output.

Listing 17.9 UriRetriever.java

import java.net.*;
import java.io.*;

/** Retrieve a URL given the host, port, and file as three 
 * separate command-line arguments. A later class 
 * (UrlRetriever) supports a single URL instead.
 */

public class UriRetriever extends NetworkClient {
 private String uri;

 public static void main(String[] args) {
  UriRetriever uriClient
   = new UriRetriever(args[0], Integer.parseInt(args[1]),
             args[2]);
  uriClient.connect();
 }

 public UriRetriever(String host, int port, String uri) {
  super(host, port); 
  this.uri = uri;
 }

 /** Send one GET line, then read the results one line at a
  * time, printing each to standard output.
  */

 // It is safe to use blocking IO (readLine), since
 // HTTP servers close connection when done, resulting
 // in a null value for readLine.
 
 protected void handleConnection(Socket uriSocket)
   throws IOException {
  PrintWriter out = SocketUtil.getWriter(uriSocket);
  BufferedReader in = SocketUtil.getReader(uriSocket);
  out.println("GET " + uri + " HTTP/1.0\n");
  String line;
  while ((line = in.readLine()) != null) {
   System.out.println("> " + line);
  }
 }
}

A Class to Retrieve a Given URL

The previous program requires the user to pass the hostname, port, and URI as three separate command-line arguments. Listing 17.10 improves on this program by building a front end that parses a whole URL, using StringTokenizer (Section 17.2), then passes the appropriate pieces to the UriRetriever.

Listing 17.10 UrlRetriever.java

 import java.util.*;

/** This parses the input to get a host, port, and file, then
 * passes these three values to the UriRetriever class to
 * grab the URL from the Web.
 */

public class UrlRetriever {
 public static void main(String[] args) {
  checkUsage(args);
  StringTokenizer tok = new StringTokenizer(args[0]);
  String protocol = tok.nextToken(":");
  checkProtocol(protocol);
  String host = tok.nextToken(":/");
  String uri;
  int port = 80;
  try {
   uri = tok.nextToken("");
   if (uri.charAt(0) == ':') {
    tok = new StringTokenizer(uri);
    port = Integer.parseInt(tok.nextToken(":/"));
    uri = tok.nextToken("");
   }
  } catch(NoSuchElementException nsee) {
   uri = "/";
  }
  UriRetriever uriClient = new UriRetriever(host, port, uri);
  uriClient.connect();
 }

 /** Warn user if the URL was forgotten. */
 
 private static void checkUsage(String[] args) {
  if (args.length != 1) {
   System.out.println("Usage: UrlRetriever <URL>");
   System.exit(-1);
  }
 }

 /** Tell user that this can only handle HTTP. */
 
 private static void checkProtocol(String protocol) {
  if (!protocol.equals("http")) {
   System.out.println("Don't understand protocol " + protocol);
   System.exit(-1);
  }
 }
}

UrlRetriever Output

No explicit port number:

Prompt> java UrlRetriever 
http://www.microsoft.com/netscape-beats-ie.html
> HTTP/1.1 404 Object Not Found
> Server: Microsoft-IIS/5.0
> Date: Fri, 31 Mar 2000 18:22:11 GMT
> Content-Length: 3243
> Content-Type: text/html
>
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
> <html dir=ltr>Explicit port number:

Explicit port number:

Prompt> java UrlRetriever
http://home.netscape.com:80/ie-beats-netscape.html
> HTTP/1.1 404 Not found
> Server: Netscape-Enterprise/3.6
> Date: Fri, 04 Feb 2000 21:52:29 GMT
> Content-type: text/html
> Connection: close
>
> <TITLE>Not Found</TITLE><H1>Not Found</H1> The requested 
object does not exist on this server. The link you followed is 
either outdated, inaccurate, or the server has been instructed 
not to let you have it. 

Hey! We just wrote a browser. OK, not quite, seeing as there is still the small matter of formatting the result. Still, not bad for about four pages of code. But we can do even better. In the next section, we'll reduce the code to two pages through the use of the built-in URL class. In Section 17.6 (WebClient: Talking to Web Servers Interactively) we'll add a simple user interface that lets you do HTTP requests interactively and view the raw HTML results. Also note that in Section 14.12 (The JEditorPane Component) we showed you how to use a JEditorPane to create a real browser that formats the HTML and lets the user follow the hypertext links.

  • + Share This
  • 🔖 Save To Your Account