The web
The Internet is a network of computers using TCP/IP. What is the Web? It is the creation of Tim Berners-Lee and is based on his major insight to combine hypertext with the already existing Internet.
First, the idea of hypertext. This term is due to Ted Nelson. The standard way of reading a book is in a linear fashion, starting with page one. The concept of hypertext is to allow a person to read or explore in a nonlinear fashion. The key concept is that hypertext contains "links" to other text. By following the links the reader is not constrained to follow any particular order. Hypertext may contain links that do not necessarily lead to other text, but to sound or video files. Before the Web came into being there were hypertext products in the marketplace. One such commercial product was Guide, distributed by Owl Ltd. If you clicked on a link in Guide, a new document would be inserted in place of the link. Also, in keeping with its tradition of being a great innovator, Apple Computer had a product called Hypercard that implemented hypertext. However, these products did not use the Internet.
Tim Berners-Lee was working at CERN (a European particle physics laboratory located near Geneva, Switzerland) in a department charged with processing and recording the results of the scientific experimental work being done there. At CERN there were scientists from many different countries, so there were many different computer operating systems and document formats in use. It was difficult for a scientist working with one computer system to obtain information from a colleague using a different computer system. This is the same problem that faced Bob Taylor at ARPA. Berners-Lee realized that it would not be feasible to force the wide mix of researchers at CERN to reorganize their way of doing things to fit a new system. It was crucial that everyone work with his or her own operating system, but still easily share information. Berners-Lee's solution was to marry hypertext with the Internet. This marriage is the World Wide Web and consists of three key components, HTTP, HTML, and URL, all developed by Berners-Lee.
HTTP (hypertext transfer protocol): Recall that a protocol is a set of rules for exchanging information on a network. HTTP is a high-level protocol used to exchange information between a browser and a server. The HTTP protocol uses TCP/IP to locate and make a connection between the browser and the server. The messages sent between the browser and server are either request or response messages. The request message contains 1) a request line containing the name of the requested file and whether the request is a GET or POST (see Tech Talk in this section), 2) a header containing information such as the type of browser and operating system, and 3) a body containing data, for example, data entered into a form. The response from the server will contain 1) a response line with a code indicating that the requested file was found or an error code (almost everyone has had to deal with the dreaded HTTP 404 Error - file not found) if there was a problem, 2) header information such as the type of server software, and 3) a body containing the HTML of the requested file. An HTTP request and response is illustrated in Figure 44.
Tech Talk
GET and POST: In the request line sent from the browser to the server is an HTTP command called the method. The method is usually a GET or a POST. The GET method is a request for a specific URL. With a GET request, the body is empty. The POST method tells the server that data will be sent in the body of the request. The POST method is used when you submit forms.
HTML (hypertext markup language): This is the language used by the browser to display the text and graphics on a Web page. In the next chapter we describe what a markup language is, and how to create a Web page using HTML.
URL (uniform resource locator): This is the "address" of a Web page. When you click on a link in a Web page, you are taken to a new location. The link contains the URL for your destination and the URL must follow a very specific syntax used in naming the destination.
Figure 4-4 An http Request.
There are three parts to a URL. They are:
The Internet protocol used, e.g., HTTP (or FTP or Telnet as discussed later)
The address or name of the server
The location and name of the file on the server
Consider the URL in Figure 45. In this example, the protocol is HTTP. The name of the server or host machine is gsbkip.uchicago.edu. The target file being requested by the browser is named foo.html. It is located in the directory tmpwhich is a subdirectory of htmls. Thus, the URL specifies the directory path for the requested file.
Figure 4--5 The parts of a URL
Suppose a user with a browser views a Web page that has a link in its hypertext to the file foo.html on the machine gsbkip.uchicago.edu. The text and graphics of the Web page are displayed according to the underlying HTML. The link contains the URL given above in the example so the packets know what server to go to and what file to retrieve. When the user clicks on that link an HTTP request for the file foo.htmlis sent over the Internet to the server machine, gsbkip.uchicago.edu. The server machine then returns the requested file. The beauty of this process is that the operating systems used by the desktop machine and by the server machine are irrelevant. They do not have to be compatible.
There are two pieces of software required for this process to take place. The desktop PC must have a browser such as Netscape Navigator or Internet Explorer. The server machine must have an HTTP server. The HTTP server software is "listening" for packets addressed to it. When we use the term server we are referring to two things. Server refers to both the physical machine as well as the software on the machine that is serving up the files. When the server software receives packets requesting files, it sends the requested file back to the desktop PC.
There are a number of server software packages on the market. The current leader is Apache with almost 60 percent of the market [122]. Apache is open source software. It runs in the Linux, Unix, and Windows environments. The name Apache came about because the software is "a patch" work of code from the numerous independent coders who worked on it. The Windows 2000 operating system comes bundled with Internet Information Server, which is Microsoft's HTTP server. It has about 20 percent of market share. Sun Microsystems' iPlanet is a distant third, with about 6.5 percent of the market.