Working with XML and Information Systems
- Representing data digitally
- XML and digital data
- Information systems
- XML and information systems
This book describes a method for representing data inside computers. As information flows through the processes that operate on it, its forms and representations change in subtle ways. These transformations are governed by patterns of rules usually called programs. Computers are information processing machines, and programs are essentially servants created to serve the needs of the information stored and processed in these machines. Programs exist to display data, to transform data, to move data from one location to another, and to let humans interact with data.
When creating information-centric applications, the many methods of representing data, XML being one among many, must be considered in relation to other methods and the needs of the information itself. Often, the information will be best served by flowing from one representation to another, as each representation best serves the purpose of one part of the system.
In this chapter we will consider how XML compares to other important methods of data representation, such as relational databases and object-oriented databases. This provides a basis for understanding how XML can be used profitably and at which points in a larger application data is best represented as XML. Later, we will look at how to write applications that read, process, and generate XML, and the various methods for doing this. Finally, we will consider how to use XML together with other information technologies in order to create useful applications.
1.1 Representing data digitally
Today's computers are digital machines, which means that any information that is to be processed by them must be represented as a sequence of binary digits (zeroes and ones). This is slightly problematic because such sequences do not have any obvious meaning. To take one example, it is impossible to tell what the string 010010000110100100100001 actually means without knowing what rules were used to produce it.
To represent information digitally we use rules that define how to convert the information from the human understanding of it into strings of bits. A collection of such rules is known as a notation in this book, but often called a data format in ordinary computer terminology. Knowing the notation also allows us to go the other way and interpret the string back into human terms. A very common interpretation for binary strings is as numbers written in base 2, i.e. in the binary system. If this interpretation were applied to the binary string above it would yield the number 4745505. This might well be the correct interpretation, but it doesn't really tell us much or seem like a very useful interpretation without a context. One context might be: the number is the population of Denmark.1 Another common representation of digital information is the ASCII character encoding, where text is represented by assigning a number to each character that may occur in text, and every character is represented as its number written out in base 2 with 8 bits (or binary digits) per character. If we interpret the string above according to this ASCII notation,2 we find that it spells out characters number 72, 105, and 33, in that order. These three characters together form the string Hi!. In other words, it is a greeting.
1.1.1 Notations
So far we have only considered the encoding of individual values or data items, such as strings and numbers, without any context for these to be interpreted in. In computing such values hardly ever appear in isolation, but are usually found in a larger context, a structured collection of data items. Imagine that a digital data stream is received by an application somehow, disregarding the transmission method for the moment. This means that a stream of binary digits will be pouring into the application, which must then somehow make sense of this stream of information. Doing so requires not only the ability to decode individual data items, but also to locate the boundaries of each item and put the items together into a coherent structure. The rules for how to interpret the stream in this higher-level sense are called a notation.3 Notations can be made to represent very nearly anything at all, be it documents, databases, sound, images, or any other kind of data. Note that there are two main kinds of notations: character based and bit based. The first consisting of characters, just like text, the structure of the second being defined in terms of bits and bytes.
One notation is the textual notation, which applies the ASCII character encoding to entire data streams. This character based notation is simple and convenient and can be used to represent anything at all, from novels through laundry lists to payroll information. However, its conceptual structure is not apparent in the text and so it cannot be processed automatically by software for purposes other than editing and display. To be able to perform most other tasks, a less general and more application-specific notation is needed.
An example may serve to make this discussion of data encoding and data formats clearer. Shown in Example 11 are the first 200 bytes of a digital data stream, with each byte in the stream interpreted as a base 2 number and displayed as a hexadecimal number, which is a common way of displaying raw binary data.
Example 11. An example data stream
46 72 6f 6d 3a 20 59 6f 75 72 20 66 72 69 65 6e 64 20 3c 66 72 69 65 6e 64 40 70 75 62 6c 69 63 2e 63 6f 6d 3e a0 54 6f 3a 20 4c 61 72 73 20 4d 61 72 69 75 73 20 47 61 72 73 68 6f 6c 20 3c 6c 61 72 73 67 61 40 67 61 72 73 68 6f 6c 2e 70 72 69 76 2e 6e 6f 3e a0 53 75 62 6a 65 63 74 3a 20 41 20 66 75 6e 6e 79 20 70 69 63 74 75 72 65 a0 4d 65 73 73 61 67 65 2d 49 44 3a 20 3c 35 30 33 32 35 42 41 32 38 42 30 39 33 34 38 32 31 41 35 37 46 30 30 38 30 35 46 42 37 46 43 32 35 30 31 45 36 36 42 35 45 40 6d 61 69 6c 2e 70 75 62 6c 69 63 2e 63 6f 6d 3e a0 44 61 74 65 3a 20 46 72 69 2c 20 38 20 4f 63 74
This binary dump doesn't make a lot of sense in the form it is shown here, but if we are told that it is a character based notation, things become much clearer. Interpreted as ASCII text, the first 200 bytes of the data stream look like Example 12.
Example 12. The data stream as ASCII
From: Your friend <friend@public.com> To: Lars Marius Garshol <larsga@garshol.priv.no> Subject: A funny picture Message-ID: <50325BA28B0934821A57F00805FB7C@mail.public.com> Date: Fri, 8 Oct
Suddenly, we see that the data stream is not just a text stream, but an email. Emails have a stricter and less general notation than plain text files, which is defined in Internet specifications, the relevant ones being RFCs 822 and 2045 to 2049. RFC stands for Request For Comments and RFCs are official Internet documents that can be found at http://www.ietf.org/rfc/rfcXXXX.html and also at a huge number of mirror sites world-wide.
The email notation starts with a list of headers and continues with a body that holds the actual email contents. Example 12 shows the beginning of the headers. Each header is placed on a separate line, lines being separated by newline characters.4 On each line, the name of the header field appears first, followed by a colon and a space and then the value of the header field. This enables us to locate individual data items in the email headers, and also to put them together into a larger structure where each data item has a name. Knowing the name of each header field, together with detailed knowledge of the email notation, also tells us how to decode the value in each field. This can sometimes be rather complex, such as in the case of the date.
Example 13 shows the entire set of headers for the email, together with an abbreviated body.
In order to be able to decode the body of the email we have to look at the Content-type header field, which tells us what data notation is used in the body. In this case, the field says multipart/mixed. This particular notation is defined by the Internet mail standard known as MIME (Multipurpose Internet Mail Extensions), defined in RFCs 2045 to 2049. It is used for emails that consist of several parts, called attachments. This means that the body consists of several data streams, each making up one attachment, separated by the boundary string also given in the Content-type field.
If we look closely at the body, we will see that it contains first a message to users using mail readers that are not MIME-aware, outside
Example 13. The entire email
From: Your friend <friend@public.com> To: Lars Marius Garshol <larsga@garshol.priv.no> Subject: A funny picture Message-ID: 50325BA28B0934821A57805FB7C@mail.public.com Date: Fri, 8 Oct 1999 11:26:22 +0200 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2448.0) Content-Type: multipart/mixed; boundary="----_=_NextPart_000_01116F" X-UIDL: 37ef28060000035b This is a MIME-encoded message. Parts or all of it may be unreadable if your software does not understand MIME. See RFC 2045 for a definition of MIME. ----_=_NextPart_000_01116F Content-type: text/plain Hi Lars, here is a funny picture. ----_=_NextPart_000_01BF116F Content-type: image/gif; name="funny.gif" Content-transfer-encoding: base64 Content-disposition: attachment; filename="funny.gif" ... ----_=_NextPart_000_01116F
of the first attachment. The first attachment has a form similar to the email itself, with headers and a body. In this case, the body is plain ASCII text, and requires no special treatment.
The second attachment, however, is a different matter. It contains a GIF image, encoded with the base64 encoding. This is a common encoding much used on the Internet for encoding binary data as text, so that it may be safely used with applications that only expect ordinary text.5 In this case, after decoding the base64 data the application will have another stream of digital information, this time in the GIF notation.
To be able to interpret and display the GIF image, the application must start from scratch again and locate the various fields inside the stream that makes up the image, decode them and use them to decode the rest of the stream. Exactly how this is done is not really relevant to this example, so we will skip this for now. Note that the GIF notation is a binary notation, which is both more efficient and harder to decode and understand than a text notation.
What we have just examined is a notation for email messages. It tells us how to decode a stream of digital information into a coherent data structure that makes sense to a human being. Inside the stream appear various data items and also new data streams, which are the contents of the two attachments. The individual data items have their own notations specified by the larger notation, as do the data streams.
1.1.2 Data representation
So far, we have only discussed the notation itself, but not what the application should do with the represented in it. The application needs to somehow store the information in the working memory, and to do this it must choose some data representation. The working memory of a computer is nothing but a huge array of bytes, just like the data stream, which means that the notation could well be used to represent the information inside a running program by simply storing the stream as-is in memory. However, notations are generally very awkward to use as the actual data representation in a program, since they are completely flat (being sequences of binary digits) and programs generally need to be able to traverse and modify the data. It is of course possible to do this using the external notation, but it is rather awkward, as Example 14 shows.
This implementation of the Email class uses the external email notation as the internal representation of emails inside the program. This is done by keeping the email as a string, so that values can be
Example 14. Using the external notation as internal representation
import string class Email: """A class for encapsulating email messages and providing access to them.""" def __init__(self, email): self._email = email def get_header(self, name): """Returns a list of the values of all instances of the header with the given name.""" values = [] pos = string.find(self._email, "\n" + name + ": ") while pos != -1: end = string.find(self._email, "\n", pos + 1) values.append(self._email[pos + len("\n" + name + ": ") : end]) pos = string.find(self._email, "\n" + name + ":", pos + 1) return values def add_header(self, name, value): "Inserts a header with the given name and value." pos = string.find(self._email, "\n\n") assert pos != -1 self._email = self._email[ : pos + 1] + name + ": " + value
+ "\n" + self._email[pos + 1 : ] # ...
extracted from the string and the entire email can be modified by modifying the string. As should be obvious, this is both awkward and inefficient.
A much more natural representation would be to have a dictionary keyed on header names that maps to a list of values to represent the headers. The attachments could be represented as a list of attachment objects, where each attachment object holds a dictionary of header fields and a file-like object to represent the attachment contents. Further classes could also be defined to represent the values in the various fields (email addresses, dates, etc.). Such an implementation is shown in Example 15.
Example 15. Using a more natural representation
class Email: """A class for encapsulating email messages and providing access to them.""" def __init__(self): self._headers = {} self._attach = [] def get_header(self, name): """Returns a list of the values of all instances of the header with the given name.""" return self._headers[name] def add_header(self, name, value): "Inserts a header with the given name and value." try: self._headers[name].append(value) except KeyError: self._headers[name] = [value] # . . . class Attachment: """A class for encapsulating attachments in an email and providing access to them.""" def __init__(self): self._headers = {} self._contents = None # . . .
What we have done now is to design an internal data structure that is optimized for storing the information from the email in the working memory of a program.n Both the data stream and the data structure are digital, but they have very different properties. The data stream is a sequential stream of bytes6 (defined by a notation), while the data structure is not necessarily contiguous in memory, has no specific order and is highly granular rather than flat as the data stream.
One thing that is important to understand is that while the data structure represents the original email data stream it does not do so fully. The data structure keeps only the information we consider essential (what is called the logical information), and throws away much information about what the original data stream looked like. One of the pieces of information we have lost is what boundary string was used between each attachment, or what the warning before the first attachment was. We can no longer recreate the original email!
This means that although the second representation is much more usable than the first, it carries a hidden cost: the loss of information that may at times be necessary. As we will see later, central XML specifications do the same, and this has both benefits and costs that one must be aware of. For if you do need to recreate the original data stream, you will need to solve this problem somehow, and the XML specifications and established practice will offer little or no help.
1.1.3 Serialization and deserialization
The problem with having the data in the working memory of the application is that once the application is shut down or the power to the machine is turned off, the contents of the working memory are lost. Also, the application cannot communicate its internal structures directly to other programs, since they are not allowed to access its memory.7 Programs running on other computers will not be able to access the data at all.
Using a notation solves this problem, however, because it gives us a well-defined way of representing our data as a data stream. It does leave us with two problems, however, which are those of moving data back and forth between the notation and the internal data structure. The technical term for the process of writing a data structure out as such a binary stream is serialization. It is so called because the structure is turned into a flat stream, or series, of bytes. Once we have this stream of bytes, we can store it into a file on disk where it will persist even if the application is shut down or the power is turned off. The file can then be read by other applications. We can also transmit the stream across the network to another machine where other applications can access it.
In the example of the email program, for example, the email program will receive the email from a mail server and store it in memory in its internal data structures. It will then write this internal structure out to its database of emails, which can be organized in many different ways. Some programs simply put each email (using the original notation) in a separate file, while others use more sophisticated database-like approaches.
In general, we can say that data has two states: live and suspended. Live data is in the internal structure used by program and is being accessed and used by that program. Suspended data is serialized data in some notation that is either stored in a file or being transmitted across a network. Suspended data must be deserialized to be turned into live data so that it can actually be used by programs. The deserial-ization of character based notations is usually known as parsing, and a substantial branch of computer science is dedicated to the various methods of parsing7a The vaguer term loading is also at times used as a synonym for deserialization.
It is not necessarily the case that each notation has a single data structure, and vice versa. In fact, usually each application supporting a notation will have its own data structure that is specific to it. In many cases applications will also support many notations.
Note that serialized (suspended) data need not be written to a file when it is stored. It can also be stored in a database (most database systems support storage of uninterpreted binary large objects, also known as blobs), as part of another file (as the email example showed) or in some other way. In fact, serialized data doesn't need to be stored at all, but can instead be transmitted across the network or to another process on the same machine.
1.1.4 Data models
Over the years, certain methods for structuring data have established themselves as useful general approaches to building data structures. When such a method is formalized by a specification of some kind it becomes a data model. A data model is perhaps easiest explained as a set of basic building blocks for creating data structures and a set of rules for how these can be combined.
One of the most widely used and best-defined data models is the relational model where data is organized into a table with horizontal rows, each containing a record, and vertical columns, representing fields. Each record contains information about a distinct entity, with individual values in each field. This is the data model used in comma-separated files and in relational databases. In relational databases some fields can also be references into other tables.
Another common data model is the object-oriented one, where data consists of individual objects, each of which has a number of attributes associated with it. Attributes have a name and a value and can be primitive values or references to other objects. This model is used by object-oriented programming languages and databases.
Defining a data model that states how data must be structured has several benefits. First, it gives a framework for thinking about information design that can be very helpful for developers by providing a set of stereotypes or templates which can be applied to the problem at hand to yield a solution. Secondly, it allows general data processing frameworks (that is, databases) to be created that can be used to create many different kinds of applications. The prime example of such frameworks are relational databases.
At this point you may be wondering what the data model used by emails is, and the answer is that email specifications do not use any particular data model. Instead, they use a well-known formalism known as EBNF (Extended Backus-Naur Form) to formally specify the notation of emails, and leave the conceptual structure undefined. People tend to agree on what the structure is anyway, although they can occasionally disagree on details, some of which may be important.
To be able to use a data model, the application developer must represent the information in the application in terms of that data model. Doing so lets the application use the notations and data processing frameworks that are based on the data model. For example, to be able to represent the structure of emails in relational databases, the application must express the structure of the emails using the tabular data model. Table 11 shows the result of this translation.
As you can see, it was a relatively simple translation. The only real problem was how to represent the attachments. The solution used here was a bit simplistic, since the attachment headers are just strings. This means that their structure is not represented using the data model at all, so this isn't really a very good solution. The attachments should have their own (almost identical) tables, but for simplicity I did not do that here.
Table 11 Email as table
Field |
Value |
Order |
From |
Your friend <friend@public.com> |
|
To |
Lars Marius Garshol <larsga@garshol.priv.no> |
|
Subject |
A funny picture |
|
Message-ID |
<50325BA28B0934821A57805FB7C@mail.public.com> |
|
Date |
Fri, 8 Oct 1999 11:26:22 +0200 |
|
MIME-Version |
1.0 |
|
X-Mailer |
Internet Mail Service (5.5.2448.0) |
|
Content-type |
multipart/mixed |
|
X-UIDL |
37ef28060000035b |
|
Body |
Content-type: text/plain Hi Lars, [...] |
1 |
Body |
Content-type: image/gif [...] |
2 |
Representing information in the application using the data model of the underlying framework is usually easy, but sometimes awkward or even quite difficult. The relational model is especially strict and inflexible, which made it possible to describe it very precisely mathematically and develop a powerful set of mathematical abstractions and techniques for working with relational data. Due to this work, relational databases today are well understood, extremely reliable and scalable and may perhaps in fairness be called the greatest success of computer science so far. For all their power, however, they are not suitable for all applications, and this is one of the facts that motivated the development of alternative models, such as the object-oriented one.
Restricting the possible forms of data to a specific data model has another benefit: formal languages can be defined to describe the structure of the data in terms of the underlying data model. Using such languages, the data structure of an application can be described formally and precisely. Such a description is known as a schema and the languages as schema languages.8 In the relational model, for example, a schema will define the tables used by an application, the type of each column in each table, and any cross-references between the tables.
Defining a schema for an application has the benefit that the framework can use it to automatically validate the data against the schema to ensure no invalid data is entered. With relational databases this means that you cannot put text in numeric columns, enter postal codes that are too long or too short, or insert a reference to a row in a table that does not exist (nor can you remove a row from one table if there are references to it from other tables).
1.1.5 Summary
Figure 11 shows how a live data structure inside an application can be serialized into a suspended sequential data stream which can then be sent over the network, passed to another application or written to application.
Figure 11 Summary of data representation terms
disk. It also shows how the stream can be read back into the application to rebuild the internal data representation. Today, the representation will usually be defined as a set of classes, but programming languages that are not object-oriented have other ways of representing data. The internal data representation will be defined in terms of a data model, such as the relational or the object-oriented. The data stream will be written according to a notation of some kind, and the notation will also be based on a data model.
Initially, we discussed the notations of individual values and data items. It is worth noting that the notation of values is often shared between the external notations and the internal data representations. These mainly differ in the way they compose larger structures from collections of values and data items, and not so much in the notation of individual values.