Modeling XML Vocabularies with UML: Part II
In Part 1 of this article series, I emphasized that models are an inevitable part of system analysis and design, even if a model is sometimes only in the developer's mind. By using UML to capture a conceptual model of the planned vocabulary, we can clarify the essential terms and relationships without getting caught up in the syntactic issues of the chosen schema language. In fact, industry standards groups may want to use UML as the primary definition for their vocabularies and leave the final choice of schema language(s) to implementing vendors.
I also want to emphasize that choosing a model-driven approach to schema design does not force you into a drawn-out waterfall development process. The approach described in these articles illustrates an evolutionary and incremental development process. The first schema produced using default mapping rules from this purchase order model may not be ideal, but it accurately captures the domain semantics that were modeled. Part 3 of this series describes how the model may be specialized to capture design characteristics that are unique to XML schema generation. This approach is compatible with the contemporary methodologies for agile programming and modeling, where the models fulfill a very pragmatic role in the development process. (See XMLmodeling.com, which is a web portal that I have created to gather case studies and modeling resources.)
To achieve these rather lofty objectives, it's essential that we have a complete, flexible mapping specification between UML and XML schemas. The following examples do not present the complete picture, but attempt to ease you into a maze of terminology from UML and the W3C XML Schema definition language (XSD).
Mapping UML Models to XML Schema
This is where the rubber meets the road when using UML in the development of XML schemas. A primary goal guiding the specification of this mapping is to allow sufficient flexibility to encompass most schema design requirements, while retaining a smooth transition from the conceptual vocabulary model to its detailed design and generation.
A related goal is to allow a valid XML schema to be automatically generated from any UML class diagram, even if the modeler has no familiarity with the XML schema syntax. Having this ability enables a rapid development process and supports reuse of the model vocabularies in several different deployment languages or environments, because the core model is not overly specialized to XML structure.
Please note that the schema examples in this article not fully compatible with the corresponding example in the XML Schema Primer. Nonetheless, the following schema fragments are still valid interpretations of the conceptual model. The third article in this series will continue the refinement process to its logical conclusion, where the resulting schema can validate the XSD Primer example.
The conceptual model for purchase orders shown in Figure 1 is duplicated with very slight modification from the first article. We'll dissect this diagram into all of its major structures and map each part to the W3C XML Schema definition language. I'll note several situations in which other alternatives are possible and also point out where the schema differs from the XSD Primer example.
Figure 1 Conceptual model of purchase order vocabulary.
Class and Attribute
A class in UML defines a complex data structure (and associated behavior) that maps by default to a complexType in XSD. As a first step, the PurchaseOrder class and its UML attributes produce the following XML Schema definition:
<xs:complexType name="PurchaseOrder"> <xs:all> <xs:element name="orderDate" type="xs:date" minOccurs="0" maxOccurs="1"/> <xs:element name="comment" type="xs:string" minOccurs="0" maxOccurs="1"/> </xs:all> </xs:complexType>
The attributes in a UML class are not restricted to a particular order, so an XSD <xs:all> element is used to create an unordered model group. In addition, a UML class creates a distinct namespace for its attribute names (that is, two classes can contain attributes having the same name), so these are produced as local element definitions in the schema. See A New Kind of Namespace for more explanation of this topic.
Both of these UML attributes are optional, indicated by [0..1] in Figure 1. These are mapped to minOccurs and maxOccurs attributes in the XSD. The UML attributes are defined using primitive datatypes from the XSD specification, so these are written directly to the generated schema using the appropriate namespace prefix. If other datatypes are used in the UML model, an XSD type library can be created to define these types for use in a schema. For example, I have created an XSD type library for the Java primitive types and common Java classes such as Date, String, Boolean, and so on.
As a useful default, a top-level element is automatically created for each complexType in the schema. The default name for this element is the same as the class name; this is allowed in the W3C XML Schema because it uses separate namespaces within the schema itself for complexTypes and top-level elements. For PurchaseOrder, the top-level schema element is created as follows:
<xs:element name="PurchaseOrder" type="PurchaseOrder"/>
Referring to the XSD Primer example, orderDate is modeled as an XML attribute, not a local element in PurchaseOrder. It also uses a <sequence> model group instead of <all>. And the top-level element is defined in the Primer using a lowercase first letter; that is, purchaseOrder (often called "lower camel case" format). All of these differences are addressed in the third article in this series by using a UML profile to expand the mapping to XML schemas.
Association
The PurchaseOrder type is specified not only by its UML attributes but also by its associations to other classes in the model. Figure 1 includes three associations that originate at PurchaseOrder, which is designated by navigation arrows at the opposite ends. Each association has a role name and multiplicity to specify how the target class is related. These associations are added to the model group of the XSD complexType, along with the elements created from the UML attributes.
<xs:complexType name="PurchaseOrder"> <xs:all> <xs:element name="orderDate" type="xs:date" minOccurs="0" maxOccurs="1"/> <xs:element name="comment" type="xs:string" minOccurs="0" maxOccurs="1"/> <xs:element name="shipTo"> <xs:complexType> <xs:sequence> <xs:element ref="Address"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="billTo"> <xs:complexType> <xs:sequence> <xs:element ref="Address"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="items" minOccurs="0" maxOccurs="1"> <xs:complexType> <xs:sequence> <xs:element ref="Item" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> </xs:all> </xs:complexType>
Because the UML attributes for orderDate and comment have primitive datatypes, the schema embeds these values as element content. However, the default mapping for associations creates a wrapper element in XSD corresponding to the role name in UML. This element then contains the instances of the associated class, which the schema refers to using the top-level element created for each complexType.
If you want to create a W3C XML Schema using the <all> content model, a wrapper element is necessary whenever the associated class has more than one occurrence. This is because <all> can be used only when the contained elements have either [0..1] or [1..1] multiplicity. So when generating the wrapper element for the association to Item, the element named item allows zero or one instances, which hold zero or more Item elements.
The difference between this default schema generated from UML and the schema included in the XSD Primer is that the Primer's shipTo and billTo roles contain the address content directly, without use of an element for the associated class. In other words, child elements for name, street, city, and so on are contained directly within shipTo and billTo. This design alternative is covered in the extensions presented in the third article in this series.
User-Defined Datatype
The default mapping to XSD would produce a complexType definition for SKU and QuantityType, but we want these to become user-defined simple datatypes in the XML Schema. This is easily achieved by adding a UML stereotype to each of these two classes, which is shown as <<XSDsimpleType>> in Figure 1. This ability to include stereotypes is an integral part of the UML standard and is used to specify additional model characteristics that are usually unique to a particular domain; in this case, unique to XML schema design.
Using the stereotype, the schema generator knows to create the following definition for SKU:
<xs:simpleType name="SKU"> <xs:annotation> <xs:documentation>Stock Keeping Unit, a code for identifying products</xs:documentation> </xs:annotation> <xs:restriction base="xs:string"> <xs:pattern value="\d{3}-[A-Z]{2}"/> </xs:restriction> </xs:simpleType>
A UML model may also include documentation for any of its model elements, which is passed through to the XML schema definition as shown in this example. The UML generalization relationship indicates which existing simple datatype should be used as the base for this user-defined type. Finally, the pattern attribute on SKU is mapped to an XSD facet that constrains the SKU string value.
The second module in the purchase order schema definition represents a reusable set of specifications for addresses, as shown in Figure 2. These definitions are taken directly from section 4.1 of the XSD Primer. Two additional schema constructs are required by this model, in addition to those used when producing a schema from Figure 1.
Figure 2 Modularized Address schema component.
Generalization
A fundamental and pervasive concept in object-oriented analysis and design is generalization from one class to another. The specialized subclass inherits attributes and associations from all of its parent classes. This is easily represented in W3C XML Schema, although it requires more indirect mechanisms when producing other XML schema languages.
In Figure 2, the Address class is shown in italic font, which is used in UML to indicate that this is an abstract class that is only intended to be used for deriving other specialized classes. Following the same default rules used for PurchaseOrder, the complexType definitions for Address and USAddress are produced as follows:
<xs:element name="Address" type="Address" abstract="true"/> <xs:complexType name="Address" abstract="true"> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="street" type="xs:string"/> <xs:element name="city" type="xs:string"/> </xs:sequence> </xs:complexType> <xs:element name="USAddress" type="USAddress" substitutionGroup="Address"/> <xs:complexType name="USAddress"> <xs:complexContent> <xs:extension base="Address"> <xs:sequence> <xs:element name="state" type="USState"/> <xs:element name="zip" type="xs:positiveInteger"/> </xs:sequence> </xs:extension> </xs:complexContent> </xs:complexType>
There are three areas of difference from previous examples:
The top-level element and complexType definitions for Address include the XSD attribute abstract="true".
The USAddress element includes substitutionGroup="Address", which means that whenever the Address element is required as a content element, USAddress may be substituted in its place. Thus, we can use USAddress (or, similarly, UKAddress) as the content of shipTo and billTo in the PurchaseOrder.
The complexType definition for USAddress is extended from the base complexType named Address. But there is a significant point of difference in how this inheritance structure is interpreted in UML versus in XSD. In UML, the order of attributes and associations within a class is not specified and the features inherited from parent classes are freely intermingled with locally defined attributes and associations in a subclass. In XSD, a complexType defined using an <all> group is not allowed to be extended with additional element content in a child complexType. As a result of this limitation in XSD, the complexType for Address and each of its specialized subtypes must use a <sequence> group definition for their element content. So the three elements inherited from Address are an ordered group in USAddress, followed in sequence by another ordered group of the two elements defined in USAddress. You cannot define an unordered group of the five elements when one or more is inherited.
Enumerated Datatype
The state element of USAddress refers to a simple type definition for USState, which is generated from a UML enumeration. In Figure 2, USState is shown with an <<enumeration>> stereotype that notifies the schema generator to create an XSD enumeration value for each of the attributes defined for this class. An enumerated type in XSD is just a specialized kind of simpleType definitions, so it must also specify a superclass in UML to use as a base type in XSD. The schema is generated as follows:
<xs:simpleType name="USState"> <xs:restriction base="xs:string"> <xs:enumeration value="AK"/> <xs:enumeration value="AL"/> <xs:enumeration value="AR"/> <xs:enumeration value="PA"/> </xs:restriction> </xs:simpleType>