Home > Articles > Web Services > XML

  • Print
  • + Share This
This chapter is from the book

This chapter is from the book

15.6 Character Classes

Atoms (quantified or otherwise) can be larger than a single character or escaped character sequence. An atom can also be a character class. This feature allows a particular atom in the pattern to be one of a number of predefined options. For example, perhaps the third character in a value is allowed to be 'a', 'b', or 'c'.

All ISBN numbers used by a particular publisher might start with '0-201-' and therefore be completed by a five-digit book identifier and a check digit, such as '77059-8'. In this case, the first part of the ISBN number can be represented by the six fixed characters, but it is clearly not practical to cater to all possible book identifier numbers, using the techniques seen so far (nor the check digit, which can also be an 'x' as well as any digit). However, a character class pattern, which is enclosed by square brackets ('[' and ']'), can assist in this situation. The expression '-[0123456789]' specifies that the digit '0', '1', '2', '3', '4', '5', '6', '7', '8', or '9' may appear after a hyphen character. A complete, but very verbose, pattern for the example above follows (but could not be broken over lines as shown here):

< pattern value="0-201-[0123456789][0123456789]

Negative classes

A character class becomes a negative character class, reversing its meaning, when the character '^' is placed immediately after the opening square bracket. This specifies that any character except those in the group can be matched. For example, the pattern '[^abc]' specifies that any character except 'a', 'b', or 'c' must be included.

Note that this feature must not be interpreted as matching no character in the value. The pattern 'a[^b]c' would match the value 'axc' but not 'ac'.

The '^' symbol can be used later in the group without having this significance, so '[a^b]' simply means that the character must be 'a' or '^' or 'b'. It should not be necessary to place this character first within such a group, where it would be interpreted as an indicator of a negative character class, but it can be placed there without misinterpretation if it is escaped ('[\^ab]').

Readers familiar with other expression languages should also note that '^' does not play its usual role as a line-start indicator. Similarly, the '$' symbol is not significant here (as an end-line indicator). Attribute values do not contain lines of text (a multiple-line attribute value is normalized by the parser before the point at which patterns are used to validate the value) and, while element content can contain lines of text, this concept is rarely relevant.

Quantified classes

Quantifiers can be used on character classes. The quantifier or quantity simply follows the ']' class terminator:

[ ]? 
[ ]+ 
[ ]* 
[ ]{5,9} 

When the qualifier indicates that more than one occurrence is allowed, this does not mean that only a selected character can repeat. For example, '[abc]+' specifies that at least one of the letters 'a', 'b', and 'c' must appear but that additional characters from this set may also appear, so 'abcbca' would be just as valid a match as 'aa', 'bbb', or 'ccccc'.

The earlier ISBN example can now be shortened considerably:

< pattern
   value="0-201-[0123456789]{5}-[0123456789x]" />
  • + Share This
  • 🔖 Save To Your Account