Home > Articles > Web Services > XML

  • Print
  • + Share This
This chapter is from the book

This chapter is from the book

15.7 Character Class Ranges

It is not necessary to individually specify every possible character in a large group of characters when the options form a sequence. For example, '[abcdefghijklm-nopqrstuvwxyz]' is a verbose way to specify that any lowercase letter can occur. When a large set of options have sequential character code values (according to the Unicode standard, which incorporates the ASCII standard), as in the example above, then a range can be specified instead. A '-' separator is placed between the first character in the range and the last character in the range. For example, '[a-z]' is a more succinct method of specifying that any lowercase letter is allowed.

Multiple ranges can be given. The following expression allows digits and all non-accented letters to occur:

<pattern value="[a-zA-Z0-9]+" /> 

If the '-' character is needed within a character class, it can be escaped with '\-', but it is not necessary to do this outside of a character class. Alternatively, it does not need to be escaped if it is the first or last character in the character class, so '[-abc]' and '[abc-]' both specify that the character must be 'a', 'b', 'c', or '-'.

The ISBN example can now be shortened still further:

<pattern value="0-201-[0-9]{5}-[0-9x]" /> 

Alternatively, here is a simplified ISBN pattern that unfortunately allows the 'x' check digit to occur anywhere in the code and does not specify exactly three hyphens, as earlier examples did:

<pattern value="[0-9x-]{13}" /> 

A more generalized ISBN code can also now be represented. In this case, no assumption is made as to the area or publisher, so the exact number of digits in each part of the code cannot be known:

<pattern value="[0-9]+-[0-9]+-[0-9]+-[0-9x]" /> 

An XML character reference such as '&#123;' or '&#xAA;' can be included in a range. This is particularly useful for representing characters that are difficult, or even impossible, to enter directly from a keyboard (but note that escape sequences that achieve the same purpose in similar expression languages are not supported here).

This approach can still be used even when some of the characters in the range would not be valid. Individual characters can be selectively removed from the range, by use of a subclass with a '-' prefix. For example:

<pattern value="[a-z-[aeiou]]+" /> 

This pattern removes all vowels as valid options from the list, including the 'a' character itself, despite the fact that it actually appears in the range (where it is needed because it signifies the start of the range):

<Consonants>bcd</Consonants>       <!-- OK --> 
<Consonants>xyz</Consonants>       <!-- OK -->
<Consonants>abcdefgh</Consonants>  <!-- ERROR --> 

Such a subclass can be included within a negative character class. This can look confusing, but the characters allowed are simply reversed, so '[^a-z-[aeiou]]' indicates that consonants are not allowed, but vowels (and other characters) are allowed.

  • + Share This
  • 🔖 Save To Your Account