Home > Articles > Web Services > XML

  • Print
  • + Share This
This chapter is from the book

This chapter is from the book

15.8 Subexpressions

A complete expression can be embedded within another expression, creating a subexpression. The embedded expression is enclosed by parentheses, '(' and ')'. On its own, however, a subexpression has no effect on the complete pattern. The following two examples are functionally identical:

abcde 
a(bcd)e 

At least two features are supported by this concept. A subexpression allows a sequence to be optional or repeatable and allows branches to be inserted into the middle of a larger expression.

Quantified groups

One reason for using a group is to give the enclosed tokens a quantifier. The whole group may be optional or repeatable. The same techniques are used as for single atoms:

a(bcd)?e 
a(bcd){5,9}e 

Note that the first example above is not equivalent to the expression 'ab?c?d?e'. The difference is that, in this case, the characters 'b', 'c' and 'd' must all be present (in that order) or must all be absent.

An ISBN code might be allowed to be incomplete if the publisher part of the code can be implied:

<pattern value="(0-201-)?[0-9]{5}-[0-9x]" /> 

Branching groups

A group is useful when several options are required at a particular location in the pattern, because a subexpression can contain branches. Consider the following example:

abc(1|2|3)d 

This pattern matches the values 'abc1d', 'abc2d', and 'abc3d'. Of course, with only a single character in each branch, this is just an alternative for the more succinct pattern 'abc[123]d'. However, that much simpler technique cannot work for multicharacter scenarios. In the following example, the values allowed are 'abc111d', 'abc222d', and 'abc333d':

abc(111|222|333)d 

Each branch is a complete expression, and may also contain subexpressions, though this is only needed when there are fixed characters before or after the embedded options:

...(...|aaa(...|...|...)zzz|...)... 

An ISBN code for any book published in France (area code '2') or Poland (area code '83') is quite straightforward to express (though the following formulation unfortunately permits a missing or extra digit in the publisher or book code and does not prevent the hyphen that should separate these two parts from actually occuring before or after them both):

<pattern value="(2|83)-[0-9-]{7,8}-[0-9x]" /> 

15.9 Character Class Escapes

There are various categories of character class escape. The simplest kind, single character escape, has already been discussed. This is an escape sequence for a single character that has a significant role in the expression language, such as '\{' to represent the '{' character (they are listed and discussed in more detail above). The other escape types are

  • multicharacter escapes (such as '\S' (non-whitespace) and '.' (non-line-ending character));

  • general category escapes (such as '\p{L}' and '\p{Lu}') and complementary general category escapes (such as '\P{L}' and '\P{Lu}');

  • block category escapes (such as '\p{IsBasicLatin}' and '\p{IsTibetan}') and complementary block category escapes (such as '\P{IsBasicLatin}' and '\P{IsTibetan}').

Multicharacter escapes

For convenience, a number of single character escape codes are provided to represent very common sets of characters, including

  • non-line-ending characters;

  • whitespace characters and non-whitespace characters;

  • initial XML name characters (and all characters except these characters);

  • subsequent XML name characters (and all characters except these characters);

  • decimal digits (and all characters except these digits).

The '.' character represents every character except a newline or carriage-return character. The sequence '.....' therefore represents a string of five characters that is not broken over lines. The simplest possible pattern for an ISBN code would be thirteen dots (ten digits and three hyphens):

<pattern value=" " /> 

The remaining multicharacter escape characters are escaped in the normal way: by a '\' symbol. They are all defined in pairs, with a lowercase letter representing a particular common requirement, and the equivalent uppercase letter representing the opposite effect.

The escape sequence '\s' represents any whitespace character, including the space, tab, newline and carriage-return characters. The '\S' sequence therefore represents any non-whitespace character.

The escape sequence '\i' represents any XML initial name character ('_', ':', or a letter). The '\I' sequence therefore represents any XML noninitial character. Similarly, the escape sequence '\c' represents any XML name character, and '\C' represents any non-XML name character.

The escape sequence '\d' represents any decimal digit. It is equivalent to '\p{Nd}' (see below). The '\D' sequence therefore represents any other character. The ISBN examples can now be shortened still further, and note that an escape sequence can even be placed within a character class, in this case to indicate that the check digit may be a digit instead of the letter 'x' (but note further that such escape sequences cannot be used to indicate the start or end of a range of characters):

<pattern value="\d*-\d*-\d*-[\dx]" /> 

The escape sequence '\w' represents all characters except punctuation, separators, and 'other' characters (using a mixture of techniques described above and below, this is equivalent to '[&#x0000;-&#x10FFFF;-[\p{P}\p{Z}\p{C}]]'), whereas the '\W' sequence represents only these characters.

Quantifiers can be used with these escape sequences. For example, '\d{5}' specifies that five decimal digits are required.

Category escapes

The escape sequence '\p' or '\P' introduces a category escape set. A category token is enclosed within curly brackets, '{' and '}'. These tokens represent predefined sets of characters, such as all uppercase letters (a general kind of category escape) or the Tibetan character set (a block from the Unicode character set).

General category escapes

A general category escape is a reference to a predefined set of characters, such as the uppercase letters, or all of the punctuation characters. These sets of characters have special names, such as 'Lu' for uppercase letters, and 'P' for all punctuation. For example, '\p{Lu}' represents all uppercase letters, and '\P{Lu}' represents all characters except uppercase letters.

Single letter codes are used for major groupings, such as 'L' for all letters (of which uppercase letters are just a subset). The full set of options is listed below:

L

 

All Letters

 

Lu

uppercase

 

Ll

lowercase

 

Lt

titlecase

 

Lm

modifier

 

Lo

other

M

 

All Marks

 

Mn

nonspacing

 

Mc

spacing combination

 

Me

enclosing

N

 

All Numbers

 

Nd

decimal digit

 

Nl

letter

 

No

other

P

 

All Punctuation

 

Pc

connector

 

Pd

dash

 

Ps

open

 

Pe

close

 

Pi

initial quote

 

Pf

final quote

 

Po

other

Z

 

All Separators

 

Zs

space

 

Zl

line

 

Zp

paragraph

S

 

All Symbols

 

Sm

math

 

Sc

currency

 

Sk

modifier

 

So

other

C

 

All Others

 

Cc

control

 

Cf

format

 

Co

private use


For details see http://www.unicode.org/Public/3.1-Update/UnicodeCharacter-Database-3.1.0.html

Block category escapes

The Unicode character set is divided into many significant groupings such as musical symbols, Braille characters, and Tibetan characters. A keyword is assigned to each group, for example, 'MusicalSymbols', 'BraillePatterns', and 'Tibetan'.

The following table lists the full set of keywords in alphabetical order:

AlphabeticPresentationForms

Hebrew

Arabic

HighPrivateUseSurrogates

ArabicPresentationForms-A

HighSurrogates

ArabicPresentationForms-B

Hiragana

Armenian

IdeographicDescriptionCharacters

Arrows

IPAExtensions

BasicLatin

Kanbun

Bengali

KangxiRadicals

BlockElements

Kannada

Bopomofo

Katakana

BopomofoExtended

Khmer

BoxDrawing

Lao

BraillePatterns

Latin-1Supplement

ByzantineMusicalSymbols

LatinExtended-A

Cherokee

LatinExtended-B

CJKCompatibility

LatinExtendedAdditional

CJKCompatibilityForms

LetterlikeSymbols

CJKCompatibilityIdeographs

LowSurrogates

CJKCompatibilityIdeographsSupplement

Malayalam

CJKRadicalsSupplement

MathematicalAlphanumericSymbols

CJKSymbolsandPunctuation

MathematicalOperators

CJKUnifiedIdeographs

MiscellaneousSymbols

CJKUnifiedIdeographsExtensionA

MiscellaneousTechnical

CJKUnifiedIdeographsExtensionB

Mongolian

CombiningDiacriticalMarks

MusicalSymbols

CombiningHalfMarks

Myanmar

CombiningMarksforSymbols

NumberForms

ControlPictures

Ogham

CurrencySymbols

OldItalic

Cyrillic

OpticalCharacterRecognition

Deseret

Oriya

Devanagari

PrivateUse (three separate sets)

Dingbats

Runic

EnclosedAlphanumerics

Sinhala

EnclosedCJKLettersandMonths

SmallFormVariants

Ethiopic

SpacingModifierLetters

GeneralPunctuation

Specials (two separate sets)

GeometricShapes

SuperscriptsandSubscripts

Georgian

Syriac

Gothic

Tags

Greek

Tamil

GreekExtended

Telugu

Gujarati

Thaana

Gurmukhi

Thai

HalfwidthandFullwidthForms

Tibetan

HangulCompatibilityJamo

UnifiedCanadianAboriginalSyllabics

HangulJamo

YiRadicals

HangulSyllables

YiSyllables


A reference to one of these categories involves a keyword that begins with 'Is...' followed by a name from the list above, such as 'Tibetan'. For example, '\p{IsTibetan}' represents any Tibetan character and '\P{IsTibetan}' represents any character not from this set.

  • + Share This
  • 🔖 Save To Your Account