Java's Lexical Structure
Lexicographer: A writer of dictionaries, a harmless drudge.—Samuel Johnson, Dictionary(1755)
This chapter specifies the lexical structure of the Java programming language. Programs are written in Unicode (§ 3.1), but lexical translations are provided (§ 3.2) so that Unicode escapes (§ 3.3) can be used to include any Unicode character using only ASCII characters. Line terminators are defined (§ 3.4) to support the different conventions of existing host systems while maintaining consistent line numbers.
The Unicode characters resulting from the lexical translations are reduced to a sequence of input elements (§ 3.5), which are white space (§ 3.6), comments (§ 3.7), and tokens. The tokens are the identifiers (§ 3.8), keywords (§ 3.9), literals (§ 3.10), separators (§ 3.11), and operators (§ 3.12) of the syntactic grammar.
3.1 Unicode
Programs are written using the Unicode character set. Information about this encoding may be found at:
Versions of the Java programming language prior to 1.1 used Unicode version 1.1.5 (see The Unicode Standard: Worldwide Character Encoding (§1.4) and updates). Later versions prior to JDK version 1.1.7 used Unicode version 2.0. Since JDK version 1.1.7, Unicode 2.1 has been in use. The Java platform will track the Unicode specification as it evolves. The precise version of Unicode used by a given release is specified in the documentation of the class Character.
Except for comments (§ 3.7), identifiers, and the contents of character and string literals (§ 3.10.4, § 3.10.5), all input elements (§ 3.5) in a program are formed only from ASCII characters (or Unicode escapes (§ 3.3) which result in ASCII characters). ASCII (ANSI X3.4) is the American Standard Code for Information Interchange. The first 128 characters of the Unicode character encoding are the ASCII characters.