Regular Expressions

Because it was impossible to build a properly-working lexical analyzer using JavaScript's built-in RegExp object, JS/CC features its own implementation of regular-expression processing. This is also the reason that not all the features of the JavaScript RegExp object are provided, like back-references and predefined character classes.

The symbols and operators to be used within JS/CC's own regular-expression language are summarized in the following table. They form a minimal implementation of a regular-expression engine.

Language Element Description
Character One character specifies exactly that character. If a regular-expression operator like + or * should be used, it must be escaped via \.
\ascii-code One character, defined via ASCII-code., e.g., "\220" matches the Ü-umlaut of the extended ASCII table.
\character Escaped character. Must be used when a character of the meta-language itself should be matched, e.g., "\|".
. Any character (character class matching all available characters).
[…] Character-class. If a beginning circumflex (^) is given, the character-class is negated. Character ranges can be specified using a dash. For example, "[A-Za-z]" specifies all capital and lower-case alphabet letters.
(…) Sub-expression.
| Or-operator. Allows to specify different expressions at one level.
* Kleene-closure operator (none or many), to be specified behind a character, character-class, or sub-expression.
+ Positive-closure operator (one or many), to be specified behind a character, character-class, or sub-expression.
? Optional-closure operator (one or none), to be specified behind a character, character-class, or sub-expression.

To allow case-insensitive keywords within grammar definitions, a terminal symbol definition can be specified using single-quoted ('…') and double-quoted ("…") strings. A single-quoted string means that a terminal symbol is matched case-sensitive, while a double-quoted string matches a terminal in any case order. For example, the terminal symbol definition "PRINT" will match for Print, print, PrINT, and PRINT, while the definition 'PRINT' will only match for PRINT itself.

From these regular expression definitions, JS/CC constructs a deterministic finite automaton which acts as lexer in the resulting parser.

Ambiguous Regular Expressions

If there are ambiguous regular expressions (where several expressions match the same string) within the terminal definition part, the expressions defined first in the terminal definition part will take higher match precedence than the later-defined terminals. It is recommended to define tokens with a higher specialization level as the first, and tokens with a lower level as the last in your token definition part.

Associativity and Precedence

Tokens can be grouped by precedence levels and associativity. This feature allows writing faster and even smaller grammars, by resolving grammar conflicts by weighting terminal symbols.

A group without a group specifier will set no associativity and a precedence level of zero to all terminal symbols in this group (as in the first example).

Else, if a group begins with the symbol < for left-associativity, > for right-associativity, or ^ for non-associativity, all terminal symbols within this group are set to the according associativity and precedence level. The precedence level is incremented each time a new group of these three types is opened, so groups that are defined at the bottom of the token definition part take the highest precedence.

The precedence information as associativity is used to resolve conflicts in ambiguous grammars by modifying the parse table's natural content. How this works in practice is described in The Grammar Definition Part in the section dealing with grammar conflicts and their handling.

Whitespace

A special type of terminal symbol is introduced by the exclamation-mark (!) symbol: the whitespace symbols!

In this definition, there is only a regular expression possible. A label or code part is prohibited. As whitespace-tokens, terminals that should always be ignored can be specified, e.g., blanks, tabs, or comments.