The second — and most important — part of a JS/CC parser definition is the grammar definition
part. In this part, below the ##
symbol, the definition of the context-free grammar to be parsed
by the generated parser is described. This is done by using a Backus-Naur-Form variant meta-language, by defining
non-terminals and their productions.
The general syntax for a non-terminal and its productions is:
non-terminal : production1 semantic-code | production2 semantic-code ⋮ | productionn semantic-code
The non-terminal defines a single-word identifier and acts as the left-hand side for the related productions attached to this symbol.
The production defines a sequence of zero or multiple terminals and non-terminals, defining the different
syntax rules. To specify terminal symbols, it is possible to call them via their (unescaped) regular expression via
a string-value or by their label, which need not (but can) be specified as a string value. In the
example, both methods are used:
FLOAT
is called by its label, while the token '+'
is called by the name generated
from its regular expression.
The semantic-code part behind the productions is optional. It defines an individual semantic code which is
executed right before the according production is reduced to its left-hand side. Same as in the token definition
part, semantic code is enclosed by [*
and *]
and will be concatenated to one code
segment which is associated with the according productions when multiple semantic code segments are specified in
a row. Read more about this in the section below.
Note that in the above syntax scheme the number of productions is completely variable. At least, one right-hand side must be given to an according left-hand side, although this can be an epsilon production.
Each production is separated by a vertical "pipe" bar (|
) from the others, and a non-terminal
definition must always be closed by a semicolon. Else, JS/CC cannot distinguish whether the next symbol belongs
to the right-hand side it currently parses or whether it reads a new non-terminal definition.
Non-terminal symbols can be called on a right-hand side before they are defined, so the order in which the rules are defined is arbitrary. Integrative checks on the grammar are done by JS/CC before the parse tables are constructed.
Within each semantic action attached to a production, as described above, the values of the right-hand side symbols can be accessed via wild cards, which are replaced by the particular variables and offsets later in the resulting parser.
The %n
wild cards are used to address every individual token starting from the left of the right-hand
side.
The %%
wild card relates to the value of the left-hand side which is pushed to the parser's value stack
right after the right-hand side symbols are popped.
So by passing a value to the %%
wild card causes the inheritance of values from the current right-hand
side to another call of the according non-terminal on a right-hand side within the parse tree. The values on the
current right-hand side will be discarded when the reduction process occurs, and the value attached to
%%
is pushed instead, so it can be used elsewhere.
As example, in the given production:
expr: expr '+' term [* %% = %1 + %3; *]
of the expression calculator, the return values
of the expr
and the term
on the right-hand side, which are addressed via
%1
and %3
(%2
addresses the value of the '+'
terminal) are added together, and the result is stored to the left-hand side (the expr
on the far
left in this case). Thus, you have full control over all individual tokens within each production.
If no individual semantic action is given to a right-hand side, the default action
[* %% = %1; *]
is used.
Semantic action between the symbols of a right-hand side is not allowed, only behind them.
Because JS/CC was designed for use with JavaScript, or any other typeless scripting language, it is not necessary
to define — as in yacc — a special union structure to hold the values on the value stack. Both
built-in primary types like String
or Number
objects as well as user-defined objects
(each function in JavaScript is internally represented by an object) can be pushed to and popped off the value
stack.
So don't confuse the values; you have to know which value stands on which position.
To automatically resolve shift-reduce or reduce-reduce conflicts at parse table generation, JS/CC features, by default, two mechanisms.
When a shift-reduce conflict occurs, JS/CC constructs the parse tables in favor of the shift, so the parse tree will grow right-derivative.
When a reduce-reduce conflict occurs, JS/CC resolves the problem by reducing the production which was defined first, so productions which were defined earlier in the grammar will be reduced in favor when this conflict arises.
As described in The Terminal Declaration Part, JS/CC features the possibility of manipulating the natural parse table generated by the LALR(1) table construction algorithm by weighting terminal symbols with a precedence level and an associativity. This information is used within shift-reduce conflicts to better decide if a shift or a reduce operation should be inserted. If a shift-reduce conflict arises, the precedence level and associativity information is compared with the according production's precedence level because every production will, by default, get the same precedence level as its rightmost terminal symbol.
To explain this behavior, let's look at the following example. It defines a calculator like the first example, but with less effort in writing the grammar. Here, we have only two non-terminal symbols instead of four, but we implement the same operator precedence behavior as in the original one.
/~ Tokens to be ignored (e.g., whitespace, comments) ~/
! ' |\t';
/~ Left-associative tokens, lowest precedence ~/
< '\+'
'\-';
/~ Left-associative tokens, highest precedence ~/
< '\*'
'/';
/~ Tokens with no associativity ~/
'\('
'\)'
'[0-9]+' INT [* %match = parseInt( %match ); *]
'[0-9]+\.[0-9]*|[0-9]*\.[0-9]+' FLOAT [* %match = parseFloat ( %match ); *]
;
##
program: expr [* print( %1 ); *]
;
expr: expr '+' expr [* %% = %1 + %3; *]
| expr '-' expr [* %% = %1 - %3; *]
| expr '*' expr [* %% = %1 / %3; *]
| expr '/' expr [* %% = %1 / %3; *]
| '(' expr ')' [* %% = %2; *]
| INT
| FLOAT
;
Sometimes, it will also be necessary to give a production another precedence level than the one of the
rightmost terminal. For example, if we want to add a unary-minus operator to the grammar above, the
production adopts the precedence level of the minus-symbol by default. But this minus-operator was
configured for its use in a binary subtraction, not in a unary subtraction. By simply adding a new
rule for unary minus to the grammar, most simple expressions will return the right result
(e.g., -2+3
), but in expressions like 4/-4*5
, the result will simply be wrong because,
through our precedence rules for multiplication, the generated parser parses 4/(-(4*5))
instead of (4/(-4))*5
.
To resolve this problem, we need to attach a higher precedence level to the production for unary
minus. For this special case, JS/CC features the &
-directive.
The &
-directive must be specified behind the rule's definition and in front of
the semantic code action (if any). Behind the &
-directive, a terminal symbol
(both as string or its label) is specified, which precedence level is taken by the production
instead of its default value.
So by changing the grammar to:
expr: expr '+' expr [* %% = %1 + %3; *]
| expr '-' expr [* %% = %1 - %3; *]
| expr '*' expr [* %% = %1 / %3; *]
| expr '/' expr [* %% = %1 / %3; *]
| '(' expr ')' [* %% = %2; *]
| '-' expr &'*' [* %% = %2 * -1; *]
| INT
| FLOAT
;
we get the right parse tree and result, because our rule with the unary minus has a higher precedence now and reduces instead of shifting in the desired cases.