Parsing converts the stream of tokens provided by the
scanner into some object. Normally, this object will be a parse tree, but it
does not have to be a parse tree. For example, the SmaCC tutorial shows a
calculator. This calculator does not produce a parse tree; it produces the
result, a number.
In SmaCC the parser is defined by the grammar specification entered in the 'Parser' tab.
The grammar specification has two parts, an optional directives section and the
production rules. The directives section is used to tell SmaCC how to handle
ambiguous grammars as well as how it should generate the code for the parser.
The production rules section contains the grammar for the parser and the code
that executes when a production rule is matched.
Directives
The optional directives section consists of a set of directives. The system currently has
5 directives. Each
directive begins with a "%" character and the directive keyword, then lists a set of symbols, and
finally ends with the semicolon character, ";".
Ambiguous Grammars and Precedence
SmaCC can handle ambiguous grammars. Given an ambiguous grammar, SmaCC will
produce some parser. However, it may not parse correctly. For an LR parser,
there are two basic types of ambiguities, reduce/reduce conflicts and
shift/reduce conflicts. Reduce/reduce conflicts are bad. SmaCC has no directives
to handle them and just picks one of the choices. These conflicts normally
require a rewrite of your grammar.
On the other hand, shift/reduce conflicts can be handled by SmaCC without
rewriting your grammar. When SmaCC encounters a shift/reduce conflict it will
perform the shift action by default. However, you can control this action with
the "%left", "%right", and "%nonassoc" directives. If a token has been declared in a "%left"
directive, it means that the token is left-associative. Therefore, the parser
will perform a reduce operation. However, if it has been declared as
right-associative, it will perform a shift operation. A token defined as %nonassoc
will produce an error if that is encountered during parsing. For example, you
may want to specify that the equal operator, "=", is non-associative, so "a = b
= c" is not parsed as a valid expression. All three
directives are followed by a list of tokens.
Additionally, the %left, %right, and %nonassoc directives allow precedence to be specified.
The order of the directives specifies the precedence of the tokens. The higher
precedence tokens appear on the higher line numbers. For example, the following
directive section gives the precedence for the simple calculator in our
tutorial:
%left "+" "-";
%left "*" "/";
%right "^";
The "+" and "-" symbols appear on the first line and have the lowest
precedence. They are also left-associative so "1 + 2 +3" will be evaluated as
"(1 + 2) + 3". On the next line are the "*" and "/" symbols. Since they appear
on a higher line number, they have higher precedence than the "+" and "-".
Finally, on line three we have the "^" symbol. It has the highest precedence.
Combining all the rules allows us to parse "1 + 2 * 3 / 4 ^ 2 ^ 3" as "1 + ((2 *
3) / (4 ^ (2 ^ 3)))".
Start Symbols
By default, the left-hand side
of the first grammar rule is the start symbol. If you want to multiple
start symbols, then you can specify them by using the "%start"
directive followed by the nonterminals that are additional start
symbols. This is useful for creating two parsers with two grammars
that are similar but slightly different. For example, consider a
Smalltalk parser. You can parse methods, and you can parse
expressions. These are two different operations, but have very similar
grammars. Instead of creating two different parsers for parsing
methods and expressions, we can specify one grammar that parses methods
and also specify another starting position for parsing expressions. The
StParser in the SmaCC Example Parsers package has an example of this.
The StParser class>>parseMethod: uses the
#startingStateForMethod position to parse methods and the StParser
class>>parseExpression: uses the #startingStateForSequenceNode
position to parse expressions.
Id Methods
Internally, the various token types are represented as integers. However,
there are times that you need to reference the various token types. For example,
in the CScanner and CParser classes, the TYPE_NAME token is identical to the
IDENTIFIER token. The IDENTIFIER matching method does a lookup in the type table
and if it finds a type definition with the same name as the current IDENTIFIER,
it want to return the TYPE_NAME token type. To determine what integer this is,
the parser was created with an %id directive for <IDENTIFIER> and <TYPE_NAME>.
This generates the IDENTIFIERId and TYPE_NAMEId methods on the scanner. These
methods simply return the number representing that token type. See the C sample
scanner and parser for a good example of how this is used.
Production Rules
The production rules contains the grammar for the parser. The first
production rule is considered to be the starting rule for the parser. Each
production rule consists of a non-terminal symbol name followed by a ":"
separator which is followed by a list of possible productions separated by vertical bar,
"|", and finally terminated by a semicolon, ";".
Each production consists of a sequence of non-terminal symbols, tokens, or
keywords followed by some optional Smalltalk code enclosed in curly brackets,
{}. Non-terminal symbols are valid Smalltalk variable names and must be defined
somewhere in the parser definition. Forward references are valid. Tokens are
enclosed in angle brackets as they are defined in the scanner (e.g., <token>)
and keywords are enclosed in double-quotes (e.g., "then"). Keywords that
contain double-quotes need to have two double-quotes per each double-quote in
the keyword. For example, if you need a keyword for one double-quote character,
you would need to enter """" (four double-quote characters).
The Smalltalk code is evaluated whenever that production is matched. If the
code is a zero or a one argument symbol, then that method is performed. For a
one argument symbol, the argument is an OrderedCollection that contains one
element for each item in the production. If the code isn't a zero or one
argument symbol, then the code is executed and whatever is returned by the code
is the result of the production. If no Smalltalk code is specified, then the
default action is to execute the #reduceFor: method. This method converts all
items into an OrderedCollection. If one of the items is another
OrderedCollection, then all of its elements are added to the new collection.
Inside the Smalltalk code you can refer to the values of each production item
by using literal strings. The literal string, '1', refers the to value of the
first production item. The values for tokens and keywords will be SmaCCToken
objects. The value for all non-terminal symbols will be whatever the Smalltalk
code evaluates to for that non-terminal symbol.
Named Symbols
When entering the Smalltalk code, you can get the value for a symbol by using
the literal strings (e.g., '2'). However, this creates difficulties when
modifying a grammar. If you insert some symbol at the beginning of a production,
then you will need to modify your Smalltalk code changing all literal string
numbers. Instead you can name each symbol in the production and then refer to
the name in the Smalltalk code. To name a symbol (non-terminal, token, or
keyword), you need to add a quoted variable name after the symbol in the
grammar. For example, "MySymbol : Expression 'expr' "+" <number> 'num' {expr
+ num} ;" creates two named variables. One for the non-terminal Expression
and one for the <number> token. These variables are then used in the Smalltalk
code.
Extended Syntax
SmaCC also has some extended syntax that makes it easier to enter different
grammars. Most of the additions are for the productions, but one change that is
not for productions is the addition of "::=" as the separator between the
non-terminal and the production. The production syntax enhancements are listed
in the following table:
Symbol ? |
Makes symbol optional. It is equivalent to defining a
new production rule: "Optional_Symbol : Symbol {'1'} | {nil};". |
Symbol * or
Symbol + |
Makes a repeating symbol. The "*" repeats zero or
more times, and the "+" repeats one or more times. It is equivalent to
defining a new production rule: "Repeat_Symbol : | Symbol;" for "*" and "Repeat_Symbol
: Symbol | Repeat_Symbol Symbol ;" for "+". |
( Productions ) |
Groups the items in Productions. By itself it is not
that useful, but it can be combined with the "?", "*", or "+". It is
equivalent to defining "Group_Productions : Productions ;". |
[ Productions ] |
Equivalent to "( Productions ) ?". |
<% Productions
%> |
Equivalent to "( Productions ) *" |
Parser Comments
The compile page has three options to generate comments. You should always
select the "Generate definition comments". That saves the scanner and parser
definition strings into the scanner and parser classes. It allows your grammar
to be under the same version control system as your Smalltalk code.
The other two comment options should not be needed unless you need to debug a
parser generated. The "generate symbol comments" option will generate a comment
that explains what each symbol is mapped to. When SmaCC compiles a grammar it
converts all symbols into integers. This comment gives you the integer for each
symbol. You may need this information if you have an incorrect scanner
definition. For example, you may have overlapping token definitions and SmaCC is
picking the wrong one (by default it picks the first one in your scanner
definition). When you debug, you can inspect the SmaCCToken object and validate
its "id" with those in the symbol comment. If they are different, then you have
a bug in your scanner.
Finally, the "generate item set comments" option should rarely be needed. It
generates a listing of all LR(1) item sets in the parser. If you are familiar
with LR parsing, then it might be interesting to look at. However, for a
moderate sized grammar (e.g., Java), this comment can be a few MB in size. I
would not recommend generating such comments when using ENVY -- you don't want
to store a 10MB method in your library. For the calculator example in the
tutorial, this comment is 9,000 characters long.
Error Recovery
Normally, when the parser encounters an error, it raises the SmaCCParserError
exception and parsing is immediately stopped. However, there are times when you
may wish to try to parse more of the input. For example, if you are highlighting
code, you do not want to stop highlighting at the first syntax error. Instead
you may wish to attempt to recover after the statement separator -- the period
".". SmaCC uses the error symbol to specify where error recovery should
be attempted. For example, we may have the following rule to specify a list of
Smalltalk statements:
Statements : Expression | Statements "." Expression ;
If we wish to attempt recovery from a syntax error when we encounter a
period, we can change our rule to be:
Statements : Expression | Statements "." Expression | error "." Expression ;
While the error recovery allows you to proceed parsing after a syntax error,
it will not allow you to return a parse tree from the input. Once the input has
been parsed with errors, it will raise a non-resumable SmaCCParserError.
|