SmaCC Parser

SmaCC Home
Download
Tutorial
Scanner
Parser

 

Parsing converts the stream of tokens provided by the scanner into some object. Normally, this object will be a parse tree, but it does not have to be a parse tree. For example, the SmaCC tutorial shows a calculator. This calculator does not produce a parse tree; it produces the result, a number.

In SmaCC the parser is defined by the grammar specification entered in the 'Parser' tab. The grammar specification has two parts, an optional directives section and the production rules. The directives section is used to tell SmaCC how to handle ambiguous grammars as well as how it should generate the code for the parser. The production rules section contains the grammar for the parser and the code that executes when a production rule is matched.

Directives

The optional directives section consists of a set of directives. The system currently has 5 directives. Each directive begins with a "%" character and the directive keyword, then lists a set of symbols, and finally ends with the semicolon character, ";".

Ambiguous Grammars and Precedence

SmaCC can handle ambiguous grammars. Given an ambiguous grammar, SmaCC will produce some parser. However, it may not parse correctly. For an LR parser, there are two basic types of ambiguities, reduce/reduce conflicts and shift/reduce conflicts. Reduce/reduce conflicts are bad. SmaCC has no directives to handle them and just picks one of the choices. These conflicts normally require a rewrite of your grammar.

On the other hand, shift/reduce conflicts can be handled by SmaCC without rewriting your grammar. When SmaCC encounters a shift/reduce conflict it will perform the shift action by default. However, you can control this action with the "%left", "%right", and "%nonassoc" directives. If a token has been declared in a "%left" directive, it means that the token is left-associative. Therefore, the parser will perform a reduce operation. However, if it has been declared as right-associative, it will perform a shift operation. A token defined as %nonassoc will produce an error if that is encountered during parsing. For example, you may want to specify that the equal operator, "=", is non-associative, so "a = b = c" is not parsed as a valid expression. All three directives are followed by a list of tokens.

Additionally, the %left, %right, and %nonassoc directives allow precedence to be specified. The order of the directives specifies the precedence of the tokens. The higher precedence tokens appear on the higher line numbers. For example, the following directive section gives the precedence for the simple calculator in our tutorial:

%left "+" "-";
%left "*" "/";
%right "^";

The "+" and "-" symbols appear on the first line and have the lowest precedence. They are also left-associative so "1 + 2 +3" will be evaluated as "(1 + 2) + 3". On the next line are the "*" and "/" symbols. Since they appear on a higher line number, they have higher precedence than the "+" and "-". Finally, on line three we have the "^" symbol. It has the highest precedence. Combining all the rules allows us to parse "1 + 2 * 3 / 4 ^ 2 ^ 3" as "1 + ((2 * 3) / (4 ^ (2 ^ 3)))".

Start Symbols

By default, the left-hand side of the first grammar rule is the start symbol. If you want to multiple start symbols, then you can specify them by using the "%start" directive followed by the nonterminals that are additional start symbols. This is useful for creating two parsers with two grammars that are similar but slightly different. For example, consider a Smalltalk parser. You can parse methods, and you can parse expressions. These are two different operations, but have very similar grammars. Instead of creating two different parsers for parsing methods and expressions, we can specify one grammar that parses methods and also specify another starting position for parsing expressions. The StParser in the SmaCC Example Parsers package has an example of this. The StParser class>>parseMethod: uses the #startingStateForMethod position to parse methods and the StParser class>>parseExpression: uses the #startingStateForSequenceNode position to parse expressions.

Id Methods

Internally, the various token types are represented as integers. However, there are times that you need to reference the various token types. For example, in the CScanner and CParser classes, the TYPE_NAME token is identical to the IDENTIFIER token. The IDENTIFIER matching method does a lookup in the type table and if it finds a type definition with the same name as the current IDENTIFIER, it want to return the TYPE_NAME token type. To determine what integer this is, the parser was created with an %id directive for <IDENTIFIER> and <TYPE_NAME>. This generates the IDENTIFIERId and TYPE_NAMEId methods on the scanner. These methods simply return the number representing that token type. See the C sample scanner and parser for a good example of how this is used.

Production Rules

The production rules contains the grammar for the parser. The first production rule is considered to be the starting rule for the parser. Each production rule consists of a non-terminal symbol name followed by a ":" separator which is followed by a list of possible productions separated by vertical bar, "|", and finally terminated by a semicolon, ";".

Each production consists of a sequence of non-terminal symbols, tokens, or keywords followed by some optional Smalltalk code enclosed in curly brackets, {}. Non-terminal symbols are valid Smalltalk variable names and must be defined somewhere in the parser definition. Forward references are valid. Tokens are enclosed in angle brackets as they are defined in the scanner (e.g., <token>) and keywords are enclosed in double-quotes (e.g., "then"). Keywords that contain double-quotes need to have two double-quotes per each double-quote in the keyword. For example, if you need a keyword for one double-quote character, you would need to enter """" (four double-quote characters).

The Smalltalk code is evaluated whenever that production is matched. If the code is a zero or a one argument symbol, then that method is performed. For a one argument symbol, the argument is an OrderedCollection that contains one element for each item in the production. If the code isn't a zero or one argument symbol, then the code is executed and whatever is returned by the code is the result of the production. If no Smalltalk code is specified, then the default action is to execute the #reduceFor: method. This method converts all items into an OrderedCollection. If one of the items is another OrderedCollection, then all of its elements are added to the new collection.

Inside the Smalltalk code you can refer to the values of each production item by using literal strings. The literal string, '1', refers the to value of the first production item. The values for tokens and keywords will be SmaCCToken objects. The value for all non-terminal symbols will be whatever the Smalltalk code evaluates to for that non-terminal symbol.

Named Symbols

When entering the Smalltalk code, you can get the value for a symbol by using the literal strings (e.g., '2'). However, this creates difficulties when modifying a grammar. If you insert some symbol at the beginning of a production, then you will need to modify your Smalltalk code changing all literal string numbers. Instead you can name each symbol in the production and then refer to the name in the Smalltalk code. To name a symbol (non-terminal, token, or keyword), you need to add a quoted variable name after the symbol in the grammar. For example, "MySymbol : Expression 'expr' "+" <number> 'num' {expr + num} ;" creates two named variables. One for the non-terminal Expression and one for the <number> token. These variables are then used in the Smalltalk code.

Extended Syntax

SmaCC also has some extended syntax that makes it easier to enter different grammars. Most of the additions are for the productions, but one change that is not for productions is the addition of "::=" as the separator between the non-terminal and the production. The production syntax enhancements are listed in the following table:

Symbol ? Makes symbol optional. It is equivalent to defining a new production rule: "Optional_Symbol : Symbol {'1'} | {nil};".
Symbol * or
Symbol
+
Makes a repeating symbol. The "*" repeats zero or more times, and the "+" repeats one or more times. It is equivalent to defining a new production rule: "Repeat_Symbol : | Symbol;" for "*" and "Repeat_Symbol : Symbol | Repeat_Symbol Symbol ;" for "+".
( Productions ) Groups the items in Productions. By itself it is not that useful, but it can be combined with the "?", "*", or "+". It is equivalent to defining "Group_Productions : Productions ;".
[ Productions ] Equivalent to "( Productions ) ?".
<% Productions %> Equivalent to "( Productions ) *"

Parser Comments

The compile page has three options to generate comments. You should always select the "Generate definition comments". That saves the scanner and parser definition strings into the scanner and parser classes. It allows your grammar to be under the same version control system as your Smalltalk code.

The other two comment options should not be needed unless you need to debug a parser generated. The "generate symbol comments" option will generate a comment that explains what each symbol is mapped to. When SmaCC compiles a grammar it converts all symbols into integers. This comment gives you the integer for each symbol. You may need this information if you have an incorrect scanner definition. For example, you may have overlapping token definitions and SmaCC is picking the wrong one (by default it picks the first one in your scanner definition). When you debug, you can inspect the SmaCCToken object and validate its "id" with those in the symbol comment. If they are different, then you have a bug in your scanner.

Finally, the "generate item set comments" option should rarely be needed. It generates a listing of all LR(1) item sets in the parser. If you are familiar with LR parsing, then it might be interesting to look at. However, for a moderate sized grammar (e.g., Java), this comment can be a few MB in size. I would not recommend generating such comments when using ENVY -- you don't want to store a 10MB method in your library. For the calculator example in the tutorial, this comment is 9,000 characters long.

Error Recovery

Normally, when the parser encounters an error, it raises the SmaCCParserError exception and parsing is immediately stopped. However, there are times when you may wish to try to parse more of the input. For example, if you are highlighting code, you do not want to stop highlighting at the first syntax error. Instead you may wish to attempt to recover after the statement separator -- the period ".". SmaCC uses the error symbol to specify where error recovery should be attempted. For example, we may have the following rule to specify a list of Smalltalk statements:

     Statements : Expression | Statements "." Expression ;

If we wish to attempt recovery from a syntax error when we encounter a period, we can change our rule to be:

  Statements : Expression | Statements "." Expression | error "." Expression ;

While the error recovery allows you to proceed parsing after a syntax error, it will not allow you to return a parse tree from the input. Once the input has been parsed with errors, it will raise a non-resumable SmaCCParserError.