This is a walk-through tutorial to demonstrate many of the
features of SmaCC, the Smalltalk Compiler Compiler. In this example, we
will incrementally develop a simple calculator.
If you haven't already done so, you will first need to
load the source. Once
you have loaded the code, you need to open the SmaCC Parser Generator. In
VisualWorks and VisualAge, it is under Tools menu. Dolphin has it in the
additional tools folder. It will open a window that looks similar to:

Our first calculator is going to be relatively simple. It is going to
take two numbers and add them
together. To start things off, we have to tell the scanner how to
recognize a number. It starts with one or more digits,
possibly followed by an decimal point with zero or more digits after it.
The scanner definition for this token is:
<number> : [0-9]+ (\. [0-9]*) ? ;
Enter that line into the scanner tab on the interface. Let's go
over each part:
<number> |
Names the token. The name inside the <> must be a legal
Smalltalk variable name.
|
: |
Separates the name of the token from the token's definition.
|
[0-9] |
Matches any single character in the range '0' to '9' (a digit).
|
+ |
Matches the previous expression one or more times. In this case,
we are matching one or more digits.
|
( ... ) |
Groups subexpressions.
|
\. |
Matches the '.' character (. has a special meaning in regular
expressions, \ quotes it).
|
* |
Matches the previous expression zero or more times.
|
? |
Matches the previous expression zero or one time (i.e., it is optional).
|
; |
Terminates a token specification.
|
We don't want to have to worry about whitespace in our language,
so we need to define what a whitespace is and to ignore it. To do
this, enter the following token specification on the next line on the
scanner page:
<whitespace> : \s+;
\s matches any whitespace character (space, tab, linefeed, etc.). So how do
we tell the scanner to ignore it? If you look in the SmaCCScanner class, you will find a method named
'whitespace'. If a scanner understands a method that has the same
name as a token name, that method will get called whenever the
scanner matches that kind of token. As you can see, the whitespace
method eats whitespace. There is also a 'comment' method that
behaves similarly.
The only other token that will appear in our system would be the
'+' token for addition. However, since this is token is always the
same, we don't have to tell the scanner what it looks like. It will
figure it out from our grammar.
Speaking of our grammar, let's go ahead and define it. Enter the
following specification in the Parser tab:
Expression :
Expression "+" Number
| Number ;
Number : <number>;
This basically says that an expressions is either a number or
an expression added to a number.
We should be able to compile a
parser now. Switch to the Compile tab. You need to enter the class name for both
the scanner and parser. Use CalculatorScanner and CalculatorParser respectively.
Once the class names are entered, we are ready to compile the parser. Press the
'Compile LALR(1)' button (you should always push this one unless you know what
you are doing. Basically, it will generate smaller parsers than the other
option). This will create new Smalltalk classes for the CalculatorScanner and
CalculatorParser and compile several methods in those classes. All methods that
SmaCC compiles will go into a "generated-*" method protocol. You should not
change those methods since they are replace each time you compile.
Whenever SmaCC creates new classes, they are compiled in the default application/package. If you are using VisualAge,
you will need to make sure that the default application is an open edition and
that it prereqs the SmaCCRuntime application.
If you have already created the scanner and parser classes, you can load their
definitions by using the "..." buttons next to the class name entry fields. If
you answer "Yes" to the dialog, the text in the scanner/parser tabs will be
replaced with the definition that was last compiled (assuming that the "Generate
definition comments" was checked during the last compile).
Now we are ready to test our parser. Go to the “test” pane, enter “3 +
4” (without the quotes), and press the “parse” button; you will see that
the parser correctly parses it. If you press “Parse and
Inspect” you will see and inspector on an OrderedCollection
that contains the parsed tokens. This is because we haven't specified
what the parser is supposed to do when it parses. You can also enter incorrect
items. For example, try to parse "3 + + 4" or "3 + a". An
error message should appear in the text.
Now we need to define the actions that need to happen when we parse our
expressions. Currently, our parser is just validating that the expression is a
bunch of numbers added together. Generally, you will create some structure that
represents what you've parsed (e.g., a parse tree). However, in this case, we
are not concerned about the structure, but we are concerned about the result
(the value of the expression). For our example, you need to modify the grammar
definition to be:
Expression :
Expression "+" Number {'1' + '3'}
| Number {'1'};
Number : <number> {'1' value asNumber};
The text between the braces is Smalltalk code that gets evaluated
when the rule is applied. Strings with a number get replaced with the
corresponding parse node. In the first Expression rule, the '1' will get replaced by the ParseNode that matches
Expression and the '3' gets replaced by the ParseNode that matches
Number. The second item in the rule is the "+" token. Since we already know what
it is, it is not interesting. Compile the new parser. Now when you do a 'Parse
and Inspect' from the test pane, you should see the result: 7.
One problem with the previous code is that if you need to change a rule then you
may also need to change the code for that rule. For example, suppose you
inserted a new token at the beginning of a rule, then you would need to change
all of your references in the Smalltalk code. We can alleviate this problem by
using named expressions. After each part of a rule, we can specify its name.
Names are specified with single quotes and must be legal Smalltalk variable
names. Doing this for our grammar we get:
Expression :
Expression 'expression' "+" Number 'number' {expression + number}
| Number 'number' {number};
Number : <number> 'numberToken' {numberToken value asNumber};
While this will result in the same language being parsed, it makes it easier
to maintain your parsers. Let's extend our language to add subtraction. Here's the new
grammar:
Expression :
Expression 'expression' "+" Number 'number' {expression + number}
| Expression 'expression' "-" Number 'number' {expression - number}
| Number 'number' {number};
Number : <number> 'numberToken' {numberToken value asNumber};
After you've compiled this, '3 + 4 - 2 ' should return '5'. Next,
let's add multiplication and division:
Expression :
Expression 'expression' "+" Number 'number' {expression + number}
| Expression 'expression' "-" Number 'number' {expression - number}
| Expression 'expression' "*" Number 'number' {expression * number}
| Expression 'expression' "/" Number 'number' {expression / number}
| Number 'number' {number};
Number : <number> 'numberToken' {numberToken value asNumber};
Here we run into a problem. If you evaluate "2 + 3 * 4" you end up
with 20. The problem is that in standard mathematics, multiplication
has a higher precedence than addition. Our grammar evaluates strictly
left-to-right. The standard solution for this problem is to define
additional nonterminals to force the sequence of evaluation. Our
grammar with that solution would look like:
Expression :
Term 'term' {term}
| Expression 'expression' "+" Term 'term' {expression + term}
| Expression 'expression' "-" Term 'term' {expression - term};
Term :
Number 'number' {number}
| Term 'term' "*" Number 'number' {term * number}
| Term 'term' "/" Number 'number' {term / number};
Number : <number> 'numberToken' {numberToken value asNumber};
If you compile this grammar, you will see that "2 + 3 * 4" evaluates
to 14 like we would expect. Now, as you can imagine, this gets pretty
complicated as the number of precedence rules increases (e.g., C). We
can use ambiguous grammars and precedence rules to simplify this
situation. Here is the same grammar using precedence to enforce our
evaluation order:
%left "+" "-";
%left "*" "/";
Expression :
Expression 'exp1' "+" Expression 'exp2' {exp1 + exp2}
| Expression 'exp1' "-" Expression 'exp2' {exp1 - exp2}
| Expression 'exp1' "*" Expression 'exp2' {exp1 * exp2}
| Expression 'exp1' "/" Expression 'exp2' {exp1 / exp2}
| Number 'number' {number};
Number : <number> 'numberToken' {numberToken value asNumber};
Notice that we changed the grammar so that there are Expressions on both sides
of the operator.
The two lines that we
added to the top of the grammar mean that “+” and “-”
are evaluated left-to-right and have the same precedence, which is
lower than “*” and “/”. Likewise, the second
line means that “*” and “/” have equal
precedence. Grammars in this form are
usually much more intuitive, especially in cases with many precedence
levels. Just as an example, let's add exponentiation and parentheses:
%left "+" "-";
%left "*" "/";
%right "^";
Expression :
Expression 'exp1' "+" Expression 'exp2' {exp1 + exp2}
| Expression 'exp1' "-" Expression 'exp2' {exp1 - exp2}
| Expression 'exp1' "*" Expression 'exp2' {exp1 * exp2}
| Expression 'exp1' "/" Expression 'exp2' {exp1 / exp2}
| Expression 'exp1' "^" Expression 'exp2' {exp1 raisedTo: exp2}
| "(" Expression 'expression' ")" {expression}
| Number 'number' {number};
Number : <number> 'numberToken' {numberToken value asNumber};
Once you have compiled the grammar, you will be able to evaluate "3 + 4 * 5 ^ 2
^ 2" to get 2503. Since the exponent operator is right associative, this
expression is evaluated like 3 + (4 * (5 ^ (2 ^ 2))). We can also evaluate
expressions with parentheses. For example, evaluating "(3 + 4) * (5 - 2) ^ 3"
results in 189.
|