



 |
Scanning takes an input stream of characters and converts that into a stream
of tokens. The tokens are then passed on to the parsing phase.
The scanner is specified by a collection of token specifications. Each token
is specified by:
TokenName : RegularExpression ;
TokenName is a valid Smalltalk variable name that is surrounded by <>.
For example, "<token>" is a valid TokenName, but "<token name>" is not since
"token name" isn't a valid Smalltalk variable name. The RegularExpression
is a regular expression that matches a token. It should match one or more
characters in the input stream. The colon character, ":", is used to separate
the TokenName and the RegularExpression, and the semicolon
character, ";", is used to terminate the token specification.
Regular Expression Syntax
While the rules are specified as regular expressions, there are many
different syntaxes for regular expressions. We choose a relatively simple syntax
that is specified below. If you wish to have a more rich syntax, you can modify
the scanner's parser: SmaCCScannerScanner & SmaCCScannerParser. These classes
were created using SmaCC.
\character |
Matches a special character. The character
immediately following the backslash is matched exactly, unless it is a
letter. Backslash-letter combinations have other meanings and are specified
below. |
\cLetter |
Matches a control character. Control characters are the
first 26 characters (e.g., \cA equals "Character value: 0"). The letter
that follows the "\c" must be an uppercase letter. |
\d |
Matches a digit, 0-9. |
\D |
Matches anything that is not a digit. |
\f |
Matches a form-feed character, "Character value: 12". |
\n |
Matches a newline character, "Character value: 10". |
\r |
Matches a carriage return character, "Character value: 13". |
\s |
Matches any whitespace character, [ \f\n\r\t\v]. |
\S |
Matches any non-whitespace character. |
\t |
Matches a tab, "Character value: 9". |
\v |
Matches a vertical tab, "Character value: 11" |
\w |
Matches any letter, number or underscore, [A-Za-z0-9_]. |
\W |
Matches anything that is not a letter, number or underscore. |
\xHexNumber |
Matches a character specified by the hex number following
the "\x". The hex number must be at least one character long and no
more than four characters for Unicode characters and two characters for
non-Unicode characters. For example, "\x20" matches the space character
(Character value: 16r20), and "\x1FFF" matches "Character value: 16r1FFF". |
<token> |
Copies the definition of <token> into the current regular
expression. For example, if we have "<hexdigit> : \d | [A-F] ;", we can use
<hexdigit> in a later rule: "<hexnumber> : <hexdigit> + ;". |
[characters] |
Matches one of the characters inside the []. This is
a shortcut for the "|" operator. In addition to single characters, you can
also specify character ranges with the "-" character. For example, "[a-z]"
matches any lower case letter. |
[^characters] |
Matches any character not listed in the characters
block. "[^a]" matches anything except for "a". |
# comment |
Creates a comment that is ignored by SmaCC. Everything from
the # to the end of the line is ignored. |
exp1| exp2 |
Matches either exp1 or exp2. |
exp1 exp2 |
Matches exp1 followed by exp2. "\d \d" matches
two digits. |
exp* |
Matches exp zero or more times. "0*" matches "" and
"000". |
exp? |
Matches exp zero or one time. "0?" matches only "" or
"0". |
exp+ |
Matches exp one or more times. "0+" matches "0" and
"000", but not "". |
exp{min,max} |
Matches exp at least min times but no more
than max times. "0{1,2}" matches only "0" or "00". It does not match
"" or "000". |
(exp) |
Groups exp for precedence. For example, "(a b)*"
matches "ababab". Without the parentheses, "a b *" would match "abbbb" but
not "ababab". |
Since there are multiple ways to combine expressions, we need precedence
rules for their combination. The or operator, "|", has the lowest precedence and
the "*", "?", "+", and "{,}" operators have the highest precedence. For example,
"a | b c *" matches "a" or "bcccc", but not "accc" or "bcbcbc". If you wish to
match "a" or "b" followed by any number of c's, you need to use "(a | b) c *".
Overlapping Tokens
Unlike T-Gen, SmaCC can handle overlapping tokens with any problems. For
example, the following is a legal SmaCC scanner definition:
<variable> : [a-zA-Z] \w* ;
<any_character> : . ;
This definition will match a variable or a single character. A variable can
also be a single character [a-zA-Z], so the two tokens overlap. SmaCC handles
overlapping characters by preferring the first token specified by the grammar.
For example, an "a" could be a <variable> or an <any_character> token, but since
<variable> is specified first, SmaCC will use it.
Matching Methods
If your scanner has a method name that matches the name of the token, (e.g.
whitespace), that method will get called upon a match of that type. The
SmaCCScanner superclass already has a default implementation of #whitespace
and #comment. These methods ignore those tokens by default. Matching methods
can also be used to handle overlapping token classes. For example, in the C
grammar, a type definition is the same as an identifier. The only way that they
can be disambiguated is by looking up the name in the type table. In our example
C parser, we have an IDENTIFIER method that is used to determine whether the
token is really an IDENTIFIER or whether it is a TYPE_NAME.
Unreferenced Tokens
If a token is not referenced from a grammar specification, it will not be
included in the generated scanner, unless the token's name is also a name of a
method (see previous section). This, coupled with the ability to do
substitutions, allows you to have the equivalent of macros within your scanner
specification. However, be aware that if you are simply trying to generate a
scanner, you will have to make sure that you create a dummy parser specification
that references all of the tokens that you want in the final scanner.
Case Insensitive Scanning
You can specify that the scanner should ignore case differences by checking
the "Ignore Case" option on the compile tab. If you have a language that is case
insensitive and has several keywords, this can be a handy feature to have. For
example, if you have "THEN" as a keyword in a case insensitive language, you
would need to specify a token for then as "<then> : [tT] [hH] [eE] [nN] ;". This
is a pain to enter correctly. When the ignore case option is checked, SmaCC will
automatically convert "THEN" into "[tT][hH][eE][nN]".
Unicode Characters
SmaCC compiles the scanner into a bunch of conditional tests on characters.
Normally, it assumes that characters have values between 0 and 255, and it can
make some optimizations based on this fact. With the "Allow Unicode Characters"
option checked, it will assume that characters have values between 0 and 65535.
|