[prev] [up] [next]

Regular Expression Parser/Matcher

Overview

These classes implement a parser/matcher for regular expressions.

Documentation

The following text is an html- transcription of the text found in the class methods in "documentation-manual" of the Regex::RxParser class. In addition, if you need further info: "use the source - Luke".

These pages will not teach you regular expression usage nor the Smalltalk language.
For regular expressions, the following excellent book is recommended:

For smalltalk literature, please refer to the "Reading List".

Introduction

A regular expression is a template specifying a class of strings. A regular expression matcher is an tool that determines whether a string belongs to a class specified by a regular expression. This is a common task of a user input validation code, and the use of regular expressions can GREATLY simplify and speed up development of such code.
As an example, here is how to verify that a string is a valid hexadecimal number in Smalltalk notation, using this matcher package:
	aString matchesRegex: '16r[[:xdigit:]]+'
(Coding the same ``the hard way'' is an exercise to a curious reader).

This matcher is offered to the Smalltalk community in hope it will be useful. It is free in terms of money, and to a large extent -- in terms of rights of use. Refer to `Boring Stuff' section for legalese.

Happy hacking,
Vassili Bykov <vassili@objectpeople.com> <vassili@magma.ca>

August 6, 1996 (first release)
April 4, 1999 (rel1.1)

Whats new in Version1.1 (Oct 1999)

Regular expression syntax corrections and enhancements:
  1. Backslash escapes similar to those in Perl are allowed in patterns:
    \w
    any word constituent character (equivalent to [a-zA-Z0-9_])
    \W
    any character but a word constituent (equivalent to [^a-xA-Z0-9_])
    \d
    a digit (same as [0-9])
    \D
    anything but a digit
    \s
    a whitespace character
    \S
    anything but a whitespace character
    \b
    an empty string at a word boundary
    \B
    an empty string not at a word boundary
    \<
    an empty string at the beginning of a word
    \>
    an empty string at the end of a word
    For example, '\w+' is now a valid expression matching any word.

  2. The following backslash escapes are also allowed in character sets (between square brackets):

        \w, \W, \d, \D, \s, and \S.
    

  3. The following grep(1)-compatible named character classes are recognized in character sets as well:

        [:alnum:]
        [:alpha:]
        [:blank:]
        [:cntrl:]
        [:digit:]
        [:graph:]
        [:lower:]
        [:print:]
        [:punct:]
        [:space:]
        [:upper:]
        [:xdigit:]
    

    For example, the following patterns are equivalent:

        '[[:alnum:]]+'
        '\w+'
        '[\w]+'
        '[a-zA-Z0-9_]+'
    

  4. Some non-printable characters can be represented in regular expressions using a common backslash notation:

        \t      tab (Character tab)
        \n      newline (Character lf)
        \r      carriage return (Character cr)
        \f      form feed (Character newPage)
        \e      escape (Character esc)
    

  5. A dot is correctly interpreted as 'any character but a newline' instead of 'anything but whitespace'.

  6. Case-insensitive matching. The easiest access to it are new messages CharacterArray understands:
        #asRegexIgnoringCase
        #matchesRegexIgnoringCase:
        #prefixMatchesRegexIgnoringCase:
    

  7. The matcher (an instance of RxMatcher, the result of String>>asRegex) now provides a collection-like interface to matches in a particular string or on a particular stream, as well as substitution protocol. The interface includes the following messages:

        matchesIn: aString
        matchesIn: aString collect: aBlock
        matchesIn: aString do: aBlock
    
        matchesOnStream: aStream
        matchesOnStream: aStream collect: aBlock
        matchesOnStream: aStream do: aBlock
    
        copy: aString translatingMatchesUsing: aBlock
        copy: aString replacingMatchesWith: replacementString
    
        copyStream: aStream to: writeStream translatingMatchesUsing: aBlock
        copyStream: aStream to: writeStream replacingMatchesWith: aString
    

    Examples:

        '\w+' asRegex matchesIn: 'now is the time'
    

    returns an OrderedCollection containing four strings: 'now', 'is', 'the', and 'time'.

        '\<t\w+' asRegexIgnoringCase
    	    copy: 'now is the Time'
    	    translatingMatchesUsing: [:match | match asUppercase]
    

    returns 'now is THE TIME' (the regular expression matches words beginning with either an uppercase or a lowercase T).

Syntax

[You can `print it' examples in this text. ]

The simplest regular expression is a single character. It matches exactly that character. A sequence of characters matches a string with exactly the same sequence of characters:

    'a' matchesRegex: 'a'                   "-> true"
    'foobar' matchesRegex: 'foobar'         "-> true"
    'blorple' matchesRegex: 'foobar'        "-> false"
The above paragraph introduced a primitive regular expression (a character), and an operator (sequencing). Operators are applied to regular expressions to produce more complex regular expressions. Sequencing (placing expressions one after another) as an operator is, in a certain sense, `invisible'--yet it is arguably the most common.

A more `visible' operator is Kleene closure, more often simply referred to as `a star'. A regular expression followed by an asterisk (`*') matches any number (including 0) of matches of the original expression.
For example:

    'ab' matchesRegex: 'a*b'                "-> true"
    'aaaaab' matchesRegex: 'a*b'            "-> true"
    'b' matchesRegex: 'a*b'                 "-> true"
    'aac' matchesRegex: 'a*b'               "-> false: b does not match"

A star's precedence is higher than that of sequencing. A star applies to the shortest possible subexpression that precedes it. For example, 'ab*' means `a followed by zero or more occurrences of b', not `zero or more occurrences of ab':

    'abbb' matchesRegex: 'ab*'              "-> true"
    'abab' matchesRegex: 'ab*'              "-> false"
To actually make a regex matching `zero or more occurrences of ab', `ab' is enclosed in parentheses:
    'abab' matchesRegex: '(ab)*'            "-> true"
    'abcab' matchesRegex: '(ab)*'           "-> false: c spoils the fun"
Two other operators similar to `*' are `+' and `?'.
For example:
    'ac' matchesRegex: 'ab*c'               "-> true"
    'ac' matchesRegex: 'ab+c'               "-> false: need at least one b"
    'abbc' matchesRegex: 'ab+c'             "-> true"
    'abbc' matchesRegex: 'ab?c'             "-> false: too many b's"
As we have seen, characters `*', `+', `?', `(', and `)' have special meaning in regular expressions. If one of them is to be used literally, it should be quoted: preceded with a backslash. (Thus, backslash is also special character, and needs to be quoted for a literal match--as well as any other special character described further).
    'ab*' matchesRegex: 'ab*'               "-> false: star in the right string is special"
    'ab*' matchesRegex: 'ab\*'              "-> true"
    'a\c' matchesRegex: 'a\\c'              "-> true"
The last operator is `|' meaning `or'.
It is placed between two regular expressions, and the resulting expression matches if one of the expressions matches. It has the lowest possible precedence (lower than sequencing). For example, `ab*|ba*' means `a followed by any number of b's, or b followed by any number of a's':
    'abb' matchesRegex: 'ab*|ba*'           "-> true"
    'baa' matchesRegex: 'ab*|ba*'           "-> true"
    'baab' matchesRegex: 'ab*|ba*'          "-> false"
A bit more complex example is the following expression, matching the name of any of the Lisp-style `car', `cdr', `caar', `cadr', ... functions:
    c(a|d)+r
It is possible to write an expression matching an empty string, for example: `a|'. However, it is an error to apply `*', `+', or `?' to such expression: `(a|)*' is an invalid expression.

So far, we have used only characters as the 'smallest' components of regular expressions. There are other, more `interesting', components.

A character set is a string of characters enclosed in square brackets. It matches any single character if it appears between the brackets.
For example, `[01]' matches either `0' or `1':

    '0' matchesRegex: '[01]'         "-> true"
    '3' matchesRegex: '[01]'         "-> false"
    '11' matchesRegex: '[01]'        "-> false: a set matches only one character"

Using the plus operator, we can build the following binary number recognizer:
    '10010100' matchesRegex: '[01]+'        "-> true"
    '10001210' matchesRegex: '[01]+'        "-> false"
If the first character after the opening bracket is `^', the set is inverted: it matches any single character *not* appearing between the brackets:
    '0' matchesRegex: '[^01]'               "-> false"
    '3' matchesRegex: '[^01]'               "-> true"
For convenience, a set may include ranges: pairs of characters separated with `-'. This is equivalent to listing all characters between them: `[0-9]' is the same as `[0123456789]'.

Special characters within a set are `^', `-', and `]' that closes the set.
Below are the examples of how to literally use them in a set:

    [01^]           -- put the caret anywhere except the beginning
    [01-]           -- put the dash as the last character
    []01]           -- put the closing bracket as the first character
    [^]01]             (thus, empty and universal sets cannot be specified)
Be careful: `.' and similar special characters are no longer special inside the character set;
therefore:
    '1' matchesRegex: '[1.]'         "-> true"
and:
    '.' matchesRegex: '[1.]'         "-> true"
but not:
    '2' matchesRegex: '[1.]'         "-> false"
Regular expressions can also include the following backquote escapes to refer to popular classes of characters:
    \w      any word constituent character (same as [a-zA-Z0-9_])
    \W      any character but a word constituent
    \d      a digit (same as [0-9])
    \D      anything but a digit
    \s      a whitespace character
    \S      anything but a whitespace character
These escapes are also allowed in character classes: '[\w+-]' means 'any character that is either a word constituent, or a plus, or a minus'.

Character classes can also include the following grep(1)-compatible elements to refer to:

    [:alnum:]               any alphanumeric, i.e., a word constituent, character
    [:alpha:]               any alphabetic character
    [:blank:]               space or tab.
    [:cntrl:]               any control character.
			    In this version, it means any character whith ascii-code is < 32.
    [:digit:]               any decimal digit.
    [:graph:]               any graphical character.
			    In this version, this mean any character with ascii-code >= 32.
    [:lower:]               any lowercase character
    [:print:]               any printable character.
			    In this version, this is the same as [:cntrl:]
    [:punct:]               any punctuation character.
    [:space:]               any whitespace character.
    [:upper:]               any uppercase character.
    [:xdigit:]              any hexadecimal character.
Note that these elements are components of the character classes, i.e. they have to be enclosed in an extra set of square brackets to form a valid regular expression.
For example, a non-empty string of digits would be represented as '[[:digit:]]+'.

The above primitive expressions and operators are common to many implementations of regular expressions. The next primitive expression is unique to this Smalltalk implementation.

A sequence of characters between colons is treated as a unary selector which is supposed to be understood by characters. A character matches such an expression if it answers true to a message with that selector. This allows a more readable and efficient way of specifying character classes (by adding appropriate protocol to the character class, it can also be easily extended).
For example, `[0-9]' is equivalent to `:isDigit:', but the latter is more efficient. Analogously to character sets, character classes can be negated: `:^isDigit:' matches a Character that answers false to #isDigit, and is therefore equivalent to `[^0-9]'.

As an example, so far we have seen the following equivalent ways to write a regular expression that matches a non-empty string of digits:

    '[0-9]+'
    '\d+'
    '[\d]+'
    '[[:digit::]+'
    :isDigit:+'
The last group of special primitive expressions includes:
    .       matching any character except a newline;
    ^       matching an empty string at the beginning of a line;
    $       matching an empty string at the end of a line.
    \b      an empty string at a word boundary
    \B      an empty string not at a word boundary
    \<      an empty string at the beginning of a word
    \>      an empty string at the end of a word
Again, all the above three characters (`.', `^' and `$') are special and should be quoted to be matched literally.

Examples:

    'axyzb' matchesRegex: 'a.+b'            "-> true"
    'ax zb' matchesRegex: 'a.+b'            "-> true (space is matched by `.')"
    ('ax' , Character cr ,'zb')
	matchesRegex: 'a.+b'                "-> false (newline is not matched by `.')"
    ('ax' , Character cr ,'zb')
	matchesRegex: 'a(.|\n)+b'           "-> true)"

EXAMPLES

As the introductions said, a great use for regular expressions is user input validation. Following are a few examples of regular expressions that might be handy in checking input entered by the user in an input field. Try them out by entering something between the quotes and print-iting. (Also, try to imagine Smalltalk code that each validation would require if coded by hand). Most example expressions could have been written in alternative ways.

Checking if aString may represent a nonnegative integer number:

    aString matchesRegex: ':isDigit:+'
or
    aString matchesRegex: '[0-9]+'
or
    aString matchesRegex: '\d+'
Checking if aString may represent an integer number with an optional sign in front:
    aString matchesRegex: '(\+|-)?\d+'
Checking if aString is a fixed-point number, with at least one digit is required after a dot:
    aString matchesRegex: '(\+|-)?\d+(\.\d+)?'
The same, but allow notation like `123.':
    aString matchesRegex: '(\+|-)?\d+(\.\d*)?'
Recognizer for a string that might be a name: one word with first capital letter, no blanks, no digits. More traditional:
    aString matchesRegex: '[A-Z][A-Za-z]*'
more Smalltalkish:
    aString matchesRegex: ':isUppercase::isAlphabetic:*'
A date in format MMM DD, YYYY with any number of spaces in between, in XX century:
    aString matchesRegex: '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[ ]+(\d\d?)[ ]*,[ ]*19(\d\d)'
Note parentheses around some components of the expression above. As `Usage' section shows, they will allow us to obtain the actual strings that have matched them (i.e. month name, day number, and year number).

For dessert, coming back to numbers: here is a recognizer for a general number format: anything like 999, or 999.999, or -999.999e+21.

    aString matchesRegex: '(\+|-)?\d+(\.\d*)?((e|E)(\+|-)?\d+)?'

Usage

The preceding section covered the syntax of regular expressions. It used the simplest possible interface to the matcher: sending a #matchesRegex:-message to the sample string, with a regular expression string as the argument.
This section explains hairier ways of using the matcher.

Prefix Matching and Case-Insensitive Matching

A CharacterArray (an EsString in VA) also understands these messages:
    #prefixMatchesRegex: regexString
    #matchesRegexIgnoringCase: regexString
    #prefixMatchesRegexIgnoringCase: regexString
#prefixMatchesRegex: is just like #matchesRegex, except that the whole receiver is not expected to match the regular expression passed as the argument; matching just a prefix of it is enough.
For example:
    'abcde' matchesRegex: '(a|b)+'          "-> false"
    'abcde' prefixMatchesRegex: '(a|b)+'    "-> true"
The last two messages are case-insensitive versions of matching.

ENUMERATION INTERFACE

An application can be interested in all matches of a certain regular expression within a String. The matches are accessible using a protocol modelled after the familiar Collection-like enumeration protocol:
    aString regex: regexString matchesDo: aBlock
Evaluates a one-argument for every match of the regular expression within the receiver string.
    aString regex: regexString matchesCollect: aBlock
Evaluates a one-argument <aBlock> for every match of the regular expression within the receiver string. Collects results of evaluations and anwers them as a SequenceableCollection.
    aString allRegexMatches: regexString
Returns a collection of all matches (substrings of the receiver string) of the regular expression.
It is an equivalent of
    aString regex: regexString matchesCollect: [:each | each].

REPLACEMENT AND TRANSLATION

It is possible to replace all matches of a regular expression with a certain string using the message:
    aString copyWithRegex: regexString matchesReplacedWith: aString
For example:
    'ab cd ab' copyWithRegex: '(a|b)+' matchesReplacedWith: 'foo'
returns the string: 'foo cd foo'.

A more general substitution is match translation:

    aString copyWithRegex: regexString matchesTranslatedUsing: aBlock
This message evaluates a block passing it each match of the regular expression in the receiver string and answers a copy of the receiver with the block results spliced into it in place of the respective matches.
For example:
    'ab cd ab' copyWithRegex: '(a|b)+' matchesTranslatedUsing: [:each | each asUppercase]
results in the string: 'AB cd AB'.

All messages of enumeration and replacement protocols perform a case-sensitive match. Case-insensitive versions are not provided as part of a CharacterArray protocol. Instead, they are accessible using the lower-level matching interface.

LOWER-LEVEL INTERFACE

Internally, aString matchesRegex: works as follows:
  1. A fresh instance of RxParser is created, and the regular expression string is passed to it, yielding the expression's syntax tree.

  2. The syntax tree is passed as an initialization parameter to an instance of RxMatcher. The instance sets up some data structure that will work as a recognizer for the regular expression described by the tree.

  3. The original string is passed to the matcher, and the matcher checks for a match.

THE MATCHER

If you repeatedly match a number of strings against the same regular expression using one of the messages defined in CharacterArray, the regular expression string is parsed and a matcher is created anew for every match. You can avoid this overhead by building a matcher for the regular expression, and then reusing the matcher over and over again. You can, for example, create a matcher at a class or instance initialization stage, and store it in a variable for future use.

You can create a matcher using one of the following methods:

A more convenient way is using one of the two matcher-created messages understood by CharacterArray. Here are four examples of creating a matcher:
    hexRecognizer := RxMatcher forString: '16r[0-9A-Fa-f]+'
    hexRecognizer := RxMatcher forString: '16r[0-9A-Fa-f]+' ignoreCase: false
    hexRecognizer := '16r[0-9A-Fa-f]+' asRegex
    hexRecognizer := '16r[0-9A-F]+' asRegexIgnoringCase

MATCHING

The matcher understands these messages (all of them return true to indicate successful match or search, and false otherwise):
matches: aString
True if the whole target string (aString) matches.

matchesPrefix: aString
True if some prefix of the string (not necessarily the whole string) matches.

search: aString
Search the string for the first occurrence of a matching substring. (Note that the first two methods only try matching from the very beginning of the string). Using the above example with a matcher for `a+', this method would answer success given a string `baaa', while the previous two would fail.

matchesStream: aStream
matchesStreamPrefix: aStream
searchStream: aStream
Respective analogs of the first three methods, taking input from a stream instead of a string. The stream must be positionable and peekable.
All these methods answer a boolean indicating success. The matcher also stores the outcome of the last match attempt and can report it:
lastResult
Answers a Boolean -- the outcome of the most recent match attempt. If no matches were attempted, the answer is unspecified.

SUBEXPRESSION MATCHES

After a successful match attempt, you can query the specifics of which part of the original string has matched which part of the whole expression. A subexpression is a parenthesized part of a regular expression, or the whole expression. When a regular expression is compiled, its subexpressions are assigned indices starting from 1, depth-first, left-to-right.
For example, `((ab)+(c|d))?ef' includes the following subexpressions with these indices:
	  1:      ((ab)+(c|d))?ef
	  2:      (ab)+(c|d)
	  3:      ab
	  4:      c|d
Be aware, that the first subexpressions represents the whole match.
After a successful match, the matcher can report what part of the original string matched what subexpression. It understandards these messages:
subexpressionCount
Answers the total number of subexpressions: the highest value that can be used as a subexpression index with this matcher. This value is available immediately after initialization and never changes.

subexpression: anIndex
An index must be a valid subexpression index, and this message must be sent only after a successful match attempt. The method answers a substring of the original string the corresponding subexpression has matched to.

subBeginning: anIndex
subEnd: anIndex
Answer positions within the original string or stream where the match of a subexpression with the given index has started and ended, respectively.
This facility provides a convenient way of extracting parts of input strings of complex format. For example, the following piece of code uses the 'MMM DD, YYYY' date format recognizer example from the `Syntax' section to convert a date to a three-element array with year, month, and day strings (you can select and evaluate it right here):
    | matcher |
    matcher := Regex::RxMatcher new initializeFromString: '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[ ]+(:isDigit::isDigit:?)[ ]*,[ ]*19(:isDigit::isDigit:)'.
    (matcher matches: 'Aug 6, 1996')
	    ifTrue:
		    [Array
			    with: (matcher subexpression: 4)
			    with: (matcher subexpression: 2)
			    with: (matcher subexpression: 3)]
	    ifFalse: ['no match']
(should answer `#('96' 'Aug' '6')').

ENUMERATION AND REPLACEMENT

The enumeration and replacement protocols exposed in CharacterArray are actually implemented by the mather.
The following messages are understood:
    matchesIn: aString
    matchesIn: aString do: aBlock
    matchesIn: aString collect: aBlock
    copy: aString replacingMatchesWith: replacementString
    copy: aString translatingMatchesUsing: aBlock

    matchesOnStream: aStream
    matchesOnStream: aStream do: aBlock
    matchesOnStream: aStream collect: aBlock
    copy: sourceStream to: targetStream replacingMatchesWith: replacementString
    copy: sourceStream to: targetStream translatingMatchesWith: aBlock

ERROR HANDLING

Exception signaling objects are accessible through RxParser class protocol. To handle possible errors, use the protocol described below to obtain the exception objects and use the protocol of the native Smalltalk implementation to handle them.

If a syntax error is detected while parsing expression, RxParser>>syntaxErrorSignal is raised/signaled.

If an error is detected while building a matcher, RxParser>>compilationErrorSignal is raised/signaled.

If an error is detected while matching (for example, if a bad selector was specified using `:<selector>:' syntax, or because of the matcher's internal error), RxParser>>matchErrorSignal is raised

RxParser>>regexErrorSignal is the parent of all three. Since any of the three signals can be raised within a call to #matchesRegex:, it is handy if you want to catch them all.

For example:

Ansi-Smalltalk (VisualWorks, SmalltalkX, Squeak etc.):

    [ 'abc' matchesRegex: '))garbage[' ]
	on: RxParser regexErrorSignal
	do: [:ex | ex returnWith: nil]
VisualWorks, SmalltalkX:
    RxParser regexErrorSignal
	handle: [:ex | ex returnWith: nil]
	do: [ 'abc' matchesRegex: '))garbage[' ]
VisualAge, SmalltalkX:
    [ 'abc' matchesRegex: '))garbage[' ]
	when: RxParser regexErrorSignal
	do: [:signal | signal exitWith: nil]

    'hello world' matchesRegex: 'h.*d'
Or:
    |matcher|

    matcher := '.*ll.*' asRegex.
    matcher matches: 'hello world'.
Fetching matched subexpressions:
    |matcher sub1 sub2 sub3|

    matcher := '\D*([0-9]+)\s([0-9]+)\D*.*' asRegex.
    (matcher matches: 'bla bla 123456 123 bla bla') ifTrue:[
	Transcript showCR:(matcher subexpressionCount printString , ' subExpressions').
	sub1 := matcher subexpression:1.
	sub2 := matcher subexpression:2.
	sub3 := matcher subexpression:3.
	Transcript showCR:'subExpr1 is ' , sub1.
	Transcript showCR:'subExpr2 is ' , sub2.
	Transcript showCR:'subExpr3 is ' , sub3.
    ].

Licensing

This addOn package is NOT to be considered part of the base ST/X system. It is provided physically with the ST/X delivery, but only for your convenience.

Legally, it is a freeware or public domain goody, as specified in the goodies copyright notice (see the goodies source).

No Warranty

This goody is provided AS-IS without any warranty whatsoever.

Origin/Authors

Found in and ported from the smalltalk archives.
Author:

Vassili Bykov

See RxParser class>>boringStuff for legal information.


Copyright © 1999 eXept Software AG

<info@exept.de>

Doc $Revision: 1.12 $