Grammars and Parsing

Robert M. Keller

Example - A problem discussed earlier:

Tautology checker:

(a + b)' = a' * b' ?

a + a'*b = a + b ?

a + a* b = a * b

Decomposition:

An applet demonstrating a tautology checker may be found at http://www.cs.hmc.edu/~keller/javaExamples/taut. Type logical expression into the left text area. The applet will tell you whether or not it is a tautology. Identifiers in this version of the tautology checker are limited to single characters. The operators are:

+ for OR
* for AND
' (postfix) for NOT
> for IMPLIES
= for IF-AND-ONLY-IF (or equals)

Parentheses may be used.

Example - Tree for

a + a'*b = a + b

The process of producing such a tree from an input sequence of characters is called

parsing.

An applet demonstrating parsing to create trees may be found at http://www.cs.hmc.edu/~keller/javaExamples/Parser. Enter arithmetic expressions involving multi-character variables, +, *, and parentheses and watch the tree being constructed.

Please note that syntax trees such as this are usually just an intermediate form used to achieve some other end, such as:

Evaluation to achieve a value (arithmetic, logical, or other)
Machine code, as in the case of a compiler
Analysis, such as in the case of a natural language translator.

Construction of a Parser

Specify a grammar for the language to be parsed.
Use the grammar to guide the construction of the parser.

Grammar Fundamentals

Many notations are possible, some expressing the essence more naturally than others.
As with programming languages, need to learn to "see through" the grammatical syntax to the essence.

Grammar Symbols

In a grammar, there are three types of symbols:

Terminal symbols: symbols in the end language (also called the object language)
Auxiliary symbols: symbols standing for syntactic categories (or sets of strings) in the end language
Meta symbols: symbols used to construct syntactic categories

Example: A grammar for additive expressions:

a + b

a + b + c

a + a + d

....

Here there are two syntactic categories:

V: identifier or "variable"

a, b, c, ...., z (say)

A: additive expressions themselves

Because it is additive expresions in which we are interested, we say that A is the root category.

Productions of the grammar specify how syntactic categories are related.

For the additive expressions, two productions suffice:

(i) A -> V { '+' V}

read

"an A is a V followed by 0 or more occurrences of '+' then a V."

or more verbosely

"an additive expression is a variable followed by 0 or more occurrences of '+' then a variable".

(ii) V -> 'a' | 'b' | 'c' | .... | 'z'

read

"a V is an 'a' or a 'b' or a 'c' or .... or a 'z'

Note: .... is not actually part of the grammar, but is just meant to abbreviate the letters between 'c' and 'z'.

-> indicates string replacement

analogous to => (expression replacement, in rex)

(i) A -> V { '+' V}

(ii) V -> 'a' | 'b' | 'c' | .... | 'z'

Start with the root symbol

Apply(i):

V '+' V '+' V

2 occurrences of '+' V

Apply(ii):

'a' '+' 'b' '+' 'c'

i.e.

"a+b+c"

is the resulting string

Notes:

Replacement of strings is non-deterministic.
Applying productions is synthetic rather than analytic.
To parse, we must apply productions in reverse, ideally analytically and deterministically.

Parsing by Recursive Descent

Example productions

(i) A -> V { '+' V}

(ii) V -> 'a' | 'b' | 'c' | .... | 'z'

Form one "function" (or procedure or method, depending on setup) for each auxiliary symbol (designating syntactic categories).

A

V

Each function is responsible for recognizing that category in an input string.

The input string should be scanned left-to-right.

Parse Functions tell us how to get stuff from the input:

(i) A -> V { '+' V}

says

"To scan an A:

scan a V (if no V, fail).

Repeat until there is no '+':

If there is next a '+', scan another V (if none, fail)."

(ii) V -> 'a' | 'b' | 'c' | .... | 'z'

says

"To scan a V:

see if there is an 'a' or a 'b' or a 'c' or .....

If none, fail."

Parse Functions in Java

Our parse functions will return Objects, either

success:

a String, representing a variable (a leaf of tree)

OR

a List, representing a non-leaf tree

failure:

a ParseFailure object

We will construct the lists using Poly; they will print as S expressions.

// PARSE FUNCTION for A -> V { '+' V }

Object A()

{

    Object result;
    Object V1 = V();
    if( isFailure(V1) ) return failure;
 
    result = V1;
 
    while( peek() == '+' )
      {
      nextChar();
      Object V2 = V();
      if( isFailure(V2) ) return failure;
      result = Poly.List.list("+", result, V2);
      }
    return result;
    }

Explanation: Each parse function (A or V in this case) scans characters left-to-right from the input stream, access of which is not shown explicitly. It returns either failure or a tree, which can be either a single leaf, represented by a string, or other, represented by a Poly.List.

In A(), V() is first called, corresponding to the first syntactic category on the right-hand side of the production A -> V { '+' V }. If the result is a failure, then the call to A() is a failure. Otherwise, the result value is started with the value of V(). The program then checks to see whether the next character is a '+'. If not, then the production for A is fulfilled, since {....} allows 0 occurrences of what is inside. In this case, the result is just returned. However, if there is a '+', then we absorb that '+' by calling nextChar() (the call to peek() only looked at the character, but did not take it from the input stream). We then call V() again. If that is a failure, the call to A must fail, since we have a '+' not followed by a V. If it succeeds, we build up the tree by forming a new tree with "+" as the root, the former result as the left sub-tree, and the value of V() as the right sub-tree. This build up continues as long as there are '+'s in the input stream.


 
 
// PARSE FUNCTION for V -> a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z
 
  Object V()
    {
    if( isVar(peek()) )
      {
      return (new StringBuffer(1).append(nextChar())).toString();
      }      
    return failure;
    }

In V() we look at the next character and if it qualifies as a variable, we make a String out of it, using the Java incantation shown:

StringBuffer is a class representing a modifiable string (objects in class String cannot be modified once created).

We create a StringBuffer (1 character long), append the next character of the input to it, and return the contents of the StringBuffer as a String.

If the next character is not a variable, then we return failure to so indicate.

The best way to check whether a character is any one of a specific set is to use a switch statement as shown.

  // isVar indicates whether its argument is a variable
 
  boolean isVar(char c)
    {
    switch( c )
      {
      case 'a': case 'b': case 'c': case 'd': case 'e': case 'f': case 'g':
      case 'h': case 'i': case 'j': case 'k': case 'l': case 'm': case 'n':
      case 'o': case 'p': case 'q': case 'r': case 's': case 't': case 'u':
      case 'v': case 'w': case 'x': case 'y': case 'z':
        return true;
      default:
        return false;
      }
    }

Certain sets of characters are pre-defined in the libraries. If we wanted to allow any "letter" as a variable, we could use, in place of isVar(c):

Character.isJavaLetter(c)

The call to A() is embedded in a method parse() which checks to see if there is any garbage remaining on the input line after A() is called. If so, it informs the user, but still regards the result returned by A() as something useable if it is not an error.

  Object parse()        // TOP LEVEL: parse with check for residual input
    {
    Object result = A();
    skipWhitespace();
    if( position < lastPosition )
      {
      System.out.print("*** Residual characters after input: ");
      while( !eof )
        {
        char c = nextChar();
        System.out.print(c);
        }
      System.out.println();
      }
    return result;
    }

The complete program, which includes the support methods such as peek() and nextChar() is in http://www.cs.hmc.edu/~keller/javaExamples/Parse/additive.java.

Grammar for Additive and Multiplicative Expressions

We use the productions to enforce precedence, e.g. we want * to take precedence over + (or bind more tightly than +).

In the syntax tree, this means that we want + to be closer to the root, i.e. we want

a + b * c

to parse as the tree on the left, not the one on the right below:

The syntactic categories in this case are:

A: Additive expressions

M: Multiplicative expressions

V: Variables

The corresponding Productions are:

A -> M { '+' M }

M -> V { '*' V }

V -> 'a' | 'b' | 'c' | .... | 'z'

Coding of a parser for additive and multiplicative expressions:

A is as before, except that calls to M() replace calls to V()
M is like A was
E simply calls A

Note the analogy between multiplicative and additive in the current grammar:

A is to M

M is to V

Here is the code for the parse functions:

  // PARSE FUNCTION for A -> V { '+' V } 
 
  Object A()
    {
    Object result;
    Object M1 = M();
    if( isFailure(M1) ) return failure;
 
    result = M1;
 
    while( peek() == '+' )
      {
      nextChar();
      Object M2 = M();
      if( isFailure(M2) ) return failure;
      result = Poly.List.list("+", result, M2);
      }
    return result;
    }
 
 
  // PARSE FUNCTION for M -> V { '*' V } 
 
  Object M()
    {
    Object result;
    Object V1 = V();
    if( isFailure(V1) ) return failure;
 
    result = V1;
 
    while( peek() == '*' )
      {
      nextChar();
      Object V2 = V();
      if( isFailure(V2) ) return failure;
      result = Poly.List.list("*", result, V2);
      }
    return result;
    }

The complete program, which includes the support methods such as peek() and nextChar() is in http://www.cs.hmc.edu/~keller/javaExamples/Parse/addMult.java.

Note: Productions enforce grouping as well as precedence.

A -> M { '+' M }

is left grouping

A -> { M '+' } M

would be right grouping.

("grouping" is sometimes called "associativity" but really is independent of whether the operator is an associative operator or not.

To see that

A -> M { '+' M }

is really left grouping and not right, at might be first inferred, consider the parsing of an expression

a + b + c + d

The a is first parsed as an M (after first parsing it as a V). Then the b is parsed and added to the first, then the c is parsed and added to that, and so on.

Problems:

Extend the grammar to include A^B (A raised to the power B) where ^ takes precedence over *.
Extend the grammar to include parenthesis grouping
where the expression in the group should function as if a single variable.

Whitespace refers to characters such as

space: ' '

tab: '\t'

form-feed '\f' (control-L)

Whitespace is usually not indicated explicitly in grammars, although it could be:

W -> ' ' | '\t' | '\f'

is a production for a single whitespace character. Thus

{ W }

denotes any number of whitespace characters, e.g.

A -> {W}V {W} {'+' {W} V}

would allow whitespace to be inserted before or after any variable.

In most languages, whitespace is allowed between most syntactic units, except within identifiers.

An exception is FORTRAN where

DO 10 I

could be a variable DO10I or the start of a DO statement.

Another useful grammar construct:

[ .... ]

means 0 or 1 occurence of ...., i.e. that .... is optional.

Example:

U is unsigned numerals

N is optionally signed numerals

D is a digit

Then the productions are:

N -> ['+' | '-'] U

U -> D {D}

Problem:

Give a grammar for the floating-point numerals, e.g.

123.

.456

123.456

1e-10

123.456e10

etc.

Problem:

Write the grammar and parser for the tautology checker. The operator symbol precedence is:

' (not) tightest

* (and)

+ (or)

> (implies)

= (equals, if-and-only-if)

Make * and + associative and > and = non-associative (i.e. a>b>c is not allowed; it must be either a>(b>c) or (a>b)>c).

Parentheses are allowed. Variables are single letters. Constants are 0 and 1.

Closing notes on Notation;

Sometimes { } is replaced by recursive productions:

{A}

is the same as

B

where

B -> empty string

B -> A B

Sometimes superscript * denotes { }, e.g.

(R | S)*

is the same as

{R | S}