Building Parse Trees


Classes that Need to be Declared

In order to construct a parse tree you will need to declare a class for each non-terminal and each production in your grammar. The class for each non-terminal should be declared as abstract and each of the non-terminal's productions should be declared as a subclass. If a terminal carries information, such as a number or id, then that terminal should also have its own top-level class. If a terminal has only one possible value or is a punctuation character, there is no need to store it since you will know its value based on its production. For example, if you have the production E -> E - E, there is no need to store the minus sign since you will know that the production represents a minus expression.

In each production's subclass you will need to have pointers to nodes that represent non-terminals on the right hand side of the production. If there are terminals that carry information, such as numbers or ids, then you will need to pointers to those nodes as well. The reason that productions should be subclasses of their left hand side nonterminal is that they expand that nonterminal and therefore represent one of the potential subtrees rooted at that nonterminal.

As an example of how you might construct a parse tree, consider the following grammar:

Pgm -> Exp
Exp -> number | Exp + Exp | Exp - Exp
The nonterminals are Pgm and E so we need abstract classes for these two nonterminals:
abstract class Exp {}
abstract class Pgm {}
The terminal number needs to have a class as well which stores the value of the number:
class Number extends Exp {
    int number; 

    public Number(String num) {
       number = num.parseInt(num);
    }
}
Pgm has only one production so it has one subclass:
class PgmExpression extends Pgm {
    Exp child;
    public PgmExpression(Exp e) { child = e; }
}
Note that this class has a pointer to an expression because the parse tree rooted at Pgm will expand to an Exp node. Also note that the Exp node represents an abstract class and hence any production subclass of Exp can be passed as the child node.

Next we define the remaining two subclass productions for Exp:

class MinusExp extends Exp {
    Exp child1;
    Exp child2;

  public MinusExp(Exp left, Exp right) {
    child1 = left;
    child2 = right;
  }
}

class PlusExp extends Exp {
    Exp child1;
    Exp child2;

  public PlusExp(Exp left, Exp right) {
    child1 = left;
    child2 = right;
  }
}


Building Parse Trees in Antlr

Now suppose that we want to use Antlr to build a parse tree for strings that can be generated using this grammar. Here is how the Antlr specification might be written:

pgm returns [Pgm value] 
      : e=exp { $value = new PgmExpression($e.value); }
      ;
exp returns [Exp value]
      :  n=NUM { $value = new Number($n.text); }
      |  l=exp '+' r=exp { $value = new PlusExp($l.value, $r.value); }
      |  l=exp '-' r=exp { $value = new MinusExp($l.value, $r.value); }
      ;


Building Abstract Syntax Trees

Often times you want to abstract away parts of the parse tree that do not add "meaning" to your program. In the above example, you actually started constructing an abstract syntax tree, because the parse tree you constructed did not include the operator symbols '+' and '-'. Instead, their meaning was embedded in the two classes you created, PlusExp and MinusExp. In the above example, you also were not concerned with preserving the start non-terminal. Instead what you really wanted was an expression tree for the expression input by the user. You could abstract away the Pgm non-terminal by having its action return the expression tree, rather than a Pgm object:

pgm returns [Exp value] 
      : e=exp { $value = $e.value; }
      ;
As another example, suppose we extended the above expression grammar by adding parenthesized expressions:
Pgm -> Exp
Exp -> number | Exp + Exp | Exp - Exp | ( Exp )
The parentheses are there to allow us to properly group the operands associated with operators, and can be discarded from the parse tree. Additionally, we do not need to define a subclass for the parenthesized expression. Instead we can simply return its expression tree, thus abstracting away the parentheses:
exp returns [Exp value]
    :  other productions
    | '(' e=exp ')' { $value = $e.value; }
    ;