<head>
<link rel="stylesheet" type="text/css" href="../cs461_notes.css" />
</head>

<center>
<h2> Building Abstract Syntax Trees </h2>
</center>
<hr>
The files associated with this lecture are
<ul>
<li>flex input file <a href="exp.lex"> exp.lex</a>
<li>bison input file <a href="exp.yacc"> exp.yacc</a> 
<li>class declarations<a href="exp.h"> exp.h</a> 
<li>class definitions<a href="exp.cpp"> exp.cpp</a> 
<li><a href="makefile">makefile</a>
</ul>

<hr>
<h2> Classes that Need to be Declared </h2>
<p>
In order to construct a parse tree you will need to declare a class for
each non-terminal and each production in your grammar. The class for each 
non-terminal should be declared as abstract and each of the non-terminal's
productions should be declared as a subclass. If a terminal carries information,
such as a number or id, then that terminal should also have its own
top-level class. If a terminal has only one possible value or is a punctuation
character, there is no need to store it since you will know its value based
on its production. For example, if you have the production <tt>E -> E - E</tt>,
there is no need to store the minus sign since you will know that the
production represents a minus expression. 
<p>
In each production's subclass
you will need to have pointers to nodes that represent non-terminals on
the right hand side of the production. If there are terminals that carry
information, such as numbers or ids, then you will need to pointers to
those nodes as well. The reason that productions should be subclasses of
their left hand side nonterminal is that they expand that nonterminal and
therefore represent one of the potential subtrees rooted at that nonterminal.
<p>
As an example of how you might construct a parse tree, consider the following
expression grammar:
<pre>
   pgm -> stmt*
   stmt -> id = exp 
        |  print id
   exp -> exp + exp | exp - exp | exp * exp | exp / exp 
        | ( exp ) | - exp | id | number 
</pre>
The nonterminals are <i>pgm</i>, <i>stmt</i> and <i>exp</i> so we need abstract classes for
these three nonterminals:
<pre>
class pgm {}
class statement {}
class exp_node {}
</pre>
We will start by defining the subclasses for the productions associated
with an expression.
The easiest subclasses are those that represent the terminals <b>number</b>
and <b>id</b>.
The subclass for <b>number</b> needs to store the value
of the number:
<pre>
class number_node : public exp {
  protected:
    int num; 

  public:
    number_node::number_node(float value) {
      num = value;
    }
};
</pre>
Likewise, the subclass for <b>id</b> needs to store the string value of
the id:
<pre>
class id_node : public exp {
  protected:
    string id;

  public:
    id_node(string value) : id(value) {}
};
</pre>
<p>
Next we define subclasses for the arithmetic productions (+, -, *, /). I
am only showing the subclasses for Plus and Times:
<pre>
class plus_node : public node {
  protected:
    node *left;
    node *right;

  public:
    plus_node(node *L, node *R): left(L), right(R) {}
};

class times_node : public node {
  protected:
    node *left;
    node *right;

  public:
    times_node(node *L, node *R): left(L), right(R) {}
};
</pre>
The plus and times nodes look pretty similar, and in fact they are so 
similar that we can factor some of their common node into a
shared superclass, called <tt>operator_node</tt>:
<pre>
class operator_node : public node {
  protected:
    node *left;
    node *right;
  public:
    operator_node(node *L, node *R): left(L), right(R) {}
};

class plus_node : public operator_node {
  public:
    plus_node(node *L, node *R): operator_node(L, R) {}
};
class times_node : public operator_node {
  public:
    times_node(node *L, node *R): operator_node(L, R) {}
};
</pre>
For now we will continue to create classes for our remaining productions.
The next class is for the unary minus production:
<pre>
class unary_minus_node : public node {
  protected:
    node *exp;

  public:
    unary_minus_node(node *expToNegate): exp(expToNegate) {}
};
</pre>
Finally we come to the parenthesized expression production. This production
is one that is useful for deriving the syntactic meaning of the program,
but is useless for deriving the semantic meaning. Hence we will not explicitly
represent this production in the syntax tree, but instead pass its expression
up to the next node in the chain. We will see how to do that below. For now
the important thing is that we do not need to create a class for the
parenthesized expression production.
<hr>
<h2> Adding Attributes and Attribute Evaluation Rules </h2>
<p>
After we have defined the classes and subclasses for our grammar, we need
to consider the set of attributes that we will need. For our simple expression
grammar we need only one attribute, which is the value of the 
expression. Remember that attributes are declared in the class associated
with the non-terminal, and that the subclasses define methods for evaluating
these attributes. Hence the <i>value</i> attribute is declared in the
<b>exp</b> class:
<pre>
class exp {
  protected:
    float num;
};
</pre>
Next we define a method for evaluating the <i>value</i> attribute. I will
call it <tt>evaluate</tt>. Each production will have its own rule for
evaluating the <i>value</i> attribute, so we need to declare the <tt>evaluate</tt>
method to be a pure virtual method:
<pre>
class exp {
  protected:
    float num;
  <b>public:
    virtual evaluate() = 0;</b>
};
</pre>
Each expression subclass will now define an appropriate evaluate method.
Here is a representative sample:
<pre>
// a number just returns its value
float number_node::evaluate() { 
  return num; 
}

// an id looks itself up in the idTable
float id_node::evaluate() { 
  return idTable[id]; 
}

// a plus operator evaluates its left and right operands and
// then adds them together
float plus_node::evaluate() {
  float left_num, right_num;

  left_num  = left->evaluate();
  right_num = right->evaluate();

  num = left_num + right_num;
  return num;
}
</pre>
<hr>
<h2> Building Abstract Syntax Trees in Bison</h2>
<p>
Now we are ready to use Bison to build an abstract syntax
tree for strings that
can be generated using this expression grammar. 
There are two steps:
<p>
<ol>
<li> Declare the types of the nodes in the abstract syntax
     tree. Since nodes in the abstract syntax tree represent the
     non-terminals on the left hand sides of productions, we will
     be declaring the nodes to be pointers to classes. This is
     done via the <tt>%union</tt> and <tt>%type</tt> directives.
     We add a field to the <tt>%union</tt> that declares a pointer
     to an Exp node, and then use the <tt>%type</tt> directive
     to make the <tt>exp</tt> non-terminal
     be a pointer to an Exp node:
<pre>
%union {
  float num;
  char *id;
  exp_node *expnode;
}
  
%type &lt;expnode&gt; exp
</pre>
<li> For each production in our grammar that has a subclass defined
     for it, write
     an action rule to create an instance of that class. Typically
     the action rule will simply pass pointers to the production's
     children to the constructor. For example, to construct a 
     plus_node, we pass it the pointers to its two operands:
<pre>
exp:	exp PLUS exp {
	  $$ = new plus_node($1, $3); }
</pre>
</ol>
<hr>
<h2> Flattening Lists </h2>
<p>
We frequently want to represent productions that represent
lists as a single root node with a list of children, as opposed
to a spindly "vine" of nodes. For example, our production for
a program is
ideally written as:
<pre>
pgm -> stmt<sup>*</sup>
</pre>
and is ideally represented in an abstract syntax tree as:
<pre>
     ---- pgm-----
    /    /        \
stmt1  stmt2 ...  stmt<sub>n</sub>
</pre>
However, to get Bison to recognize the grammar, we must rewrite
it as:
<pre>
pgm -> stmtlist
stmtlist -> stmtlist stmt
         |  &epsilon;
</pre>
which creates a viney looking tree:
<pre>
                 pgm
                  |
              stmtlist
              /      \
          stmtlist stmt<sub>n</sub>
             ...
          /      \
      stmtlist  stmt2
      /      \
   stmtlist stmt1
      |
      &epsilon;
</pre>
We can flatten a tree by starting a list when we encounter the
&epsilon production, and simply appending children to this list
thereafter. For example:
<pre>
%union {
  ...
  list<statement *> *stmts;
  pgm *prog;
}

%type &lt;stmts&gt; stmtlist
%type &lt;prog&gt; program

// $1 is a list of stmts
program : stmtlist { $$ = new pgm($1); }
;

stmtlist : stmtlist stmt NEWLINE 
            { // copy up the list and add the stmt to it
              $$ = $1;
              $1->push_back($2);
            }
         |  { $$ = new list<statement *>(); }
</pre>
<hr>
<h2> Returning the Abstract Syntax Tree from the Parser </h2>
<p>
You will need to store a pointer to the root of your abstract
syntax tree in a global variable, since <tt>yyparse</tt> always
returns an <tt>int</tt> (0 for success, 1 for failure). Here's
how I did it in the expression grammar:
<pre>
%{
// definitions section
// the root of the abstract syntax tree
pgm *root;
%}

%%

  /* rules section */ 
program : stmtlist { $$ = new pgm($1); root = $$; }
;
</pre>