In order to construct a parse tree you will need to declare a class for each non-terminal and each production in your grammar. The class for each non-terminal should be declared as abstract and each of the non-terminal's productions should be declared as a subclass. If a terminal carries information, such as a number or id, then that terminal should also have its own top-level class. If a terminal has only one possible value or is a punctuation character, there is no need to store it since you will know its value based on its production. For example, if you have the production E -> E - E, there is no need to store the minus sign since you will know that the production represents a minus expression.
In each production's subclass you will need to have pointers to nodes that represent non-terminals on the right hand side of the production. If there are terminals that carry information, such as numbers or ids, then you will need to pointers to those nodes as well. The reason that productions should be subclasses of their left hand side nonterminal is that they expand that nonterminal and therefore represent one of the potential subtrees rooted at that nonterminal.
As an example of how you might construct a parse tree, consider the following expression grammar:
pgm -> stmt* stmt -> id = exp | print id exp -> exp + exp | exp - exp | exp * exp | exp / exp | ( exp ) | - exp | id | numberThe nonterminals are pgm, stmt and exp so we need abstract classes for these three nonterminals:
class pgm {} class statement {} class exp_node {}We will start by defining the subclasses for the productions associated with an expression. The easiest subclasses are those that represent the terminals number and id. The subclass for number needs to store the value of the number:
class number_node : public exp { protected: int num; public: number_node::number_node(float value) { num = value; } };Likewise, the subclass for id needs to store the string value of the id:
class id_node : public exp { protected: string id; public: id_node(string value) : id(value) {} };
Next we define subclasses for the arithmetic productions (+, -, *, /). I am only showing the subclasses for Plus and Times:
class plus_node : public node { protected: node *left; node *right; public: plus_node(node *L, node *R): left(L), right(R) {} }; class times_node : public node { protected: node *left; node *right; public: times_node(node *L, node *R): left(L), right(R) {} };The plus and times nodes look pretty similar, and in fact they are so similar that we can factor some of their common node into a shared superclass, called operator_node:
class operator_node : public node { protected: node *left; node *right; public: operator_node(node *L, node *R): left(L), right(R) {} }; class plus_node : public operator_node { public: plus_node(node *L, node *R): operator_node(L, R) {} }; class times_node : public operator_node { public: times_node(node *L, node *R): operator_node(L, R) {} };For now we will continue to create classes for our remaining productions. The next class is for the unary minus production:
class unary_minus_node : public node { protected: node *exp; public: unary_minus_node(node *expToNegate): exp(expToNegate) {} };Finally we come to the parenthesized expression production. This production is one that is useful for deriving the syntactic meaning of the program, but is useless for deriving the semantic meaning. Hence we will not explicitly represent this production in the syntax tree, but instead pass its expression up to the next node in the chain. We will see how to do that below. For now the important thing is that we do not need to create a class for the parenthesized expression production.
After we have defined the classes and subclasses for our grammar, we need to consider the set of attributes that we will need. For our simple expression grammar we need only one attribute, which is the value of the expression. Remember that attributes are declared in the class associated with the non-terminal, and that the subclasses define methods for evaluating these attributes. Hence the value attribute is declared in the exp class:
class exp { protected: float num; };Next we define a method for evaluating the value attribute. I will call it evaluate. Each production will have its own rule for evaluating the value attribute, so we need to declare the evaluate method to be a pure virtual method:
class exp { protected: float num; public: virtual evaluate() = 0; };Each expression subclass will now define an appropriate evaluate method. Here is a representative sample:
// a number just returns its value float number_node::evaluate() { return num; } // an id looks itself up in the idTable float id_node::evaluate() { return idTable[id]; } // a plus operator evaluates its left and right operands and // then adds them together float plus_node::evaluate() { float left_num, right_num; left_num = left->evaluate(); right_num = right->evaluate(); num = left_num + right_num; return num; }
Now we are ready to use Bison to build an abstract syntax tree for strings that can be generated using this expression grammar. There are two steps:
%union { float num; char *id; exp_node *expnode; } %type <expnode> exp
exp: exp PLUS exp { $$ = new plus_node($1, $3); }
We frequently want to represent productions that represent lists as a single root node with a list of children, as opposed to a spindly "vine" of nodes. For example, our production for a program is ideally written as:
pgm -> stmt*and is ideally represented in an abstract syntax tree as:
---- pgm----- / / \ stmt1 stmt2 ... stmtnHowever, to get Bison to recognize the grammar, we must rewrite it as:
pgm -> stmtlist stmtlist -> stmtlist stmt | εwhich creates a viney looking tree:
pgm | stmtlist / \ stmtlist stmtn ... / \ stmtlist stmt2 / \ stmtlist stmt1 | εWe can flatten a tree by starting a list when we encounter the &epsilon production, and simply appending children to this list thereafter. For example:
%union { ... list*stmts; pgm *prog; } %type <stmts> stmtlist %type <prog> program // $1 is a list of stmts program : stmtlist { $$ = new pgm($1); } ; stmtlist : stmtlist stmt NEWLINE { // copy up the list and add the stmt to it $$ = $1; $1->push_back($2); } | { $$ = new list (); }
You will need to store a pointer to the root of your abstract syntax tree in a global variable, since yyparse always returns an int (0 for success, 1 for failure). Here's how I did it in the expression grammar:
%{ // definitions section // the root of the abstract syntax tree pgm *root; %} %% /* rules section */ program : stmtlist { $$ = new pgm($1); root = $$; } ;