Bison

The GNU Project yacc replacement

These notes provide a summary of Bison. The full reference manual is located at http://www.gnu.org/software/bison/manual/bison.html.
Normally Bison is run in batch (offline) mode. In these notes, I am presenting an expression interpreter that needs to run in interactive mode, because an expression should be evaluated every time a new line character is entered. If you are running Bison in interactive mode, add the directive
```
%option interactive
```
to the top of your flex file. If you fail to do so, then flex will wait to read your last token until it sees more input. As this last token is the newline character, my parser will fail to reduce the last expression until the next expression is entered.

Bison is a parser generator, just as flex is a lexer generator. It requires an input file that you write which is similar to that required by flex. Generally bison input files are given a .y or .yacc extension. Bison generates a parser function named yyparse(), which you can then call from a main program.

A bison input file consists of 3 sections; definitions, rules, and user subroutines. These sections are separated by two percent signs. The first 2 sections are required although one may be empty. The third section, which is the user subroutine section, and its preceeding %% are optional.

Definition Section

Like flex, this section can contain a literal block of C/C++ code which is copied verbatim to the beginning of the generated C file. This block is set off using %{ and %}. Other declarations which can be contained here are %union, %start, %left, %right, %token, %type, and %nonassoc. It can also contain C style comments.

%start

The %start direction tells bison what is the start non-terminal for your grammar. Here the start non-terminal is program:

%start program

The return value of yyparse() will be equal to the return value of your start non-terminal.

%union

The %union declaration defines the structure that the lexer will use to pass lexemes to the parser. In my sample parser, you will see:

%union {
        float num;
        char *id;
        };

This causes bison to create a union datatype in the file y.tab.h and declare 'yylval' to be of this datatype. The lexer can now use yylval to pass lexeme information to the parser.

%token

You will also notice in the sample parser the following lines:

%token <num> NUMBER
%token <id> ID
%token NEWLINE
%token EQUALS PRINT

%token lines defines symbols which represent the values which will be returned by the lexer and correspond to the terminals in your rules. All symbols used as tokens must be defined in this section although not all need be on the same line. These tokens will be assigned values in the y.tab.h file that bison will create. The <var_name> declares that the following tokens will be associated with the data type defined by var_name, where var_name is a variable specified in your %union. It is a promise to the parser that if you try to access the value of a NUMBER token, you will treat it as an int, and if you access the value of an ID token, you will treat it as a char *.

%type

The %type line specifies names for non-terminals in your rules. These may or may not have a type assigned. In exp.yacc the two type declarations are:

%type <num> exp 
%type <num> stmt

which declares both exp and stmt to be ints. You cannot directly declare a non-terminal to have a C-type. Instead you must indirectly do so by assigning it one of the variable names from your %union statement.

%left, %right, and %nonassoc

You declare operator associativity and precedence using the following three directives:

%left -- specifies tokens which are left associative
%right -- specifies tokens which are rigth associative
%nonassoc -- tokens which are non-associative

From the Bison Manual:

The associativity of an operator op determines how repeated uses of the operator nest: whether `x op y op z' is parsed by grouping x with y first or by grouping y with z first. %left specifies left-associativity (grouping x with y first) and %right specifies right-associativity (grouping y with z first). %nonassoc specifies no associativity, which means that `x op y op z' is considered a syntax error.
The precedence of an operator determines how it nests with other operators. All the tokens declared in a single precedence declaration have ` equal precedence and nest together according to their associativity. When two tokens declared in different precedence declarations associate, the one declared later has the higher precedence and is grouped first.

In the sample expression parser, there are four associativity declarations:

%left PLUS MINUS 
%left TIMES DIVIDE 
%left LPAREN RPAREN
%nonassoc UMINUS

which say that all the arithmetic operators are left associative and that unary minus (the negation operator) is non-associative. They also say that times and divide have precedence over plus and minus, and that unary minus has precedence over times and divide.

%{ and %} for user-defined code

Finally the definitions section is the place to put user-defined code that you plan to use in either your parser or your scanner. As in a lex/flex specification, you place user-defined code between %{ and %} delimiters:

%{
#include <iostream>
#include <string>
#include <stdlib.h>
#include <map>

using namespace std;

// our hash table for variable names
  map<string, float> idTable;

// for keeping track of line numbers in the program we are parsing
  int line_num = 1;  

// we need the yylex and yyerror prototypes because bison uses them
// in its generated code, but does not declare them for us automatically.
int yylex();
void yyerror(char * s);
%}

Rules Section

The rules section is similar to that of flex. You specify a pattern for a production and the code, if any, that is executed when the rule is matched. The pattern for a production is similar to what you have seen in class, with a LHS non-terminal and a RHS set of symbols. The only difference is that the arrow that is between the LHS and RHS is replaced with a colon (:). Here are several sample productions from the expression parser:

stmt: ID EQUALS exp
    | PRINT ID 

exp: MINUS exp %prec UMINUS
    | exp PLUS exp
    | exp MINUS exp
    | NUMBER

%prec

The %prec statement tells Bison to override the usual precedence for the MINUS token, which is less than the precedence for multiplication or division. It says to assign this production the precedence of a unary minus token, rather than the precedence of a minus token. Notice that the UMINUS token is a completely made up token that is not generated by the lexer. That is because the lexer cannot distinguish between a minus sign used to denote unary minus and a minus sign used to denote subtraction. Also I can only specify associativity for a token once, and so I chose to express the associativity for the subtraction minus sign. I then introduced the imaginary token UMINUS, so that I could boost the precedence of the unary minus production in the rules section.

Converting Extended BNF Form to BNF Form

The style in which you have seen grammars written in class is called Extended Backus Naur Form (EBNF) after the two creators of this style. Bison expects your grammar to be written in standard BNF form, which means without the shorthand notation for repetition (* for 0 or more, and + for 1 or more) and without the shorthand notation for optional items (?). Here is how to convert each of these three notations into a standard BNF form:

*: A star represents a list of 0 or more elements. For example, I might write:
```
exp -> stmt^*
```
This type of list can be written in standard BNF as:
```
exp -> stmtList
stmtList -> stmtList stmt
         |
```
Notice that the second production for stmtList derives the empty string, which is what allows us to have an empty list.
+: A plus represents a list of 1 or more elements. For example, I might write:
```
exp -> stmt⁺
```
This type of list can be written in standard BNF as:
```
exp -> stmtList
stmtList -> stmtList stmt
         | stmt
```
Notice that the second production for stmtList derives a single stmt, which forces us to have at least a one element list.
?: A question mark represents an optional element. For example, I might write:
```
edge -> label ([thickness = NUM])? ([color = STRING])?
```
An optional element can be written in two different forms in standard BNF:
```
Style 1:
  edge -> label
      |   label [thickness = NUM]
      |   label [color = STRING]
      |   label [thickness = NUM] [color = STRING]

Style 2:
  edge -> label optionalThickness optionalColor
  optionalThickness -> [thickness = NUM]
       |
  optionalColor -> [color = STRING]
       |
```
Style 1 writes out all possible variations of the production, while Style 2 removes each optional component to a separate set of 2 productions, one which derives the optional component and one which derives an empty string. Either one is fine. Style 1 is often more readable for productions that contain 1 or 2 optional elements, but it blows up exponentially with the number of optional elements. Hence for 3 or more optional elements, it is often more readable to use Style 2.

Left versus Right Recursion

An important part of parsing a grammar is recursion. Bison does not really care whether you use left or right recursion, but for efficiency you should try to use left recursion. Left recursion is like

List: List Exp

and right recursion is the opposite

List: Exp List

The reason to prefer left recursion over right recursion is that you use less stack space wth left recursion. Right recursion forces the parser to shift all the recursive elements recognized by the rule onto the stack, which can make the stack arbitrarily deep.

Additionally, you always make sure that every recursive rule has at least one non-recursive alternative to avoid the possiblity of infinite recursion.

Rule Actions

Productions typically have actions associated with them. In a later set of notes you will see how you can use actions to build parse trees. In this set of notes the actions illustrate how to implement an expression interpreter: A sample action from the expression interpreter is:

stmt: ID EQUALS exp { 
   idTable[$1] = $3;       // update the value of the id in the hash table
   cout << $3 << endl;     // print the value of the expression to the console
   cout << ">>> ";
   $$ = $3;                // make the value of stmt equal to the value of expression
}

This code looks pretty normal except for the $n elements. These elements represent the values, if any, associated with the terminals and non-terminals:

$$ represents the value assigned to the LHS non-terminal.
$1, $2, ..., $n represents the values that have already been assigned to the RHS symbols. If $i is a non-terminal, then the value was assigned by the action of a previously recognized production. If $i is a terminal, then the value was assigned by the scanner. Bison has already assigned the correct value to the terminal because the %token statement told it what field in yylval contained the value of this token.

User Subroutines

This section is copied verbatim to the C file. Normally you put main here and have it call yyparse(), which is the name of the parsing function generated by bison.

Errors

Your parser will encounter errors. There are a number of methods to handle them but I will cover the simple versions. The parsing function yyparse() calls yyerror() when it gets input that does not match any rule. By default yyparse() returns after calling yyerror() and exits the parser.

This is probably not the behavior you want. Instead you would like to resume parsing and find any additional errors in the input. To do so, you can judiciously place the reserved word error in one or more grammar rules. For example, in exp.yacc I have included the production:

stmtlist : stmtlist error NEWLINE { cout << ">>> "; yyclearin; }

error tells the parsor that if it cannot match any of the other productions for stmt, then it should scan to the next NEWLINE character and recognize this error production. In essence, error is like the .* pattern in flex. It recognizes everything until a specified delimiter is reached. Usually you have the error production read everything until the parse can get back to a normal parsing state. A normal parsing state is typically the start of a new statement, which is why my production causes the parser to consume everything up to the newline character.

The yyclearin statement is a bison macro that tells the parser to consume the last token. If you do not use yyclearin, then the parser will attempt to re-use the last token when it resumes parsing. If the last token was an error token, then you will get two error messages, which can be confusing to the user.

Another potential point of confusion is that bison will recognize an error-free prefix of a statement, before generating an error. Hence if the user types:

>>> a = 20 ? 30

the user receives the output:

error token: ? on line 1
20
line 1: syntax error, unexpected NUMBER, expecting NEWLINE
>>> >>>

In this case bison recognized the statement "a = 20". The first ">>>" is printed by the production that recognizes the statement "a = 20" and the redundant ">>>" is printed by the production that recognizes the error production.

The message about the syntax error is generated by the function yyerror. Even though yyerror is not called by the action, it is still called by Bison. yyerror is discussed more in the next section.

Usually Bison recovers properly after recognizing an error production, but sometimes your action must execute the macro yyerrok to indicate that error recovery is complete. For example:

stmtlist: stmtlist error NEWLINE { cout << ">>> "; yyclearin; yyerrok; }

One final note from the bison manual:

To prevent a lot of error messages, the parser will output no error message for another syntax error that happens shortly after the first; only after three consecutive input tokens have been successfully shifted will error messages resume. Rules which accept the error token may have actions, just as any other rules can.

yyerror()

The parser expects to report errors by calling an error reporting function named yyerror(), which you must define in the user code section. It is called by yyparse() whenever a syntax error is found, and it receives one argument, which is a pointer to a bison-generated string describing the error. For a syntax error, the string is normally "syntax error". The following definition suffices in simple programs:

yyerror (char *s)
{
  cerr << s << endl;
}

Unfortunately, the terse message "syntax error" is normally not helpful to the user. If you place the directive:

%error-verbose

in the definitions section, then bison will generate more meaningful error messages. For example:

>>> a ? 10
syntax error, unexpected NUMBER, expecting EQUALS

While this suffices for our interpreter, usually the user will also want to know the line number on which the error occurred, and will also want to know that ? is an unrecognized token. You can provide line numbers by taking the following steps:

In the definitions section of bison, define a line number variable. For example:
```
%{
  int line_num = 1;
%}
```
In the definitions section of flex, define this line number variable as an external variable:
```
%{ 
  extern int line_num;
%}
```
Place a rule in flex that recognizes the newline character (\n) and increments line_num by 1.
Place rules in flex for recognizing error tokens, and print them out, prefixed with the line number on which they occurred.
Use the line_num variable in yyerror to report the line number on which an error occurred.

This amount of error reporting will suffice for this course. In a professional parser you might also want to provide the character position on the line where the error occurred, but this requires much more complicated code and is beyond the scope of this course.

Compiling and Debugging

Bison supports many options but the most important are

-d -- this causes the output file which will contain the union and token definitions to be created.
-y -- this causes the output files to be named in the yacc manner, that is y.tab.c and y.tab.h.
-t -- this causes the debugging information to be output in the y.tab.c file.

The debugging information used by bison consists of printing to stderr the states and transitions used to parse a given input. In order to make use of this, you must generate the y.tab.c file with the -t option AND put in the statement 'yydebug=1;' before yyparse() is called (see the commented out yydebug statement in exp.yacc). It doesn't matter what value is used so long as it is greater than 0. This output will be quit large even for a small input. See the attached listing. Hopefully this debugging information will help you fix your rules so the grammar is correctly parsed.

Files

The files associated with this lecture are

flex input file exp.lex
bison input file exp.yacc
Makefile
y.tab.h bison header file, generated by bison with -d option
test input file
debugging output listing

References

Quoted passages are taken all or in part from "Bison: The YACC-compatible Parser Generator, November 1995, Bison Version 1.25 by Charles Donnelly and Richard Stallman".