%option interactiveto the top of your flex file. If you fail to do so, then flex will wait to read your last token until it sees more input. As this last token is the newline character, my parser will fail to reduce the last expression until the next expression is entered.
A bison input file consists of 3 sections; definitions, rules, and user subroutines. These sections are separated by two percent signs. The first 2 sections are required although one may be empty. The third section, which is the user subroutine section, and its preceeding %% are optional.
The %start direction tells bison what is the start non-terminal for your grammar. Here the start non-terminal is program:
%start programThe return value of yyparse() will be equal to the return value of your start non-terminal.
%union { float num; char *id; };This causes bison to create a union datatype in the file y.tab.h and declare 'yylval' to be of this datatype. The lexer can now use yylval to pass lexeme information to the parser.
%token <num> NUMBER %token <id> ID %token NEWLINE %token EQUALS PRINT%token lines defines symbols which represent the values which will be returned by the lexer and correspond to the terminals in your rules. All symbols used as tokens must be defined in this section although not all need be on the same line. These tokens will be assigned values in the y.tab.h file that bison will create. The <var_name> declares that the following tokens will be associated with the data type defined by var_name, where var_name is a variable specified in your %union. It is a promise to the parser that if you try to access the value of a NUMBER token, you will treat it as an int, and if you access the value of an ID token, you will treat it as a char *.
%type <num> exp %type <num> stmtwhich declares both exp and stmt to be ints. You cannot directly declare a non-terminal to have a C-type. Instead you must indirectly do so by assigning it one of the variable names from your %union statement.
The associativity of an operator op determines how repeated uses of the operator nest: whether `x op y op z' is parsed by grouping x with y first or by grouping y with z first. %left specifies left-associativity (grouping x with y first) and %right specifies right-associativity (grouping y with z first). %nonassoc specifies no associativity, which means that `x op y op z' is considered a syntax error.The precedence of an operator determines how it nests with other operators. All the tokens declared in a single precedence declaration have ` equal precedence and nest together according to their associativity. When two tokens declared in different precedence declarations associate, the one declared later has the higher precedence and is grouped first.
In the sample expression parser, there are four associativity declarations:
%left PLUS MINUS %left TIMES DIVIDE %left LPAREN RPAREN %nonassoc UMINUSwhich say that all the arithmetic operators are left associative and that unary minus (the negation operator) is non-associative. They also say that times and divide have precedence over plus and minus, and that unary minus has precedence over times and divide.
Finally the definitions section is the place to put user-defined code that you plan to use in either your parser or your scanner. As in a lex/flex specification, you place user-defined code between %{ and %} delimiters:
%{ #include <iostream> #include <string> #include <stdlib.h> #include <map> using namespace std; // our hash table for variable names map<string, float> idTable; // for keeping track of line numbers in the program we are parsing int line_num = 1; // we need the yylex and yyerror prototypes because bison uses them // in its generated code, but does not declare them for us automatically. int yylex(); void yyerror(char * s); %}
stmt: ID EQUALS exp | PRINT ID exp: MINUS exp %prec UMINUS | exp PLUS exp | exp MINUS exp | NUMBER
The style in which you have seen grammars written in class is called Extended Backus Naur Form (EBNF) after the two creators of this style. Bison expects your grammar to be written in standard BNF form, which means without the shorthand notation for repetition (* for 0 or more, and + for 1 or more) and without the shorthand notation for optional items (?). Here is how to convert each of these three notations into a standard BNF form:
exp -> stmt*This type of list can be written in standard BNF as:
exp -> stmtList stmtList -> stmtList stmt |Notice that the second production for stmtList derives the empty string, which is what allows us to have an empty list.
exp -> stmt+This type of list can be written in standard BNF as:
exp -> stmtList stmtList -> stmtList stmt | stmtNotice that the second production for stmtList derives a single stmt, which forces us to have at least a one element list.
edge -> label ([thickness = NUM])? ([color = STRING])?An optional element can be written in two different forms in standard BNF:
Style 1: edge -> label | label [thickness = NUM] | label [color = STRING] | label [thickness = NUM] [color = STRING] Style 2: edge -> label optionalThickness optionalColor optionalThickness -> [thickness = NUM] | optionalColor -> [color = STRING] |Style 1 writes out all possible variations of the production, while Style 2 removes each optional component to a separate set of 2 productions, one which derives the optional component and one which derives an empty string. Either one is fine. Style 1 is often more readable for productions that contain 1 or 2 optional elements, but it blows up exponentially with the number of optional elements. Hence for 3 or more optional elements, it is often more readable to use Style 2.
List: List Exp
and right recursion is the opposite
List: Exp List
The reason to prefer left recursion over right recursion is that you use less stack space wth left recursion. Right recursion forces the parser to shift all the recursive elements recognized by the rule onto the stack, which can make the stack arbitrarily deep.
Additionally, you always make sure that every recursive rule has at least one non-recursive alternative to avoid the possiblity of infinite recursion.
stmt: ID EQUALS exp { idTable[$1] = $3; // update the value of the id in the hash table cout << $3 << endl; // print the value of the expression to the console cout << ">>> "; $$ = $3; // make the value of stmt equal to the value of expression }This code looks pretty normal except for the $n elements. These elements represent the values, if any, associated with the terminals and non-terminals:
This is probably not the behavior you want. Instead you would like to resume parsing and find any additional errors in the input. To do so, you can judiciously place the reserved word error in one or more grammar rules. For example, in exp.yacc I have included the production:
stmtlist : stmtlist error NEWLINE { cout << ">>> "; yyclearin; }error tells the parsor that if it cannot match any of the other productions for stmt, then it should scan to the next NEWLINE character and recognize this error production. In essence, error is like the .* pattern in flex. It recognizes everything until a specified delimiter is reached. Usually you have the error production read everything until the parse can get back to a normal parsing state. A normal parsing state is typically the start of a new statement, which is why my production causes the parser to consume everything up to the newline character.
The yyclearin statement is a bison macro that tells the parser to consume the last token. If you do not use yyclearin, then the parser will attempt to re-use the last token when it resumes parsing. If the last token was an error token, then you will get two error messages, which can be confusing to the user.
Another potential point of confusion is that bison will recognize an error-free prefix of a statement, before generating an error. Hence if the user types:
>>> a = 20 ? 30the user receives the output:
error token: ? on line 1 20 line 1: syntax error, unexpected NUMBER, expecting NEWLINE >>> >>>In this case bison recognized the statement "a = 20". The first ">>>" is printed by the production that recognizes the statement "a = 20" and the redundant ">>>" is printed by the production that recognizes the error production.
The message about the syntax error is generated by the function yyerror. Even though yyerror is not called by the action, it is still called by Bison. yyerror is discussed more in the next section.
Usually Bison recovers properly after recognizing an error production, but sometimes your action must execute the macro yyerrok to indicate that error recovery is complete. For example:
stmtlist: stmtlist error NEWLINE { cout << ">>> "; yyclearin; yyerrok; }One final note from the bison manual:
To prevent a lot of error messages, the parser will output no error message for another syntax error that happens shortly after the first; only after three consecutive input tokens have been successfully shifted will error messages resume. Rules which accept the error token may have actions, just as any other rules can.
yyerror (char *s) { cerr << s << endl; }Unfortunately, the terse message "syntax error" is normally not helpful to the user. If you place the directive:
%error-verbosein the definitions section, then bison will generate more meaningful error messages. For example:
>>> a ? 10 syntax error, unexpected NUMBER, expecting EQUALSWhile this suffices for our interpreter, usually the user will also want to know the line number on which the error occurred, and will also want to know that ? is an unrecognized token. You can provide line numbers by taking the following steps:
%{ int line_num = 1; %}
%{ extern int line_num; %}
The debugging information used by bison consists of printing to stderr the states and transitions used to parse a given input. In order to make use of this, you must generate the y.tab.c file with the -t option AND put in the statement 'yydebug=1;' before yyparse() is called (see the commented out yydebug statement in exp.yacc). It doesn't matter what value is used so long as it is greater than 0. This output will be quit large even for a small input. See the attached listing. Hopefully this debugging information will help you fix your rules so the grammar is correctly parsed.