JCup Notes

These notes show a sample JLex and JCup specification that implements the following grammar:

cond -> if booleanExp then stmt (else stmt)?

booleanExp -> id == id
           |  id != id

stmt -> { stmt⁺ }
     |  id = num;
     |  cond

A sample program in this grammar might be:

if a != b then
    if a == c then {
        if b != c then {
	    a = 20;
	    b = 30;
	}
	x = 30;
    }
    else 
        if a != b  then
	    x = 20;
	else {
	    x = 40;
	    y = 60;
	}

The indentation shows the way in which we would like the else clauses to be associated with if clauses.

Steps in Creating a JCup Specification

There are four major tasks we must perform in order to create a JCup specification for this grammar:

Determine the terminals, which will become the tokens recognized by JLex.
Determine the nonterminals
Determine if we need to use any precedence or associativity rules to dis-ambiguate the grammar
Write the productions and any associated actions

Determine the Terminals

In general, terminals fall into one of five categories:

keywords: In the above specification, if, then, and else fall into this category.
punctuation and delimiters: The semi-colon (;) and the curly braces {} fall into this category.
operators: The ==, !=, and assignment (=) operators fall into this category.
constants: Numbers fall into this category.
user-defined names: Variable names (ids) fall into this category. In a more complete language, such as java, class names and function names would also fall into this category.

You need to create a name for each token and declare it using JCup's terminal keyword:

 
terminal IF, THEN, ELSE;
terminal Integer NUM;
terminal String ID;
terminal EQUALS, NOT_EQUALS;
terminal ASSIGNMENT;
terminal SEMI;
terminal LBRACE, RBRACE;

If a terminal has a lexeme value, such as NUM and ID, then you need to declare the type of the lexeme value, in this case Integer and String. The type of the lexeme value must match the type of object that you pass as the second argument to the Symbol constructor in your JLex specification and this type must be a class, not a primitive type. Additionally, the terminal name must match the token name in the JLex specification. In the JLex specification you will need to prefix the token name with the phrase "sym.", because JCup will define the token/terminal name as a numeric constant in the sym class. Hence to pass a NUM token to the JCup parser, your JLex specification must have a statement of the form:

return(new Symbol(sym.NUM, new Integer(yytext())));

associated with the regular expression that defines a NUM.

Determine the Non-terminals

The non-terminals are all the symbols that appear on the left side of a production. In this case they are cond, booleanExp, and stmt. We must declare them as follows using JCups non terminal keywords (note the space between non and terminal):

non terminal cond, booleanExp, stmt;

You can also declare non terminals to have types. For example, I might have classes denoting each of these non-terminals so that I could save information about each of them in the parse tree that I construct. Then my declaration might look as follows:

non terminal Cond cond;
non terminal BooleanExp booleanExp;
non terminal Stmt stmt;

I'll cover this topic more thoroughly in another set of notes. For the homework you do not need to associate types with the non-terminals and for the second project assignment it is probably good enough to make the non terminals be of type String so that you can pass the names of id's up to the formula level so you can check that the id declared on the left side of a formula agrees with the id used on the right side of the formula.

Precedence and Associativity

As noted in class, it is often easier and more straightforward to specify an ambiguous grammar than to specify a non-ambiguous grammar. Parser generators provide two mechanisms that allow you to write an ambiguous grammar and then disambiguate it. The first mechanism is called operator precedence and the second mechanism is called operator associativity. Operator precedence refers to the tightness with which different operators bind to operands. For example, in normal arithmetic, the * and / operators bind more tightly to operands than the + or - operators. Hence, we typically parse the expression a + b * c as a + (b * c) rather than (a + b) * c. In this case the * operator binds more tightly to b than the + operator.

Operator associativity refers to how to group operands involving the same operator. left associativity indicates that the operands are grouped left-to-right while right associativity indicates that the operands are grouped right-to-left. Arithmetic operators are typically left associative while assignment operators are typically right associative. For example, we will parse the expression a + b + c as (a + b) + c and the expression a = b = c as a = (b = c).

Our conditional grammar is ambiguous because of what is called the "dangling" else problem. The example program at the top of these notes illustrates this problem perfectly. The indentation I used makes it clear how I would like the else clauses to be associated with the if clauses. However, the parser typically does not take advantage of indentation. Here is a reprise of that code. The if statements are labeled with numbers and the else statements are labeled with letters:

1) if a != b then
2)    if a == c then {
3)        if b != c then {
	    a = 20;
	    b = 30;
	  }
	  x = 30;
      }
a)    else 
4)        if a != b  then
	    x = 20;
b)        else {
	    x = 40;
	    y = 60;
	  }

It's clear from the way that I have indented the code that I want if statement 2 matched with else statement a and that I want if statement 4 matched with else statement b. However, given the way that the grammar is written, it is equally valid for the parser to assume that either else statement a or else statement b is paired with if statement 1 because it can find a set of derivations that will allow it to create these pairings.

I can use operator precedence to resolve this problem by simply declaring that else has precedence over if:

precedence nonassoc ELSE;
precedence nonassoc IF;

The nonassoc keyword says that IF and ELSE are non-associative, which makes sense if you think of the fact that else's get paired with if's, not with other else's, and the same for if's. This precedence order will cause an else keyword to want to bind with the nearest possible "operands", which are the stmt associated with the previous if keyword and the stmt associated with the else keyword. In other words, given the string if booleanExp then stmt else stmt, the else wants to "bind" to the two statements on either side of it. This desire causes the parser to shift an else onto the stack when it sees one, rather than try to reduce an existing if statement to the production cond -> if booleanExp then stmt. Note that if you remove the two precedence statements from my JCup specification and feed it to JCup, you will get a shift/reduce error because JCup no longer knows your intentions.

If you want to declare an operator as either left or right associative, use the left or right keywords after the precedence keyword. For example, the following precedence statements make the *, /, +, and - operators be left associative and the assignment operator be right associative. They also give assignment the highest precedence, followed by the * and / operators, which have equal precedence, and the + and - operators, which have the lowest, but also equal precedence:

precedence right ASSIGNMENT;
precedence left MULTIPLY, DIVIDE;
precedence left PLUS, MINUS;

Writing Productions

The last thing you must do is write the productions for your grammar. Typically you can pretty much copy the grammar you have been given, with a couple exceptions. First, you use ::= rather than an arrow (->) to separate the left and right sides of a production. Second, you cannot use the ?, +, or * idioms that you can use with regular expressions. To express an optional part of a production, you have to write it twice. Thus the production

cond -> if booleanExp then stmt (else stmt)?

becomes

cond -> if booleanExp then stmt 
     | if booleanExp then stmt else stmt

Similarly, the production:

stmt -> { stmt⁺ }

becomes

stmt -> { stmtList }
stmtList -> stmt
         |  stmtList stmt

Notice that I need to introduce an additional non-terminal called stmtList. Also notice that I made the second production for stmtList be left recursive rather than right recursive. Parser generators for LR grammars can produce parsers with smaller stack sizes if you use left recursion rather than right recursion.

Finally you can perform actions after a production has been recognized by placing Java code between a pair of {: :} operators. In the following JCup specification my actions print the production that has been recognized:

cond ::= IF booleanExp THEN stmt {: System.out.printf("cond -> if booleanExp then stmt\n"); :} | IF booleanExp THEN stmt ELSE stmt {: System.out.printf("cond -> if booleanExp then stmt else stmt\n"); :} ; booleanExp ::= ID EQUALS ID {: System.out.printf("booleanExp -> id == id\n"); :} | ID NOT_EQUALS ID {: System.out.printf("booleanExp -> id != id\n"); :} ; stmt ::= LBRACE stmtList RBRACE {: System.out.printf("stmt -> { stmtList }\n"); :} | ID ASSIGNMENT NUM SEMI {: System.out.printf("stmt -> id = num;\n"); :} | cond {: System.out.printf("stmt -> cond\n"); :} ; stmtList ::= stmt {: System.out.printf("stmtList -> stmt\n"); :} | stmtList stmt {: System.out.printf("stmtList -> stmtList stmt\n"); :} ;

Pulling It All Together

I have included the full JLex and JCup specifications for the conditional grammar at the end of these notes. If you want to try running the parser, type:

java -cp .:..:/usr/local/lib/jar/:/home/bvz/cs365/ conditional.parser

You can use the code shown at the beginning of these notes as an example input file.

JLex Specification

package conditional; import java_cup.runtime.*; %% %cup %line NUM = [0-9] LETTER = [a-zA-Z] WhiteSpace = [ \t\r\n\f] %% "==" {return new Symbol(sym.EQUALS);} "!=" {return new Symbol(sym.NOT_EQUALS);} "=" {return new Symbol(sym.ASSIGNMENT);} ";" {return new Symbol(sym.SEMI);} "{" {return new Symbol(sym.LBRACE);} "}" {return new Symbol(sym.RBRACE);} "if" {return new Symbol(sym.IF);} "then" {return new Symbol(sym.THEN);} "else" {return new Symbol(sym.ELSE);} {NUM}+ {return new Symbol(sym.NUM, new Integer(yytext()));} {LETTER}({LETTER}|{NUM}|_)* {return new Symbol(sym.ID, new String(yytext()));} {WhiteSpace} { /* ignore white space. */ } . { System.err.println("Illegal Character: "+yytext()+" at line "+yyline);}

JCup Specification

package conditional; import java_cup.runtime.*; parser code {: public static void main(String args[]) throws Exception { new parser(new Yylex(System.in)).parse(); } :} terminal IF, THEN, ELSE; terminal Integer NUM; terminal String ID; terminal EQUALS, NOT_EQUALS; terminal ASSIGNMENT; terminal SEMI; terminal LBRACE, RBRACE; non terminal cond, booleanExp, stmt; non terminal stmtList; precedence nonassoc ELSE; precedence nonassoc IF; cond ::= IF booleanExp THEN stmt {: System.out.printf("cond -> if booleanExp then stmt\n"); :} | IF booleanExp THEN stmt ELSE stmt {: System.out.printf("cond -> if booleanExp then stmt else stmt\n"); :} ; booleanExp ::= ID EQUALS ID {: System.out.printf("booleanExp -> id == id\n"); :} | ID NOT_EQUALS ID {: System.out.printf("booleanExp -> id != id\n"); :} ; stmt ::= LBRACE stmtList RBRACE {: System.out.printf("stmt -> { stmtList }\n"); :} | ID ASSIGNMENT NUM SEMI {: System.out.printf("stmt -> id = num;\n"); :} | cond {: System.out.printf("stmt -> cond\n"); :} ; stmtList ::= stmt {: System.out.printf("stmtList -> stmt\n"); :} | stmtList stmt {: System.out.printf("stmtList -> stmtList stmt\n"); :} ;