Parsing Notes

Brad Vander Zanden


I. Top-Down Parsing (also called recursive-descent parsing)

    A. A parsing technique that builds the parse tree top-down, starting
	at the root, and always expanding the left-most, non-terminal node
	first 

	1. Traces out a left-most derivation

	2. When at node n, labeled with nonterminal A, select one of
		the productions for A and contruct children at n for
		the symbols on the right side of the production

	3. Predictive parser: A recursive-descent parser that needs no
		backtracking. In other words, given the current input
		symbol a and the nonterminal A to be expanded, 
		there is a unique production among the alternative
		productions for A (A -> α1 | α2 | ... | αk)
		that derives a string beginning with a.

	    a. LL(1) Grammar: A grammars that can be recognized by a 
		non-backtracking, predictive parser using one lookahead symbol.

	    b. The first L stands for scanning left-to-right and the second
		L stands for producing a left-most derivation.

	    c. In order to determine whether or not for each symbol a
                whether we can find a unique production for A, we must
		we must compute the First and Follow sets for A,
                and the Predict set for each production of A:

  	        i. First(X): First(X) is the set of terminals that begins 
		    strings derived from X (X may be either a terminal, a 
		    nonterminal, or a string of symbols-in our case X is
                    the nonterminal A). If X has productions numbered
                    p₁ to p_n, then:

                    X = First(p₁) U First(p₂) U  ... U First(p_n)

                    where p_i is the right hand side string of 
                    production i.

		    Here is an algorithm for computing First(X):

		    1. if X is a terminal, First(X) is {X}
		    2. if X -> ε then put ε in First(X)
		    3. if X -> Y₁Y₂...Y_n then put First(Y₁) in First(X)
		    4. if X -> Y₁Y₂...Y_n and Y₁, Y₂, ..., Y_i-1 all can derive empty strings,
                       then add Y_i to First(X). More formally, we
                       would state this rule as "if ε ∈ First(Y_j) for 1 < j < i-1, then 
		       put First(Y_i) in First(X)"

	        ii. Follow(X): The set of terminals that can appear immediately 
		   to the right of X in some string:
		   (i.e., S =>^* α X a β).

                     Here is a good algorithm for computing Follow(X):

                     1) if A -> αXYβ, then put {First(Y)-ε} in Follow(X)
                     2) if A -> αX, then put Follow(A) in Follow(X).
                     3) if A -> αXβ and β can derive the
                         empty string, then put Follow(A) in Follow(X).

                iii. Predict(A -> α): The predict set for a
                     production is the first set for that production,
                     augmented with the Follow set for A if that production
                     can generate the empty string. Formally we can write
                     the Predict set as:
             
                     Predict(A -> α) = First(α) 
                                   ∪ (if α =>^* ε) then Follow(A) else ∅

                If for each non-terminal A, the Predict sets for each 
                production of A have an empty intersection, then the grammar is LL(1).
                
    B. Usefulness: Top-down parsing is useful for recognizing:

        1) command-based languages where each command begins with a unique
           command name
        2) flow-of-control constructs because they
	   are usually tagged by unique keywords, such as if, while, for, etc.

    C. Requirements

	1. No left recursion: A grammar is left recursive if there is
		a nonterminal A such that A =>⁺ A α

	2. No two productions start with the same terminal

		a. Can use left factoring to factor out a common prefix

		    i. If A -> α β1 | α β2
			
		  	then

			A -> α A'
			A' -> β1 | β2

			is the factored form

II. Bottom-Up Parsing

    A. So called because this technique grows the parse tree from the
	leaves up

	1. The building of the parse tree traces out a right-most derivation
		in reverse. 

	2. Given a string of symbols, α, and a lookahead symbol, either

	    a. Shift: shift the input symbol onto the end of the string, or

	    b. Reduce: replace the last k symbols in the string
		(call the string formed by these k strings, β)
		with a nonterminal A using the production A -> β.

	3. Example:

	    (1) E -> E + E 
	    (2) E -> E * E
	    (3) E -> ( E )
	    (4) E -> id

		Input		Stack		Action
		id+id*id$			Shift
		+id*id$		id		Reduce by E -> id
		+id*id$		E		Shift
		id*id$		E+		Shift
		*id$		E+id		Reduce by E -> id
		*id$		E+E		Shift
		id$		E+E*		Shift
		$		E+E*id		Reduce by E -> id
		$		E+E*E		Reduce by E -> E * E
		$		E+E		Reduce by E -> E + E
		$		E		accept

	4. The sequence of reductions represents a rightmost derivation
		in reverse:

		E -> E + E
		  -> E + E * E
		  -> E + E * id
		  -> E + id * id
		  -> id + id * id

	5. An ambiguous grammar will cause a parser to confront one of
		two possible types of conflicts:

	    a. Shift/Reduce conflict: The parser does not know whether to
		shift the symbol onto the stack or to replace a portion of
		the string on the stack with a nonterminal (i.e., to reduce
		the stack).

		i. classic example is the dangling else problem:

		    conditional -> if exp then stmt
				       |  if exp then stmt else stmt
		
 		    Given the following configuration, the parser does not
			know whether to shift the else onto the stack or
			to replace the "if exp then stmt" symbols with the
			conditional nonterminal:

		    Stack				Input

		    ... if expr then stmt		else ... $

	    b. Reduce/Reduce conflict: Given the current lookahead symbol,
		there are two nonterminals that could be used to replace
		symbols on the stack.

		i. example

		    VarDecl -> Type []
		    ArrayAccess -> Name [exp]
		    Type -> id | int
		    Name -> id

		   Given the following configuration and allowing only one
		 	lookahead symbol, the parser does not know whether to
			replace id with Type or with 
			Name:

		   Stack				Input
	
		   ... id				[ ... $

		   This ambiguity could be resolved by looking ahead two
		 	symbols. Hence, this grammar can be recognized by
			a parser using two lookahead symbols but not by
			one using only one lookahead symbol.
		
V. LR Parsers

    A. LR(k) Grammar: A grammar that can be recognized by a parser 
	performing a rightmost derivation in reverse using k 
	lookahead symbols.

	1. The L stands for scanning left-to-right and the R stands for
		constructing a rightmost derivation in reverse

	2. A grammar is LR(k) if we can recognize the occurrence of the
		right side of a production, having seen all of the string
		derived by that right side, plus an additional k
		input symbols.

	3. A grammar is LL(k) if we can recognize the occurrence of the
		right side of a production, having seen only the first
		k symbols of what its right side derives. Hence
		LR grammars describe more languages than LL grammars.

    B. Why use LR parsing?

	1. Can recognize virtually all programming language constructs
		for which context free grammars can be written

	2. Most general nonbacktracking shift-reduce technique known,
		yet it is as efficient as other shift-reduce techniques

	3. Grammars recognized by predictive parsers are a proper
		subset of grammars recognizable by LR parsers

    C. LR Parsing Algorithm

	1. To create an LR parser, we construct an action table in which the
		row correspond to states and the columns correspond to terminals
		and non-terminals. Each table entry can have four possible 
		values: 1) shift, 2) reduce, 3) accept, and 4) error

		a. shift state #: Indicates that the current symbol should be
			pushed onto the stack and that the automata should 
			transition to state #. The state # should also 
			be pushed onto the stack.

		b. reduce production: Indicates the number of state/symbol
			pairs that should be popped and the non-terminal that
			should be pushed back onto the input stream. The
			production is typically a number that refers to a
			table entry. The table entry indicates the number of
			state/symbol pairs to be popped and the non-terminal
			to be pushed onto the input stream.

		c. accept: Accept the input string as valid if there are no
			symbols left on the input

		d. error: The current state does not have a valid transition
			for the current input symbol. The input string is
			invalid.

	3. Constructing an action table for an SLR grammar

	    a. Item: A production and a position in the production

		i. An item represents how much of the production is
			currently on the stack and how much is left
			to be recognized

		ii. The portion of an item before the dot represents
			a viable prefix. A viable prefix 
			is a string of grammar symbols that can comprise
			the first part of the right side of a production.

			Example: a, aX, aXY, and aXYb are viable prefixes
			   of the production A -> aXYb

		iii. Items are grouped into states: The items in a
			state represent the productions that could
			be recognized given the current contents of
			the stack

	    b. Items are constructed using two operations:

		i. Closure: Given an item A -> α . B β, the closure 
			operation finds all the productions that can derive
			a portion of the string B β.

		    Given a set of items I,

		    1) every item in I is in closure(I)
		
		    2) If A -> α . B β is in closure(I)
			and B -> gamma is a production, then add the 
			item B -> . gamma to I, if it is not alreaady there.
			Apply this rule until no more new items can be added
			to closure(I).

		ii. GoTo(I, X): If I is the set of items that can generate
			the viable prefix gamma, then goto(I, X) is 
			the set of items that can generate the viable prefix 
			gamma X. 

		    1) Formal definition: goto(I, X) is the closure of the set
			of all items [A -> α X . β] such that
			[A -> α . X β] is in I.

		iii. LR(0) item: The items constructed in this fashion are 
			called LR(0) items. The reason for the 0 is that
			these items are created without considering the
			next symbol that could appear in the input. It turns
			out that if we consider the next symbol that could
			appear in the input, we could make our items more
			sophisticated (such items are called LR(1) items).
			LR(0) items are used for SLR parsing (Simple LR
			parsing whereas LR(1) items are used for LR parsing).

	    c. Constructing sets of items

		i. Augment the Grammar with a distinguished start symbol S'
			and add one production [S' -> S].

		ii. The algorithm: 

		    1) Basic idea: For each set of items I, find a subset
			of items, I', that have the dot before a common grammar
			symbol (e.g., . X). Compute closure(goto(I', X)) 
			and add
			the resulting set of items to the sets-of-items
			collection

		 iii. If each of the resulting set of items is considered a
			state, and the goto function is considered the
			transition function, then you get a deterministic
			finite automaton that recognizes the viable prefixes
			of the grammar.

	    d. Shift/Reduce conflicts: Shift/reduce conflicts occur when there
	    	are two items in a set of the following form:

		A -> gamma α .
		B -> β α . lambda

		where gamma and β may be null and lambda is non-null.
		Further, the next input symbol, a, can both follow
		the nonterminal A and can be the first symbol in
		a string derived from lambda. In this case we do not know 
		whether we should reduce to A or shift a onto the
		stack, thus resulting in a shift/reduce conflict.
		
		In order to determine whether or not we have a shift/reduce
		conflict, we must compute the Follow set for A and
		the First set for lambda.

	        i. Follow(X): The set of terminals that can appear immediately 
		   to the right of X in some sentential form 
		   (i.e., S =>^* α X a β).

  	        ii. First(X): First(X) is the set of terminals that begins 
		    strings derived from X (X may be either a terminal, a 
		    nonterminal, or a string of symbols). 

		We only have a conflict if a is in Follow(A) and
		First(lambda). If a is in only one of these sets, then
		we can perform a reduction or shift, depending on the set in
		which a resides.

	    e. Constructing the action table

	        1) Construct the collection of sets of LR(0) items
		2) Set state_i = set_i. 
		3) Actions for terminal symbols: Consider each
		    item in set_i:
		    a) Items of the form [A -> α . a Β]: If 
		       Goto(set_i, a) = set_j, then
		       set action[i, a] = shift j.
		    b) Items of the form [A => α .]: set action[i,a] to
		       reduce A -> α for all a in
		       Follow(A). A may not be S', the start symbol.
		    c) [S' -> S.]: action[i,$] = accept where $ represents
		    	"end of string".
		4) Actions for nonterminal symbols in state_i: If
		    Goto(set_i,A) = set_j, then 
		    action[i, A] = shift j where A gets shifted onto the stack.
		5) All entries not defined by rules 3 and 4 are made "error".
		6) The initial state is the one constructed from the set of
		    items containing [S' -> .S].

	4. Why SLR(1) parsers can fail

	    a. Suppose a set of items contains item [A -> α .] and that
		a is in Follow(A). In some situations, when the state 
		containing this item appears on top of the stack, the viable 
		prefix [β α] is such that [β A] 
		cannot be followed by a in right-sentential form. 

	    b. LR(0) states do not include the information that a
		cannot follow A in this state since they do not include
		lookahead information. It is possible to build this 
		information into a state by including one symbol of
		lookahead. Such states are called LR(1) states, and they
		are used in LR(1) parsing.

	    c. LALR(1) Parsers: The parsers produced by Yacc and Bison are
		LALR(1) parsers. An LALR(1) grammar (lookahead-LR) is a
		slightly weaker form of grammar than LR, but it accepts
		most of the grammars that arise in practice and it is
		much more efficient. The problem with LR grammars is that
		they cause a state explosion when the lookahead information
		is added to the state. The same state is duplicated multiple
		times, with different lookahead information placed in each
		state (i.e., a different set of lookahead symbols appears
		in each state). LALR eliminate the state explosion by
		collapsing multiple states into the same state and
		merging lookahead symbols. This can occasionally lead to
		conflicts among lookahead symbols, but it rarely happens.