I. Top-Down Parsing (also called recursive-descent parsing) A. A parsing technique that builds the parse tree top-down, starting at the root, and always expanding the left-most, non-terminal node first 1. Traces out a left-most derivation 2. When at noden, labeled with nonterminal A, select one of the productions for A and contruct children atnfor the symbols on the right side of the production 3. Predictive parser: A recursive-descent parser that needs no backtracking. In other words, given the current input symbolaand the nonterminal A to be expanded, there is a unique production among the alternative productions for A (A -> α1 | α2 | ... | αk) that derives a string beginning witha. a. LL(1) Grammar: A grammars that can be recognized by a non-backtracking, predictive parser using one lookahead symbol. b. The first L stands for scanning left-to-right and the second L stands for producing a left-most derivation. c. In order to determine whether or not for each symbolawhether we can find a unique production for A, we must we must compute theFirstandFollowsets for A, and the Predict set for each production of A: i. First(X): First(X) is the set of terminals that begins strings derived from X (X may be either a terminal, a nonterminal, or a string of symbols-in our case X is the nonterminal A). If X has productions numbered p_{1}to p_{n}, then: X = First(p_{1}) U First(p_{2}) U ... U First(p_{n}) where p_{i}is the right hand side string of production i. Here is an algorithm for computing First(X): 1. if X is a terminal, First(X) is {X} 2. if X -> ε then put ε in First(X) 3. if X -> Y_{1}Y_{2}...Y_{n}then put First(Y_{1}) in First(X) 4. if X -> Y_{1}Y_{2}...Y_{n}and Y_{1}, Y_{2}, ..., Y_{i-1}all can derive empty strings, then add Y_{i}to First(X). More formally, we would state this rule as "if ε ∈ First(Y_{j}) for 1 < j < i-1, then put First(Y_{i}) in First(X)" ii. Follow(X): The set of terminals that can appear immediately to the right of X in some string: (i.e., S =>^{*}α X a β). Here is a good algorithm for computing Follow(X): 1) if A -> αXYβ, then put {First(Y)-ε} in Follow(X) 2) if A -> αX, then put Follow(A) in Follow(X). 3) if A -> αXβ and β can derive the empty string, then put Follow(A) in Follow(X). iii. Predict(A -> α): The predict set for a production is the first set for that production, augmented with the Follow set for A if that production can generate the empty string. Formally we can write the Predict set as: Predict(A -> α) = First(α) ∪ (if α =>^{*}ε) then Follow(A) else ∅ If for each non-terminal A, the Predict sets for each production of A have an empty intersection, then the grammar is LL(1). B. Usefulness: Top-down parsing is useful for recognizing: 1) command-based languages where each command begins with a unique command name 2) flow-of-control constructs because they are usually tagged by unique keywords, such as if, while, for, etc. C. Requirements 1. No left recursion: A grammar is left recursive if there is a nonterminal A such that A =>^{+}A α 2. No two productions start with the same terminal a. Can use left factoring to factor out a common prefix i. If A -> α β1 | α β2 then A -> α A' A' -> β1 | β2 is the factored form II. Bottom-Up Parsing A. So called because this technique grows the parse tree from the leaves up 1. The building of the parse tree traces out a right-most derivation in reverse. 2. Given a string of symbols, α, and a lookahead symbol, either a. Shift: shift the input symbol onto the end of the string, or b. Reduce: replace the lastksymbols in the string (call the string formed by thesekstrings, β) with a nonterminal A using the production A -> β. 3. Example: (1) E -> E + E (2) E -> E * E (3) E -> ( E ) (4) E -> id Input Stack Action id+id*id$ Shift +id*id$ id Reduce by E -> id +id*id$ E Shift id*id$ E+ Shift *id$ E+id Reduce by E -> id *id$ E+E Shift id$ E+E* Shift $ E+E*id Reduce by E -> id $ E+E*E Reduce by E -> E * E $ E+E Reduce by E -> E + E $ E accept 4. The sequence of reductions represents a rightmost derivation in reverse: E -> E + E -> E + E * E -> E + E * id -> E + id * id -> id + id * id 5. An ambiguous grammar will cause a parser to confront one of two possible types of conflicts: a. Shift/Reduce conflict: The parser does not know whether to shift the symbol onto the stack or to replace a portion of the string on the stack with a nonterminal (i.e., to reduce the stack). i. classic example is the dangling else problem:conditional->ifexpthenstmt|ifexpthenstmtelsestmtGiven the following configuration, the parser does not know whether to shift the else onto the stack or to replace the "if exp then stmt" symbols with the conditional nonterminal: Stack Input ... if expr then stmt else ... $ b. Reduce/Reduce conflict: Given the current lookahead symbol, there are two nonterminals that could be used to replace symbols on the stack. i. exampleVarDecl->Type[]ArrayAccess->Name[exp]Type->id|intName->idGiven the following configuration and allowing only one lookahead symbol, the parser does not know whether to replaceidwithTypeor withName: Stack Input ... id [ ... $ This ambiguity could be resolved by looking ahead two symbols. Hence, this grammar can be recognized by a parser using two lookahead symbols but not by one using only one lookahead symbol. V. LR Parsers A. LR(k) Grammar: A grammar that can be recognized by a parser performing a rightmost derivation in reverse usingklookahead symbols. 1. The L stands for scanning left-to-right and the R stands for constructing a rightmost derivation in reverse 2. A grammar is LR(k) if we can recognize the occurrence of the right side of a production, having seen all of the string derived by that right side, plus an additionalkinput symbols. 3. A grammar is LL(k) if we can recognize the occurrence of the right side of a production, having seen only the firstksymbols of what its right side derives. Hence LR grammars describe more languages than LL grammars. B. Why use LR parsing? 1. Can recognize virtually all programming language constructs for which context free grammars can be written 2. Most general nonbacktracking shift-reduce technique known, yet it is as efficient as other shift-reduce techniques 3. Grammars recognized by predictive parsers are a proper subset of grammars recognizable by LR parsers C. LR Parsing Algorithm 1. To create an LR parser, we construct an action table in which the row correspond to states and the columns correspond to terminals and non-terminals. Each table entry can have four possible values: 1) shift, 2) reduce, 3) accept, and 4) error a. shift state #: Indicates that the current symbol should be pushed onto the stack and that the automata should transition tostate #. The state # should also be pushed onto the stack. b. reduce production: Indicates the number of state/symbol pairs that should be popped and the non-terminal that should be pushed back onto the input stream. The production is typically a number that refers to a table entry. The table entry indicates the number of state/symbol pairs to be popped and the non-terminal to be pushed onto the input stream. c. accept: Accept the input string as valid if there are no symbols left on the input d. error: The current state does not have a valid transition for the current input symbol. The input string is invalid. 3. Constructing an action table for an SLR grammar a. Item: A production and a position in the production i. An item represents how much of the production is currently on the stack and how much is left to be recognized ii. The portion of an item before the dot represents aviable prefix. A viable prefix is a string of grammar symbols that can comprise the first part of the right side of a production. Example: a, aX, aXY, and aXYb are viable prefixes of the production A -> aXYb iii. Items are grouped into states: The items in a state represent the productions that could be recognized given the current contents of the stack b. Items are constructed using two operations: i. Closure: Given an item A -> α . B β, the closure operation finds all the productions that can derive a portion of the string B β. Given a set of items I, 1) every item in I is inclosure(I)2) If A -> α . B β is inclosure(I)and B -> gamma is a production, then add the item B -> . gamma to I, if it is not alreaady there. Apply this rule until no more new items can be added toclosure(I). ii. GoTo(I, X): If I is the set of items that can generate the viable prefixgamma, then goto(I, X) is the set of items that can generate the viable prefixgammaX. 1) Formal definition: goto(I, X) is the closure of the set of all items [A -> α X . β] such that [A -> α . X β] is in I. iii. LR(0) item: The items constructed in this fashion are called LR(0) items. The reason for the 0 is that these items are created without considering the next symbol that could appear in the input. It turns out that if we consider the next symbol that could appear in the input, we could make our items more sophisticated (such items are called LR(1) items). LR(0) items are used for SLR parsing (Simple LR parsing whereas LR(1) items are used for LR parsing). c. Constructing sets of items i. Augment the Grammar with a distinguished start symbol S' and add one production [S' -> S]. ii. The algorithm: 1) Basic idea: For each set of items I, find a subset of items, I', that have the dot before a common grammar symbol (e.g., . X). Compute closure(goto(I', X)) and add the resulting set of items to the sets-of-items collection iii. If each of the resulting set of items is considered a state, and the goto function is considered the transition function, then you get a deterministic finite automaton that recognizes the viable prefixes of the grammar. d. Shift/Reduce conflicts: Shift/reduce conflicts occur when there are two items in a set of the following form: A -> gamma α . B -> β α . lambda where gamma and β may be null and lambda is non-null. Further, the next input symbol,a, can both follow the nonterminalAand can be the first symbol in a string derived from lambda. In this case we do not know whether we should reduce to A or shiftaonto the stack, thus resulting in a shift/reduce conflict. In order to determine whether or not we have a shift/reduce conflict, we must compute theFollowset for A and theFirstset for lambda. i. Follow(X): The set of terminals that can appear immediately to the right of X in some sentential form (i.e., S =>^{*}α X a β). ii. First(X): First(X) is the set of terminals that begins strings derived from X (X may be either a terminal, a nonterminal, or a string of symbols). We only have a conflict ifais in Follow(A) and First(lambda). Ifais in only one of these sets, then we can perform a reduction or shift, depending on the set in whicharesides. e. Constructing the action table 1) Construct the collection of sets of LR(0) items 2) Set state_{i}= set_{i}. 3) Actions for terminal symbols: Consider each item in set_{i}: a) Items of the form [A -> α . a Β]: If Goto(set_{i}, a) = set_{j}, then setaction[i, a] = shift j. b) Items of the form [A => α .]: set action[i,a] toreduce A -> αfor allain Follow(A). A may not be S', the start symbol. c) [S' -> S.]: action[i,$] = accept where $ represents "end of string". 4) Actions for nonterminal symbols in state_{i}: If Goto(set_{i},A) = set_{j}, then action[i, A] = shift j where A gets shifted onto the stack. 5) All entries not defined by rules 3 and 4 are made "error". 6) The initial state is the one constructed from the set of items containing [S' -> .S]. 4. Why SLR(1) parsers can fail a. Suppose a set of items contains item [A -> α .] and thatais in Follow(A). In some situations, when the state containing this item appears on top of the stack, the viable prefix [β α] is such that [β A] cannot be followed byain right-sentential form. b. LR(0) states do not include the information thatacannot follow A in this state since they do not include lookahead information. It is possible to build this information into a state by including one symbol of lookahead. Such states are called LR(1) states, and they are used in LR(1) parsing. c. LALR(1) Parsers: The parsers produced by Yacc and Bison are LALR(1) parsers. An LALR(1) grammar (lookahead-LR) is a slightly weaker form of grammar than LR, but it accepts most of the grammars that arise in practice and it is much more efficient. The problem with LR grammars is that they cause a state explosion when the lookahead information is added to the state. The same state is duplicated multiple times, with different lookahead information placed in each state (i.e., a different set of lookahead symbols appears in each state). LALR eliminate the state explosion by collapsing multiple states into the same state and merging lookahead symbols. This can occasionally lead to conflicts among lookahead symbols, but it rarely happens.