Parsing Notes
I. Top-Down Parsing (also called recursive-descent parsing)
A. A parsing technique that builds the parse tree top-down, starting
at the root, and always expanding the left-most, non-terminal node
first
1. Traces out a left-most derivation
2. When at node n, labeled with nonterminal A, select one of
the productions for A and contruct children at n for
the symbols on the right side of the production
3. Predictive parser: A recursive-descent parser that needs no
backtracking. In other words, given the current input
symbol a and the nonterminal A to be expanded,
there is a unique production among the alternative
productions for A (A -> α1 | α2 | ... | αk)
that derives a string beginning with a.
a. LL(1) Grammar: A grammars that can be recognized by a
non-backtracking, predictive parser using one lookahead symbol.
b. The first L stands for scanning left-to-right and the second
L stands for producing a left-most derivation.
c. In order to determine whether or not for each symbol a
whether we can find a unique production for A, we must
we must compute the First and Follow sets for A,
and the Predict set for each production of A:
i. First(X): First(X) is the set of terminals that begins
strings derived from X (X may be either a terminal, a
nonterminal, or a string of symbols-in our case X is
the nonterminal A). If X has productions numbered
p1 to pn, then:
X = First(p1) U First(p2) U ... U First(pn)
where pi is the right hand side string of
production i.
Here is an algorithm for computing First(X):
1. if X is a terminal, First(X) is {X}
2. if X -> ε then put ε in First(X)
3. if X -> Y1Y2...Yn then put First(Y1) in First(X)
4. if X -> Y1Y2...Yn and Y1, Y2, ..., Yi-1 all can derive empty strings,
then add Yi to First(X). More formally, we
would state this rule as "if ε ∈ First(Yj) for 1 < j < i-1, then
put First(Yi) in First(X)"
ii. Follow(X): The set of terminals that can appear immediately
to the right of X in some string:
(i.e., S =>* α X a β).
Here is a good algorithm for computing Follow(X):
1) if A -> αXYβ, then put {First(Y)-ε} in Follow(X)
2) if A -> αX, then put Follow(A) in Follow(X).
3) if A -> αXβ and β can derive the
empty string, then put Follow(A) in Follow(X).
iii. Predict(A -> α): The predict set for a
production is the first set for that production,
augmented with the Follow set for A if that production
can generate the empty string. Formally we can write
the Predict set as:
Predict(A -> α) = First(α)
∪ (if α =>* ε) then Follow(A) else ∅
If for each non-terminal A, the Predict sets for each
production of A have an empty intersection, then the grammar is LL(1).
B. Usefulness: Top-down parsing is useful for recognizing:
1) command-based languages where each command begins with a unique
command name
2) flow-of-control constructs because they
are usually tagged by unique keywords, such as if, while, for, etc.
C. Requirements
1. No left recursion: A grammar is left recursive if there is
a nonterminal A such that A =>+ A α
2. No two productions start with the same terminal
a. Can use left factoring to factor out a common prefix
i. If A -> α β1 | α β2
then
A -> α A'
A' -> β1 | β2
is the factored form
II. Bottom-Up Parsing
A. So called because this technique grows the parse tree from the
leaves up
1. The building of the parse tree traces out a right-most derivation
in reverse.
2. Given a string of symbols, α, and a lookahead symbol, either
a. Shift: shift the input symbol onto the end of the string, or
b. Reduce: replace the last k symbols in the string
(call the string formed by these k strings, β)
with a nonterminal A using the production A -> β.
3. Example:
(1) E -> E + E
(2) E -> E * E
(3) E -> ( E )
(4) E -> id
Input Stack Action
id+id*id$ Shift
+id*id$ id Reduce by E -> id
+id*id$ E Shift
id*id$ E+ Shift
*id$ E+id Reduce by E -> id
*id$ E+E Shift
id$ E+E* Shift
$ E+E*id Reduce by E -> id
$ E+E*E Reduce by E -> E * E
$ E+E Reduce by E -> E + E
$ E accept
4. The sequence of reductions represents a rightmost derivation
in reverse:
E -> E + E
-> E + E * E
-> E + E * id
-> E + id * id
-> id + id * id
5. An ambiguous grammar will cause a parser to confront one of
two possible types of conflicts:
a. Shift/Reduce conflict: The parser does not know whether to
shift the symbol onto the stack or to replace a portion of
the string on the stack with a nonterminal (i.e., to reduce
the stack).
i. classic example is the dangling else problem:
conditional -> if exp then stmt
| if exp then stmt else stmt
Given the following configuration, the parser does not
know whether to shift the else onto the stack or
to replace the "if exp then stmt" symbols with the
conditional nonterminal:
Stack Input
... if expr then stmt else ... $
b. Reduce/Reduce conflict: Given the current lookahead symbol,
there are two nonterminals that could be used to replace
symbols on the stack.
i. example
VarDecl -> Type []
ArrayAccess -> Name [exp]
Type -> id | int
Name -> id
Given the following configuration and allowing only one
lookahead symbol, the parser does not know whether to
replace id with Type or with
Name:
Stack Input
... id [ ... $
This ambiguity could be resolved by looking ahead two
symbols. Hence, this grammar can be recognized by
a parser using two lookahead symbols but not by
one using only one lookahead symbol.
V. LR Parsers
A. LR(k) Grammar: A grammar that can be recognized by a parser
performing a rightmost derivation in reverse using k
lookahead symbols.
1. The L stands for scanning left-to-right and the R stands for
constructing a rightmost derivation in reverse
2. A grammar is LR(k) if we can recognize the occurrence of the
right side of a production, having seen all of the string
derived by that right side, plus an additional k
input symbols.
3. A grammar is LL(k) if we can recognize the occurrence of the
right side of a production, having seen only the first
k symbols of what its right side derives. Hence
LR grammars describe more languages than LL grammars.
B. Why use LR parsing?
1. Can recognize virtually all programming language constructs
for which context free grammars can be written
2. Most general nonbacktracking shift-reduce technique known,
yet it is as efficient as other shift-reduce techniques
3. Grammars recognized by predictive parsers are a proper
subset of grammars recognizable by LR parsers
C. LR Parsing Algorithm
1. To create an LR parser, we construct an action table in which the
row correspond to states and the columns correspond to terminals
and non-terminals. Each table entry can have four possible
values: 1) shift, 2) reduce, 3) accept, and 4) error
a. shift state #: Indicates that the current symbol should be
pushed onto the stack and that the automata should
transition to state #. The state # should also
be pushed onto the stack.
b. reduce production: Indicates the number of state/symbol
pairs that should be popped and the non-terminal that
should be pushed back onto the input stream. The
production is typically a number that refers to a
table entry. The table entry indicates the number of
state/symbol pairs to be popped and the non-terminal
to be pushed onto the input stream.
c. accept: Accept the input string as valid if there are no
symbols left on the input
d. error: The current state does not have a valid transition
for the current input symbol. The input string is
invalid.
3. Constructing an action table for an SLR grammar
a. Item: A production and a position in the production
i. An item represents how much of the production is
currently on the stack and how much is left
to be recognized
ii. The portion of an item before the dot represents
a viable prefix. A viable prefix
is a string of grammar symbols that can comprise
the first part of the right side of a production.
Example: a, aX, aXY, and aXYb are viable prefixes
of the production A -> aXYb
iii. Items are grouped into states: The items in a
state represent the productions that could
be recognized given the current contents of
the stack
b. Items are constructed using two operations:
i. Closure: Given an item A -> α . B β, the closure
operation finds all the productions that can derive
a portion of the string B β.
Given a set of items I,
1) every item in I is in closure(I)
2) If A -> α . B β is in closure(I)
and B -> gamma is a production, then add the
item B -> . gamma to I, if it is not alreaady there.
Apply this rule until no more new items can be added
to closure(I).
ii. GoTo(I, X): If I is the set of items that can generate
the viable prefix gamma, then goto(I, X) is
the set of items that can generate the viable prefix
gamma X.
1) Formal definition: goto(I, X) is the closure of the set
of all items [A -> α X . β] such that
[A -> α . X β] is in I.
iii. LR(0) item: The items constructed in this fashion are
called LR(0) items. The reason for the 0 is that
these items are created without considering the
next symbol that could appear in the input. It turns
out that if we consider the next symbol that could
appear in the input, we could make our items more
sophisticated (such items are called LR(1) items).
LR(0) items are used for SLR parsing (Simple LR
parsing whereas LR(1) items are used for LR parsing).
c. Constructing sets of items
i. Augment the Grammar with a distinguished start symbol S'
and add one production [S' -> S].
ii. The algorithm:
1) Basic idea: For each set of items I, find a subset
of items, I', that have the dot before a common grammar
symbol (e.g., . X). Compute closure(goto(I', X))
and add
the resulting set of items to the sets-of-items
collection
iii. If each of the resulting set of items is considered a
state, and the goto function is considered the
transition function, then you get a deterministic
finite automaton that recognizes the viable prefixes
of the grammar.
d. Shift/Reduce conflicts: Shift/reduce conflicts occur when there
are two items in a set of the following form:
A -> gamma α .
B -> β α . lambda
where gamma and β may be null and lambda is non-null.
Further, the next input symbol, a, can both follow
the nonterminal A and can be the first symbol in
a string derived from lambda. In this case we do not know
whether we should reduce to A or shift a onto the
stack, thus resulting in a shift/reduce conflict.
In order to determine whether or not we have a shift/reduce
conflict, we must compute the Follow set for A and
the First set for lambda.
i. Follow(X): The set of terminals that can appear immediately
to the right of X in some sentential form
(i.e., S =>* α X a β).
ii. First(X): First(X) is the set of terminals that begins
strings derived from X (X may be either a terminal, a
nonterminal, or a string of symbols).
We only have a conflict if a is in Follow(A) and
First(lambda). If a is in only one of these sets, then
we can perform a reduction or shift, depending on the set in
which a resides.
e. Constructing the action table
1) Construct the collection of sets of LR(0) items
2) Set statei = seti.
3) Actions for terminal symbols: Consider each
item in seti:
a) Items of the form [A -> α . a Β]: If
Goto(seti, a) = setj, then
set action[i, a] = shift j.
b) Items of the form [A => α .]: set action[i,a] to
reduce A -> α for all a in
Follow(A). A may not be S', the start symbol.
c) [S' -> S.]: action[i,$] = accept where $ represents
"end of string".
4) Actions for nonterminal symbols in statei: If
Goto(seti,A) = setj, then
action[i, A] = shift j where A gets shifted onto the stack.
5) All entries not defined by rules 3 and 4 are made "error".
6) The initial state is the one constructed from the set of
items containing [S' -> .S].
4. Why SLR(1) parsers can fail
a. Suppose a set of items contains item [A -> α .] and that
a is in Follow(A). In some situations, when the state
containing this item appears on top of the stack, the viable
prefix [β α] is such that [β A]
cannot be followed by a in right-sentential form.
b. LR(0) states do not include the information that a
cannot follow A in this state since they do not include
lookahead information. It is possible to build this
information into a state by including one symbol of
lookahead. Such states are called LR(1) states, and they
are used in LR(1) parsing.
c. LALR(1) Parsers: The parsers produced by Yacc and Bison are
LALR(1) parsers. An LALR(1) grammar (lookahead-LR) is a
slightly weaker form of grammar than LR, but it accepts
most of the grammars that arise in practice and it is
much more efficient. The problem with LR grammars is that
they cause a state explosion when the lookahead information
is added to the state. The same state is duplicated multiple
times, with different lookahead information placed in each
state (i.e., a different set of lookahead symbols appears
in each state). LALR eliminate the state explosion by
collapsing multiple states into the same state and
merging lookahead symbols. This can occasionally lead to
conflicts among lookahead symbols, but it rarely happens.