Lexing Notes
1. Lexical analysis involves breaking a program into a stream of tokens
that can be passed to a parser.
2. Definitions
a. Token: Set of strings for which the same name is used
e.g., id could stand for the set of all legitimate identifiers
b. Pattern: The rule that describes the set of strings in the token set
c. Lexeme: The value of the token (alternatively, the sequence
of characters in the program that matches the pattern for
the token)
3. Specification of Tokens
a. Rules may be constructed from
i. union
ii. concatenation
iii. Kleene closure (*)
iv. positive closure (+)
b. Start with primitive elements (e.g., letters) and then combine
them to form more powerful languages
i. language = a set of strings over a fixed alphabet
c. Regular expressions
i. epsilon is a regular expression denoting the set {epsilon}
ii. if a is a symbol in an alphabet, then a is
a regular expression that denotes {a}, the set
containing the string a.
iii. r | s = union
rs = concatenation
r* = Kleene closure
d. ? means 0/1: For example, a?b is a regular expression denoting
two possible strings, ab and b.
e. [] means union of everything between brackets
f. Use arrows to name regular expressions (e.g., digit -> [0-9])
4. Recognition of Tokens
a. State transition diagrams
b. May have multiple transition diagrams so put ones recognizing
longer lexemes first