Lexing Notes

Brad Vander Zanden


1. Lexical analysis involves breaking a program into a stream of tokens
   that can be passed to a parser.

2. Definitions

    a. Token: Set of strings for which the same name is used

        e.g., id could stand for the set of all legitimate identifiers

    b. Pattern: The rule that describes the set of strings in the token set

    c. Lexeme: The value of the token (alternatively, the sequence
	of characters in the program that matches the pattern for
	the token)

3. Specification of Tokens

    a. Rules may be constructed from

	i. union
	ii. concatenation
	iii. Kleene closure (*)
	iv. positive closure (+)

    b. Start with primitive elements (e.g., letters) and then combine
	them to form more powerful languages

	i. language = a set of strings over a fixed alphabet

    c. Regular expressions

	i. epsilon is a regular expression denoting the set {epsilon}

	ii. if a is a symbol in an alphabet, then a is
		a regular expression that denotes {a}, the set
		containing the string a.

 	iii. r | s = union
             rs = concatenation
	     r* = Kleene closure

    d. ? means 0/1: For example, a?b is a regular expression denoting
    	two possible strings, ab and b.

    e. [] means union of everything between brackets
    
    f. Use arrows to name regular expressions (e.g., digit -> [0-9])

4. Recognition of Tokens

    a. State transition diagrams

    b. May have multiple transition diagrams so put ones recognizing
	longer lexemes first