Introduction to Flex

<head>
<link rel="stylesheet" type="text/css" href="../cs461_notes.css" />
<style>
table
{
border-collapse:collapse;
}
table,th, td
{
border: 1px solid black;
padding: 5px;
}
</style>
</head>
<center>
<h1>Introduction to Flex</h1>
<h2>Brad Vander Zanden and Ray Byler</h2>
</center>
<hr>
<h2>Overview</h2>
<p>
This text is meant to provide a brief introduction to the Flex
lexical analyzer and to show how you integrate it with the Bison
parser to produce a compiler front end. The mechanics of Bison
itself will be covered later. Flex and Bison are basically
better versions of Lex and Yacc. They are more flexible and
produce faster code. For the homework and class discussion, we
will be using Flex and Bison. You are welcome to use different
lexical/parsing tools for your project. For example, JLex/JCup
is a lexer/parser pair for Java that is based on the lex/yacc
model, and there are other Java lexer/parsers that you can find
by searching the internet that also are based on the Lex/Yacc
model.
<p>
You can use Flex and Bison independently, but they have been engineered
to work well together. Bison produces a parser from an input file
that you provide. The parser expects to receive a token stream from
a lexer of your choice, and it expects your lexer to provide it with
a function named yylex() that it can call to retrieve tokens from this
token stream. Flex generates the yylex()
function automatically when you provide it with a .l file (e.g., graph.l).
We will discuss how the .l file should be written later in these notes,
and we will discuss how to create an input file for bison in a separate
set of notes.

<p>
The yylex() function produced by Flex 
uses simulated finite-state-machines (FSM) to recognize strings (or 
lexemes) then passes this information to the parser in the form of integer 
tokens. These simulated FSMs are generated by Flex from the regular expressions 
that you write. The parser 
parses (think about parse trees and context free grammars) this sequence of 
tokens to verify that the statements formed conform to the grammar of the 
language.  The lexical analyzer is usually the slowest part of a compiler since 
it has to read every single character of the input file.

<hr>
<h2>A Sample Flex Specification</h2>

Here is a sample flex specification that reads lines from stdin and
checks to see whether each line contains a valid credit card number.
A valid credit card number is defined as one that has four groups of
four numbers each, with each group separated by an optional space or 
dash (-).

<pre>
%option noyywrap
%{
/* * * * * * * * * * * *
 * * * DEFINITIONS * * *
 * * * * * * * * * * * */
%}

%{
// recognize whether or not a credit card number is valid
int line_num = 1;
%}

digit [0-9]
group {digit}{4}
%%

%{
/* * * * * * * * * 
 * * * RULES * * *
 * * * * * * * * */
%}
   /* The carat (^) says that a credit card number must start at the
      beginning of a line and the $ says that the credit card number
      must end the line. */
^{group}([ -]?{group}){3}$  { printf(" credit card number: %s\n", yytext); }

   /* The .* accumulates all the characters on any line that does not
      match a valid credit card number */
.* { printf("%d: error: %s \n", line_num, yytext); }
\n { line_num++; }
%%

/* * * * * * * * * * * 
 * * * USER CODE * * *
 * * * * * * * * * * *
 */
int main(int argc, char *argv[]) {
  yylex();
}
</pre>
Here are things to note about the above code:
<p>
<ol>
<li> To run the code, type the following commands:
   <pre>
flex credit_card.lex    // assumes you stored the specification in credit_card.lex
gcc lex.yy.c            
</pre>
   Typically you store your lex specifications in files with a <tt>.lex</tt>
   extension. flex generates a file named lex.yy.c. You can change this name
   but it usually does not matter. 
<p>
<li> The specification is divided into three sections with "%%" delimiters
     placed between sections:
  <p>
    <ol>
      <li> A definitions section where you can 1) place code that you want
	   to go at the beginning of the generated scanner, 2)
	   define names that are in effect macros that can be expanded
	   within a pattern, and 3) define states that control when
	   rules are active (see <a href="#states">States</a> for more
	   details about how this is done). You place
	   code that should go at the beginning of the scanner
	   between %{ and %} delimiters. Normally you place 3 types of
	   code in these delimiters:
	   <ol>
	     <li> comments: You can use the %{ ... %} anywhere in either
	          the definitions or rules section to safely embed comments
	          in your code. flex is very picky about when and where it
	          is safe to place comments, so to be safe, you can always
	          place comments in a %{ %} block.
	     <li> .h files: When you are using your scanner with a bison-generated
	          parser, you will often include the file <tt>y.tab.h</tt>
	          in a %{ %} block, along with any other .h files that you
	          may need to use.
	     <li> function definitions that you plan to use in your actions
	          (see rules section below for a definition of actions):
	          Often you will define error-handling functions that you
	          can call from various action routines when you encounter
	          erroneous tokens.
	  </ol>
	   <p>
	     A name definition has the form:
	     <pre>
name pattern
</pre>     where <tt>name</tt> is the name of your macro and pattern is
           a regular expression. As you can see from the definition of
           <tt>group</tt> in the example specification, you 
           can use previously defined names in your patterns. 
	   <p>
      <li> A rules section where you have rules of the form:
<pre>
pattern action
</pre>
           A pattern is a regular expression and the action is normally
           C/C++ code to do something with the string that matches the
           pattern. The string that matches the pattern is placed in a
           pre-defined <tt>char *</tt> variable named <tt>yytext</tt>.
           As you can see from the example, actions are normally enclosed
           in C-style curly braces ({}).
	   <p>
      <li> A user code section where you can place code to invoke the
	   scanner. Flex generates a function named <tt>yylex</tt> that
	   you can call from <tt>main</tt> if you want to use your scanner
	   as a standalone application. <tt>yylex</tt> will read input from
	   stdin until it is exhausted, or until you tell it to return
	   with a return statement in an action.
	   Normally you will be using the scanner with a bison-generated
	   parser, and you will leave this section empty, because the parser
	   will call <tt>yylex</tt> for you.
      </ol>
	   <p>
  <li> You normally should keep track of the line count yourself, so that
       you can print out error messages when you encounter an unrecognizable
       token. Flex will keep track of the current line number in a variable 
       named <tt>yylineno</tt> if you include the following line at the
       very top of your flex specification:
<pre>
%option yylineno
</pre>
       However, yylineno is not in the posix specification for lex, so
       your lex specification will not be portable should you decide to
       use it.
<p>
   <li> Flex is very finicky about indenting. You will avoid any pitfalls
        if you start <i>all</i> flex commands on the first column. If you
        indent a flex command, such as a <tt>%{</tt> delimiter even one
        space, flex may or may not complain, but flex will not generate
        a valid lex.yy.c file and gcc will probably generate dozens of
        useless error messages.
   <p>
   <li> Annoyingly flex declares a function named <tt>yywrap</tt> but does
        not bother to define it. You typically won't need it if you have only
        one input file. You can tell flex not to declare this function by
        placing the directive:
<pre>
%option noyywrap
</pre>
        at the top of your flex file. Alternatively you can tell flex
        to find a default version of <tt>yywrap</tt> in the <tt>fl</tt>
        (flex) library, with the <tt>-lfl</tt> compilation flag:
<pre>
gcc lex.yy.c -lfl
</pre>
  </ol>
<hr>
<h2> Patterns </h2>
<p>
Flex looks for 
the longest possible match (i.e., it is a greedy matcher).  
The consequence of this is that you get faster code 
if you use longer patterns.
<p>
Generally, you should start with simple elements (e.g. letters) and then combine 
them to form more powerful expressions/languages.
<p>
Here are a list of the most useful regular patterns. You can find a
complete list at <a href="http://flex.sourceforge.net/manual/Patterns.html#Patterns">http://flex.sourceforge.net/manual/Patterns.html#Patterns</a>.
<p>
<table>
<tr><td>x</td><td>matches the character x</td></tr>
<tr><td>.</td><td> (Period) matches any single character except a newline.</td></tr>
<tr><td>\n</td><td> matches a newline character</td></tr>
<tr><td>\* or "*"</td><td> \ is used both as an escape character, so that you can use a reserved
               character as a literal, and to specify certain control
               characters, such as newline characters (\n) and tabs (\t). 
               If the \ does not specify a control character, then it
               escapes the character.
               For example, \* is a literal asterisk, rather than an
               asterisk meaning 0 or more occurrences of a regular expression.
               Alternatively you can use quotes (&quot; &quot) to specify
               that a reserved character should be interpreted literally as
               that character.
               </td></tr>
<tr><td>$</td><td>By itself, $ is a special symbol meaning end of input (EOF).
              For example, <tt>"$" { return 0; }</tt>. Normally you
              do not care about EOF unless you need to do some sort
              of special processing, such as switching to another input
              file.</td></tr>
<tr><td>r$</td><td>When placed at the end of a regular expression, $ specifies
        that the string that matches the regular expression r must be at 
        the end of the
        current line of input.</td></tr>
<tr><td>[xyz]</td><td> a character class that matches any of the characters
        between the []'s. In this case the character class matches any
        of x, y, or z</td></tr>
<tr><td>[a-zA-Z]</td><td>the '-' denotes a range of ascii characters. This
        specification matches any lower or upper case letter. Do not make
        the mistake of writing [a-Z] because there are ascii characters between
        lowercase 'z' and uppercase 'A' that would be included in the pattern.</td></tr>
<tr><td>[0-9]</td><td> any single digit</td></tr>
<tr><td>[ \t\n\r\f]</td><td>matches any whitespace character. \r and \f
        stand for "return" and "form feed" and are often present in
        Windows generated files.</td></tr>
<tr><td>[^A-Z]</td><td>A ^ that is the <i>first</i> character inside the
        character class negates that character class, or alternatively,
        says any character but the characters in that character class.
        In this case [^A-Z] says anything except an uppercase letter</td></tr>
<tr><td>^r</td><td>When placed at the beginning of a pattern, the ^ says
        that the string which matches the regular expression
        r must start at the beginning of a line of input.</td></tr>
<tr><td>[a-z]{-}[aeiou]</td><td>The set difference operator (-) subtracts
        anything in the second character class from the first character
        class. In this case the pattern specifies the consonants.</td></tr>
<tr><td>r*</td><td>0 or more r's, where r is any regular expression.</td></tr>
<tr><td>r+</td><td>1 or more r's, where r is any regular expression.</td></tr>
<tr><td>r?</td><td>0 or 1 r's, where r is any regular expression. You may
       also think of ? as saying that the regular expression is optional.
       For example,
       <b>-?[0-9]</b> matches a single digit with an optional 
       leading minus sign.</td></tr>
<tr><td>r{2,5}</td><td>Matches anywhere from 2 to 5 r's</td></tr>
<tr><td>r{4,}</td><td>Matches 4 or more r's</td></tr>
<tr><td>r{4}</td><td>Matches exactly 4 r's</td></tr>
<tr><td>rs</td><td>the concatenation of the regular expressions r and s.
       You can also think of the pattern as r followed by s.</td></tr>
<tr><td>r | s</td><td>either r or s (i.e., the union operation).</td></tr>
<tr><td>[0-9]+ </td><td> any number</td></tr>
<tr><td>. | \n</td><td> matches any character.</td></tr>
<tr><td>(brad|bvz)*</td><td>parentheses are used to group regular expressions
             and to override precedence. For example,
             <tt>brad|bvz*</tt> would typically match either "brad" or "bv" followed by 0 or more
	     z's. To instead match 0 or more occurrences of either 
	     "brad" or "bvz", you would use parentheses: 
	     <tt>(brad|bvz)*</tt>.</td></tr>
<tr><td>{DIGIT}+"."{DIGIT}*</td><td>A name that is placed between curly braces
        ({}) will be replaced by its associated pattern from the definitions
        section. If DIGIT were defined as [0-9] in the definitions section,
        then this pattern specifies a number that consists of
        1 or more digits, followed by a period, followed by 0 or more digits.
        Note that the decimal point had to be placed in quotes to prevent
        it from being interpreted as a pattern that matches any single
        character.</td></tr>
<tr><td>&lt;s&gt;r</td><td>A regular expression that is active only when state
        <tt>s</tt> is enabled. See Section <a href="#states">States</a> for
	more details.</td></tr>
<tr><td>&lt*&gt;r</td><td>A regular expression that is active in any state.</td></tr>
<tr><td>&lt;s1,s2,s3&gt;r</td><td>A regular expression that is active only when
        state s1, s2, or s3 is active.</td></tr>
</table>

<hr>
<h3><a name="states"/>States</h3>

<p>
Different states are like different FSMs. The lexer reacts to strings 
differently depending on what state it is in.  Two common situations where
we want to use states are where we want to recognize C-style comments and
where we are processing errors. C-style comments are difficult to handle
as a normal regular expression and errors may require us to consume some part
of the input until we get back to a part of the input where we are prepared
to resume scanning. For example, in the credit card section, we used the
pattern ".*" to mop up any characters before the newline character. However,
we did not need to use an error processing state in that example.
<p>
Before discussing the specifics of states, let's look at the following
set of rules from the <a href="http://flex.sourceforge.net/manual/Start-Conditions.html#Start-Conditions">flex</a> manual
for both recognizing open ended C-style comments (i.e., comments
of the form /* ... */) and keeping track of the current line number whenever
a comment spans multiple lines:
<pre>
"/*"         BEGIN(comment);
     
<comment>[^*\n]*        /* consume anything that's not a '*' or newline */
<comment>"*"+[^*/\n]*   /* this pattern is similar to the previous one but
                           it also consumes 1 or more leading *'s. For example, if
                           the comment were "brad **** smiley" than the
                           first pattern would consume "brad " and this
                           pattern would consume "**** smiley". */
<comment>\n             ++line_num;
<comment>"*"+"/"        { /* end of comment--resume initial state. Note that
                             this pattern specifies one or more "*"'s followed
                             by a "/", so it picks up comments ended by an
                             arbitrary number of *'s */
                          BEGIN(INITIAL); 
                        }
</pre>
The BEGIN action places the scanner in the specified state. INITIAL is
the default initial state for the scanner and any rule that is not
prefixed with a state is automatically active in the INITIAL state. 

Any rule that is prefixed with a state name enclosed in <>'s will be
active only when that state is active. 

States are defined in the definitions section using either %x or %s. For
example:
<pre>
%x comment
%s brad
</pre>
%x means that only rules prefixed with the state name will be active when
the scanner is in that state (think of %x as meaning eXclusive). %s
means that both rules prefixed by the state name and rules with no state
names will be active when the scanner is in that state (%s means inclusive,
as in including states active in the INITIAL state).

If you want to see an example of handling error tokens using error states,
go <a href="#error">here</a>.
<p>
<hr>
<h2> Frequent Problems With a Flex Specification </h2>
<p>
This section describes frequent problems that occur in a flex specification:
<p>
<ol>
<li> You indented a flex statement, such as %{, and either flex is telling
     you that you have an error at the end of your flex file or flex seems
     to be working but gcc is producing dozens of error messages. Make sure
     that you start <i>all</i> flex statements in the first column.
<p>
<li> <a name="error"/> Your rules do not cover all possible input cases: If no pattern matches
     the current string, then a flex-generated scanner echos the string to
     stdout. This is not the behavior you want. Typically this would occur
     because the string does not match any token in your language and you
     therefore want to treat it as an error token and print an error message.
     Placing a <tt>.</tt> pattern as the last pattern in your rules section
     will ensure that you catch any string that does not match another rule.
     Since . only matches one character, if you print out an error message,
     your error token will be only one character long. Frequently you want
     to scan and consume input until you reach the beginning of the next
     viable token. The beginning of the next viable token is a character
     that can start a viable token. Hence the
     best thing to do is switch into an error state and then keep adding
     characters to yytext until you reach a character that can start a
     viable token. You can use the function <tt>yymore()</tt> to add a character
     to the previous value of yytext, rather than causing yytext to be
     erased and replaced with the new string.
     You can use <tt>yyless(x)</tt> to subtract one or
     more characters from yytext and push those characters back onto the
     input stream. Counterintuitively, the argument to yyless is the
     number of characters from yytext that you wish to keep, and the 
     remainder will be pushed back onto the input stream. <tt>yyleng</tt>
     keeps track of the current length of yytext, so the command:
<pre>
yyless(yyleng-n)
</pre>
     will push back the last <tt>n</tt> characters of input and keep
     the first "yyleng-n" characters in yytext. Here is how you could
     handle an error condition in a language where a token can begin
     with a letter, digit, underscore ('_'), one of the arithmetic operators,
     and the assignment operator:
<pre>
. { BEGIN(error); yymore(); }
&lt;error&gt;[^a-zA-Z0-9_+-/*=] { yymore(); }
&lt;error&gt;. { /* anything else is the beginning of a valid token */
           yyless(yyleng-1);  // push back the last character
           printf("error token: %s\n", yytext);
           BEGIN(INITIAL);
         }
</pre>
   <li> A reserved word is not getting identified and instead an id
        is getting identified. For example, you might have the rules:
<pre>
[a-zA-Z]+   { // action for an id }
while       { // action for a while token }
</pre>
        The problem is that when two patterns match an equal length string,
        the first pattern wins. The solution is to put more specific patterns
        first, and more general patterns last. Here that means putting all
        your reserved keyword patterns before your id recognizing pattern.
        Since flex always tries to match the longest pattern, it does not
        matter how you order two patterns where one pattern is a substring
        of the second pattern. For example:
<pre>
&lt;     and   &lt;=
&lt;=          &lt;
</pre>
        should be equivalent, because the scanner will always match
        <tt><=</tt> if possible.
<p>
<li> You put spaces between parts of your pattern to make it more readable,
     but then no string matches the pattern. For example, to recognize a
     floating point number, you write the pattern:
<pre>
{DIGIT}+ "." {DIGIT}*
</pre>
     but strings of the form "35.878" do not get recognized. The issue is
     that you used spaces to separate the "." from the integer and fractional 
     parts of the number. These spaces are part of the pattern, not
     pretty printing, and hence your scanner will only recognize numbers that
     look like "35 . 878". The solution is to get rid of the spaces.
</ol>
x	matches the character x
.	(Period) matches any single character except a newline.
\n	matches a newline character
\* or "*"	\ is used both as an escape character, so that you can use a reserved character as a literal, and to specify certain control characters, such as newline characters (\n) and tabs (\t). If the \ does not specify a control character, then it escapes the character. For example, \* is a literal asterisk, rather than an asterisk meaning 0 or more occurrences of a regular expression. Alternatively you can use quotes (" ") to specify that a reserved character should be interpreted literally as that character.
$	By itself, $ is a special symbol meaning end of input (EOF). For example, `"$" { return 0; }`. Normally you do not care about EOF unless you need to do some sort of special processing, such as switching to another input file.
r$	When placed at the end of a regular expression, $ specifies that the string that matches the regular expression r must be at the end of the current line of input.
[xyz]	a character class that matches any of the characters between the []'s. In this case the character class matches any of x, y, or z
[a-zA-Z]	the '-' denotes a range of ascii characters. This specification matches any lower or upper case letter. Do not make the mistake of writing [a-Z] because there are ascii characters between lowercase 'z' and uppercase 'A' that would be included in the pattern.
[0-9]	any single digit
[ \t\n\r\f]	matches any whitespace character. \r and \f stand for "return" and "form feed" and are often present in Windows generated files.
[^A-Z]	A ^ that is the first character inside the character class negates that character class, or alternatively, says any character but the characters in that character class. In this case [^A-Z] says anything except an uppercase letter
^r	When placed at the beginning of a pattern, the ^ says that the string which matches the regular expression r must start at the beginning of a line of input.
[a-z]{-}[aeiou]	The set difference operator (-) subtracts anything in the second character class from the first character class. In this case the pattern specifies the consonants.
r*	0 or more r's, where r is any regular expression.
r+	1 or more r's, where r is any regular expression.
r?	0 or 1 r's, where r is any regular expression. You may also think of ? as saying that the regular expression is optional. For example, -?[0-9] matches a single digit with an optional leading minus sign.
r{2,5}	Matches anywhere from 2 to 5 r's
r{4,}	Matches 4 or more r's
r{4}	Matches exactly 4 r's
rs	the concatenation of the regular expressions r and s. You can also think of the pattern as r followed by s.
r \| s	either r or s (i.e., the union operation).
[0-9]+	any number
. \| \n	matches any character.
(brad\|bvz)*	parentheses are used to group regular expressions and to override precedence. For example, `brad\|bvz` would typically match either "brad" or "bv" followed by 0 or more z's. To instead match 0 or more occurrences of either "brad" or "bvz", you would use parentheses: `(brad\|bvz)`.
{DIGIT}+"."{DIGIT}*	A name that is placed between curly braces ({}) will be replaced by its associated pattern from the definitions section. If DIGIT were defined as [0-9] in the definitions section, then this pattern specifies a number that consists of 1 or more digits, followed by a period, followed by 0 or more digits. Note that the decimal point had to be placed in quotes to prevent it from being interpreted as a pattern that matches any single character.
<s>r	A regular expression that is active only when state `s` is enabled. See Section States for more details.
<*>r	A regular expression that is active in any state.
<s1,s2,s3>r	A regular expression that is active only when state s1, s2, or s3 is active.