Introduction to Flex

Brad Vander Zanden and Ray Byler


Overview

This text is meant to provide a brief introduction to the Flex lexical analyzer and to show how you integrate it with the Bison parser to produce a compiler front end. The mechanics of Bison itself will be covered later. Flex and Bison are basically better versions of Lex and Yacc. They are more flexible and produce faster code. For the homework and class discussion, we will be using Flex and Bison. You are welcome to use different lexical/parsing tools for your project. For example, JLex/JCup is a lexer/parser pair for Java that is based on the lex/yacc model, and there are other Java lexer/parsers that you can find by searching the internet that also are based on the Lex/Yacc model.

You can use Flex and Bison independently, but they have been engineered to work well together. Bison produces a parser from an input file that you provide. The parser expects to receive a token stream from a lexer of your choice, and it expects your lexer to provide it with a function named yylex() that it can call to retrieve tokens from this token stream. Flex generates the yylex() function automatically when you provide it with a .l file (e.g., graph.l). We will discuss how the .l file should be written later in these notes, and we will discuss how to create an input file for bison in a separate set of notes.

The yylex() function produced by Flex uses simulated finite-state-machines (FSM) to recognize strings (or lexemes) then passes this information to the parser in the form of integer tokens. These simulated FSMs are generated by Flex from the regular expressions that you write. The parser parses (think about parse trees and context free grammars) this sequence of tokens to verify that the statements formed conform to the grammar of the language. The lexical analyzer is usually the slowest part of a compiler since it has to read every single character of the input file.


A Sample Flex Specification

Here is a sample flex specification that reads lines from stdin and checks to see whether each line contains a valid credit card number. A valid credit card number is defined as one that has four groups of four numbers each, with each group separated by an optional space or dash (-).
%option noyywrap
%{
/* * * * * * * * * * * *
 * * * DEFINITIONS * * *
 * * * * * * * * * * * */
%}

%{
// recognize whether or not a credit card number is valid
int line_num = 1;
%}

digit [0-9]
group {digit}{4}
%%

%{
/* * * * * * * * * 
 * * * RULES * * *
 * * * * * * * * */
%}
   /* The carat (^) says that a credit card number must start at the
      beginning of a line and the $ says that the credit card number
      must end the line. */
^{group}([ -]?{group}){3}$  { printf(" credit card number: %s\n", yytext); }

   /* The .* accumulates all the characters on any line that does not
      match a valid credit card number */
.* { printf("%d: error: %s \n", line_num, yytext); }
\n { line_num++; }
%%

/* * * * * * * * * * * 
 * * * USER CODE * * *
 * * * * * * * * * * *
 */
int main(int argc, char *argv[]) {
  yylex();
}
Here are things to note about the above code:

  1. To run the code, type the following commands:
    flex credit_card.lex    // assumes you stored the specification in credit_card.lex
    gcc lex.yy.c            
    
    Typically you store your lex specifications in files with a .lex extension. flex generates a file named lex.yy.c. You can change this name but it usually does not matter.

  2. The specification is divided into three sections with "%%" delimiters placed between sections:

    1. A definitions section where you can 1) place code that you want to go at the beginning of the generated scanner, 2) define names that are in effect macros that can be expanded within a pattern, and 3) define states that control when rules are active (see States for more details about how this is done). You place code that should go at the beginning of the scanner between %{ and %} delimiters. Normally you place 3 types of code in these delimiters:
      1. comments: You can use the %{ ... %} anywhere in either the definitions or rules section to safely embed comments in your code. flex is very picky about when and where it is safe to place comments, so to be safe, you can always place comments in a %{ %} block.
      2. .h files: When you are using your scanner with a bison-generated parser, you will often include the file y.tab.h in a %{ %} block, along with any other .h files that you may need to use.
      3. function definitions that you plan to use in your actions (see rules section below for a definition of actions): Often you will define error-handling functions that you can call from various action routines when you encounter erroneous tokens.

      A name definition has the form:

      name pattern
      
      where name is the name of your macro and pattern is a regular expression. As you can see from the definition of group in the example specification, you can use previously defined names in your patterns.

    2. A rules section where you have rules of the form:
      pattern action
      
      A pattern is a regular expression and the action is normally C/C++ code to do something with the string that matches the pattern. The string that matches the pattern is placed in a pre-defined char * variable named yytext. As you can see from the example, actions are normally enclosed in C-style curly braces ({}).

    3. A user code section where you can place code to invoke the scanner. Flex generates a function named yylex that you can call from main if you want to use your scanner as a standalone application. yylex will read input from stdin until it is exhausted, or until you tell it to return with a return statement in an action. Normally you will be using the scanner with a bison-generated parser, and you will leave this section empty, because the parser will call yylex for you.

  3. You normally should keep track of the line count yourself, so that you can print out error messages when you encounter an unrecognizable token. Flex will keep track of the current line number in a variable named yylineno if you include the following line at the very top of your flex specification:
    %option yylineno
    
    However, yylineno is not in the posix specification for lex, so your lex specification will not be portable should you decide to use it.

  4. Flex is very finicky about indenting. You will avoid any pitfalls if you start all flex commands on the first column. If you indent a flex command, such as a %{ delimiter even one space, flex may or may not complain, but flex will not generate a valid lex.yy.c file and gcc will probably generate dozens of useless error messages.

  5. Annoyingly flex declares a function named yywrap but does not bother to define it. You typically won't need it if you have only one input file. You can tell flex not to declare this function by placing the directive:
    %option noyywrap
    
    at the top of your flex file. Alternatively you can tell flex to find a default version of yywrap in the fl (flex) library, with the -lfl compilation flag:
    gcc lex.yy.c -lfl
    

Patterns

Flex looks for the longest possible match (i.e., it is a greedy matcher). The consequence of this is that you get faster code if you use longer patterns.

Generally, you should start with simple elements (e.g. letters) and then combine them to form more powerful expressions/languages.

Here are a list of the most useful regular patterns. You can find a complete list at http://flex.sourceforge.net/manual/Patterns.html#Patterns.

xmatches the character x
. (Period) matches any single character except a newline.
\n matches a newline character
\* or "*" \ is used both as an escape character, so that you can use a reserved character as a literal, and to specify certain control characters, such as newline characters (\n) and tabs (\t). If the \ does not specify a control character, then it escapes the character. For example, \* is a literal asterisk, rather than an asterisk meaning 0 or more occurrences of a regular expression. Alternatively you can use quotes (" ") to specify that a reserved character should be interpreted literally as that character.
$By itself, $ is a special symbol meaning end of input (EOF). For example, "$" { return 0; }. Normally you do not care about EOF unless you need to do some sort of special processing, such as switching to another input file.
r$When placed at the end of a regular expression, $ specifies that the string that matches the regular expression r must be at the end of the current line of input.
[xyz] a character class that matches any of the characters between the []'s. In this case the character class matches any of x, y, or z
[a-zA-Z]the '-' denotes a range of ascii characters. This specification matches any lower or upper case letter. Do not make the mistake of writing [a-Z] because there are ascii characters between lowercase 'z' and uppercase 'A' that would be included in the pattern.
[0-9] any single digit
[ \t\n\r\f]matches any whitespace character. \r and \f stand for "return" and "form feed" and are often present in Windows generated files.
[^A-Z]A ^ that is the first character inside the character class negates that character class, or alternatively, says any character but the characters in that character class. In this case [^A-Z] says anything except an uppercase letter
^rWhen placed at the beginning of a pattern, the ^ says that the string which matches the regular expression r must start at the beginning of a line of input.
[a-z]{-}[aeiou]The set difference operator (-) subtracts anything in the second character class from the first character class. In this case the pattern specifies the consonants.
r*0 or more r's, where r is any regular expression.
r+1 or more r's, where r is any regular expression.
r?0 or 1 r's, where r is any regular expression. You may also think of ? as saying that the regular expression is optional. For example, -?[0-9] matches a single digit with an optional leading minus sign.
r{2,5}Matches anywhere from 2 to 5 r's
r{4,}Matches 4 or more r's
r{4}Matches exactly 4 r's
rsthe concatenation of the regular expressions r and s. You can also think of the pattern as r followed by s.
r | seither r or s (i.e., the union operation).
[0-9]+ any number
. | \n matches any character.
(brad|bvz)*parentheses are used to group regular expressions and to override precedence. For example, brad|bvz* would typically match either "brad" or "bv" followed by 0 or more z's. To instead match 0 or more occurrences of either "brad" or "bvz", you would use parentheses: (brad|bvz)*.
{DIGIT}+"."{DIGIT}*A name that is placed between curly braces ({}) will be replaced by its associated pattern from the definitions section. If DIGIT were defined as [0-9] in the definitions section, then this pattern specifies a number that consists of 1 or more digits, followed by a period, followed by 0 or more digits. Note that the decimal point had to be placed in quotes to prevent it from being interpreted as a pattern that matches any single character.
<s>rA regular expression that is active only when state s is enabled. See Section States for more details.
<*>rA regular expression that is active in any state.
<s1,s2,s3>rA regular expression that is active only when state s1, s2, or s3 is active.


States

Different states are like different FSMs. The lexer reacts to strings differently depending on what state it is in. Two common situations where we want to use states are where we want to recognize C-style comments and where we are processing errors. C-style comments are difficult to handle as a normal regular expression and errors may require us to consume some part of the input until we get back to a part of the input where we are prepared to resume scanning. For example, in the credit card section, we used the pattern ".*" to mop up any characters before the newline character. However, we did not need to use an error processing state in that example.

Before discussing the specifics of states, let's look at the following set of rules from the flex manual for both recognizing open ended C-style comments (i.e., comments of the form /* ... */) and keeping track of the current line number whenever a comment spans multiple lines:

"/*"         BEGIN(comment);
     
[^*\n]*        /* consume anything that's not a '*' or newline */
"*"+[^*/\n]*   /* this pattern is similar to the previous one but
                           it also consumes 1 or more leading *'s. For example, if
                           the comment were "brad **** smiley" than the
                           first pattern would consume "brad " and this
                           pattern would consume "**** smiley". */
\n             ++line_num;
"*"+"/"        { /* end of comment--resume initial state. Note that
                             this pattern specifies one or more "*"'s followed
                             by a "/", so it picks up comments ended by an
                             arbitrary number of *'s */
                          BEGIN(INITIAL); 
                        }
The BEGIN action places the scanner in the specified state. INITIAL is the default initial state for the scanner and any rule that is not prefixed with a state is automatically active in the INITIAL state. Any rule that is prefixed with a state name enclosed in <>'s will be active only when that state is active. States are defined in the definitions section using either %x or %s. For example:
%x comment
%s brad
%x means that only rules prefixed with the state name will be active when the scanner is in that state (think of %x as meaning eXclusive). %s means that both rules prefixed by the state name and rules with no state names will be active when the scanner is in that state (%s means inclusive, as in including states active in the INITIAL state). If you want to see an example of handling error tokens using error states, go here.


Frequent Problems With a Flex Specification

This section describes frequent problems that occur in a flex specification:

  1. You indented a flex statement, such as %{, and either flex is telling you that you have an error at the end of your flex file or flex seems to be working but gcc is producing dozens of error messages. Make sure that you start all flex statements in the first column.

  2. Your rules do not cover all possible input cases: If no pattern matches the current string, then a flex-generated scanner echos the string to stdout. This is not the behavior you want. Typically this would occur because the string does not match any token in your language and you therefore want to treat it as an error token and print an error message. Placing a . pattern as the last pattern in your rules section will ensure that you catch any string that does not match another rule. Since . only matches one character, if you print out an error message, your error token will be only one character long. Frequently you want to scan and consume input until you reach the beginning of the next viable token. The beginning of the next viable token is a character that can start a viable token. Hence the best thing to do is switch into an error state and then keep adding characters to yytext until you reach a character that can start a viable token. You can use the function yymore() to add a character to the previous value of yytext, rather than causing yytext to be erased and replaced with the new string. You can use yyless(x) to subtract one or more characters from yytext and push those characters back onto the input stream. Counterintuitively, the argument to yyless is the number of characters from yytext that you wish to keep, and the remainder will be pushed back onto the input stream. yyleng keeps track of the current length of yytext, so the command:
    yyless(yyleng-n)
    
    will push back the last n characters of input and keep the first "yyleng-n" characters in yytext. Here is how you could handle an error condition in a language where a token can begin with a letter, digit, underscore ('_'), one of the arithmetic operators, and the assignment operator:
    . { BEGIN(error); yymore(); }
    <error>[^a-zA-Z0-9_+-/*=] { yymore(); }
    <error>. { /* anything else is the beginning of a valid token */
               yyless(yyleng-1);  // push back the last character
               printf("error token: %s\n", yytext);
               BEGIN(INITIAL);
             }
    
  3. A reserved word is not getting identified and instead an id is getting identified. For example, you might have the rules:
    [a-zA-Z]+   { // action for an id }
    while       { // action for a while token }
    
    The problem is that when two patterns match an equal length string, the first pattern wins. The solution is to put more specific patterns first, and more general patterns last. Here that means putting all your reserved keyword patterns before your id recognizing pattern. Since flex always tries to match the longest pattern, it does not matter how you order two patterns where one pattern is a substring of the second pattern. For example:
    <     and   <=
    <=          <
    
    should be equivalent, because the scanner will always match <= if possible.

  4. You put spaces between parts of your pattern to make it more readable, but then no string matches the pattern. For example, to recognize a floating point number, you write the pattern:
    {DIGIT}+ "." {DIGIT}*
    
    but strings of the form "35.878" do not get recognized. The issue is that you used spaces to separate the "." from the integer and fractional parts of the number. These spaces are part of the pattern, not pretty printing, and hence your scanner will only recognize numbers that look like "35 . 878". The solution is to get rid of the spaces.