CS140 Lecture notes -- Some more on Malloc and Fields

  • Jim Plank--with modifications by Brad Vander Zanden
  • Directory: ~cs140/www-home/notes/MalField
  • Lecture notes: http://www.cs.utk.edu/~cs140/notes/MalField
  • Tue Sep 15 16:51:51 EDT 1998

    A simple malloc() example

    Try to write the following program: variance. Variance takes one command line argument, which is the number n. Variance then expects to read n floating point numbers on standard input. It will then calculate the average of these numbers, and then for each of the n numbers, the square of the difference between that number and the average. The variance is the average of these values. It will print out the average and the variance.

    In other words, suppose n is 3 and the numbers are 1, 2, and 3. Then the average is 2. The squares of the differences are 1, 0 and 1, so the variance is 2/3.

    Suppose n is 3 and the numbers are 10, 5 and 3. The average is 18/3 = 6. The squares of the differences are (10-6)(10-6) = 16, (5-6)(5-6) = 1, and (3-6)(3-6) = 9. Thus, the variance is (16+1+9)/3 = 26/3 = 8.6667.

    Ok, now take a few minutes and try to write variance.

    Here's the strategy for writing variance

    And here's the code (also in variance.c):

    #include < stdio.h >
    
    main(int argc, char **argv)
    {
      int n, i;
      double *values;
      double avg;
      double variance;
    
      
      /*  First you need to get n from the command line arguments. */
      
      if (argc != 2) {
        fprintf(stderr, "usage: variance n\n");
        exit(1);
      }
      n = atoi(argv[1]);
      if (n <= 0) exit(1);
    
      /*  Next, you need to malloc() space for n doubles.   */
    
      values = (double *) malloc(sizeof(double)*n);
    
      /*  Next, you read them in using scanf(). */
    
      for (i = 0; i < n; i++) {
        if (scanf("%lf", &(values[i])) != 1) exit(1);
      }
    
      /*  Next, you compute their average. */
    
      avg = 0;
      for (i = 0; i < n; i++) {
        avg += values[i];
      }
      avg /= n;
    
      /*  Now, you compute the sum of the squares of the differences. */
    
      variance = 0;
      for (i = 0; i < n; i++) {
        variance += ((values[i]-avg)*(values[i]-avg));
      }
    
      /*  Finally, you compute the variance and print them both out. */
    
      variance /= n;
    
      printf("Average:  %lf\n", avg);
      printf("Variance: %lf\n", variance);
    }
    
    It works quite nicely:
    UNIX> variance 
    usage: variance n
    UNIX> variance 3
    1 2 3
    Average:  2.000000
    Variance: 0.666667
    UNIX> variance 3
    1
           2                    3
    Average:  2.000000
    Variance: 0.666667
    UNIX> variance 3
    10
    
    
    5
             3
    Average:  6.000000
    Variance: 8.666667
    UNIX> 
    

    Different input

    Now, suppose that the input can have comment lines. If a line begins with the '#' character, then it should be ignored. This throws a real monkey wrench into scanf() -- you would have a very hard time doing this with scanf(). Why? Because if you're using scanf("%lf", ...), you are basically ignoring line information. So how can you figure out if a line begins with '#'.

    We'll solve this by using the fields library instead. We'll read in lines using get_line() and if a line begins with '#', we'll ignore it. If it doesn't, then we'll use sscanf() to turn its fields into (double)'s. The code is in variance2.c. Here is the only change from variance.c:

    #include "fields.h"
    
    main(int argc, char **argv)
    {
      int n, i, j;
      IS is;
       
      ...
      
      /*  Next, you read them in using the fields library. */
    
      is = new_inputstruct(NULL);
      i = 0;
      while (i < n && get_line(is) >= 0) {
        if (is->text1[0] != '#') {
          for (j = 0; j < is->NF; j++) {
            if (i < n) {
              if (sscanf(is->fields[j], "%lf", &(values[i])) != 1) exit(1);
              i++;
            }
          }
        }
      }
      if (i < n) exit(1);
      
    
    Note, I've checked for some common input errors such as not entering a double, and not entering enough numbers. Go over this code until you understand exactly what it is doing. I will expect you to be able to write code like this without problems.

    When we run it, we see it works:

    UNIX> make variance2
    gcc -g -I/home/cs140/spring-2004/include -c variance2.c
    gcc -g -I/home/cs140/spring-2004/include -o variance2 variance2.o /home/cs140/spring-2004/objs/libfdr.a
    UNIX> variance2 3
    1 2 3
    Average:  2.000000
    Variance: 0.666667
    UNIX> variance2 3
    1 2 Frog
    UNIX> variance2 3
    10
    # Hi!!!!
    5
    3
    Average:  6.000000
    Variance: 8.666667
    UNIX> 
    
    Note, when I typed Frog, it exited, because sscanf() returned zero.

    Better structure: tokens

    The code in variance2.c is not the cleanest, and it is not readily readable. What you want to do instead is something like you did with the original scanf() statement:
      for (i = 0; i < n; i++) {
        value[i] = get_next_double();
      }
    
    In order to do this, we're going to define a new ``data structure'' called a ``token generator.'' A data structure is an organization of data plus procedures to access the data. In the case of our token generator, our data is going to consist of an inputstruct and an integer:
    typedef struct {
      IS is;
      int field;
    } TokenGen;
    
    We'll define two procedures to access data from the token generator:

    Now, before we write the code for the token generator, let's see how such a data structure makes writing variance3.c easier. Now, the relevant sections of code are going to look as follows:

    main(int argc, char **argv)
    {
      int n, i, j;
      TokenGen *tg;
      char *s;
    
      ...
       
      /*  Next, you read them in using the token generator */
    
      tg = new_tokengen(NULL);
      for (i = 0; i < n; i++) {
        s = tokengen_get_token(tg);
        if (s == NULL) exit(1);
        if (sscanf(s, "%lf", &(values[i])) != 1) exit(1);
      }
    
    That's much cleaner, no? Now, let's write the TokenGen procedures. Basically, the struct holds an inputstruct and a current field. When tokengen_get_token(tg) is called, we're going to return tg->is->fields[tg->field], and then increment tg->field. When tg->field is greater than (or equal to) tg->is->NF, then we need to call get_line(tg->is) and set tg->fields to zero. We also need to ignore lines that begin with '#'.

    Here's the code. One additional thing that I did was to set tg->field to -1 if the current line should not be used. This happens when I first allocate the TokenGen, and whenever I see a line with a comment.

    typedef struct {
      IS is;
      int field;
    } TokenGen;
    
    TokenGen *new_tokengen(char *fn)
    {
      TokenGen *tg;
    
      tg = (TokenGen *) malloc(sizeof(TokenGen));
    
      tg->is = new_inputstruct(fn);
      if (tg->is == NULL) return NULL;
      tg->field = -1;
      return tg;
    }
    
    char *tokengen_get_token(TokenGen *tg)
    {
      char *s;
    
      while(tg->field == -1 || tg->field >= tg->is->NF) {
        if (get_line(tg->is) < 0) return NULL;
        if (tg->is->text1[0] == '#') {
          tg->field = -1;
        } else {
          tg->field = 0;
        }
      }
    
      s = tg->is->fields[tg->field];
      tg->field++;
      return s;
    }
    
    The entire code is in variance3.c.

    Read over that code, and make sure you understand it. If you don't, you will have troubles with your labs. This is a very nice example of malloc(), structs, procedure calls, etc.


    Making TokenGen more general

    It certainly seems as though the TokenGen might be useful in other programs. For example, it would be really useful in reading pgm files that can have comments in them. Perhaps we'd like to have a token library like the fields library.

    As a first pass in doing this, I've made two files: tg.h and tg.c. Tg.h contains the typedef for the TokenGen struct, and function prototypes for new_tokengen() and tokengen_get_token(). These prototypes are used by the compiler to type check for you when procedures are defined and used by different C programs. The extern keyword means that these procedures are defined in a separate C file.

    Tg.c simply includes tg.h and then defines the two procedures. Take a look at both files so that you see what is defined where.

    Now variance4.c uses tg.c and tg.h. Note, it is the exact same as variance3. except it doesn't define the TokenGen procedures -- it simply uses them.

    Note the compilation process for variance4:

    UNIX> make variance4
    gcc -g -I/home/cs140/spring-2004/include -c variance4.c
    gcc -g -I/home/cs140/spring-2004/include -c tg.c
    gcc -g -I/home/cs140/spring-2004/include -o variance4 variance4.o /home/cs140/spring-2004/objs/libfdr.a tg.o
    UNIX> variance4 3
    1
    2
    #HI!!
    3
    Average:  2.000000
    Variance: 0.666667
    UNIX> 
    


    The extern Keyword

    You saw in the previous section that the extern keyword can be used to specify that procedures are defined in a different file. It can also be used to specify that variables are defined in a different file. Why would one want to use the extern keyword? You need to use the extern keyword once you start using multiple files to define a program. Suppose for example that in variance4.c you want to call tokengen_get_token, which is defined in tg.c. If you do not tell gcc that tokengen_get_token is defined in another file, then gcc will give you a compile error saying that tokengen_get_token is undefined. You tell gcc that a function is defined in another file by using the extern keyword and a function prototype. A function prototype is a statement that declares a function's return type, its name, and its parameter types. It typically looks just like a function header. For example, the function prototype for tokengen_get_token is:

    char *tokengen_get_token(TokenGen *);
    

    Notice that a function prototype does not need to include the names of the parameters, only the types of the parameters.

    If you do not put the extern keyword in front of a function prototype, then gcc will expect you to define the function later in that same file. However, if you include the extern keyword, then gcc will put a "note" into the .o file that tells the linker that it needs to resolve the function name when it creates the binary executable. The linker will do this by trying to find the function definition in the other .o files that are used to create the executable.

    Typically you see the extern keyword only in .h files since the same function is likely to be used in multiple files and therefore it is easier to move the extern declaration to a .h file where it only has to be written once. If the function is later modified, you will also only have to modify one extern declaration. Whenever you use the extern keyword you must have one file that defines the function or variable declared by the extern keyword.

    When you declare a variable using the extern keyword then you are declaring the variable to be a global variable. We will not often use global keywords in this course but you will encounter them in later courses. Like an extern function, an extern variable must be defined in exactly one file.

    You might wonder if it is legal to declare a variable/function to be extern in a file and to also define it there. The answer is yes it is legal. For example the following code is legal:

    extern int counter;
    extern int get_counter();
    
    int counter;
    
    int get_counter() { return counter; }
    

    Note that by making such code legal it becomes possible to include a .h file without worrying about whether it declares as extern variables or functions that are defined in this file.