CS360 Lecture Notes -- Fields


The fields library is a suite of routines that make reading input easier than using getchar(), scanf() or gets(). This is a library that I wrote -- it is not standard in Unix, but it should work with any C compiler (this includes on DOS/Windows). If you want to take the fields library with you after class, go ahead and do so. The source code is in this repo, in the directory "Libfdr".

In order to use the fields procedures in this class, you should include the file fields.h. On the EECS machines, this can be found in the directory /home/plank/cs360/include. Instead of including the full path name in your C file, just do:

#include "fields.h"
and then compile the program with:
gcc -I/home/plank/cs360/include
When you link your object files to make an executable, you need to follow the directions in the Libfdr notes.

The makefile in this directory does both of these things for you. When you look over the file printwords.c, make sure you figure out how to compile it so that it finds fields.h, and so that the compilation links with libfdr.a.


The fields library defines and implements a data structure that simplifies input processing in C. The data structure consists of a type definition and four procedure calls. All are defined in fields.h:

#include <stdio.h>
#define MAXLEN 1001
#define MAXFIELDS 1000

typedef struct inputstruct {
  char *name;               /* File name */
  FILE *f;                  /* File descriptor */
  int line;                 /* Line number */
  char text1[MAXLEN];       /* The line */
  char text2[MAXLEN];       /* Working -- contains fields */
  int NF;                   /* Number of fields */
  char *fields[MAXFIELDS];  /* Pointers to fields */
  int file;                 /* 1 for file, 0 for popen */
} *IS;

extern IS new_inputstruct(/* FILENAME -- NULL for stdin */);
extern IS pipe_inputstruct(/* COMMAND -- NULL for stdin */);
extern int get_line(/* IS */); /* returns NF, or -1 on EOF.  Does not close the file */
extern void jettison_inputstruct(/* IS */);  /* frees the IS and fcloses the file */

To read a file with the fields library, you call new_inputstruct() with the proper filename. New_inputstruct() takes the file name as its argument (NULL for standard input), and returns an IS as a result. Note that the IS is a pointer to a struct inputstruct. This is malloc()'d for you in the new_inputstruct() call. If new_inputstruct() cannot open the file, it will return NULL, and you can call perror() to print out the reason for the failure (read the man page on perror() if you want to learn about it).

Once you have an IS, you call get_line() on it to read a line. Get_line() changes the state of the IS to reflect the reading of the line. Specifically:

Jettison_inputstruct() closes the file associated with the IS and deallocates (frees) the IS. Do not worry about pipe_inputstruct() for now.


These procedures are very convenient for processing input files. For example, the following program (in printwords.c) prints out every word of an input file prepended with its line number:

#include <stdio.h>
#include <stdlib.h>
#include "fields.h"

int main(int argc, char **argv)
{
  IS is;
  int i;

  if (argc != 2) { fprintf(stderr, "usage: printwords filename\n"); exit(1); }
 
  /* Open the file as an inputstruct.  Error check. */

  is = new_inputstruct(argv[1]);
  if (is == NULL) {
    perror(argv[1]);
    exit(1);
  }

  /* Read each line with get_line().  Print out each word. */

  while(get_line(is) >= 0) {
    for (i = 0; i < is->NF; i++) {
      printf("%d: %s\n", is->line, is->fields[i]);
    }
  }

  /* Free up the memory allocated with new_inputstruct, and
     close the open file.  This is not necessary in this program, 
     since we are exiting anyway, but I just want to show how you free it up. */

  jettison_inputstruct(is);
  exit(0);
}

So, for example, if the file rex-1.txt contains the following three lines:

June: Hi ... I missed you!
Rex:  Same here!  You're all I could think about!
June: I was?

Then running printwords on rex-1.txt results in the following output:

UNIX> ./printwords rex-1.txt
1: June:
1: Hi
1: ...
1: I
1: missed
1: you!
2: Rex:
2: Same
2: here!
2: You're
2: all
2: I
2: could
2: think
2: about!
3: June:
3: I
3: was?
UNIX>

One important thing to note about fields.o is that only new_inputstruct() calls malloc(). Get_line() simply fills in the fields of the IS structure --- it does not perform memory allocation. This means that if you want to store a line or its fields, and not have it be overwritten by the next get_line() call, then you need to make a copy of it.

This is very important, so please pay attention to this. The most common mistake that students make with the fields library, and with fgets() in general, is not to make a copy when they need a copy. I'm going to illustrate that bug here, which will also help you with pointers and malloc().

Our goal will be to write the program tail, which prints the last n lines of standard input. The value of n defaults to 10, but you should be able to specify it on the command line. Let's start by writing tail10-bad.c, which will attempt to print out the last 10 lines using the fields library. This will illustrate the common bug that I'm talking about above. Here's the code, which is pretty straightforward. We'll have an array of 10 char *'s, which we'll simply set to is->text1 whenever we read each line:

#include <stdio.h>
#include <stdlib.h>
#include "fields.h"

int main(int argc, char **argv) 
{
  IS is;
  int i, n;
  char *lines[10];
  
  /* Read the lines of standard input, and only keep the last ten. */

  is = new_inputstruct(NULL);
  n = 0;
  while (get_line(is) >= 0) {
    lines[n%10] = is->text1;        /* This is the bad line -- it doesn't copy the string. */
    n++;
  }

  /* Print the last 10 lines, or fewer if there are fewer lines. 
     Remember that is->text1 has a newline at the end. */

  i = (n >= 10) ? (n-10) : 0;
  for ( ; i < n; i++) printf("%s", lines[i%10]);

  exit(0);
}

I have an input file with 15 lines full of random names. Take a look at what happens when I run tail10-bad on it:

UNIX> cat tail-input-15.txt
     1	Elijah Christian Shatterproof
     2	Cameron Ostracod
     3	Ryan Sargent
     4	Christopher Tempest
     5	Aiden Circumferential
     6	Carson Carcass
     7	Caroline Jazz
     8	Molly Jade
     9	Jordan Equivalent MD
    10	Aaron Nagging
    11	Isaac Bandwidth
    12	Leah Bulk
    13	Victoria Glutamate
    14	Lucas Workmen
    15	Sofia Godlike
UNIX> 
UNIX> ./tail10-bad < tail-input-15.txt
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
UNIX> 

That looks like a bug to me -- what has happened is that each entry of lines points to the same (char *) -- this is is->text1, which gets overwritten by each get_line() call.

To highlight this, in tail10-bad-print.c, I have put the following print statement right after I set lines[n%10]:

    printf("I have set lines[%d] to 0x%lx, which is currently %s",
           n%10, (unsigned long) (lines[n%10]), lines[n%10]);

You can see, when I run this, that every entry of lines is equal to the same pointer, which is getting overwritten at each get_line() call:

UNIX> ./tail10-bad-print < tail-input-15.txt
I have set lines[0] to 0x7fc014002614, which is currently      1	Elijah Christian Shatterproof
I have set lines[1] to 0x7fc014002614, which is currently      2	Cameron Ostracod
I have set lines[2] to 0x7fc014002614, which is currently      3	Ryan Sargent
I have set lines[3] to 0x7fc014002614, which is currently      4	Christopher Tempest
I have set lines[4] to 0x7fc014002614, which is currently      5	Aiden Circumferential
I have set lines[5] to 0x7fc014002614, which is currently      6	Carson Carcass
I have set lines[6] to 0x7fc014002614, which is currently      7	Caroline Jazz
I have set lines[7] to 0x7fc014002614, which is currently      8	Molly Jade
I have set lines[8] to 0x7fc014002614, which is currently      9	Jordan Equivalent MD
I have set lines[9] to 0x7fc014002614, which is currently     10	Aaron Nagging
I have set lines[0] to 0x7fc014002614, which is currently     11	Isaac Bandwidth
I have set lines[1] to 0x7fc014002614, which is currently     12	Leah Bulk
I have set lines[2] to 0x7fc014002614, which is currently     13	Victoria Glutamate
I have set lines[3] to 0x7fc014002614, which is currently     14	Lucas Workmen
I have set lines[4] to 0x7fc014002614, which is currently     15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
UNIX> 
The simple fix is to use strdup(). This will allocate memory for a copy of the line, and then copy the line. The code is in tail10-memory-leak.c, which as you can tell by its name, is going to have some issues of its own. The only change is that we longer assign lines[n%10] to is->text1, but instead we make a copy with strdup():

  is = new_inputstruct(NULL);
  n = 0;
  while (get_line(is) >= 0) {
    lines[n%10] = strdup(is->text1);
    n++;
  }
}

It runs fine on tail-input-15.txt:

UNIX> ./tail10-memory-leak < tail-input-15.txt 
     6	Carson Carcass
     7	Caroline Jazz
     8	Molly Jade
     9	Jordan Equivalent MD
    10	Aaron Nagging
    11	Isaac Bandwidth
    12	Leah Bulk
    13	Victoria Glutamate
    14	Lucas Workmen
    15	Sofia Godlike
UNIX> 
In fact, it will run just fine on most input. However, as intimated by its name, it has a memory leak. Whenever n is greater than or equal to 10, the strdup() line overwrites the pointer that is currently in lines[n%10], and the pointer is gone forever. The memory that it points to, however, is still allocated, and will not be deallocated until the program exits. That is the very definition of a memory leak. If we run this on input with a lot of lines, the memory usage of the program will blow up, eventually grinding your machine to a halt and/or terminating when strdup() fails.

You can try this on your own machine. This is on my mac -- I have an awk script print an infinite number of lines with X's, and pipe that to tail10-memory-leak:

UNIX> echo "" | awk '{ while (1) print "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" }' | ./tail10-memory-leak  &
Then, I take a look at how the program is running with top:
UNIX> top
PID    COMMAND      %CPU  TIME     #TH   #WQ  #PORT MEM    PURG   CMPRS  PGRP  PPID  STATE
.....  All my running processes ....
Here is how the program is running at 6 seconds, 30 seconds and 60 seconds:
84909  tail10-memor 99.8  00:06.13 1/1   0    12    745M+  0B     0B     84907 79428 running
...
84909  tail10-memor 99.8  00:30.87 1/1   0    12    3807M+ 0B     0B     8490  79428 running
...
84909  tail10-memor 99.8  01:00.83 1/1   0    12    7469M+ 0B     0B     84907 79428 running
The "745M+" says that at 6 seconds, the process is consuming 745 Megabytes of memory. As you can see, that number goes up to 3.8 GB, and 7.4 GB. It shouldn't be consuming any memory, since it only needs to maintain the last 10 lines at any point.

So, let's fix it. strdup() calls malloc(), so when you no longer need the string, and are about to overwrite the pointer, free the string. Here's the changed code (in tail10-good.c):

  is = new_inputstruct(NULL);
  n = 0;
  while (get_line(is) >= 0) {
    if (n >= 10) free(lines[n%10]);    /* This line prevents the memory leak. */
    lines[n%10] = strdup(is->text1);
    n++;
  }

Now, when we run it on the infinte input, you'll see that the process size stays stable at 492 KBytes (why this brain-dead program needs half a meg of memory is beyond me, but that's life...)

PID    COMMAND      %CPU  TIME     #TH   #WQ  #PORT MEM    PURG   CMPRS  PGRP  PPID  STATE
85101  tail10-good  99.7  00:07.44 1/1   0    12    492K   0B     0B     85099 79428 running
...
85101  tail10-good  99.9  00:30.15 1/1   0    12    492K   0B     0B     85099 79428 running
...
85101  tail10-good  99.9  01:00.11 1/1   0    12    492K   0B     0B     85099 79428 running

tailanyf

Here's the general version of tail, where you specify the number of lines on the command line. This program illustrates a few things that you should be getting used to: The program is in tailanyf.c:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "fields.h"

int main(int argc, char **argv)
{
  char **lastn;
  int nlines, i, n;
  IS is;

  /* Error check the command line. */

  if (argc != 2) { fprintf(stderr, "usage: tailany1 n\n"); exit(1); }
  if (sscanf(argv[1], "%d", &n) == 0 || n <= 0) {
    fprintf(stderr, "usage: tailany1 n\n");
    fprintf(stderr, "       bad n: %s\n", argv[1]);
    exit(1);
  }

  /* Allocate the array */

  lastn = (char **) malloc(sizeof(char *)*n);
  if (lastn == NULL) { perror("malloc"); exit(1); }
 
  /* Allocate the IS */

  is = new_inputstruct(NULL);
  if (is == NULL) { perror("stdin"); exit(1); }

  /* Read the input */

  nlines = 0;
  while (get_line(is) >= 0) {
    if (nlines >= n) free(lastn[nlines%n]);
    lastn[nlines%n] = strdup(is->text1);
    nlines++;
  }

  /* Print the last n lines */

  i = (nlines < n) ? 0 : nlines-n;
  for ( ; i < nlines; i++) {
    printf("%s", lastn[i%n]);
  }

  /* Don't bother freeing stuff when you're just exiting anyway. */

  exit(0);
}


pipe_inputstruct()

This lets you read from a pipe that gets opened with popen(). The program pipetest.c uses pipe_inputstruct() to count the number of lines in all the .c files in the It does this by using pipe_inputstruct() to get the standard output of 'cat *.c' into an inputstruct:

#include <stdio.h>
#include <stdlib.h>
#include "fields.h"

int main()
{
  IS is;
  int nlines;

  is = pipe_inputstruct("cat *.c");
  if (is == NULL) { perror("cat *.c"); exit(1); }

  nlines = 0;
  while (get_line(is) >= 0) nlines++;

  printf("# lines in *.c: %d\n", nlines);
  exit(0);
}