CS360 Lecture Notes -- Fields


The fields library is a suite of routines that make reading input easier than using getchar(), scanf() or fgets(). This is a library that I wrote -- it is not standard in Unix, but it should work with any C compiler (this includes on DOS/Windows). If you want to take the fields library with you after class, go ahead and do so. The source code is in this repo, in the directory "Libfdr".

In order to use the fields procedures in this class, you should include the file fields.h. On the EECS machines, this can be found in the directory /home/jplank/cs360/include. Instead of including the full path name in your C file, just do:

#include "fields.h"
and then compile the program with:
gcc -I/home/jplank/cs360/include
When you link your object files to make an executable, you need to follow the directions in the Libfdr notes.

The makefile in this directory assumes that Libfdr is in the directory ../Libfdr. If you pulled the lecture note repo, everything should be set up, although you may have to do:

( cd ../Libfdr ; make )
so that ../Libfdr/libfdr.a is compiled and ready.
The fields library defines and implements a data structure that simplifies input processing in C. The data structure consists of a type definition and four procedure calls. All are defined in fields.h:

/* The fields library -- making input processing easier */

#include <stdio.h>
#define MAXLEN 1001
#define MAXFIELDS 1000

typedef struct inputstruct {
  const char *name;         /* File name */
  FILE *f;                  /* File descriptor */
  int line;                 /* Line number */
  char text1[MAXLEN];       /* The line */
  char text2[MAXLEN];       /* Working -- contains fields */
  int NF;                   /* Number of fields */
  char *fields[MAXFIELDS];  /* Pointers to fields */
  int file;                 /* 1 for file, 0 for popen */
} *IS;

extern IS new_inputstruct(const char *filename);       /* Use NULL for stdin. Returns NULL on failure. */
extern IS pipe_inputstruct(const char *shell_command); /* Returns NULL on failure. */
extern int get_line(IS inputstruct);                   /* returns NF, or -1 on EOF. */
extern void jettison_inputstruct(IS inputstruct);      /* frees the IS and fcloses/pcloses the file */
#endif

To read a file with the fields library, you call new_inputstruct(). New_inputstruct() takes the file name as its argument (NULL for standard input), and returns an IS as a result. Note that the IS is a pointer to a struct inputstruct. This is malloc()'d for you in the new_inputstruct() call. If new_inputstruct() cannot open the file, it will return NULL, and you can call perror() to print out the reason for the failure (read the man page on perror() if you want to learn about it).

Once you have an IS, you call get_line() on it to read a line. Get_line() changes the state of the IS to reflect the reading of the line. Specifically:

Jettison_inputstruct() closes the file associated with the IS and deallocates (frees) the IS. Do not worry about pipe_inputstruct() for now.


These procedures are very convenient for processing input files. For example, the following program (in src/printwords.c) prints out every word of an input file prepended with its line number:

/* Use the fields library to print each word on standard input, labeled with its line number. */

#include <stdio.h>
#include <stdlib.h>
#include "fields.h"

int main(int argc, char **argv)
{
  IS is;
  int i;

  if (argc != 2) { fprintf(stderr, "usage: printwords filename\n"); exit(1); }
 
  /* Open the file as an inputstruct.  Error check. */

  is = new_inputstruct(argv[1]);
  if (is == NULL) {
    perror(argv[1]);
    exit(1);
  }

  /* Read each line with get_line().  Print out each word. */

  while(get_line(is) >= 0) {
    for (i = 0; i < is->NF; i++) {
      printf("%d: %s\n", is->line, is->fields[i]);
    }
  }

  /* Free up the memory allocated with new_inputstruct, and
     close the open file.  This is not necessary in this program, 
     since we are exiting anyway, but I just want to show how you free it up. */

  jettison_inputstruct(is);
  return 0;
}

So, for example, if the file txt/rex-1.txt contains the following three lines:

June: Hi ... I missed you!
Rex:  Same here!  You're all I could think about!
June: I was?

Then running printwords on rex-1.txt results in the following output:

UNIX> bin/printwords txt/rex-1.txt
1: June:
1: Hi
1: ...
1: I
1: missed
1: you!
2: Rex:
2: Same
2: here!
2: You're
2: all
2: I
2: could
2: think
2: about!
3: June:
3: I
3: was?
UNIX>

Malloc is only called during new_inputstruct()

One important thing to note about the fields library is that the only time that malloc() is called is during new_inputstruct(). Get_line() simply fills in the fields of the IS structure --- it does not perform memory allocation. This means that if you want to store a line or its fields, and not have it be overwritten by the next get_line() call, then you need to make a copy of it, typically with strdup().

This is very important, so please pay attention to this. The most common mistake that students make with the fields library, and with fgets() in general, is not to make a copy when they need a copy. I'm going to illustrate that bug here, which will also help you with pointers and malloc().

Our goal will be to write the program tail, which prints the last n lines of standard input. The value of n defaults to 10, but you should be able to specify it on the command line. Let's start by writing src/tail10-bad.c, which will attempt to print out the last 10 lines using the fields library. This will illustrate the common bug that I'm talking about above. Here's the code, which is pretty straightforward. We'll have an array of 10 char *'s, which we'll simply set to is->text1 whenever we read each line:

/* A buggy program to print the last 10 lines of standard input. */

#include <stdio.h>
#include <stdlib.h>
#include "fields.h"

int main(int argc, char **argv) 
{
  IS is;
  int i, n;
  char *lines[10];    /* This array will hold the last 10 lines of standard input. */
  
  /* Read the lines of standard input, and only keep the last ten. */

  is = new_inputstruct(NULL);
  n = 0;
  while (get_line(is) >= 0) {
    lines[n%10] = is->text1;        /* This is the bad line -- it doesn't copy the string. */
    n++;
  }

  /* Print the last 10 lines, or fewer if there are fewer lines. 
     Remember that is->text1 has a newline at the end. */

  i = (n >= 10) ? (n-10) : 0;                      /* This is the line number of the 10th line from the end. */
  for ( ; i < n; i++) printf("%s", lines[i%10]);   /* Print this line to the last line. */

  return 0;
}

I have an input file with 15 lines full of random names. Take a look at what happens when I run tail10-bad on it:

UNIX> cat txt/tail-input-15.txt
     1	Elijah Christian Shatterproof
     2	Cameron Ostracod
     3	Ryan Sargent
     4	Christopher Tempest
     5	Aiden Circumferential
     6	Carson Carcass
     7	Caroline Jazz
     8	Molly Jade
     9	Jordan Equivalent MD
    10	Aaron Nagging
    11	Isaac Bandwidth
    12	Leah Bulk
    13	Victoria Glutamate
    14	Lucas Workmen
    15	Sofia Godlike
UNIX> 
UNIX> bin/tail10-bad < txt/tail-input-15.txt
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
UNIX> 

That looks like a bug to me -- what has happened is that each entry of lines points to the same (char *) -- this is is->text1, which gets overwritten by each get_line() call.

To highlight this, in src/tail10-bad-print.c, I have put the following print statement right after I set lines[n%10]:

    printf("I have set lines[%d] to 0x%lx, which is currently %s",
           n%10, (unsigned long) (lines[n%10]), lines[n%10]);

You can see, when I run this, that every entry of lines is equal to the same pointer, which is getting overwritten at each get_line() call:

UNIX> bin/tail10-bad-print < txt/tail-input-15.txt
I have set lines[0] to 0x7fc014002614, which is currently      1	Elijah Christian Shatterproof
I have set lines[1] to 0x7fc014002614, which is currently      2	Cameron Ostracod
I have set lines[2] to 0x7fc014002614, which is currently      3	Ryan Sargent
I have set lines[3] to 0x7fc014002614, which is currently      4	Christopher Tempest
I have set lines[4] to 0x7fc014002614, which is currently      5	Aiden Circumferential
I have set lines[5] to 0x7fc014002614, which is currently      6	Carson Carcass
I have set lines[6] to 0x7fc014002614, which is currently      7	Caroline Jazz
I have set lines[7] to 0x7fc014002614, which is currently      8	Molly Jade
I have set lines[8] to 0x7fc014002614, which is currently      9	Jordan Equivalent MD
I have set lines[9] to 0x7fc014002614, which is currently     10	Aaron Nagging
I have set lines[0] to 0x7fc014002614, which is currently     11	Isaac Bandwidth
I have set lines[1] to 0x7fc014002614, which is currently     12	Leah Bulk
I have set lines[2] to 0x7fc014002614, which is currently     13	Victoria Glutamate
I have set lines[3] to 0x7fc014002614, which is currently     14	Lucas Workmen
I have set lines[4] to 0x7fc014002614, which is currently     15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
    15	Sofia Godlike
UNIX> 
The simple fix is to use strdup(). This will allocate memory for a copy of the line, and then copy the line. The code is in src/tail10-memory-leak.c, which as you can tell by its name, is going to have some issues of its own. The only change is that we longer assign lines[n%10] to is->text1, but instead we make a copy with strdup():

  is = new_inputstruct(NULL);
  n = 0;
  while (get_line(is) >= 0) {
    lines[n%10] = strdup(is->text1);       /* This is the only change - we call strdup(). */
    n++;
  }
}

It runs fine on tail-input-15.txt:

UNIX> bin/tail10-memory-leak < txt/tail-input-15.txt 
     6	Carson Carcass
     7	Caroline Jazz
     8	Molly Jade
     9	Jordan Equivalent MD
    10	Aaron Nagging
    11	Isaac Bandwidth
    12	Leah Bulk
    13	Victoria Glutamate
    14	Lucas Workmen
    15	Sofia Godlike
UNIX> 
In fact, it will run just fine on most input. However, as intimated by its name, it has a memory leak. Whenever n is greater than or equal to 10, the strdup() line overwrites the pointer that is currently in lines[n%10], and the pointer is gone forever. The memory that it points to, however, is still allocated, and will not be deallocated until the program exits. That is the very definition of a memory leak. If we run this on input with a lot of lines, the memory usage of the program will blow up, eventually grinding your machine to a halt and/or terminating when strdup() fails.

You can try this on your own machine. This is on my mac --the awk script prints an infinite number of lines with X's, and we pipe that to bin/tail10-memory-leak:

UNIX> echo "" | awk '{ while (1) print "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" }' | bin/tail10-memory-leak  &
Then, I take a look at how the program is running with top:
UNIX> top
PID    COMMAND      %CPU  TIME     #TH   #WQ  #PORT MEM    PURG   CMPRS  PGRP  PPID  STATE
.....  All my running processes ....
Here is how the program is running at 6 seconds, 30 seconds and 60 seconds:
84909  tail10-memor 99.8  00:06.13 1/1   0    12    745M+  0B     0B     84907 79428 running
...
84909  tail10-memor 99.8  00:30.87 1/1   0    12    3807M+ 0B     0B     8490  79428 running
...
84909  tail10-memor 99.8  01:00.83 1/1   0    12    7469M+ 0B     0B     84907 79428 running
The "745M+" says that at 6 seconds, the process is consuming 745 Megabytes of memory. As you can see, that number goes up to 3.8 GB, and 7.4 GB. It shouldn't be consuming any memory, since it only needs to maintain the last 10 lines at any point.

So, let's fix it. strdup() calls malloc(), so when you no longer need the string, and are about to overwrite the pointer, free the string. Here's the changed code (in src/tail10-good.c):

  is = new_inputstruct(NULL);
  n = 0;
  while (get_line(is) >= 0) {
    if (n >= 10) free(lines[n%10]);    /* This line prevents the memory leak. */
    lines[n%10] = strdup(is->text1);
    n++;
  }

Now, when we run it on the infinte input, you'll see that the process size stays stable at 492 KBytes (why this brain-dead program needs half a meg of memory is beyond me, but that's life...)

PID    COMMAND      %CPU  TIME     #TH   #WQ  #PORT MEM    PURG   CMPRS  PGRP  PPID  STATE
85101  tail10-good  99.7  00:07.44 1/1   0    12    492K   0B     0B     85099 79428 running
...
85101  tail10-good  99.9  00:30.15 1/1   0    12    492K   0B     0B     85099 79428 running
...
85101  tail10-good  99.9  01:00.11 1/1   0    12    492K   0B     0B     85099 79428 running

tailanyf

I often don't go over this program in class. It is a straightforward extension of the program tail10-good.c.

Here's the general version of tail, where you specify the number of lines on the command line. This program illustrates a few things that you should be getting used to:

The program is in src/tailanyf.c:

/* This program is more like tail -- it takes the number of lines, n, 
   as a command line argument, and prints the last n lines of standard input. */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "fields.h"

int main(int argc, char **argv)
{
  char **lastn;
  int nlines, i, n;
  IS is;

  /* Error check the command line. */

  if (argc != 2) { fprintf(stderr, "usage: tailany1 n\n"); exit(1); }
  if (sscanf(argv[1], "%d", &n) == 0 || n <= 0) {
    fprintf(stderr, "usage: tailany1 n\n");
    fprintf(stderr, "       bad n: %s\n", argv[1]);
    exit(1);
  }

  /* Allocate the array */

  lastn = (char **) malloc(sizeof(char *)*n);
  if (lastn == NULL) { perror("malloc"); exit(1); }
 
  /* Allocate the IS */

  is = new_inputstruct(NULL);
  if (is == NULL) { perror("stdin"); exit(1); }

  /* Read the input */

  nlines = 0;
  while (get_line(is) >= 0) {
    if (nlines >= n) free(lastn[nlines%n]);     /* Prevent the memory leak. */
    lastn[nlines%n] = strdup(is->text1);
    nlines++;
  }

  /* Print the last n lines */

  i = (nlines < n) ? 0 : nlines-n;
  for ( ; i < nlines; i++) {
    printf("%s", lastn[i%n]);
  }

  /* Don't bother freeing stuff when you're just exiting anyway. */

  return 0;
}


pipe_inputstruct()

I also don't typically go over this program in class, but just keep it here for reference in case you want to use pipe_inputstruct().

This lets you read from a pipe that gets opened with popen(). The program src/pipetest.c uses pipe_inputstruct() to count the number of lines in all the .c files in the src directory. It does this by using pipe_inputstruct() to get the standard output of "cat src/*.c" into an inputstruct:

/* pipetest.c counts the number of lines in all the .c files in the  
   src directory.  It does this by using pipe_inputstruct to get
   the standard output of the cat command into an inputstruct */

#include <stdio.h>
#include <stdlib.h>
#include "fields.h"

int main()
{
  IS is;
  int nlines;

  is = pipe_inputstruct("cat src/*.c");
  if (is == NULL) { perror("cat src/*.c"); exit(1); }

  nlines = 0;
  while (get_line(is) >= 0) nlines++;

  printf("# lines in src/*.c: %d\n", nlines);
 
  return 0;
}

In case you were wondering what happens when you put "/*" into a string -- this does compile and run correctly. Would you have bet a family member on that?