CS360 Lecture notes -- Dup

Directory: /blugreen/homes/plank/cs360/notes/Dup

Lecture notes: http://www.cs.utk.edu/~plank/plank/classes/cs360/360/notes/Dup/lecture.html

Open files

First, read section 3.10 in the book. This discusses the various ways you can share open files. Basically, what is says is the following:

The operating system (OS) has three kinds of data structures for files:

``File table entry'' -- it has one of these for each file descriptor for a user. It contains information like the current lseek pointer, and a pointer to a ``vnode'' (see below). So, when you start your program, the OS has three file table entries for you -- one each for stdin, stdout, stderr. Each time you call open(), a new file table entry is created for you in the OS.
``vnode'' -- There is one of these for each physical file that has been opened. It contains a pointer to the file's inode, the file's size, etc.
``inode'' -- There is one of these for each file on disk. It contains all the information returned by stat().

The difference between a vnode and an inode is where it's located and when it's valid. Inodes are located on disk and are always valid because they contain information that is always needed such as ownership and protection. Vnodes are located in the operating system's memory, and only exist when a file is opened. However, just one vnode exists for every physical file that is opened.

So, look at the following program:

main()
{
  int fd1, fd2;

  fd1 = open("file1", O_WRONLY | O_CREAT | O_TRUNC, 0644);
  fd2 = open("file1", O_WRONLY);

}

Now, what has happened? The OS has created two file table entries, one for each open() call, but only one vnode. This is because there is only one file. Both file table entries point to the same vnode, but they each have different seek pointers. Thus, if we expand the above program into: (This is file fs1.c)

main()
{
  int fd1, fd2;

  fd1 = open("file1", O_WRONLY | O_CREAT | O_TRUNC, 0644);
  fd2 = open("file1", O_WRONLY);

  write(fd1, "Jim\n", strlen("Jim\n"));
  write(fd2, "Plank\n", strlen("Plank\n"));

  close(fd1);
  close(fd2);
}

Then what will happen? Well, the first write() call will write the string "Jim\n" into file1. Then the second write() call will overwrite it with "Plank\n". This is because each fd points to its own file table entry, which has its own lseek pointer, and thus the first write() does not update the lseek pointer of the fd2.

To make this more clear, fs1a.c prints out the values of each fd's seek pointer at each step of the program. As you can see, even though the two fd's are for the same file, since they each have their own file table entry, they each have their own seek pointer:

UNIX> fs1
UNIX> cat file1
Plank
UNIX> fs1a
Before writing Jim:   lseek(fd1, 0, 1): 0.  lseek(fd2, 0, 1): 0
Before writing Plank: lseek(fd1, 0, 1): 4.  lseek(fd2, 0, 1): 0
After writing Plank:  lseek(fd1, 0, 1): 4.  lseek(fd2, 0, 1): 6
UNIX> cat file1
Plank
UNIX>

Dup()

Now, the system call dup(int fd) duplicates a file descriptor fd. What this does is return a second file descriptor that points to the same file table entry as fd does. So now you can treat the two file descriptors as identical.

Look at an alteration of fs1.c (in fs2.c). Instead of calling open() to initialize fd2, it calls dup(fd1). Thus, after the first write, the lseek pointer of fd2 has been updated to reflect the write to fd1 -- this is because the two file descriptors point to the same file table entry.

Thus, after running fs2.c, the file "file2" should say "Jim\nPlank\n". Like fs1a.c, fs2a.c prints out the lseek pointers of fd1 and fd2 at each step. As you can see, the write() to fd1 updates the lseek pointer for fd2:

UNIX> fs2
UNIX> cat file2
Jim
Plank
UNIX> fs2a
Before writing Jim:   lseek(fd1, 0, 1): 0.  lseek(fd2, 0, 1): 0
Before writing Plank: lseek(fd1, 0, 1): 4.  lseek(fd2, 0, 1): 4
After writing Plank:  lseek(fd1, 0, 1): 10.  lseek(fd2, 0, 1): 10
UNIX> cat file2 
Jim
Plank
UNIX>

Now, when fork() is called, ALL FILE DESCRIPTORS ARE DUPLICATED, AS IF dup() WERE CALLED. Thus, look at the following program (fs3.c):

main()
{
  char s[1000];
  int i, fd;

  fd = open("file3", O_WRONLY | O_CREAT | O_TRUNC, 0644);

  i = fork();
  sprintf(s, "fork() = %d.\n", i);
  write(fd, s, strlen(s));
}

What should happen? Well, whichever process gets control of the CPU first after the fork() will write s to file3. Then the other process will append its string s to file3. For example:

UNIX> fs3
UNIX> cat file3
fork() = 0.
fork() = 22107.
UNIX> fs3
UNIX> cat file3
fork() = 0.
fork() = 22110.
UNIX> fs3
UNIX> cat file3
fork() = 22113.
fork() = 0.
UNIX>

Now, this is because the file descriptor fd is duplicated across fork() calls. Were it not duplicated, but instead re-opened, then one write() would overwrite the other.

Perhaps you're thinking, ``He opened a file and then called fork(). Does he have to worry about that buffer copying problem in the last lab?'' The answer is no, because I'm using write(), which is a system call, and there is no buffering. You have to worry about the buffering problem when the standard I/O library is being used, and the buffer is not empty when fork() is called. For example, look at fs3a.c and fs3b.c. They use fprintf() instead of write(). When I call them, I get the following:

UNIX> fs3a
UNIX> cat file3
fork() = 0.
fork() = 3716.
UNIX> fs3b
UNIX> cat file3
This is file3
fork() = 3719.
This is file3
fork() = 0.
UNIX>

Do you see where the copied buffer is a problem? Make sure you can explain this phenomenon.

Dup2()

Dup2() is a system call that dups an open file descriptor so that the result is a desired file descriptor.

int dup2(int fd1, int fd2)

     With dup2(), fd2 specifies the  desired  value  of  the  new
     descriptor.   If  descriptor  fd2  is  already in use, it is
     first deallocated as if it were closed by close(2V).

Dup2() is most often used so that you can redirect standard input or output. When you call dup2(fd, 0) and the dup2() call is successful, then whenever your program reads from standard input, it will read from fd. Similarly, when you call dup2(fd, 1) and the dup2() call is successful, then whenever your program writes to standard output, it will write to fd.

For example, look at dup2ex.c. This opens the file file4 for writing, and then uses dup2 to redirect standard output to that file. When it's done, you'll see that everything has gone intto file4:

UNIX> dup2ex
UNIX> cat file4
Standard output now goes to file4
It goes even after we closed file descriptor 3
putchar works
And fwrite
And write
UNIX>

Why did I make the fflush() call in dup2ex.c? Take it out and see. Make sure that you can explain this.

Now, suppose you want to execute, for example, "cat < f1 > f2" by calling fork(), exec() and dup2() instead of doing it from the shell. You can do this in catf1f2.c. This opens f1 for reading on stdin (fd 0), and f2 for writing on stdout (fd 1).

Study this program closely, because you will find it greatly helpful in the jsh lab.

Note: not all properties of the process would change across an exec call. In other words, the new process inherits a number of properties from the calling process:

process ID and parent process ID
real user ID and reall group ID
supplementary group IDs
process group ID
session ID
controlling terminal
time left on alarm clock
current working directory
root directory
file mode creation mask
file locks
process signal mask
pending signals
resource limits
file descriptors without close-on-exec flag set

Obviously, here we care the most about open file descriptors. While we say that exec replaces the old process with a new one, most open file descriptors would remain open. Care to venture where all the above information is stored at?