In Unix, all processes are created with the system call fork(). This is without exception. What fork() does is the following:
It creates a new process which is a copy of the calling process. That means that it copies the caller's memory (code, globals, heap and stack), registers, and open files. The calling process returns from fork() with the pid of the newly created process (which is called the "child" process. The calling process is called the "parent" process). The newly created process, as it is a duplicate of the parent, also returns from the fork() call (this is because it is a duplicate -- it has the same memory and registers, and thus the same stack pointer, frame pointer, and program counter, and thus has to return from the fork() call). It returns with a value of zero. This is how you know what process you're in when fork() returns.
In Unix programming, fork had been considered quite expensive, and this fact served as one of the major motivations to use threads (light-weight processes). However, as the procedure to create a new process gets accelerated over the years, especially in the world of Linux, forking a process is now not considered to incur an overhead high enough to warrant much special attention (except a few rare cases). Many techniques have been used to achieve this, with a major method being COW (copy-on-write). In COW, the parent's data, stack and heap are shared by the parent and child and have their protection changed by the kernel to read-only. If either process tries to modify these regions, the kernel then makes a copy of that piece of memory only, typically a page. COW is a big win, especially knowing the fact that the call in the child process following fork is most likely an exec, a call that replaces this current process with another. Aided by other techniques, Linux process spawn process has been shown to be more efficient than thread spawns on other operating systems.
Look at simpfork.c:
main() { int i; printf("simpfork: pid = %d\n", getpid()); i = fork(); printf("Did a fork. It returned %d. getpid = %d. getppid = %d\n", i, getpid(), getppid()); }When it is run, the following happens:
UNIX> simpfork simpfork: pid = 914 Did a fork. It returned 915. getpid = 914. getppid = 381 Did a fork. It returned 0. getpid = 915. getppid = 914 UNIX>
So, what is going on? When simpfork is executed, it has a pid of 914. Next it calls fork() creating a duplicate process with a pid of 915. The parent gains control of the CPU, and returns from fork() with a return value of the 915 -- this is the child's pid. It prints out this return value, its own pid, and the pid of csh, which is still 381. Then it exits. Next, the child gets the CPU and returns from fork() with a value of 0. It prints out that value, its pid, and the pid of the parent.
Note, there is no guarantee which process gains control of the CPU first after a fork(). It could be the parent, and it could be the child. When I executed simpfork a second time, the child got control first:
UNIX> simpfork simpfork: pid = 928 Did a fork. It returned 0. getpid = 929. getppid = 928 Did a fork. It returned 929. getpid = 928. getppid = 381 UNIX>(on some machines, it does appear that the child always gets control first, but you should not rely on such a fact when writing code).
UNIX> simpfork2 Child. getpid() = 1301, getppid() = 1300 Parent exiting now UNIX> After sleeping. getpid() = 1301, getppid() = 1Note that the "UNIX>" prompt returns once the parent returns, even though the child is still running. This is because csh waits only for the parent to complete, not for any other processes.
UNIX> simpfork3 Before forking: j = 200, K = 300 After forking, child: j = 201, K = 301 After forking, parent: j = 200, K = 300 UNIX>Interestingly, if we redirect the output of simpfork3 to a file, we see the following behavior:
UNIX> simpfork3 > output UNIX> cat output Before forking: j = 200, K = 300 After forking, child: j = 201, K = 301 Before forking: j = 200, K = 300 After forking, parent: j = 200, K = 300 UNIX>This is explained in the book, and I'll explain it here. When redirecting output to a terminal, stdout is buffered line by line -- that is, once you do a putchar('\n') or equivalent, the buffer is written to standard output with a "write(1, ...)". However, when stdout is redirected to a file, the stdio library buffers on a coarser scale -- not writing until some large buffer (probably 4K or 8K characters) is full. Thus, at the time of the fork() call, the "Before forking:" string has not been written to fd=1. Instead, it has been buffered in the standard I/O library. That buffer is part of simpfork3's address space, and is thus copied to the child process when fork() is called. Thus, when the bytes are finally flushed from the buffer, the "Before forking: ..." string is written to the file twice. This is an important thing to realize. It looks strange but has a logical explanation.
UNIX> simpfork4 UNIX> cat tmpfile Before forking Child: After forking: Seek pointer = 15 Parent: After forking: Seek pointer = 55 UNIX>
This is also explained in the book.
I will go over more on sharing file descriptors in the dup lecture.