CS360 Lecture notes -- Thread #3 -- Condition Variables; The printer simulation


In this lecture, we go over a simulation program that uses the producer/consumer paradigm with a bounded buffer, that requires monitors/condition variables.

The Printer Simulation Problem

This lecture revolves around a simulation that we will write. This is a simulation of a system that has users and printers and a fixed-size buffer that connects them. In particular, we have nusers users, and nprinters printers. We'll assume that all the printers are identical (e.g. in a machine room) so that when a user wants to print something out, it doesn't matter which printer it comes out on.

Now, in our simulation, every so often, a user will decide to print something. When this happens, the print job will be submitted, and if any printer is available, it will print the job, taking 4 seconds a page. If all the printers are printing something, then the job will be queued until one of the printers is ready. The print queue will be of a fixed size (this is what makes it a "bounded buffer" problem). If the queue is full, then the user must wait until the queue is not full to submit the job.

We are going to use threads for this simulation. Each user will have its own thread, and each printer will have its own thread. The threads will communicate through shared memory.


printqsim

This program has a specific format, which the "Bonding" lab will share. There is a header file, in this case include/printqsim.h, that defines some data structures plus some subroutine prototypes. In the labs, these subroutines are the ones that you have to write to make the lab work.

There is also a driver program, in this case src/printqsim.c. This defines a main() routine which sets up the threads. Together with your definitions of the subroutines, the driver program will solve the problem.

You are not allowed to change either the header or driver files. Instead, you are to provide a C file that defines the subroutines in the header file, and when this is compiled with the driver program, the resulting program solves the problem.

In this case, our job is to define initialize_v(), submit_job() and get_print_job() so that together with printqsim.c, our program performs the user/printer simulation correctly. I'll go into more detail later.

For now, let's look at src/printqsim.c. The main() takes 6 arguments:

  1. nusers: The number of users.
  2. nprinters: The number of printers.
  3. arrtime: The average time that users will take between submitting print jobs.
  4. maxpages: The maximum size of a print job (in pages).
  5. bufsize: The size of the print queue.
  6. nevents: The number of print jobs that each user will make.
The main() routine sets up a Spq struct. This is defined in include/printqsim.h:

/* This is the main struct for the simulation.  Each thread gets its own 
   copy of this struct, and every copy is identical with the exception of the id.
   You'll note that there's a (void *) here.  That's what you get to define and
   use in your procedures. */

typedef struct {
  int nusers;          // Number of users in the simulation
  int nprinters;       // Number of users in the simulation
  int arrtime;         // Average interarrival time of printer jobs from a user.
  int maxpages;        // Maximum size of a printer job.
  int bufsize;         // Size of the printer queue.
  int nevents;         // Number of print jobs that each user will generate in the simulation.
  int id;              // Id of the user/printer. 
  int starttime;       // Starting time of the simulation (you can ignore this) 
  pthread_mutex_t *stdiolock;   // A mutex to protect stdio (you can ignore this)
  void *v;             // You get to define this and use it.
} Spq;

Each user and printer thread will receive a pointer to one of these structs as its argument, and all information that the thread needs will be in this struct. Each of the command line arguments has a field in the Spq struct, plus there are the following extra fields which are described in the comments above.

After setting up the Spq struct, the main thread calls initialize_v(), again defined in the header file:

/* This is called before the before the threads are created.  The 
   argument is the initial Spq struct, which is copied to all of
   the Sqp structs that go to the users/printers. */

extern void initialize_v(Spq *s);

You are to write this procedure, and your job is to allocate and initialize the (void *) however you please. You'll use that to implement the other two procedures.

Next, the main thread sets up a random number generator, and then creates nusers user threads, and nprinters printer threads. Each thread gets its own copy of the Spq struct as its argument. The only thing that differs in each Spq struct is the id. Everything else (including the pointer to v) is the same. Finally, the main thread exits, leaving only the user and printer threads.

User threads

Each user thread does the same thing. It iterates for nevents iterations. In each iteration, it sleeps for a random period of time (between 1 and arrtime*2 -- this yields a mean waiting time of arrtime), and then submits a print job. This job is represented by a Job struct (defined in the header file):

/* This struct defines a print job.  When a user submits a job, it submits one
   of these.  When a printer "gets" a job, it gets one of these. */

typedef struct {
  int jobsize;         // Number of pages to print.
  int userid;          // Id of the user (duh)
  int jobid;           // Print job number.
} Job;

The user thread submits the job using submit_job():

/* This is what a user calls to submit a job.  It should only return when
   the job is on the printer queue. */

extern void submit_job(Spq *s, Job *j); 

Its your job to write submit_job() using the information in the Spq (including your (void *)). After submitting nevents jobs, the user thread exits. The user thread prints out when it sleeps, and when it submits a job.

Printer threads

Each printer thread does the same thing. It iterates forever. Each time, it gets a job using get_print_job(), and it simulates printing the job by sleeping for 4 seconds for each page. After printing, it repeats the process. The printer thread prints out when it asks for a job, and when it prints a job.

The prototype for get_print_job() is below:

/* This is what a printer calls to get jobs to print.  It returns when a job
   to print (otherwise, it blocks until it has a job to print. When the simulation
   is over, it will return NULL. */

extern Job *get_print_job(Spq *);

You write get_print_job()

Now, what we'll do is walk though how I'd solve this problem if it were a lab. I'm going to illustrate using mutexes and condition variables along the way.


A dummy solution

To reiterate, were this a lab, your job would be to write initialize_v(), submit_job() and get_print_job() so that they work with include/printqsim.h and src/printqsim.c You would not be allowed to modify include/printqsim.h and src/printqsim.c

Now, look at src/ps1.c.

This is one solution to the problem. It's not a working solution, but it is one that will compile and run. What it does is set s->v to NULL, ignore print jobs when they are submitted, and force the printer threads to exit.

void initialize_v(Spq *s)
{
  s->v = NULL;
}

void submit_job(Spq *s, Job *j)
{
  return;
}

Job *get_print_job(Spq *s)
{
  return NULL;
}

Try running it:

UNIX> bin/ps1 5 3 5 5 5 3
   0: user  0/000: Sleeping for  6 seconds
   0: user  1/000: Sleeping for  7 seconds
   0: user  2/000: Sleeping for  6 seconds
   0: user  3/000: Sleeping for  1 seconds
   0: user  4/000: Sleeping for 10 seconds
   0: prnt  0/000: ready to print
   0: prnt  0/000: Done
   0: prnt  1/000: ready to print
   0: prnt  1/000: Done
   0: prnt  2/000: ready to print
   0: prnt  2/000: Done
   1: user  3/000: Submitting a job with size 4
   1: user  3/001: Sleeping for  7 seconds
   6: user  2/000: Submitting a job with size 5
   6: user  2/001: Sleeping for  8 seconds
   6: user  0/000: Submitting a job with size 2
   6: user  0/001: Sleeping for  8 seconds
   7: user  1/000: Submitting a job with size 5
   7: user  1/001: Sleeping for  6 seconds
   8: user  3/001: Submitting a job with size 2
   8: user  3/002: Sleeping for  4 seconds
  10: user  4/000: Submitting a job with size 5
  10: user  4/001: Sleeping for  8 seconds
  12: user  3/002: Submitting a job with size 3
  12: user  3/003: Done
  13: user  1/001: Submitting a job with size 3
  13: user  1/002: Sleeping for  5 seconds
  14: user  0/001: Submitting a job with size 1
  14: user  0/002: Sleeping for 10 seconds
  14: user  2/001: Submitting a job with size 5
  14: user  2/002: Sleeping for  9 seconds
  18: user  4/001: Submitting a job with size 5
  18: user  4/002: Sleeping for  6 seconds
  18: user  1/002: Submitting a job with size 3
  18: user  1/003: Done
  23: user  2/002: Submitting a job with size 4
  23: user  2/003: Done
  24: user  0/002: Submitting a job with size 2
  24: user  0/003: Done
  24: user  4/002: Submitting a job with size 4
  24: user  4/003: Done
UNIX> 
This created a simulation with 5 users, 3 printers, an average of 5 seconds between print jobs, a max page size of 5, a print queue size of 5, and three print jobs per user.

You'll note that the simulation did run, but not correctly. Why? Well, the printers never printed anything, for starters. Moreover, more than 5 print jobs were submitted and ostensibly queued, and the subsequent print jobs were still allowed to be submitted.

This may seem like a boneheaded example, but it illustrates something important -- solutions to a problem may compile and run, but you have to check their output for correctness. I will provide "solutions" like this one for your thread lab that will be incorrect, but give you a starting point.


ps2: Starting on a real solution

To actually solve this problem, you'll need to make use of that (void *) named v. You'll need to set up a queue of print jobs. This queue will have bufsize elements. When a user submits a job, if there are less than bufsize elements in the queue, you will put the job there. Otherwise, you'll have to wait for a printer to remove one of the jobs.

Since you have multiple threads accessing the buffer, you'll need to protect it with a mutex. The above is all done in src/ps2.c.

First, it defines a Buffer struct that uses an array as a circular queue (with head/tail/njobs) defining the state of the queue. It also has a mutex.

/* We're going to define a Buffer here, and store it in the void *.
   It has a fixed size array of job pointers, plus head, tail, and
   njobs variables to define a queue of jobs within the buffer.
   Finally, there is a mutex to protect access to the buffer. */

typedef struct {
  Job **b;       
  int head;
  int tail;
  int njobs;
  pthread_mutex_t *lock;
} Buffer;

In initialize_v(), the buffer is allocated and initialized, and v is set to be the buffer.

/* Initialize_v() sets up the buffer and then copies the pointer
   to the buffer to the (void *). */

void initialize_v(Spq *s)
{
  Buffer *b;

  b = (Buffer *) malloc(sizeof(Buffer));
  b->b = (Job **) malloc(sizeof(Job *)*s->bufsize);
  b->head = 0;
  b->tail = 0;
  b->njobs = 0;
  b->lock = (pthread_mutex_t *) malloc(sizeof(pthread_mutex_t));
  pthread_mutex_init(b->lock, NULL);
  s->v = (void *) b;
}

Moreover, now submit_job inserts the job into the buffer if there's room. If there's not room, the user thread exits.

/* This puts a job onto the buffer if there's room.
   If there's not room, it simply kills the thread. */

void submit_job(Spq *s, Job *j)
{
  Buffer *b;

  b = (Buffer *) s->v;

  while(1) {
    pthread_mutex_lock(b->lock);
    if (b->njobs < s->bufsize) {
      b->b[b->head] = j;
      b->head = (b->head + 1) % s->bufsize;
      b->njobs++;
      pthread_mutex_unlock(b->lock);
      return;
    } else {
      pthread_mutex_unlock(b->lock);
      printf("%4ld: user %2d -- the queue is full -- exiting\n", time(0)-s->starttime, s->id);
      fflush(stdout);
      pthread_exit(NULL);
    }
  }
}

Nothing is done with get_print_job() -- it still returns NULL which makes the printers exit. This is an example of programming incrementally -- you try one thing and test it to make sure it works before going on.

When we call this with the same arguments as before, we see that 5 jobs get submitted, and then the users all exit. This is what we expect, so the code is working:

UNIX> bin/ps2 5 3 5 5 5 3
   0: user  0/000: Sleeping for 10 seconds
   0: user  1/000: Sleeping for  5 seconds
   0: user  2/000: Sleeping for  8 seconds
   0: user  3/000: Sleeping for  3 seconds
   0: user  4/000: Sleeping for  6 seconds
   0: prnt  0/000: ready to print
   0: prnt  0/000: Done
   0: prnt  1/000: ready to print
   0: prnt  1/000: Done
   0: prnt  2/000: ready to print
   0: prnt  2/000: Done
   3: user  3/000: Submitting a job with size 2
   3: user  3/001: Sleeping for  1 seconds
   4: user  3/001: Submitting a job with size 2
   4: user  3/002: Sleeping for  6 seconds
   5: user  1/000: Submitting a job with size 5
   5: user  1/001: Sleeping for  6 seconds
   6: user  4/000: Submitting a job with size 2
   6: user  4/001: Sleeping for  2 seconds
   8: user  2/000: Submitting a job with size 2
   8: user  2/001: Sleeping for  6 seconds
   8: user  4/001: Submitting a job with size 3
   8: user  4 -- the queue is full -- exiting
  10: user  3/002: Submitting a job with size 3
  10: user  3 -- the queue is full -- exiting
  10: user  0/000: Submitting a job with size 5
  10: user  0 -- the queue is full -- exiting
  11: user  1/001: Submitting a job with size 3
  11: user  1 -- the queue is full -- exiting
  14: user  2/001: Submitting a job with size 5
  14: user  2 -- the queue is full -- exiting
UNIX> 

A semi-working solution

Now the question is -- what should we do when the queue is full? Moreover, when we start writing get_print_job(), what do we do when the queue is empty and there are no jobs to print? Well, ps3.c provides one solution. It is not a good solution, but it works. When submit_job() is called and the queue is full, the mutex is released, and sleep(1) is called. Then the queue is checked again. In this way, if a printer thread calls get_print_job() during that second, then it can take a job off the queue, and then user's job may be submitted. Similarly, when the queue is empty and a printer calls get_print_job() it sleeps for a second an checks again. note, it has to release the mutex when it sleeps so that a user thread can actually put a job on the queue.

Here's the code for submit_job(). You can see the new code that has the thread sleep when the queue is full:

void submit_job(Spq *s, Job *j)
{
  Buffer *b;

  b = (Buffer *) s->v;

  while(1) {
    pthread_mutex_lock(b->lock);
    if (b->njobs < s->bufsize) {
      b->b[b->head] = j;
      b->head = (b->head + 1) % s->bufsize;
      b->njobs++;
      pthread_mutex_unlock(b->lock);
      return;
    } else {         // Here's the new code -- when the queue is full, unlock the mutex and sleep.
      pthread_mutex_unlock(b->lock);
      printf("%4ld: user %2d sleeping because the queue is full\n", time(0)-s->starttime, s->id);
      fflush(stdout);
      sleep(1);
    }
  }
}

The code works. Try it out:

UNIX> bin/ps3 5 3 5 5 5 3
   0: user  0/000: Sleeping for 10 seconds
   0: user  1/000: Sleeping for  1 seconds
   0: user  2/000: Sleeping for  4 seconds
   0: user  3/000: Sleeping for  1 seconds
   0: user  4/000: Sleeping for 10 seconds
   0: prnt  0/000: ready to print
   0: prnt  0 sleeping because the queue is empty    # The queue is empty, so the printers sleep.
   0: prnt  1/000: ready to print
   0: prnt  1 sleeping because the queue is empty
   0: prnt  2/000: ready to print
   0: prnt  2 sleeping because the queue is empty
   1: user  1/000: Submitting a job with size 3
   1: user  1/001: Sleeping for  7 seconds
   1: user  3/000: Submitting a job with size 4
   1: user  3/001: Sleeping for  4 seconds
   1: prnt  0/000: Printing job   0 from user  1 size   3
   1: prnt  1/000: Printing job   0 from user  3 size   4
   1: prnt  2 sleeping because the queue is empty
   2: prnt  2 sleeping because the queue is empty
   3: prnt  2 sleeping because the queue is empty
   4: user  2/000: Submitting a job with size 4
   4: user  2/001: Sleeping for 10 seconds
   4: prnt  2/000: Printing job   0 from user  2 size   4
   5: user  3/001: Submitting a job with size 1
   5: user  3/002: Sleeping for  2 seconds
   7: user  3/002: Submitting a job with size 2
   7: user  3/003: Done
   8: user  1/001: Submitting a job with size 5
   8: user  1/002: Sleeping for  4 seconds
  10: user  4/000: Submitting a job with size 3
  10: user  4/001: Sleeping for  9 seconds
  10: user  0/000: Submitting a job with size 5
  10: user  0/001: Sleeping for  5 seconds
  12: user  1/002: Submitting a job with size 3
  12: user  1 sleeping because the queue is full             # Now the queue is full, and user 1 sleeps.
  13: prnt  0/001: ready to print
  13: prnt  0/001: Printing job   1 from user  3 size   1    # Printer 0 has a new job, which opens up a slot in the buffer, for user 1 to use
  13: user  1/003: Done                                      # And user 1 now has submitted the job.
  14: user  2/001: Submitting a job with size 1
  14: user  2 sleeping because the queue is full             # The queue is full again, so user 2 sleeps.
  15: user  0/001: Submitting a job with size 1
  15: user  0 sleeping because the queue is full             # As does user 0
  15: user  2 sleeping because the queue is full
  16: user  2 sleeping because the queue is full
  16: user  0 sleeping because the queue is full
  17: prnt  1/001: ready to print
  17: prnt  1/001: Printing job   2 from user  3 size   2
  17: user  0/002: Sleeping for  3 seconds
  17: prnt  0/002: ready to print
...
  60: prnt  1 sleeping because the queue is empty        # Eventually the users are done, and the printers spin away, sleeping and sleeping.
  60: prnt  2/004: ready to print
  60: prnt  2 sleeping because the queue is empty
  60: prnt  0 sleeping because the queue is empty
  61: prnt  2 sleeping because the queue is empty
  61: prnt  0 sleeping because the queue is empty
  61: prnt  1 sleeping because the queue is empty
  62: prnt  0 sleeping because the queue is empty
  62: prnt  1 sleeping because the queue is empty
  62: prnt  2 sleeping because the queue is empty
  63: prnt  0 sleeping because the queue is empty
  63: prnt  1 sleeping because the queue is empty
  63: prnt  2 sleeping because the queue is empty
  64: prnt  1 sleeping because the queue is empty
< CNTL-C >
UNIX> 
It all works fine. When all the user jobs are done, the printer threads keep sleeping and checking the queue, so you eventually have to cntl-c out of the program.

This is a workable solution, but it is not a good one. The technique of periodically checking the queue is called polling. It's not really what you want because you'd like for a printer thread to wake up and start printing as soon as a job is inserted into the queue, instead of up to a second afterward. Similarly, you'd like the user to complete submitting a job as soon as a printer thread empties a space in the queue instead of up to a second afterward.

In short, polling is ok, but not great. I show it to you because it's good for you to see, but I don't want to see any polling in your labs -- if you do it, you will get points taken off.


Monitors and condition variables

Monitors and condition variables together form a very convenient tool for synchronization. There are two ways to discuss monitors and condition variables -- as part of a threaded language, or as part of a threads library.

The CS361 textbook (Silberschatz & Galvin, Chapter 6 -- or at least that's what it was last time I checked) discusses them as part of a threaded language, but I'm going to discuss them as part of a threads library, since that's how you will use them.

A monitor is a data structure which a thread can "enter" and "exit". Only one thread may be in the monitor at a time. This is just like a mutex, and in pthreads, there is no entity called a "monitor". You just use a mutex. Condition variables allow you to do more sophisticated things with monitors. A condition variable must be associated with a specific monitor. There are three procedures that act on condition varaibles, and whenever you call them, you must have entered the relevant monitor (i.e. you must have locked the relevant mutex):

Let me advocate testing the return values of all monitor and condition variable calls. This is because you often make errors messing with these, and testing the return value can save you hours of debugging.

Now, here is an odd thing -- you don't need to own the mutex when you call pthread_cond_signal() or pthread_cond_broadcast(). However, I advocate that you do own the mutex. It will make your code easier to reason about. What's confusing is that the thread that you are unblocking will have locked the mutex when it called pthread_cond_wait(). This at first appears to be a contradiction, but you must remember that the waiting thread unlocks the mutex while it is blocked. When it is unblocked, it must relock the mutex before returning from pthread_cond_wait().

My personal philosophy is that you should unlock the mutex right after you call pthread_cond_signal() or pthread_cond_broadcast(). You don't have to do this (actually, you don't have to own the mutex), but again, it will make your code less bug-prone. My code will always do this.


Using condition variables

Now, adding condition variables to our program is straightforward. We need two condition variables -- one for when the queue is full and one for when the queue is empty. We'll call pthread_cond_wait() in submit_job() when the queue is full, and pthread_cond_signal() in get_print_job() when a printer thread removes a job from a full queue.

Likewise, we'll call pthread_cond_wait() in get_print_job() when the queue is empty, and pthread_cond_signal() in submit_job() when a user thread inserts a job into an empty queue.

The code is in src/ps4.c. Here's the change to the data structure -- I won't show the initialization code, because it is straightforward:

typedef struct {
  Job **b;
  int head;
  int tail;
  int njobs;
  pthread_mutex_t *lock;
  pthread_cond_t *full;         /* The users wait on this when the queue is full. */
  pthread_cond_t *empty;        /* The printers wait on this when the queue is empty. */
} Buffer;

I'll show the code for submit_job(), because get_print_job() is very similar. Note that submit_job() and get_print_job() both use while loops because when pthread_cond_wait() returns, the queue may have become full/empty in the time between when the waiting thread unblocked and the time that it acquired the mutex. Therefore, it may have to wait again.

void submit_job(Spq *s, Job *j)
{
  Buffer *b;

  b = (Buffer *) s->v;

  pthread_mutex_lock(b->lock);
  while(1) {
    if (b->njobs < s->bufsize) {
      b->b[b->head] = j;
      b->head = (b->head + 1) % s->bufsize;
      b->njobs++;
      if (b->njobs == 1) pthread_cond_signal(b->empty);   // New code: Signal the printers when the queue was empty.
      pthread_mutex_unlock(b->lock);
      return;
    } else {
      printf("%4ld: user %2d blocking because the queue is full\n", 
             time(0)-s->starttime, s->id);
      fflush(stdout);
      pthread_cond_wait(b->full, b->lock);               // New code: wait when the queue is full.
    }
  }
}

When you run it, everything seems to work just fine.


A bug

However, there is a problem with this code, which arises with a scenario like the following:
njobs    User 0                  User 1                 Printer 0             Printer 1
-----    -----------             -----------            -----------           ---------------
  0                                                     get_print_job()
  0                                                     pthread_cond_wait()
  0                                                                           get_print_job()
  0                                                                           pthread_cond_wait()
  0      submit_job()                                                                            
  0      pthread_mutex_lock()                                                                    
  1      Add job to the queue                                                                    
  1      pthread_cond_signal()                                                                   
  1                                                                           pthread_cond_wait() will return when it gets the mutex
  1                              submit_job()                                                    
  1                              pthread_mutex_lock()-blocks
  1      pthread_mutex_unlock()  
  1                              pthread_mutex_lock()-unblocks
  2                              Add job to the queue  
  2                              Doesn't call pthread_cond_signa()
  2                              pthread_mutex_unlock()
  2                                                                           pthread_cond_wait() returns
  1                                                                           prints users 0's job.
  1                                                     This printer never wakes up.
The key here is that pthread_cond_signal() is only called when njobs is set to one. In the scenario above, njobs is set to two, and a printer should be signaled, but it is not.

Look at ps4-bad.txt. This is exactly what happens. There are three user threads and five printer threads. Initially, all of the printer threads block. At the 3 second mark, two user threads submit jobs, but only one printer thread (0) is signaled. Then, more jobs are put onto the print queue, but since njobs is greater than 1, no more printers get awakened.

Fixing this bug is simple (in src/ps5.c) -- simply remove the if statements around the pthread_cond_signal() calls. This means that submit_job always signals the empty condition variable, and get_print_job always signals the full condition variable. This works fine -- if there are no blocked threads, pthread_cond_signal() does nothing, and if, for example, a user thread is unblocked and there is no room on the queue, it will simply call pthread_cond_wait() again. Try it out. If you look at ps5-good.txt, you'll see the same scenario as in ps4-bad.txt at the 27 second mark, and that it is handled just fine.


So

So, you've learned what monitors/condition variables are, and you've seen a detailed example of their use. You have also seen that synchronization problems can be subtle, and you have to examime your program's output carefully to make sure that it is working like you think it should.