CS360 Lecture notes -- Threads #5 - Multiple threads working on one problem


An example of multiple threads working on one problem: Determining primes

This program is slightly contrived, but it will help you with your jtalk lab. The three files numbers-20.txt, numbers-100.txt and numbers-5000.txt contain 20, 100 and 5000 large numbers (up to 16 digits), and we want to know which ones are prime numbers. We're going to write a multithreaded program so that multiple threads can test primality of these numbers. The first of these is in src/prime_1_simple.c. I'm only going to show the code that the threads run. The setup code in the main() thread is straightforward and very much like the code in the mutex examples.

/* Each thread will have its own private information, and some shared information.  
   Here's the shared information.   At this point, there's little to share, because
   the threads are simply reading from standard input and writing to standard output. */

struct shared {
  int debug;          /* This is 0 or 1.  If 1, it will print more information. */
};

/* Here's the private information.  This includes a pointer to the shared information. */

struct info {
  int id;
  struct shared *s;
};

/* This is what each thread runs.  
   Each thread reads longs from standard input, and prints out the primes on standard output. */
   
void *worker(void *arg)
{
  struct info *info;               /* Private Information. */
  struct shared *s;                /* Shared Information. */
  long prime_to_test;              /* The number to determine primality. */
  long i;
  int prime;                       /* Boolean -- is it prime? */

  /* Set the shared and private information from the argument. */

  info = (struct info *) arg;
  s = info->s;

  while (1) {

    /* Read a number from standard input.  Return if standard input is done. */

    if (scanf("%ld", &prime_to_test) != 1) {
      if (s->debug) printf("Thread %d - Input_over - Exiting.\n", info->id);
      return NULL;
    }

    if (s->debug) {
      printf("Thread %d testing %ld\n", info->id, prime_to_test);
      fflush(stdout);
    }

    /* Determine if the number is prime, and print out if so. */

    prime = (prime_to_test >= 2);
    for (i = 2; prime && i*i <= prime_to_test; i++) {
        if (prime_to_test % i == 0) prime = 0;
    }

    if (prime) {
      printf("Thread %d found prime %ld\n", info->id, prime_to_test);
    } else if (s->debug) {
      printf("Thread %d found composite %ld.  Exited the loop when i = %ld\n", 
              info->id, prime_to_test, i-1);
    } 
  }
}

(BTW, this won't work on a 32-bit machine like the Pi, because it assumes that longs are 64 bits).

We run this on one thread, and it finds two prime numbers in numbers-20.txt. It finds 122 primes in numbers-5000.txt and takes one minute and 45 seconds on my Macbook (in 2019):

UNIX> bin/prime_1_simple 1 n < numbers-100.txt
Thread 0 found prime 7075339107081019
Thread 0 found prime 8608832394453737
UNIX> time bin/prime_1_simple 1 n < numbers-5000.txt > output-5000.txt
105.853u 0.195s 1:46.36 99.6%	0+0k 0+0io 0pf+0w
UNIX> wc output-5000.txt
     122     610    4623 output-5000.txt
UNIX> head -n 5 output-5000.txt 
Thread 0 found prime 7463817079068967
Thread 0 found prime 9023127434641013
Thread 0 found prime 9985493102083921
Thread 0 found prime 361166542956361
Thread 0 found prime 9343670048266007
UNIX> tail -n 5 output-5000.txt
Thread 0 found prime 3609439943333089
Thread 0 found prime 6564031501227023
Thread 0 found prime 9114759606402677
Thread 0 found prime 6889406536167677
Thread 0 found prime 6371913195199121
UNIX> 
Below, I'll increase the thread count from 1 to 8. In each case, I verify that the outputs match each other. Take a look -- I've highlighted the wall-clock times of the runs in blue, and the number of threads in magenta:
UNIX> time bin/prime_1_simple 1 n < numbers-5000.txt > tmp.txt
102.430u 0.051s 1:42.51 99.9%	0+0k 0+3io 0pf+0w
UNIX> sed 's/Thread ./Thread 0/' < tmp.txt | sort | openssl md5 ; rm tmp.txt
(stdin)= b4c0be00e5c83840b5286c5d0505029e
UNIX> time bin/prime_1_simple 2 n < numbers-5000.txt > tmp.txt
99.482u 0.071s 0:50.25 198.1%	0+0k 0+5io 0pf+0w
UNIX> sed 's/Thread ./Thread 0/' < tmp.txt | sort | openssl md5 ; rm tmp.txt
(stdin)= b4c0be00e5c83840b5286c5d0505029e
UNIX> time bin/prime_1_simple 3 n < numbers-5000.txt > tmp.txt
101.360u 0.119s 0:34.40 294.9%	0+0k 0+3io 0pf+0w
UNIX> sed 's/Thread ./Thread 0/' < tmp.txt | sort | openssl md5 ; rm tmp.txt
(stdin)= b4c0be00e5c83840b5286c5d0505029e
UNIX> time bin/prime_1_simple 4 n < numbers-5000.txt > tmp.txt
105.055u 0.115s 0:26.82 392.0%	0+0k 1+3io 2pf+0w
UNIX> sed 's/Thread ./Thread 0/' < tmp.txt | sort | openssl md5 ; rm tmp.txt
(stdin)= b4c0be00e5c83840b5286c5d0505029e
UNIX> time bin/prime_1_simple 5 n < numbers-5000.txt > tmp.txt
124.551u 0.118s 0:25.63 486.3%	0+0k 0+5io 0pf+0w
UNIX> sed 's/Thread ./Thread 0/' < tmp.txt | sort | openssl md5 ; rm tmp.txt
(stdin)= b4c0be00e5c83840b5286c5d0505029e
UNIX> time bin/prime_1_simple 6 n < numbers-5000.txt > tmp.txt
142.507u 0.120s 0:24.70 577.4%	0+0k 0+3io 0pf+0w
UNIX> sed 's/Thread ./Thread 0/' < tmp.txt | sort | openssl md5 ; rm tmp.txt
(stdin)= b4c0be00e5c83840b5286c5d0505029e
UNIX> time bin/prime_1_simple 7 n < numbers-5000.txt > tmp.txt
159.279u 0.220s 0:24.05 663.1%	0+0k 0+1io 0pf+0w
UNIX> sed 's/Thread ./Thread 0/' < tmp.txt | sort | openssl md5 ; rm tmp.txt
(stdin)= b4c0be00e5c83840b5286c5d0505029e
UNIX> time bin/prime_1_simple 8 n < numbers-5000.txt > tmp.txt
170.716u 0.128s 0:22.47 760.2%	0+0k 0+3io 0pf+0w
UNIX> sed 's/Thread ./Thread 0/' < tmp.txt | sort | openssl md5 ; rm tmp.txt
(stdin)= b4c0be00e5c83840b5286c5d0505029e
UNIX> 
As you can see, from one thread to four, the addition of an extra thread speeds things up, pretty much proportionally to the number of threads. After that, though, there is no significant speedup. That's because my Macintosh has four cores.

Take a look at what I'm doing in the commands that begin with sed. I'll break it down for you:

sed 's/Thread ./Thread 0/' < tmp.txt  - This makes every line begin with "Thread 0"
| sort                                  - This sorts the output
| openssl md5                           - This calculates the MD5 hash of the output
; rm tmp.txt                            - And this deletes tmp.txt

The sed command gets rid of any differences that the outputs may have because of the thread numbers -- it simple makes every line begin with "Thread 0". I need the sort command, because with threads, the primes can be printed in any order, depending on which thread does what. When I sort, the output will be the same, regardless of the order that the primes are printed. An MD5 hash takes the contents of the file and turns them into a 128-bit hash -- when two files differ, the probability that their hashes are the same is 1 in 2128. So, we can be reasonably assured that if two outputs have the same hash, they are identical, and if you look at the output, they are. Thus, I'm convinced that they work.

Look at the output when we look at 20 numbers and I turn on the debugging flag -- it's interesting:

UNIX> bin/prime_1_simple 4 y < numbers-20.txt
Thread 0 testing 8185099285209145
Thread 1 testing 7075339107081019    - This is a prime number
Thread 2 testing 640904115652591
Thread 3 testing 699059234573950
Thread 0 found composite 8185099285209145.  Exited the loop when i = 5
Thread 2 found composite 640904115652591.  Exited the loop when i = 7
Thread 0 testing 8608832394453737    - This is a prime number
Thread 2 testing 3821701465442891
Thread 3 found composite 699059234573950.  Exited the loop when i = 2
Thread 3 testing 9370646097390169
Thread 2 found composite 3821701465442891.  Exited the loop when i = 71
Thread 2 testing 3231405834979810
Thread 3 found composite 9370646097390169.  Exited the loop when i = 23
Thread 3 testing 1850884017234395
Thread 2 found composite 3231405834979810.  Exited the loop when i = 2
Thread 2 testing 3801829741075466
Thread 3 found composite 1850884017234395.  Exited the loop when i = 5
Thread 3 testing 540168881271137       - This is not prime, but it will take a while. 
Thread 2 found composite 3801829741075466.  Exited the loop when i = 2
Thread 2 testing 2545171349758192
Thread 2 found composite 2545171349758192.  Exited the loop when i = 2
Thread 2 testing 1597181395944134
Thread 2 found composite 1597181395944134.  Exited the loop when i = 2
Thread 2 testing 4808764804932068
Thread 2 found composite 4808764804932068.  Exited the loop when i = 2
Thread 2 testing 7581664330568676
Thread 2 found composite 7581664330568676.  Exited the loop when i = 2
Thread 2 testing 2281999321403870
Thread 2 found composite 2281999321403870.  Exited the loop when i = 2
Thread 2 testing 4533540853644025
Thread 2 found composite 4533540853644025.  Exited the loop when i = 5
Thread 2 testing 8421794371578008
Thread 2 found composite 8421794371578008.  Exited the loop when i = 2
Thread 2 testing 5935296786582469
Thread 2 found composite 5935296786582469.  Exited the loop when i = 41
Thread 2 testing 6884845122873466
Thread 2 found composite 6884845122873466.  Exited the loop when i = 2
Thread 2 - Input_over - Exiting.
Thread 3 found composite 540168881271137.  Exited the loop when i = 8688607
Thread 3 - Input_over - Exiting.
Thread 1 found prime 7075339107081019
Thread 1 - Input_over - Exiting.
Thread 0 found prime 8608832394453737
Thread 0 - Input_over - Exiting.
UNIX> 
When thread 1 gets 7075339107081019, it spends the duration of the program calculating that it is prime. Fortunately, we have other threads to help do the work.

Shortly thereafter, thread 0 gets 8608832394453737, which is also a prime; so it too spends all of its time calculating the primality of the number. After that, it's up to threads 2 and 3 to do all of the testing, which they do, until thread 3 gets 540168881271137. This number isn't prime, but it takes a while to figure that out, because it is equal to 8688607 * 62169791. For that reason, thread 2 is the thread that finishes up all of the other calculations.

This is a nice example of having multiple threads split up the work, and they do so by splitting up the CPU time necessary, rather than, for example, giving 5 numbers to each of the four threads.

You'll note that this code does not have any mutexes in it. That's because the standard I/O library is "thread-safe" -- you don't get race conditions on that scanf() call. You can take a look at http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_01 if you want more information about that...

Let me draw a picture of how these threads are organized. It's a simplistic picture, but it will be useful when we look at the remaining programs. As you can see, all of the worker threads simply read from standard input and write to standard output. Because the standard-IO library is thread-safe everything works nicely.


Adding a thread to process output

Now, suppose we'd like to process the output in a more sophisticated manner. A nice way to do that is to add a thread to the system whose sole purpose is to process output. It's clean, because your threads have specific purposes: To do this, we need a way for the worker threads to communicate their primes to the output thread. And we need a way for the output thread to know when there are primes to output. The first part of that is easy -- we can use a dllist, and protect it with a mutex: worker threads append to the dllist, and the output thread reads from the dllist.

The second part is trickier, and we'll make use of a condition variable, that we introduced in the printer simulation. The next few paragraphs are repetition, but a little repetition never hurts...

A condition variable facilitates those times when you want a thread to block, because there is a certain "condition". And then it facilitates unblocking the thread when that condition is no longer true. Like the mutex, the condition variable has three procedures that act on it:

int pthread_cond_init(pthread_cond_t *cond, NULL);
int pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *lock);
int pthread_cond_signal(pthread_cond_t *cond);

Like the mutex, you initialize a condition variable with pthread_cond_init(). You call pthread_cond_wait() when you want your thread to block. You call it on a condition variable and a mutex. You must own the lock to the mutex when you call it. pthread_cond_wait() will atomically release the mutex and block your thread. It will store the fact that your thread is blocked as part of the condition variable.

When a thread calls pthread_cond_signal() on a condition variable, then the thread system checks to see if there are any threads that are blocked because they called pthread_cond_wait() on that condition variable. If there are none, then the pthread_cond_signal() call does nothing. However, if there are any threads that are blocked, then the thread system unblocks one of them, and it tries to re-lock the mutex. When it does lock the mutex, then it returns from the pthread_cond_wait() call that blocked it.

Once again -- condition variables are designed for when a thread needs to block because some condition is present in the system. Another thread will unblock it when the condition is no longer present by calling pthread_cond_signal(). The safety of the conditions are protected by the mutex.

In the case of our primes program, we'll have the worker threads communicate with the output thread as follows:

Here's the schematic picture:

The code is in src/prime_2_output_thread.c. I'm not going to include all of the code -- just the relevant changes. Here is the addition to the shared data:

/* Each thread will have its own private information, and some shared information.  
   Here's the shared information.   In addition to the debug flag, we have a 
   dllist, mutex and condition variable that allow the workers to communicate with
   the output thread. */

struct shared {
  int debug;                       /* This is 0 or 1.  If 1, it will print more information. */
  Dllist primes;                   /* List of primes found by the worker threads. */
  pthread_mutex_t *output_mutex;   /* Mutex for the list of primes. */
  pthread_cond_t  *output_cond;    /* Condition variable to wake up the output thread. */
};

Here's the code that we add to main() to initialize the data:

  /* Set up the shared and private information. */

  nthreads = atoi(argv[1]);
  S.debug = (strcmp(argv[2], "y") == 0);
  S.primes = new_dllist();
  S.output_mutex = (pthread_mutex_t *) malloc(sizeof(pthread_mutex_t));
  S.output_cond = (pthread_cond_t *) malloc(sizeof(pthread_cond_t));
  pthread_mutex_init(S.output_mutex, NULL);
  pthread_cond_init(S.output_cond, NULL);

Here is the change in the worker thread code when it finds a prime number:

    /* If it's prime, then put it onto the list and signal the output thread. */

    if (prime) {
      pthread_mutex_lock(s->output_mutex);
      dll_append(s->primes, new_jval_l(prime_to_test));
      pthread_cond_signal(s->output_cond);
      pthread_mutex_unlock(s->output_mutex);
    }

And finally, here is the code for the output thread:

/* Here's the code for the output thread.  It takes as its argument the shared data.
   It monitors the dllist of primes, blocking when it is empty, and printing/clearing
   its contents when it is non-empty. */

void *output_thread(void *arg)
{
  struct shared *s;                /* Shared Information. */
  int counter;                     /* We'll keep track of the number of primes, too. */

  counter = 0;
  s = (struct shared *) arg;

  pthread_mutex_lock(s->output_mutex);
  while (1) {                                   /* Print the primes if there are any */
    while (!dll_empty(s->primes)) {
      printf("Prime %5d: %20ld\n", counter, s->primes->flink->val.l);
      counter++;
      dll_delete_node(s->primes->flink);
    }                                           /* When the list is empty, wait. */
    pthread_cond_wait(s->output_cond, s->output_mutex);
  }
  return NULL;
}

When we run it on the 20-number input, it finds and prints the primes correctly. Unfortunately, it doesn't terminate, and to be honest with you, I'm not sure why:

UNIX> bin/prime_2_output_thread 4 n < numbers-20.txt
Prime     0:     7075339107081019
Prime     1:     8608832394453737
It hangs here.
(I suspect this may differ from system to system. My understanding of condition variable semantics is that if all threads are blocked, the process exits. Perhaps that's not true...)

Fixing the program so that it terminates

To fix this, we're going to add a boolean variable workers_alive to the shared data. It will be set to one initially. We'll have the main thread call pthread_join() on the worker threads, so that it can figure out when they are all done. When that happens, it sets workers_alive to zero, and signals the condition variable. We add some code to the output thread to exit when the primes list is empty, and when workers_alive is zero.

The code is in src/prime_3_fix_end.c. I'll show two snippets. The first is the code in main() that joins with the workers, and then signals the condition variable:

  /* Wait for the worker threads to exit, and then signal the output thread to exit. */

  for (i = 0; i < nthreads; i++) pthread_join(tids[i], &rv);
  pthread_mutex_lock(S.output_mutex);
  S.workers_alive = 0;
  pthread_cond_signal(S.output_cond);
  pthread_mutex_unlock(S.output_mutex);

  pthread_exit(NULL);

The second is the output thread, which now exits when the list is empty and there are no more workers:

  /* Change the loop to exit when the list is empty and there are no more workers. */

  pthread_mutex_lock(s->output_mutex);
  while (!dll_empty(s->primes) || s->workers_alive) {
    while (!dll_empty(s->primes)) {          /* Print and delete the primes on the list. */
      printf("Prime %5d: %20ld\n", counter, s->primes->flink->val.l);
      counter++;
      dll_delete_node(s->primes->flink);
    }                                           /* When the list is empty, wait. */
    pthread_cond_wait(s->output_cond, s->output_mutex);
  }
  pthread_mutex_unlock(s->output_mutex);
  return NULL;


Adding a thread to process input.

The final structural change is to add an input thread, whose sole job is to process input. This is a bit more complex than the output thread, because we don't want it to read all of the input at once. Instead, we're going to have it read ten numbers at time, and then only read more numbers when those ten numbers have been used by the worker threads.

To make this work, we're going to have another dllist, primes_to_test, which the input thread will fill with ten numbers at a time. The worker threads will grab their testing numbers from this list, deleting the numbers as they go.

The synchronization here requires two condition variables -- one for workers and one for the input thread. The input thread will wait on its condition variable whenever the list is non-empty. When it's empty, it will fill it with ten numbers, and signal the worker condition variable.

On the flip side, workers have no problem when the list is non-empty -- they grab a number and delete it from the list. When the list is empty, however, they will signal the input thread to wake up, and then they will wait on the worker condition variable.

There's a subtlety here about waking up the workers. When the input thread fills the dllist, it signals the worker's condition variable. That will wake up one worker. That worker, in turn, checks to see if there's work to do, and if so, it signals the worker's condition variable again, in case there is another worker that is waiting. In this way, the input thread only has to call signal once, and the workers wake each other up.

I have a variable for the input thread to set when it hits EOF. When the list is empty, the worker threads check this variable, and exit if it has been set.

Here's the diagram of how the threads and shared data fit together:

Here's the relevant code, in src/prime_4_input_thread.c. First, the new shared data:

struct shared {
  int debug;                       /* This is 0 or 1.  If 1, it will print more information. */
  int workers_alive;               /* Boolean for whether there are still worker threads. */
  int input_over;                  /* Boolean for whether there is no more input. */
  Dllist primes;                   /* List of primes found by the worker threads. */
  Dllist primes_to_test;           /* List of primes to test. */
  pthread_mutex_t *output_mutex;   /* Mutex for the list of primes. */
  pthread_cond_t  *output_cond;    /* Condition variable to wake up the output thread. */
  pthread_mutex_t *input_mutex;    /* Mutex for the list of primes. */
  pthread_cond_t *input_cond;      /* Condition variable for the input thread to wait. */
  pthread_cond_t *worker_cond;     /* Condition variable for the workers to wait. */
};

Next, the input thread:

/* Here's the code for the input thread.  When the primes_to_test list is empty, it
   reads eight primes from standard input and puts them onto the list.  It also
   signals the worker_cond once for each number.  If there are multiple workers waiting,
   they will wake each other up. */
   
void *input_thread(void *arg)
{
  struct shared *s;                /* Shared Information. */
  int i;
  long l;

  s = (struct shared *) arg;

  pthread_mutex_lock(s->input_mutex);

  while (1) {
    /* If the list is empty, then put ten items on it (if there are ten items to
       put on it).  Then signal the worker_cond to wake up a blocking worker.
       That worker will unblock the others if necessary. */

    if (dll_empty(s->primes_to_test)) {
      for (i = 0; i < 10; i++) {

        /* Read an integer and put it on the list. */
        if (scanf("%ld", &l) == 1) {
          dll_append(s->primes_to_test, new_jval_l(l));
          if (s->debug) printf("Reader thread put %ld onto primes_to_test.\n", l);

        /* If we're at EOF, then flag that input is over, signal the worker thread and exit. */
        } else {
          s->input_over = 1;
          pthread_cond_signal(s->worker_cond);
          if (s->debug) printf("Reader thread is done.\n");
          pthread_mutex_unlock(s->input_mutex);
          return NULL;
        }
      }
      pthread_cond_signal(s->worker_cond);

    /* If the list isn't empty, simply block and wait for a worker thread to signal you
       that the list is empty. */

    } else {
      if (s->debug) printf("Reader thread is blocking until primes_to_test is empty.\n");
      pthread_cond_wait(s->input_cond, s->input_mutex);
    }
  }
 
  /* Keep the compiler quiet. */
  return NULL;
}

Finally, here's the code that the workers run to get their testing numbers:

    /* Get a number from the primes_to_test list.  If it's empty, signal the
       input thread, and wait to be unblocked. */

    pthread_mutex_lock(s->input_mutex);
    while (!s->input_over && dll_empty(s->primes_to_test)) {
      if (s->debug) printf("Thread %d - waking up the reader thread and blocking.\n", info->id);
      pthread_cond_signal(s->input_cond);
      pthread_cond_wait(s->worker_cond, s->input_mutex);
    }
    pthread_cond_signal(s->worker_cond);   /* See the lecture notes for why I do this. */

    if (!dll_empty(s->primes_to_test)) {  /* Grab a prime to test from the list. */
      prime_to_test = s->primes_to_test->flink->val.l;
      dll_delete_node(s->primes_to_test->flink);
    } else {                              /* Otherwise, the input is done - so return */
      pthread_mutex_unlock(s->input_mutex);
      if (s->debug) printf("Thread %d - exiting.\n", info->id);
      return NULL;
    }

    pthread_mutex_unlock(s->input_mutex);
 
    /* All of the rest of the code is the same. */

It works nicely, and runs at roughly the same speed as the first program.

UNIX> bin/prime_4_input_thread 4 n < numbers-20.txt
Prime     0:     7075339107081019
Prime     1:     8608832394453737
UNIX> time bin/prime_4_input_thread 4 n < numbers-5000.txt > tmp.txt
93.606u 0.033s 0:23.83 392.9%	0+0k 0+3io 0pf+0w
UNIX> 
UNIX> sed 's/.* //' output-5000.txt | sort | openssl md5      This shows that the primes are the same.
(stdin)= 5aad6da280e11ecd46f34b622ec77933
UNIX> sed 's/.* //' tmp.txt | sort | openssl md5
(stdin)= 5aad6da280e11ecd46f34b622ec77933
UNIX> 

Lessons Learned from this Lecture

You've gotten further illustration of race conditions and how to help prevent them with a mutex. You've also learned how to use a condition variable to control a thread that needs to block and unblock because of certain conditions. In this example, we had three classes of threads: In your chat server lab, you'll have an output thread that is very similar to the one here.