CS360 Lecture notes -- Thread #6 - Sockets and Performance


In this lecture, we go over race conditions in more detail, focusing on using mutexes, and the trade-off between safety and performance.

SSNSERVER

The lecture revolves around a piece of code that maintains a database of people/ages/social security numbers. The main code is in src/ssnserver.c. It maintains a red-black tree (t) keyed on a person's name (in the order last, first). The val field points to an Entry struct, which contains the person's name again, his/her age, and his/her social-security number, stored as a string.

src/Ssnserver.c creates the tree and then accepts four kinds of inputs from standard input:

  1. ADD fn ln age ssn -- This adds an entry to the tree.
  2. DELETE fn ln -- This deletes an entry from the tree.
  3. PRINT -- This prints the tree.
  4. DONE -- This causes the program to exit.
Try it out:

INPUTGEN

Ok, now look at src/inputgen.c. This is a program that I wrote to really beat on ssnserver. As input, it takes a number of events, a random number seed, and a file of last names. The file of last names that I've created is lns.txt, which is simply a dictionary of words copied into a file. The program reads the last names into the array lns, and it has an array fns of 65 first names. Now, what it does is create nevents random input events for src/ssnserver.c. The first 50 events are random ADD events, and thereafter, it will create either ADD, DELETE or PRINT events (these in the ratio 5/5/1). It ends with a PRINT and a DONE event.

In order to create DELETE events that correspond to entries in the tree, inputgen uses a rb-tree of its own. This tree is keyed on a random number, and its val field is one of the names that it added previously. When it creates a DELETE event, it chooses the first name in the tree -- this will be a random name, deletes it from the tree, and then uses this name for the DELETE event.

So, this is a little complex, but you should be able to understand it. Inputgen is set up so that the tree that it manages will average around 50 elements, regardless of the number of events that it generates. To prove this to yourself, try it:

You'll note that the above tree has 50 elements.

Turning ssnserver into a real server

Now, look at src/ssnserver1.c.

What this does is turn ssnserver into a real server. It serves a socket, and then calls accept_connection(), and creates a server_thread() thread to service the connection. The server_thread() thread works just like src/ssnserver.c, with the exception that the tree is a global variable.

Try it out with nc. For example, in one window on hydra4 I do:

while in another, I do: It works just fine. I modified src/inputgen.c to work as a socket client -- the code is in src/inclient.c. It is straightforward and uses a second thread to read the socket output and print it to standard out. Try it out on the same server: Now, look at src/ssnserver2.c. This works just like ssnserver1 except that it can service multiple connections simultaneously by forking off one server_thread() per connection. Note however, that that access to t is not protected by mutexes. This presents a problem because, for example, one thread may be adding one element to the tree while another is deleting a nearby element. If the first thread is interrupted before it finishes adding the element, then the rb-tree pointers may not be where they should be when the second thread tries to delete. This will result in an error, probably a core dump.

To help illustrate this, I wrote a shell script called kill_it.sh. This forks off a given number of inclient processes who all blast away at the given ssnserver2 server.

Try it out: On one machine, start a ssnserver2. For example, I did the following on hydra4:

Then, on hydra3, I had 5 inclients send 1000 entries simultaneously to the server: Within a few seconds, the ssnserver2 process had a segmentation violation. This doesn't always happen, but usually. The reason is that access to ti->t is not protected.

Adding a mutex

Now look at src/ssnserver3.c. This adds a mutex that each thread locks while it processes a connection. This solves the problem with accessing t, because no two threads may access t simultaneously. I.e. try out kill_it.sh: On hydra4: And on hydra3: No core dump!

So, this solves the mutual exclusion problem, but it is like stapling papers with a sledge hammer. By having each thread lock the mutex throughout its lifetime, we have serialized the server -- no two threads can do anything simultaneously, and this is a performance problem. For example, a client could open a connection and then do nothing, thereby disabling the server!

We solve this problem in a very standard way, with src/ssnserver4.c. Instead of locking the mutex at all times, the thread only locks the mutex when it accesses the tree. This is within the code for ADD, DELETE and PRINT.

You can now test that multiple clients may indeed access the server simultaneously by using nc as the client.

When I went to demonstrate the performance of the two programs, I was confronted with something that happens all too often in computer science research. I couldn't make sense of the results. What I did was the following. I ran the ssnserver3 on port 8889 of hydra3, and then I ran the following on mamba, the machine on my desk:

UNIX> time sh kill_it.sh hydra3 8889 120000 10
26.268u 11.264s 1:13.13 51.3% 0+0k 8+0io 0pf+0w
UNIX>
Next, I ran ssnserver4 on port 8889 of hydra3:
 
UNIX> time sh kill_it.sh hydra3 8889 120000 10
37.515u 15.702s 1:45.63 50.3% 0+0k 0+0io 0pf+0w
UNIX>
That is odd indeed -- 1:13 for the version that serializes everything, and 1:45 for the version that allows the clients to work in parallel. It doesn't make sense!

I have been a professor long enough now, to know that many students, when they get results like these, graph them and call it a day. You need to avoid that temptation, and instead try to explain what is going on. To probe further, I printed out the starting time of the shell script, and then the completion times of the 10 clients in the serialized version (their time(0) values):

T0 1524685224
T0 1524685226
T0 1524685228
T0 1524685231
T0 1524685237
T0 1524685244
T0 1524685252
T0 1524685261
T0 1524685272
T0 1524685283
T0 1524685296
You'll note that the clients take the following times to complete: 2 seconds, 2, 3, 6, 7, 8, 9, 11, 13. Can you explain why each client is taking longer than the next?

Here's the reason -- when the first client runs, it puts 50 elements into the tree, and then processes its 119,950 other events. When it's done, the tree has roughly 50 elements. When the second client runs, it puts 50 more elements into the tree, and then processes its 119,950 other events. When it's done, the tree has roughly 100 elements. Each successive client works on a tree that gets incrementally bigger.

Now, think about the clients when the server allows them to work simultaneously. Pretty much instantly, the tree has 500 elements. When means that each client has to work on a tree of 500 elements. The average client completion time roughly 10.5 seconds.

In other words, we have set up a bad experiment to compare the two servers. Now, I'm not going to set up a better one, because I don't have the time, and I don't think that it adds value to the class. What would be better would be to write a program that populates the tree, and then fires off clients that just do ADD/DELETE/PRINT.

Remember this lecture, because you will likely see something like it in your near future, when you are trying to analyze something involving computers.



ssnserver5

There's another improvement that you can do to your program, to increase the amount of parallelism in it. Think to yourself: does the mutex really need to be locked while printing the tree? No, not really. You can do some buffering to help you. Instead, create the string that you'll be using to print the tree while holding this mutex. This will take some time, but not nearly as much as writing this string to the socket. Then you release the mutex and write the string. This is done in src/Ssnserver5.c. I wish that this improved performance more than it does -- it improved the time of the above test from 1:45 to 1:39. Someday, I'll explore this one more, but for now, just let it be known that it is a nice application of buffering that does improve performance somewhat.

The lesson to be learned

The lesson to be learned here is that you need to think carefully about your use of synchronization primitives. There are two issues: correctness and performance. You want to make sure that there are no race conditions in your code, as there were in src/ssnserver2.c. However, you want to eliminate these race conditions in a way that maximizes performance. This should be done by making sure you only hold a mutex for as long as you need it locked. If you are performing a very time consuming operation (such as writing to a socket or file) while holding the mutex, then you should consider the use of buffering so that you can move the time consuming operation out of the code that holds the mutex. This is what ssnserver5 does.