# Parallel Processing

Parallel processing is the ability to carry out multiple tasks at the same time.  All of us, as humans, do this:  you listen to music while doing your homework, chat on your cellphone while walking, etc.  Computers nowadays do this as well, but many older computers did not do any parallel processing.   Raw processing power is usually measured in megaflops--millions of floating-point operations per second.  This is the traditional "number-crunching":  floating-point arithmetic, where you need to worry about decimal points, requires more work than ordinary integer arithmetic.

The Cray-1, which dates back to 1976, cost about \$15 million in 2007 dollars.  It was rated at about 180 megaflops peak speed.  It required very elaborate cooling equipment and a team of skilled people to maintain it:  the Cray-1.  By contrast, if you have a 3ghz Pentium PC at home--that costs about \$800, and has a peak speed of about 3 gigaflops.  Times have changed!  Parallel processors nowadays tend to be clusters of off-the-shelf components.  You might, for example, connect together 16 3ghz Pentiums by using network equipment that runs in the gigabit range:  this might cost, say, \$25K.  At JICS (Joint Institute for Computational Studies) at Oak Ridge, there's a computing equipment room about the size of a football field:  processing power is measured in teraflops (trillions of floating-point operations per second and petaflops (quadrillions of flops):  JICS

Your PC at home does some parallel processing.  The CPU on a much older machine would have just one ALU (arithmetic-logic unit) which could carry out one arithmetic operation at a time.  Current CPUs have multiple ALUs, and so can be doing multiple arithmetic operations at once.  In fact, the underlying hardware is carefully designed to squeeze as much speed as possible out of the CPU.  Lets look at a sequence of machine instructions the CPU needs to execute (we'll assume we have 3 ALUs here):
1) R1 = R2 + R3     (add R2 and R3, store the result in R1)
2) R4 = R6 - R1
3) R7 = R8 * R9
4) R11 = R12 + R13
This sequence of instructions poses some interesting problems.  Let's begin by having ALU #1 start instruction #1:  R1 = R2 + R3.  When can we start instruction #2?  Instruction #2 needs instruction #1 to finish so we have the correct result in R1.  Your PC's CPU can figure out that instruction #3 need not wait on the completion of instructions 1 or 2--it's using registers 7, 8, and 9--so ALU #2 starts carrying out instruction 3:  R7 = R8 * R9.  This problem is known as "register dependency" and what we're doing is called "out-of-order execution of instructions".  If this is done correctly, it can substantially speed up the overall effective execution speed.  It's tricky--but it works, and this is done by your PC:  arithmetic ops are being done in parallel.

If you are willing to spend some additional money, you could get a "dual-processor" Pentium computer.  This actually has two CPUs working together on problems.  Extra work must be done by the CPUs to coordinate tasks, but it can make your PC quite a bit faster.
You may also have a system with dual video cards for extra video horsepower--this is popular with some gamers.
---------------------------------------------------------------------------
Other than a simple (actually, far from simple) PC with one CPU and its multiple ALUs, we see two basic approaches to parallel processors:  the first is called "shared-memory" parallel processors, the other is known as "distributed memory" parallel processors.  And just to complicate matters, there are systems that do not have a clear dividing line--some memory may be distributed, while other memory might be shared.

First:  shared-memory parallel processors.  The dual-processor pentium is an example of this.  You have one memory in the PC--say 2 gigabytes of RAM.  Both CPUs access this same shared memory.  There are inherent conflicts:  suppose both CPUs want to fetch the same piece of data from memory at once?  Or one wants to store at a location where the other CPU wants to fetch from.  Each CPU has its own cache (think back to that).  Keeping the caches coordinated is a major problem (the "cache-coherence problem").  Suppose, for example, that CPU #1 has changed the variable NUMBER in its cache by writing to it, and CPU #2 wants to fetch the variable NUMBER from its own cache.  NUMBER might have the value -3 in CPU 1's cache, and it might have the value 46 in CPU 2's cache.  Coordination is vital here so that CPUs are not working with incorrect or obsolescent values (and I might note that on your  own PC at home with it's single CPU, there can also be cache coherence problems).  In general, imagine, say, a shared memory system with 16 CPUs and one large memory bank.  Hardware must resolve conflicts about which CPU gets what values.  The old Crays and,
again, dual-processor Pentiums are examples of shared-memory parallel processors.

Second, distributed-memory parallel processors.  Here, each CPU has its own bank of memory to work with, as well as its own cache or caches.  There are no inherent memory clashes.  What is needed is a good communications network bewteen the machines, and what is also needed are communication instructions.   You and 3 friends who are playing EverQuest on 4 different PCs is an example of sorts of distributed-memory systems.  Each of you is controlling a different character, but each of you is seeing the same thing on your monitor.  Your PCs have to send information back and forth.  As your paladin draws his sword and slashes at the hobgoblin, your friends will see that action and movement on their monitors, and will have their own characters act accordingly.  There must be coordination between the computers:  your computer must send out information to the other PCs about what your character has just
done.  If it did not send out this information, multiplayer games would be worthless.  So think of a cluster of PCs--say 32 PCs with a
high-speed (say 1 gigabit) interconnection network.  PC # 17 can send a message to PC #5 about what it has done, what it needs PC #5 to do, etc.  The components must cooperate.  One of the common tools is MPI--Message Passing Interface--which can run on any set of networked PCs, Macs, UNIX workstations, etc.  The programming languages that are used need special additions of the form
SendMessage(message, size of message, destination host, etc)  and RecvMessage(message, size of message, sending host, etc)
A library (like Tk, random, etc) is brought in which has these MPI features.
---------------------------
The other main problem here is how to make effective use of a parallel system.  We'll look at that in part 2.