Parallel Processing
Parallel processing is the ability to
carry out multiple tasks at the same time. All of us, as humans,
do this: you listen to music while doing your homework, chat on
your cellphone while walking, etc. Computers nowadays do this as
well, but many older computers did not do any parallel
processing. Raw processing power is usually measured in
megaflops--millions of floating-point operations per second. This
is the traditional "number-crunching": floating-point arithmetic,
where you need to worry about decimal points, requires more work than
ordinary integer arithmetic.
The Cray-1, which dates back to 1976, cost about $15 million in 2007
dollars. It was rated at about 180 megaflops peak speed. It
required very elaborate cooling equipment and a team of skilled people
to maintain it: the
Cray-1. By contrast, if you have a 3ghz Pentium PC at
home--that costs about $800, and has a peak speed of about 3
gigaflops. Times have changed! Parallel processors nowadays
tend to be clusters of off-the-shelf components. You might, for
example, connect together 16 3ghz Pentiums by using network equipment
that runs in the gigabit range: this might cost, say, $25K.
At JICS (Joint Institute for Computational Studies) at Oak Ridge,
there's a computing equipment room about the size of a football
field: processing power is measured in teraflops (trillions of
floating-point operations per second and petaflops (quadrillions of
flops): JICS.
Your PC at home does some parallel processing. The CPU on a much
older machine would have just one ALU (arithmetic-logic unit) which
could carry out one arithmetic operation at a time. Current CPUs
have multiple ALUs, and so can be doing multiple arithmetic operations
at once. In fact, the underlying hardware is carefully designed
to squeeze as much speed as possible out of the CPU. Lets look at
a sequence of machine instructions the CPU needs to execute (we'll
assume we have 3 ALUs here):
1) R1 = R2 + R3 (add R2 and R3, store the
result in R1)
2) R4 = R6 - R1
3) R7 = R8 * R9
4) R11 = R12 + R13
This sequence of instructions poses some interesting problems.
Let's begin by having ALU #1 start instruction #1: R1 = R2 +
R3. When can we start instruction #2? Instruction #2 needs
instruction #1 to finish so we have the correct result in R1.
Your PC's CPU can figure out that instruction #3 need not wait on the
completion of instructions 1 or 2--it's using registers 7, 8, and 9--so
ALU #2 starts carrying out instruction 3: R7 = R8 * R9.
This problem is known as "register dependency" and what we're doing is
called "out-of-order execution of instructions". If this is done
correctly, it can substantially speed up the overall effective
execution speed. It's tricky--but it works, and this is done by
your PC: arithmetic ops are being done in parallel.
If you are willing to spend some additional money, you could get a
"dual-processor" Pentium computer. This actually has two CPUs
working together on problems. Extra work must be done by the CPUs
to coordinate tasks, but it can make your PC quite a bit faster.
You may also have a system with dual video cards for extra video
horsepower--this is popular with some gamers.
---------------------------------------------------------------------------
Other than a simple (actually, far from simple) PC with one CPU and its
multiple ALUs, we see two basic approaches to parallel
processors: the first is called "shared-memory" parallel
processors, the other is known as "distributed memory" parallel
processors. And just to complicate matters, there are systems
that do not have a clear dividing line--some memory may be distributed,
while other memory might be shared.
First: shared-memory parallel processors. The
dual-processor pentium is an example of this. You have one memory
in the PC--say 2 gigabytes of RAM. Both CPUs access this same
shared memory. There are inherent conflicts: suppose both
CPUs want to fetch the same piece of data from memory at once? Or
one wants to store at a location where the other CPU wants to fetch
from. Each CPU has its own cache (think back to that).
Keeping the caches coordinated is a major problem (the "cache-coherence
problem"). Suppose, for example, that CPU #1 has changed the
variable NUMBER in its cache by writing to it, and CPU #2 wants to
fetch the variable NUMBER from its own cache. NUMBER might have
the value -3 in CPU 1's cache, and it might have the value 46 in CPU
2's cache. Coordination is vital here so that CPUs are not
working with incorrect or obsolescent values (and I might note that on
your own PC at home with it's single CPU, there can also be cache
coherence problems). In general, imagine, say, a shared memory
system with 16 CPUs and one large memory bank. Hardware must
resolve conflicts about which CPU gets what values. The old Crays
and,
again, dual-processor Pentiums are examples of shared-memory parallel
processors.
Second, distributed-memory parallel processors. Here, each CPU
has its own bank of memory to work with, as well as its own cache or
caches. There are no inherent memory clashes. What is
needed is a good communications network bewteen the machines, and what
is also needed are communication instructions. You and 3
friends who are playing EverQuest on 4 different PCs is an example of
sorts of distributed-memory systems. Each of you is controlling a
different character, but each of you is seeing the same thing on your
monitor. Your PCs have to send information back and forth.
As your paladin draws his sword and slashes at the hobgoblin, your
friends will see that action and movement on their monitors, and will
have their own characters act accordingly. There must be
coordination between the computers: your computer must send out
information to the other PCs about what your character has just
done. If it did not send out this information, multiplayer games
would be worthless. So think of a cluster of PCs--say 32 PCs with
a
high-speed (say 1 gigabit) interconnection network. PC # 17 can
send a message to PC #5 about what it has done, what it needs PC #5 to
do, etc. The components must cooperate. One of the common
tools is MPI--Message Passing Interface--which can run on any set of
networked PCs, Macs, UNIX workstations, etc. The programming
languages that are used need special additions of the form
SendMessage(message, size of message,
destination host, etc) and RecvMessage(message, size of message,
sending host, etc)
A library (like Tk, random, etc) is brought in which has these MPI
features.
---------------------------
The other main problem here is how to make effective use of a parallel
system. We'll look at that in part 2.