There are many possible reasons to employ parallelism. Some ways of implementing parallelism is quite intrusive and require your code to be refactored and sometimes completely redone. Sometimes the interface is so well hidden from sight that you don't realize there is parallel computing underneath. One good example in this case is CPU pipelines. Due to the rich set of alternative ways to implement parallelism, it is always important to first consider what are your reasons to use parallel computing, and more specifically in this case, parallel visualization.
Let us show a few almost trivial ways to get parallelism.
Say you have a visualization code that extracts an isosurface. The basic steps are very straightforward. Simply, read the data file, process it and write out the output. Normally you write one serial program (let's call this code mc) to sequentially perform these tasks. In this case, mc will first be I/O bound, then CPU bound and then I/O bound again. Usually when people say let's try to parallelize mc, they imply rewriting it using MPI. But there is a much simpler way as the following.
- step1: write a piece of code: dumbread - step2: write another piece of code: dumbwrite
dumbreader solely reads a file from disk byte by byte, and writes out the bytes on stdout. dumbwriter solely reads from stdio and writes to disk byte by byte. Now;
- step3: change mc to read from stdin and write to stdout.Then, you can run the code as:
UNIX> dumbread volumedata | mc | dumbwrite geometry-meshdata
This will cause three processes to be run concurrently on your machine. A reader, a worker and a writer. While data gets read, mc can start to process as data becomes incrementally available. As long as mc works in a streaming manner, this new way is likely a whole lot faster. Taking another look, what is this doing? Well, pipeline.
What's good about it? First you can easily single out different functional modules of your code. Second, you get native parallel computing without doing anything special and probably won't introduce new bugs in your mc code.
What's bad about it? Primarily this model only applies when your code works in the filter design pattern. If there are runtime need to go back on some previous data, then this model could run into problem, because that will potentially require you to store up to 3 complete copies of the dataset in memory.
A few gotcha to watch for if you want to program in this manner (knowing to do so is often a trait of elegant system programmers):
- know that stdin, stdout, stderr are for totally different purposes. - know how to refactor or partition data into a streaming model. - know the differences between system calls and buffered library calls.
Also, a few numbers to keep in mind. If you code gets 100% hits on disk caches, you can get amazing I/O bandwidths. Numbers in the range of ~x GB/second are common. Of course in the real world, that ideal case won't happen. The normal I/O bandwidth for a standard disk should be still on the order of tens of MB/second. The bandwidth to send data between dumbread and mc is somewhere around several hundred MB/second.
Now, an even better news. You don't have to write dumbread and dumbwrite yourself. A program is already there for you to use. It's is the always there but always underappreciated cat. So the actual run looks like:
UNIX> cat volumedata | mc | cat > geometry-meshdataIf you have exprienced Unix style programming before, you are probably wondering about running the following:
UNIX> mc < volumedata > geometry-meshdataIt also works. What is the difference? This is again running a single process to handle all tasks, not pipelining on a process level any more.
UNIX> cat volumedata | mc | gzip -c > geometry-meshdata.ZJust for fun, what does this one do? And what about the following:
UNIX> zcat volumedata.Z | mc | cat > geometry-meshdata
The purpose is quite simple. To get you actively thinking about why use parallel to begin with? There are many good reasons like the following:
- to get more cycles per second (or FLOPS) - to get more memory - to get higher throughput (similar to how web servers are designed) - to get functional separation to actually create real interfaces
Accordingly, there are different methodology to develop parallel code. Just to name a few: data parallel, task parallel, pipeline parallel, etc. Implementation wise, there are probably just as many paradigms of parallel programming. Concurrent processes is one that we have already seen. Others include threads (pthread, winthread), MPI (message passing interface), OpenMP, etc. At this point I want to emphasize the following. Although parallel visualization most often uses MPI, do remember it is not bound to use one thing or another. The core creativity comes from how you study the complex problem and factor the problem into scalable components.
MPI stands for Message Passing Interface. It defines the standard interfaces and programming constructs needed by parallel programs that assume a distributed memory model and hence distributed memory address spaces. In other words, each running process acts totally independent of each other and only communicates via by explicity sending messages (both point to point messages and collective ones). If you code needs to run across multiple machines, MPI is the defacto tool of standard.
Let's look at a MPI hello world program.
#include <stdio.h> #include <mpi.h> main(int argc, char **argv) { int node; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &node); printf("Hello World from Node %d\n",node); MPI_Finalize(); }
This program needs to be compiled using mpicc, which is thin wrapper around cc or its variations. mpicc makes sure the correct parallel computing libraries are linked into your executable. Other than that, simply from a user's point of view, the invocation of mpicc is virtually no different from cc. So we do:
UNIX> cat phello.c #include <stdio.h> #include <mpi.h> main(int argc, char **argv) { int node; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &node); printf("Hello World from Node %d\n",node); MPI_Finalize(); } UNIX> mpicc -c phello.c UNIX> mpicc -o phello phello.o UNIX> ll total 48 -rwxr-x--- 1 huangj huangj 12732 Aug 12 13:58 phello* -rw-r----- 1 huangj huangj 238 Aug 12 13:57 phello.c -rw-r----- 1 huangj huangj 828 Aug 12 13:58 phello.o
To use a small to medium computing cluster, there is aften a machine file, also called hostfile, to set up to describe the configuration of the cluster. Depending on which impelementation you use (mpich, pvm, etc.) the exact formats of the hostfile could be slightly different. The following is just an example:
<hostname> slots=<number of processors on host> max_slots=<maximum amount of mpi processes to run on host>
The hostfile is used by the launcher mpirun in the following manner:
UNIX> mpirun --hostfile my-hostfile -np 3 hello
parallel_mc is a parallel version of the marching cube extraction code. mpirun launches it using 3 processors (as requested by "-np 3"). Which processors to use? Well, the first 3 from the list in my-hostfile.
What if no hostfile is specified? In this case, the system often defaults to use the standard hostfile stored directly with the mpich installation. Here is a good man page to look at that describes the host file http://linux.die.net/man/1/mpirun. On some systems, aprun is used instead of mpirun. The functionalities are very similar.
Our example then looks like the following:
UNIX> mpirun -np 3 phello Hello World from Node 0 Hello World from Node 2 Hello World from Node 1
Please note on large scale supercomputers (for instance, with 64000 processors), users don't get to control the hostfile. A job queuing system, such as qsub or its variations like cqsub, dictates which job runs and where a job will run.
Finally, getting a cluster for you own may take some efforts :-). But with today's personal computers coming 2, 4 or more cores, there is nothing stopping you to install mpich on your laptop. Look at: http://www.mcs.anl.gov/mpi/mpich/.
#include <stdlib.h> #include <stdio.h> #include <string.h> #include "vcbutils.h" #include <math.h> #include <mpi.h> int main(int argc, char ** argv) { vcbdatatype datatype; int ndims, orig[3], dsize[3], nattribs; int cutsize[3], lbounds[3], ubounds[3]; int ncuts, k, interval, headlen, i; void * outdata; FILE * fp; /* See if user passed an argument. */ if (argc != 3) { fprintf(stderr, "Usage: %s datafile outrootname\n", argv[0]); exit(1); } MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &k); MPI_Comm_size(MPI_COMM_WORLD, &ncuts); vcbReadHeaderBinm(argv[1], &datatype, &ndims, orig, dsize, &nattribs); /* set up partition boundaries */ interval = (int)ceil(((double)dsize[0])/ncuts); lbounds[0] = interval * k; ubounds[0] = interval * (k+1); if (ubounds[0] >= dsize[0]) ubounds[0] = dsize[0]-1; cutsize[0] = ubounds[0] - lbounds[0] + 1; /* set up other bounds */ lbounds[1] = 0; ubounds[1] = dsize[1] - 1; lbounds[2] = 0; ubounds[2] = dsize[2] - 1; cutsize[1] = dsize[1]; cutsize[2] = dsize[2]; outdata = malloc(cutsize[0]*cutsize[1]*cutsize[2]*nattribs*vcbSizeOf(datatype)); fp = fopen(argv[1],"rb"); headlen = (3+2*ndims)*sizeof(int)+sizeof(datatype);/*vcbSizeOf(datatype);*/ vcbFileGrablk(fp, headlen, outdata, nattribs, vcbSizeOf(datatype), ndims, dsize, lbounds, ubounds, NULL); for (i = 0; i < ndims; orig[i] += lbounds[i], i++); vcbGenBinm(argv[2], datatype, ndims, orig, cutsize, nattribs, outdata); free(outdata); fclose(fp); MPI_Finalize(); return 0; }
This looks a bit too long on first glimpse. So 2nd try:
#include <stdlib.h> #include <stdio.h> #include <string.h> #include "vcbutils.h" #include <math.h> #include <mpi.h> int main(int argc, char ** argv) { MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &k); MPI_Comm_size(MPI_COMM_WORLD, &ncuts); get header information about the dataset set up partition boundaries outdata = malloc(interval*nattribs*vcbSizeOf(datatype)); only read the part I am in charge of (k out of ncuts) write my part out to disk clean up }How about we make this more useful by doing a marching cube on it? It looks like this:
#include <stdlib.h> #include <stdio.h> #include <string.h> #include "vcbutils.h" #include <math.h> #include <mpi.h> int main(int argc, char ** argv) { MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &k); MPI_Comm_size(MPI_COMM_WORLD, &ncuts); get header information about the dataset set up partition boundaries outdata = malloc(interval*nattribs*vcbSizeOf(datatype)); only read the part I am in charge of (k out of ncuts) extract isosurface using marching cube write the surface geometry out to disk clean up }
The actual code looks like the following. Not complicated at all.
#include <stdlib.h> #include <stdio.h> #include <string.h> #include <math.h> #include "vcbutils.h" #include "vcbmcube.h" #include <mpi.h> int main(int argc, char ** argv) { vcbdatatype datatype; int ndims, orig[3], dsize[3], nattribs; int cutsize[3], lbounds[3], ubounds[3]; int ncuts, k, interval, headlen, i; void * outdata; FILE * fp; /* application specific data starts here */ int nverts, nfacets, * fdata; float isoval, * vdata, * fnormals; /* end of application specific data */ /* See if user passed an argument. */ if (argc != 2) { fprintf(stderr, "Usage: %s datafile\n", argv[0]); exit(1); } MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &k); MPI_Comm_size(MPI_COMM_WORLD, &ncuts); vcbReadHeaderBinm(argv[1], &datatype, &ndims, orig, dsize, &nattribs); /* set up partition boundaries */ interval = (int)ceil(((double)dsize[0])/ncuts); lbounds[0] = interval * k; ubounds[0] = interval * (k+1); if (ubounds[0] >= dsize[0]) ubounds[0] = dsize[0]-1; cutsize[0] = ubounds[0] - lbounds[0] + 1; /* set up other bounds */ lbounds[1] = 0; ubounds[1] = dsize[1] - 1; lbounds[2] = 0; ubounds[2] = dsize[2] - 1; cutsize[1] = dsize[1]; cutsize[2] = dsize[2]; outdata = malloc(cutsize[0]*cutsize[1]*cutsize[2]*nattribs*vcbSizeOf(datatype)); fp = fopen(argv[1],"rb"); headlen = (3+2*ndims)*sizeof(int)+sizeof(datatype); vcbFileGrablk(fp, headlen, outdata, nattribs, vcbSizeOf(datatype), ndims, dsize, lbounds, ubounds, NULL); for (i = 0; i < ndims; orig[i] += lbounds[i], i++); isoval = 45.f; if ((nfacets = vcbMarchingCubeBlk(datatype, outdata, isoval, cutsize, VCB_WITHNORMAL, &nverts, &vdata, &fdata))<=0) { fprintf(stderr, "p%d: vcbMarchingCubeBlk: extract zero triangles at isovalue %f\n", k, isoval); } /* saving vertex array and triangle array into endian safe formats */ for (i = 0; i < nverts; i ++) { vdata[i*6+0] += orig[0]; vdata[i*6+1] += orig[1]; vdata[i*6+2] += orig[2]; } /* output the surface mesh */ vcbGenBinm("vertexarray.bin", VCB_FLOAT, 1, NULL, &nverts, 6, vdata); vcbGenBinm("triangarray.bin", VCB_INT, 1, NULL, &nfacets, 3, fdata); free(vdata); free(fdata); free(fnormals); free(outdata); fclose(fp); MPI_Finalize(); return 0; }
The library used is the Visualization CookBook library vcblib. The code and makefile are available as the following: makefile, pcut.c, pmc.c.
Now that you have seen how things work on a high level point of view, let me list a few top things that must be considered in a real world application:
data partition job partition results composition task scheduler minimizing runtime communication
Please note in the code we listed, there are no calls like MPI_Barrier() and MPI_Send()/MPI_Recv(). Hence the method used could be termed as embarrasingly parallel (believe me, this is a good word :-)). What if this is not the case for your application? Well, time to get creative and hard working. Do remember one thing, do not ever guess where the bottlenecks are. Measure it and analyze it!