The implementation of CSI-MSVD for a network of workstations (NOW) using PVM has also been successfully ported to the Cray T3D massively parallel computing system. Due to differences in PVM implementations on the Cray T3D (see [22]), some syntactic modifications were required. Only the minimal changes required to successfully port the NOW version were attempted so that a portable and modular implementation of the CSI-MSVD algorithm could be maintained across multiple computing platforms. This section presents the observed elapsed wall-clock times for the CSI-MSVD algorithm on a 256-processor Cray T3D at the Advanced Computing Laboratory, Los Alamos National Laboratory. The Cray Standard C compiler (Version 4.0.3.2) and MPPLDR loader (Version 10.x) were used for this implementation of CSI-MSVD.
The C compiler used for these experiments was Version 4.0.3.2 of the Cray Standard C compiler, and the loader used was Version 10.x of MPPLDR. Compiler optimization for aggressive vectorization, suppression of redundant symbol-tables, and usage of branches instead of jumps to external functions were used.
The specific modifications that were made to port the NOW implementation of the CSI-MSVD algorithm to the Cray T3D are discussed in greater detail in [22]. Since the Cray T3D does not support pvm_spawn() and therefore requires all nodes to run the same executable program, heterogeneity was achieved by using myid in a driver program to determine if the node under consideration would participate in the computations involving MATVEC, PHI, or GAMMA. The driver program is available in [22].
accuracy.
The elapsed wall-clock times for approximating the 10-largest singular
values and corresponding singular vectors of selected matrices on
the Cray T3D are shown in
Figure 10 along with the corresponding times for
execution on a network of workstations.
connectivity can be clearly seen. For matrix
matrix TECH the improvement in
execution time through an increase in parallelism is no longer damped
by communication overhead. In fact, when CSI-MSVD is applied to
both TECH and ENCY, the rate
of increase in communication latency is much lower with the Cray T3D,
so that when the matrices are partitioned across 125 processors,
the execution time is considerably less than the execution time on a network
of workstations with the matrix stored on only one processor. Also,
the minimum execution time is still observed when the size of the
MATVEC group ranges between 5 and 12 for most matrices considered. This
somewhat confirms the heuristic established
by Figure 4. A comparison of this minimum execution time
between the networked- and MPP versions of PVM is shown in Table
5. It can be seen that the MPP implementation is about
2 to 10 times faster than the networked implementation. The largest
differences in execution time are observed for
matrices such as CRAN and CISI which have the smallest number of
non-zeros, indicating that these matrices are most critically
affected by communication overhead in NOW environments.
accuracy using CSI-MSVD. Cray
T3D and networked versions of PVM were used . The times reported
here were obtained with the PVM configurations that result in the
minimum execution time for the respective platforms.
On the Cray T3D, program performance can be profiled using the Cray MPP Apprentice tool, a window-based performance analysis tool available on Cray computing systems. The Apprentice tool can be configured to report time spent in each subroutine in performing tasks such as parallel computations, I/O, and communications. When CSI-MSVD is used to approximate the 10-largest singular triplets of the matrix KNOXNS with 5 processors defining the MATVEC PVM group, the output of Apprentice (see [22]) indicates that 47% of the total time taken by the CSI-MSVD program is accounted for by pvm_recv alone.