next up previous
Next: Summary and Future Up: CSI-MSVD Performance Results Previous: Scalability

5.5 Results of Cray T3D Implementation

The implementation of CSI-MSVD for a network of workstations (NOW) using PVM has also been successfully ported to the Cray T3D massively parallel computing system. Due to differences in PVM implementations on the Cray T3D (see [22]), some syntactic modifications were required. Only the minimal changes required to successfully port the NOW version were attempted so that a portable and modular implementation of the CSI-MSVD algorithm could be maintained across multiple computing platforms. This section presents the observed elapsed wall-clock times for the CSI-MSVD algorithm on a 256-processor Cray T3D at the Advanced Computing Laboratory, Los Alamos National Laboratory. The Cray Standard C compiler (Version 4.0.3.2) and MPPLDR loader (Version 10.x) were used for this implementation of CSI-MSVD.

The C compiler used for these experiments was Version 4.0.3.2 of the Cray Standard C compiler, and the loader used was Version 10.x of MPPLDR. Compiler optimization for aggressive vectorization, suppression of redundant symbol-tables, and usage of branches instead of jumps to external functions were used.

The specific modifications that were made to port the NOW implementation of the CSI-MSVD algorithm to the Cray T3D are discussed in greater detail in [22]. Since the Cray T3D does not support pvm_spawn() and therefore requires all nodes to run the same executable program, heterogeneity was achieved by using myid in a driver program to determine if the node under consideration would participate in the computations involving MATVEC, PHI, or GAMMA. The driver program is available in [22].

  


Figure 10: Wall-clock times for execution using Cray T3D's MPP version of PVM, compared with times using PVM on a network of workstations. The 10-largest singular values and corresponding vectors were computed to accuracy.

The elapsed wall-clock times for approximating the 10-largest singular values and corresponding singular vectors of selected matrices on the Cray T3D are shown in Figure 10 along with the corresponding times for execution on a network of workstations. connectivity can be clearly seen. For matrix matrix TECH the improvement in execution time through an increase in parallelism is no longer damped by communication overhead. In fact, when CSI-MSVD is applied to both TECH and ENCY, the rate of increase in communication latency is much lower with the Cray T3D, so that when the matrices are partitioned across 125 processors, the execution time is considerably less than the execution time on a network of workstations with the matrix stored on only one processor. Also, the minimum execution time is still observed when the size of the MATVEC group ranges between 5 and 12 for most matrices considered. This somewhat confirms the heuristic established by Figure 4. A comparison of this minimum execution time between the networked- and MPP versions of PVM is shown in Table 5. It can be seen that the MPP implementation is about 2 to 10 times faster than the networked implementation. The largest differences in execution time are observed for matrices such as CRAN and CISI which have the smallest number of non-zeros, indicating that these matrices are most critically affected by communication overhead in NOW environments.

  


Table 5: Comparison of elapsed wall-clock times to approximate the 10-largest singular triplets to accuracy using CSI-MSVD. Cray T3D and networked versions of PVM were used . The times reported here were obtained with the PVM configurations that result in the minimum execution time for the respective platforms.

On the Cray T3D, program performance can be profiled using the Cray MPP Apprentice tool, a window-based performance analysis tool available on Cray computing systems. The Apprentice tool can be configured to report time spent in each subroutine in performing tasks such as parallel computations, I/O, and communications. When CSI-MSVD is used to approximate the 10-largest singular triplets of the matrix KNOXNS with 5 processors defining the MATVEC PVM group, the output of Apprentice (see [22]) indicates that 47% of the total time taken by the CSI-MSVD program is accounted for by pvm_recv alone.



next up previous
Next: Summary and Future Up: CSI-MSVD Performance Results Previous: Scalability



Michael W. Berry (berry@cs.utk.edu)
Sun May 19 11:34:27 EDT 1996