next up previous
Next: Results of Cray Up: CSI-MSVD Performance Results Previous: Components of the

5.4 Scalability

  In this section, elapsed wall-clock times for CSI-MSVD executing on a network of SPARCstation 5 machines (see Section 4.2) are presented. For the test matrices from Table 1 with over a million nonzero elements, the time to calculate the matrix-vector product by storing the entire matrix on one processor is considerably larger than the corresponding time when the matrix is partitioned across multiple processors. This is largely due to a reduction in page faults per process for the partitioned matrices. A corresponding super-linear speedup can be observed in such cases.

Figures 7 and 8 illustrate the elapsed wall-clock times for matrices having similar numbers of nonzeros. For both figures, the optimal number of processors p for MATVEC would appear to be about 5 when each SUN SPARCstation 5 has 32MB memory and is communicating at Ethernet (10 Mbps) speed. Although this distributed implementation of CSI-MSVD cannot exploit the parallelism afforded by a larger number of processors, it does obtain peak performance on a modest number of workstations.

 


Figure 7: CSI-MSVD wall-clock times (seconds) for matrices having nonzeros on a Ethernet-linked network of SUN SPARCstation 5's (each with 32 MB memory). 

   

 


Figure 8: CSI-MSVD wall-clock times (seconds) for matrices having nonzeros on a Ethernet-linked network of SUN SPARCstation 5's (each with 32 MB memory). 

Figure 9 illustrates the speed improvements that can be obtained when FastEthernet (100 Mbps) is used to link the network of SUN SPARCstation 5's and each workstation is upgraded to have 96 MB internal memory. For the computing the 10-largest singular triplets of the matrix KNOXNS, the optimal number of processors p for MATVEC is closer to 7 and speeds almost twice that of the original (Ethernet) network are obtained as the number of processors increase. Such improvements for large unstructured matrices are more reflective of the improved network latencies rather than any gains from page-fault reduction associated with an increase in internal memory. However, as discussed in [22] for the large test matrices (see Table 1) such as KNOXNS, the average number of page faults per process is affected by factors such as number of columns of the matrix assigned to a MATVEC process, the distribution of non-zeros in these columns, and hardware features such as cache size.

 


Figure 9: Performance of CSI-MSVD for computing the 10-largest singular triplets of the matrix KNOXNS on both an Ethernet and Fast-Ethernet network of workstations. 

  

 


Table 4: Tabulated speedups attained for CSI-MSVD when computing the 10-largest singular triplets of the selected matrices on a PVM network of SUN SPARCstation 5's (Ethernet, 32MB) with p processors in the PVM group MATVEC. 

Table 4 summarizes the speedups (see Equation (38)) obtained by CSI-MSVD when the 10-largest singular triplets of the following matrices were approximated: AMOCO, KNOXNS, ENCY, and TREC3. The AMOCO matrix with 1,159,116 nonzeros gives rise to the smallest computational load while the largest computational load is produced by the TREC3 matrix which has 2,343,775 nonzeros. Although the ENCY has considerably more nonzeros than TREC3, it is the sparser of the two matrices (see Table 1). The low sparsity ratio for ENCY produces a detrimental memory-access pattern which is aggravated by the larger value of nonzeros, so that when ENCY is stored on one processor, the elapsed wall-clock time T(1) is larger than the corresponding value for TREC3. This in turn explains why the speedup T(1)/T(p) for ENCY is larger than the speedup for TREC3 when .

However, since the number of rows in ENCY (25,629) is approximately twice that for TREC3 (10,836), the communication overhead associated with MATVEC processes for ENCY is correspondingly higher. Thus, as p increases, the larger communication overhead has a deteriorating effect on speedup, and CSI-MSVD achieves the higher speedups for for TREC3 when p>5. Since speedup is sensitive to the choice of T(1), it cannot be used as an absolute estimate of the degree of parallelization achieved. However, Table 4 does provide a heuristic for estimating the upper bound on the number of processors that should define the virtual machine for such large, sparse and unstructured matrices.



next up previous
Next: Results of Cray Up: CSI-MSVD Performance Results Previous: Components of the



Michael W. Berry (berry@cs.utk.edu)
Sun May 19 11:34:27 EDT 1996