Figures 7 and 8 illustrate the elapsed wall-clock times for matrices having similar numbers of nonzeros. For both figures, the optimal number of processors p for MATVEC would appear to be about 5 when each SUN SPARCstation 5 has 32MB memory and is communicating at Ethernet (10 Mbps) speed. Although this distributed implementation of CSI-MSVD cannot exploit the parallelism afforded by a larger number of processors, it does obtain peak performance on a modest number of workstations.
Figure 9 illustrates the speed improvements that can be obtained when FastEthernet (100 Mbps) is used to link the network of SUN SPARCstation 5's and each workstation is upgraded to have 96 MB internal memory. For the computing the 10-largest singular triplets of the matrix KNOXNS, the optimal number of processors p for MATVEC is closer to 7 and speeds almost twice that of the original (Ethernet) network are obtained as the number of processors increase. Such improvements for large unstructured matrices are more reflective of the improved network latencies rather than any gains from page-fault reduction associated with an increase in internal memory. However, as discussed in  for the large test matrices (see Table 1) such as KNOXNS, the average number of page faults per process is affected by factors such as number of columns of the matrix assigned to a MATVEC process, the distribution of non-zeros in these columns, and hardware features such as cache size.
Table 4 summarizes the speedups (see Equation (38)) obtained by CSI-MSVD when the 10-largest singular triplets of the following matrices were approximated: AMOCO, KNOXNS, ENCY, and TREC3. The AMOCO matrix with 1,159,116 nonzeros gives rise to the smallest computational load while the largest computational load is produced by the TREC3 matrix which has 2,343,775 nonzeros. Although the ENCY has considerably more nonzeros than TREC3, it is the sparser of the two matrices (see Table 1). The low sparsity ratio for ENCY produces a detrimental memory-access pattern which is aggravated by the larger value of nonzeros, so that when ENCY is stored on one processor, the elapsed wall-clock time T(1) is larger than the corresponding value for TREC3. This in turn explains why the speedup T(1)/T(p) for ENCY is larger than the speedup for TREC3 when .
However, since the number of rows in ENCY (25,629) is approximately twice that for TREC3 (10,836), the communication overhead associated with MATVEC processes for ENCY is correspondingly higher. Thus, as p increases, the larger communication overhead has a deteriorating effect on speedup, and CSI-MSVD achieves the higher speedups for for TREC3 when p>5. Since speedup is sensitive to the choice of T(1), it cannot be used as an absolute estimate of the degree of parallelization achieved. However, Table 4 does provide a heuristic for estimating the upper bound on the number of processors that should define the virtual machine for such large, sparse and unstructured matrices.