Figures 7 and 8 illustrate the elapsed wall-clock times for matrices having similar numbers of nonzeros. For both figures, the optimal number of processors p for MATVEC would appear to be about 5 when each SUN SPARCstation 5 has 32MB memory and is communicating at Ethernet (10 Mbps) speed. Although this distributed implementation of CSI-MSVD cannot exploit the parallelism afforded by a larger number of processors, it does obtain peak performance on a modest number of workstations.
nonzeros on a Ethernet-linked network of SUN
SPARCstation 5's (each with 32 MB memory).
nonzeros on a Ethernet-linked network of SUN
SPARCstation 5's (each with 32 MB memory).
Figure 9 illustrates the speed improvements
that can be obtained when FastEthernet (100 Mbps) is used
to link the network of SUN SPARCstation 5's and each workstation
is upgraded to have 96 MB internal memory. For the computing
the 10-largest singular triplets of the
matrix KNOXNS, the optimal number of
processors p for MATVEC is closer to 7 and speeds almost
twice that of the original (Ethernet) network are obtained as
the number of processors increase. Such improvements for
large unstructured matrices are more
reflective of the improved network latencies rather than any
gains from page-fault reduction associated with an increase
in internal memory. However, as discussed in [22]
for the large test matrices (see Table 1) such as
KNOXNS, the average number of
page faults per process is affected by
factors such as number of columns of the matrix assigned to a MATVEC
process, the
distribution of non-zeros in these columns, and hardware features such
as cache size.
Table 4 summarizes the speedups
(see Equation (38)) obtained by CSI-MSVD when
the 10-largest singular triplets of the following matrices were
approximated: AMOCO, KNOXNS, ENCY, and TREC3.
The AMOCO matrix with 1,159,116 nonzeros gives rise to
the smallest computational load while the largest computational load is
produced by the TREC3 matrix which has 2,343,775 nonzeros.
Although the ENCY has considerably more nonzeros than TREC3,
it is the sparser of the two matrices (see Table 1).
The low sparsity ratio for ENCY produces
a detrimental memory-access pattern which is
aggravated by the larger value of nonzeros, so that
when ENCY is stored on one processor, the elapsed wall-clock
time T(1) is larger than the corresponding value for TREC3. This in
turn explains why the speedup T(1)/T(p) for ENCY is larger than the
speedup for TREC3 when
.
However, since the number of rows in ENCY (25,629) is approximately twice that for TREC3 (10,836), the communication overhead associated with MATVEC processes for ENCY is correspondingly higher. Thus, as p increases, the larger communication overhead has a deteriorating effect on speedup, and CSI-MSVD achieves the higher speedups for for TREC3 when p>5. Since speedup is sensitive to the choice of T(1), it cannot be used as an absolute estimate of the degree of parallelization achieved. However, Table 4 does provide a heuristic for estimating the upper bound on the number of processors that should define the virtual machine for such large, sparse and unstructured matrices.