Figures 7 and 8 illustrate the
elapsed wall-clock times for matrices having similar numbers
of nonzeros. For both figures, the
optimal number of processors **p** for MATVEC would appear to be
about 5 when each SUN SPARCstation 5 has 32MB memory and is
communicating at Ethernet (10 Mbps) speed. Although this
distributed implementation of CSI-MSVD cannot exploit the parallelism
afforded by a larger number
of processors, it does obtain peak performance on a modest number
of workstations.

Figure 9 illustrates the speed improvements
that can be obtained when FastEthernet (100 Mbps) is used
to link the network of SUN SPARCstation 5's and each workstation
is upgraded to have 96 MB internal memory. For the computing
the 10-largest singular triplets of the
matrix KNOXNS, the optimal number of
processors **p** for MATVEC is closer to 7 and speeds almost
twice that of the original (Ethernet) network are obtained as
the number of processors increase. Such improvements for
large unstructured matrices are more
reflective of the improved network latencies rather than any
gains from page-fault reduction associated with an increase
in internal memory. However, as discussed in [22]
for the large test matrices (see Table 1) such as
KNOXNS, the average number of
page faults per process is affected by
factors such as number of columns of the matrix assigned to a MATVEC
process, the
distribution of non-zeros in these columns, and hardware features such
as cache size.

Table 4 summarizes the speedups
(see Equation (38)) obtained by CSI-MSVD when
the 10-largest singular triplets of the following matrices were
approximated: AMOCO, KNOXNS, ENCY, and TREC3.
The AMOCO matrix with **1,159,116** nonzeros gives rise to
the smallest computational load while the largest computational load is
produced by the TREC3 matrix which has **2,343,775** nonzeros.
Although the ENCY has considerably more nonzeros than TREC3,
it is the sparser of the two matrices (see Table 1).
The low sparsity ratio for ENCY produces
a detrimental memory-access pattern which is
aggravated by the larger value of nonzeros, so that
when ENCY is stored on one processor, the elapsed wall-clock
time T(1) is larger than the corresponding value for TREC3. This in
turn explains why the speedup T(1)/T(**p**) for ENCY is larger than the
speedup for TREC3 when .

However, since the number of rows in ENCY (25,629) is
approximately twice that for TREC3 (10,836), the communication
overhead associated with MATVEC processes
for ENCY is correspondingly higher. Thus, as **p**
increases, the larger communication overhead has a deteriorating
effect on speedup, and CSI-MSVD achieves the higher speedups for
for TREC3 when **p>5**. Since
speedup is sensitive to the choice of T(1), it cannot be used
as an absolute estimate of the degree of parallelization achieved.
However, Table 4 does provide
a heuristic for estimating the upper bound on the number of
processors that should define the virtual machine for such large,
sparse and unstructured matrices.

Michael W. Berry (berry@cs.utk.edu)

Sun May 19 11:34:27 EDT 1996