Four hypertext matrices used for performance comparisons among the symbolic and spectral-based methods presented in Sections 3.3 through 3.5 are listed in Table 3. The first two matrices, MAN1 and MAN2, were constructed from the See Also entries of the BSD 4.3 Unix manpages. The manpage of who, for example, contains the See Also entries getuid and utmp. Hence, two links are associated with who, namely who getuid and who utmp. Parsing all 625 manpages produced 1853 links, and hence the rows and columns of MAN1 correspond to links and manpages, respectively. The nonzero elements of both MAN1 and MAN2 are all 1's and simply reflect the incidence rather than frequency of linkage. The MAN2 matrix was derived from the MAN1 matrix by removing duplicate links (i.e., who getuid is the same as getuid who) and removing 18 manpages (columns) whose links were not connected to the main graph of the MAN1 matrix. The resulting MAN2 matrix had 1426 unique links corresponding to 607 manpages.
The 850 entries under the letter A of the Concise Columbia Encyclopedia (1989 Second Edition, on-line version) produced the 1778 cross-references or links for the CCE-A matrix listed in Table 3. For the 14-th entry ABDOMEN shown below, there are five cross-references or links indicated by brackets: [stomach], [liver], [gall bladder], [pancreas], and [kidneys]. Hence, the 14-th column of CCE-A has five nonzeros (or 1's) in row positions 25 ([stomach]) through 29 ([kidneys]), which correspond to the links in order of their occurrence in the text.
ABDOMEN in vertebrates, portion of the trunk between the diaphragm and lower pelvis. In humans the abdominal cavity is lined with a thin membrane, the peritoneum, which encloses the [stomach], intestines, [liver], and [gall bladder]. The [pancreas], [kidneys], urinary bladder, and, in the female, reproductive organs are also located within the abdominal cavity.The NHSE matrix in Table 3 was derived from 400 of the distributed HTML documents accessible from the National HPCC Software Exchange (or NHSE) [BDG+95] homepage on the World Wide Web (WWW). The selected documents fell under the NHSE's HPCC Programs and Activities heading, and reflect a breadth-first tree search of links of the form <A HREF="..."</A> to 3 three levels which produced a total of 4233 hypertext links.
The density and average number of nonzeros per row () and column () of each of the four hypertext matrices are also provided in Table 3. The Density of each matrix is defined to be the ratio (Nonzeros) (Rows Columns). Clearly, these matrices are quite sparse with only 1 or 2 links associated with each document (manpage, article, or HTML page).
Table 3: Sparse hypertext matrix specifications.
Table 4 lists the bandwidth , envelope size , and the alternative bandwidth measure (see in Section 3.1) for each of the hypertext matrices, and Figure 1 illustrates the nonzero patterns for three of the four matrices considered. The upper-triangular structure for the nonzeros of matrices MAN2 and CCE-A reflects the identification of links (cross-references) in their order of occurrence in the original (on-line) text.
Table 4: Profiles of the hypertext matrices prior to reordering;
is the envelope size, denotes the bandwidth,
and is the alternative (non-diagonal) bandwidth measure.
The goal of the techniques in the next two sections will be to reorder rows and columns of the term-by-document matrix to reduce both and . A similar problem for symmetric matrices arises in the context of sparse Cholesky factorization. Since the Cholesky decomposition is stable under any symmetric permutation, various schemes have been proposed to reorder the rows and columns to reduce the number of generated nonzero values during the factorization. Several of the techniques derived below have their antecedents in symmetric envelope reduction algorithms. The correspondence analysis technique described in Section 3.5 is the primary exception, since its expense is too large for the Cholesky reordering problem.