Lexical matching methods for information retrieval can be quite inaccurate when they are used for query processing. Given the common occurrence of synonyms and polysemus words, a more desirable approach for retrieval would allow users to retrieve information from databases according to a relevant topic or meaning. Latent Semantic Indexing (LSI) [BDO95,DDF+90] is an example of a vector space information retrieval model which addresses the problems of lexical matching retrieval methods by using statistically derived conceptual indices instead of individual words for retrieval. Assuming an underlying or latent structure in word usage that is somewhat obscured by variability in word choice, LSI employs a truncated singular value decomposition (SVD) [GL89] to estimate the structure in word usage across documents. Retrieval is then performed using the database of singular values and vectors obtained from the truncated SVD. Empirical data suggest that these statistically derived vectors are more robust indicators of meaning than individual terms when applied to a wide variety of text collections.
Retrieval methods can be applied to edge-vertex incidence matrices [OSG92] corresponding to graphs of hypertext, i.e., text objects with links or cross-references between them [BDO95]. The link structure can be represented by the nonzero patterns of the sparse document-by-link incidence matrices associated with hypertext [Siz94]. Whereas LSI can use relevant information stored in links, current hypertext search implementations based on keyword or string search do not usually exploit link structure. We propose a new approach for utilizing the information associated with links by permuting the corresponding document-by-link incidence matrices to reveal document and link clusters.
The primary focus of this work is to compare a variety of sparse matrix reordering schemes (spectral and symbolic) for generating narrow-banded (or clustered) nonzero patterns from hypertext incidence matrices. Such nonzero patterns allow the immediate detection of document and link clusters, and serve as textual browsers for hypertext and other similar on-line information. The detection of additional or implicit hypertext links is also improved using these narrow-banded nonzero patterns so as to facilitate automatic hypertext construction. Vector space information retrieval models such as LSI, which are based on spectral decompositions (e.g, SVD), can exploit banded incidence matrices through reduced indirect addressing (band storage rather than gather-scatter access) and optimal partitioning of nonzero elements (weighted term frequencies) across processors for parallel implementations of sparse matrix-vector multiplication (used by iterative methods such as Lanczos [Ber92]).
Section 2 reviews some of the basic concepts needed to understand IR models such as LSI, and provides a sample term-by-document matrix corresponding to a small text collection. In Section 3, both symbolic and spectral approaches for reordering document-by-link incidence matrices are presented. The term-by-document matrix from the constructive LSI example in Section 2 is used to explain both spectral and nonspectral approaches in Section 3. A performance comparison of the three methods from Sections 3 using sparse hypertext-based document-by-link matrices generated from the Condensed Columbia Encyclopedia, UNIX BSD 4.3 man pages, and a subset of HyperText Markup Language (HTML) pages from the National High-performance Software Exchange (NHSE) on the World-Wide-Web (WWW) is provided in Section 4. Finally, a brief summary and discussion of future work comprise Section 5.