Lexical matching methods for information retrieval can be quite inaccurate
when they are used for query processing. Given the common occurrence of
synonyms and polysemus words, a more desirable approach for retrieval
would allow users to retrieve information from databases according to
a relevant topic or meaning. Latent
Semantic Indexing (LSI) [BDO95,DDF+90] is an example
of a * vector space* information retrieval model which addresses
the problems of lexical matching retrieval methods
by using statistically derived conceptual indices
instead of individual words for retrieval. Assuming an underlying
or latent structure in word usage that is somewhat
obscured by variability in word choice, LSI employs a truncated
singular value
decomposition (SVD) [GL89] to estimate the structure in word
usage across documents. Retrieval is then performed using the database
of singular values and vectors obtained from the truncated SVD.
Empirical data suggest that these statistically derived vectors are
more robust indicators of meaning than individual terms when applied
to a wide variety of text collections.

Retrieval methods can be applied to edge-vertex incidence matrices
[OSG92] corresponding to graphs of hypertext, i.e.,
text objects with links or cross-references between them [BDO95].
The link structure can be represented by the nonzero
patterns of the sparse * document-by-link* incidence matrices
associated with hypertext [Siz94]. Whereas LSI can use
relevant information stored
in links, current hypertext search implementations based on
keyword or string search do not usually exploit link
structure. We propose a new approach for utilizing the information
associated with links by permuting the corresponding document-by-link
incidence matrices to reveal document and link clusters.

The primary focus of this work is
to compare a variety of sparse matrix reordering schemes (spectral
and symbolic) for generating
narrow-banded (or clustered) nonzero patterns from hypertext
incidence matrices. Such nonzero patterns allow
the immediate detection of document and link clusters, and
serve as textual browsers for hypertext and other similar
on-line information. The detection of additional or * implicit* hypertext
links is also improved using these narrow-banded nonzero patterns so
as to facilitate automatic hypertext construction. Vector
space information retrieval models
such as LSI, which are based on spectral decompositions (e.g, SVD),
can exploit banded incidence matrices through
reduced indirect addressing (band storage rather than
gather-scatter access) and optimal partitioning of nonzero elements
(weighted term frequencies) across processors for parallel
implementations of sparse matrix-vector multiplication (used by
iterative methods such as Lanczos [Ber92]).

Section 2 reviews some of the basic concepts needed to understand IR
models such as LSI,
and provides a sample term-by-document matrix corresponding to a
small text collection. In Section 3, both symbolic and spectral
approaches for reordering document-by-link incidence matrices are
presented. The term-by-document matrix from the constructive
LSI example in Section 2 is used to explain both spectral and
nonspectral approaches in Section 3. A performance
comparison of the three methods from Sections 3 using
sparse hypertext-based document-by-link matrices generated from
the Condensed Columbia Encyclopedia, UNIX BSD 4.3 * man* pages,
and a subset of HyperText Markup Language (HTML) pages from the National
High-performance Software Exchange (NHSE) on the World-Wide-Web (WWW)
is provided in Section 4. Finally, a brief summary and discussion of
future work comprise Section 5.

Michael W. Berry (berry@cs.utk.edu)

Mon Jan 29 14:30:24 EST 1996