**Michael W. Berry
(Presenting Author)
Department of Computer Science
University of Tennessee
Knoxville, TN 37996-1301
Office: 615-974-3838, FAX: 615-974-4404
**

**Susan T. Dumais
Bellcore
Information Science Research Group
445 South Street, Room 2L-371
Morristown, NJ 07962-1910
**

**Todd A. Letsche
Department of Computer Science
University of Tennessee
Knoxville, TN 37996-1301
**

**Keywords:** indexing, information, latent, matrices,
retrieval, semantic, singular value decomposition, sparse, updating

Typically, information is retrieved by literally matching terms in documents with those of a query. However, lexical matching methods can be inaccurate when they are used to match a user's query. Since there are usually many ways to express a given concept (synonymy), the literal terms in a user's query may not match those of a relevant document. In addition, most words have multiple meanings (polysemy), so terms in a user's query will literally match terms in irrelevant documents. A better approach would allow users to retrieve information on the basis of a conceptual topic or meaning of a document.

Latent Semantic Indexing (LSI) [5] tries to overcome the problems of lexical matching by using statistically derived conceptual indices instead of individual words for retrieval. LSI assumes that there is some underlying or latent structure in word usage that is partially obscured by variability in word choice. A truncated singular value decomposition (SVD) [15] is used to estimate the structure in word usage across documents. Retrieval is then performed using the database of singular values and vectors obtained from the truncated SVD. Performance data shows that these statistically derived vectors are more robust indicators of meaning than individual terms. A number of software tools have been developed to perform operations such as parsing document texts, creating a term by document matrix, computing the truncated SVD of this matrix, creating the LSI database of singular values and vectors for retrieval, matching user queries to documents, and adding new terms or documents to an existing LSI database [5][24]. The bulk of LSI processing time is spent in computing the truncated SVD of the large sparse term by document matrices.

Section 2 is a review of basic concepts needed to understand LSI.
Section 3 uses a constructive example to illustrate how LSI represents terms
and documents in the same semantic space, how a query is represented,
how additional documents are added (or folded-in), and how SVD-updating
represents additional documents. In Section 4, an algorithm for
SVD-updating is discussed along with a comparison to the folding-in
process with regard to robustness of query
matching and computational complexity. Section 5
surveys promising applications of LSI along with parameter
estimation problems that arise with its use.

where and . The first ** r** columns of the orthogonal matrices
and define the orthonormal eigenvectors associated with the

The following two theorems illustrate how the SVD can reveal important information about the structure of a matrix.

Proof. See [15].

In other words, , which is constructed from the ** k**-largest singular triplets of

In order to implement Latent Semantic Indexing [5][12] a matrix of terms by documents must be constructed. The elements of the term-document matrix are the occurrences of each word in a particular document, i.e.,

where denotes the frequency in which term ** i** occurs in document

where ** L(i,j)** is the local weighting for term

Figure 1 is a mathematical representation of the singular value decomposition. and are considered the term and document vectors respectively, and represents the singular values. The shaded regions in and and the diagonal line in represent from Equation (2).

It is important for the LSI method that the derived matrix
not reconstruct the original term document matrix ** A** exactly.
The truncated SVD, in one sense, captures most of the
important underlying structure in the association of terms
and documents, yet at the same time removes the noise or
variability in word usage that plagues word-based retrieval
methods. Intuitively, since the number of dimensions,

Consider the words *car*, *automobile*, *driver*, and
*elephant*. The terms
*car* and *automobile* are synonyms, *driver* is a related
concept and *elephant* is unrelated. In most retrieval systems,
the query *automobiles* is no more likely to retrieve documents about
cars than documents about elephants, if neither used precisely
the term *automobile* in the documents. It would be preferable
if a query about *automobiles* also retrieved articles about
*cars* and even articles about drivers to a lesser extent.
The derived ** k**-dimensional feature space can represent these
useful term inter-relationships. Roughly speaking, the words

For purposes of information retrieval, a user's query must be represented
as a vector in ** k**-dimensional space and compared to documents. A
query (like a document) is a set of words. For example,
the user query can be represented by

where ** q** is simply the vector of words in the users query, multiplied by
the appropriate term weights (see Equation (5)).
The sum of these k-dimensional terms vectors is reflected by the
term in Equation (6), and the
right multiplication by differentially weights
the separate dimensions. Thus, the query vector is located at the weighted
sum of its constituent term vectors. The query
vector can then be compared to all existing document vectors, and
the documents ranked by their similarity (nearness) to the query. One
common measure of similarity is the cosine between the query vector and
document vector. Typically, the

Suppose an LSI-generated database already exists.
That is, a collection of text objects has been parsed, a term-document matrix
has been generated, and the SVD of the term-document matrix has been computed.
If more terms and documents must be added,
two alternatives for incorporating them currently exist: recomputing the
SVD of a new term-document matrix or *folding-in* the new terms and
documents.

Four terms are defined below to avoid confusion when discussing updating.
*Updating* refers to the general process of adding new terms and/or
documents to an existing LSI-generated database.
Updating can mean either folding-in or SVD-updating.
*SVD-updating* is the new method of updating developed in [24].
*Folding-in* terms or documents
is a much simpler alternative that uses an existing SVD to represent
new information.
*Recomputing the SVD* is not an updating method, but a way of creating an
LSI-generated database with new terms and/or documents from scratch which
can be compared to either updating method.

Recomputing the SVD of a larger term-document matrix requires more computation
time and, for large problems, may be impossible due to memory constraints.
Recomputing the SVD allows the new ** p** terms and

Folding-in documents is essentially the process described in Section 2.2 for query representation. Each new document is represented as a weighted sum of its component term vectors. Once a new document vector has been computed it is appended to the set of existing document vectors or columns of (see Figure 2). Similarly, new terms can be represented as a weighted sum of the vectors for documents in which they appear. Once the term vector has been computed it is appended to the set of existing term vectors or columns of (see Figure 3).

**Figure 2:** Mathematical representation of folding-in ** p** documents.

**Figure 3:** Mathematical representation of folding-in ** q** terms.

To fold-in a new ** m** by

Similarly, to fold-in a new ** 1** by

In this section, Latent Semantic Indexing (LSI) and the folding-in process discussed in Section 2.3 are applied to a small database of medical topics. In Table 2, 14 topics taken from the testbed of 1033 MEDLINE abstracts on biomedicine obtained from the National Library of Medicine (see MED collection in [5]). All the underlined words in Table 2 denote keywords which are used as referents to the medical topics. The parsing rule used for this sample database required that keywords appear in more than one topic. Of course, alternative parsing strategies can increase or decrease the number of indexing keywords (or terms).

**Table 2:** Database of medical topics from *MEDLINE*. Keywords are underlined.

**Table 3:** The 18 by 14 term-document matrix corresponding to the medical topics in Table 2.

Corresponding to the text in Table 2 is the 18 by 14
term-document matrix shown in Table 3.
The elements of this matrix are the frequencies in which a term occurs in
a document or medical topic (see Section 4). For example,
in medical topic **M2**, the second column of the term-document matrix,
*culture*, *discharge*, and *patients* all occur once. For
simplicity, term weighting is not used in this
example matrix. Now compute the truncated SVD (with
** k**=

Using the first column of multiplied by the first singular value, , for the x-coordinates and the second column of multiplied by the second singular value, , for the y-coordinates, the terms can be represented on the Cartesian plane. Similarly, the first column of scaled by are the x-coordinates and the second column of scaled by are the y-coordinates for the documents (medical topics). Figure 4 is a two-dimensional plot of the terms and documents for the 18 by 14 sample term-document matrix.

Notice the documents and terms pertaining to *patient behavior or
hormone production* are clustered above the x-axis while terms and documents
related to *blood disease or fasting* are
clustered near the lower y-axis. Such groupings suggest that subsets
of medical topics such as {**M2, M3, M4**} and {**M10, M11, M12**}
each contain topics of *similar* meaning. Although topics
**M1** and **M2** share the polysemous terms *culture* and
*discharge* they are not represented by nearly identical vectors
by LSI. The *meaning* of those terms in topics **M1** and **M2** are
clearly different and literal-matching indexing schemes have difficulty
resolving such context changes.

Suppose we are interested in the documents that contain information related to
the *age of children with blood abnormalities*. Recall
that a query vector (** q**) can be represented as
() via = (see Equation
(6)). Since
the words

**Figure 5:** Derived coordinates for the query of **age
blood abnormalities**.

**Figure 6:**2-D plot of terms and documents along with the
query **age blood abnormalities**.

A different cosine threshold, of course, could have been used so that a
larger or smaller set of documents would be returned.
The cosine is merely used to rank-order documents and its numerical value
is not always an adequate measure of relevance
[24][30].

In this example, LSI has been applied using two factors, i.e.
is used to approximate the original 18 by 14 term-document matrix.
Using a cosine threshold of 0.85, three medical topics related to *blood abnormalities and kidney failure* were returned:
topics **M8**, **M9**, and **M12**. If the cosine
threshold was reduced to just , then medical topics
**M7** and **M11** (which are somewhat related) are also returned.
With lexical-matching, five medical topics (**M1**, **M8**, **M10**,
**M11**, **M12**) would be returned. Clearly, topics **M1** and
**M10** are not relevant and topic **M9** would be missed. LSI,
on the other hand, is able to retrieve the most relevant topic
from Table 2 (i.e., **M9**)
to the original query *age of children with blood abnormalities*
since *christmas disease* is the name associated with hemophilia in
young children. This
ability to retrieve relevant information based on context or meaning rather
than literal term usage is the main motivation for using LSI.

Table 4 lists the LSI-ranked documents (medical topics) with
different numbers of factors (** k**).
The documents returned in Table 4 satisfy a cosine
threshold of 0.40, i.e., returned documents are within a cosine of 0.40
of the pseudo-document used to represent the query. As alluded to earlier,
the cosine best serves as a measure for rank-ordering only as Table
4 clearly demonstrates that its value associated
with returned documents can significantly vary with changes in the number of
factors

**Table 4:** Returned documents based on different numbers of LSI factors.

Suppose the fictitious topics listed in Table 5 are to be
added to the original set of medical topics in Table 2.
These new topics (**M15** and **M16**) essentially use the terms as the
original topics in Table 2 but in a somewhat different
sense or context. Topic **M15** relates a *rise in oestrogen*
with the behavior of *rats* rather than *patients*. Topic
**M16** uses the term *pressure* in the context of
*behavior* rather than *blood*. As with Table 2,
all underlined words in Table 5 are considered significant
since they appear in more than one title (across
all 16 topics from Tables 2 and 5).
Folding-in (see Section 2.3) is one approach for updating
the original LSI-generated database with the 2 new medical topics.
Figure 7 demonstrates how these topics are folded-into the
database based on LSI factors via Equation (7).
The new medical topics are denoted on the graph by their document labels
(in a different boldface font).
Notice that the coordinates of the original topics stay fixed, and hence the
new data has no effect on the clustering of existing terms or documents.

Ideally, the most robust way to produce the best rank-** k** approximation
() to a term-document matrix which has been updated with new terms
and documents is to simply compute the SVD of a reconstructed term-document
matrix, say . Updating methods which can approximate the SVD
of the larger term-document matrix
become attractive in the presence of memory or time constraints.
As discussed in [24], the
the accuracy of SVD-updating approaches can be easily compared to that
obtained when the SVD of is explicitly computed.

**Figure 8:** Two-dimensional plot of terms and documents using the SVD
decomposition of a reconstructed term-document matrix.

Suppose the topics from Table 5 are combined with those of Table 2 in order to create a new 18 by 16 term-document matrix . Following Figure 1, we then construct the rank-2 approximation to given by

Figure 8 is a two-dimensional plot of the 18 terms and 16
documents (medical topics) using the elements of and
for term and document coordinates, respectively.
Notice the difference in term and document positions between Figures
7 and 8. Clearly, the
new medical topics from Table 5 have helped
redefine the underlying latent structure when the SVD decomposition
of is computed. That is, one can discuss blood *pressure*
and behavioral *pressure* in different contexts. Note that in
Figure 8 (unlike Figure 7) the
topics (old and new) related to the use of *rats* form
a well-defined *cluster* or subset of documents.
Folding-in the 2 new medical topics based on the existing rank-2 approximation
to ** A** (defined by Table 3) may not accurately reproduce the
true LSI representation of the new (or updated) database. In the
case of topic

In practice, the difference between folding-in and
SVD-updating is likely to depend on the number of
new documents and terms relative to the number in
the original SVD of ** A**. Thus, we expect SVD-updating
to be especially valuable for rapidly changing
databases.

The process of SVD-updating discussed in Section 2.3
can also be illustrated using the medical topics
from Tables 2 and 5.
The three steps required to perform a complete SVD-update involve adding new
documents, adding new terms, and correction for changes in term weightings.
The order of these steps, however, need not
follow the ordering presented in this section (see [24]).

Let ** D** denote the

are computed.
This is almost the same process as recomputing the SVD, only ** A** is replaced
by .

Let ** T** denote a collection of

are computed.

The correction step for incorporating changes in term weights
(see Equation (5)) is performed
after any terms or documents have
been SVD-updated and the term weightings of the original
matrix have changed.
For a change of weightings in ** j** terms, let be an

The mathematical computations required in each phase of the
SVD-updating process are detailed in this section.
SVD-updating incorporates new term or document information into an
existing semantic model ( from Equation (2)) using sparse
term-document matrices (** D**,

where is the number of iterations required by a Lanczos-type procedure
[2] to approximate the eigensystem of and *trp*
is the number of accepted singular triplets (i.e. singular values
and corresponding left and right singular vectors). The
additional multiplication by ** G** is required to extract the left singular
vector given approximate singular values and their corresponding right
singular vector approximations from a Lanczos procedure. A brief summary
of the required computations for updating an existing rank-

Let from Equation (10) and define SVD (** B**) =
. Then

since If and SVD(F) = then it follows that

Hence and are ** m** by

Let from Equation (11) and define SVD = . Then

If and SVD() = then it follows that

Hence and are
(** m**+

Let ,
where
is
** m** by

If and
SVD(** Q**) = , then it follows that

Since .
Hence and are
** m** by

Table 7 contains the complexities for folding-in
terms and documents, recomputing the SVD, and the three phases of
SVD-updating.
Using the complexities in Table 7 the required
number of floating-point operations (or flops)
for each method can be compared for varying numbers of added documents
or terms. As shown in [24] for a condensed encyclopedia
test case, the computational advantages of one scheme over another
depends the values of the variables listed in Table 6.
For example, if
the sparsity of the ** D** matrix from Equation (10) reflects that
of the original

One important distinction between the folding-in (see Section
2.3) and the SVD-updating processes lies in
the guarantee of orthogonality in the vectors (or axes) used for
term and document coordinates. Recall that
an orthogonal matrix ** Q** satisfies
, where is the n-th order identity matrix.
Let be the collection of all folded-in documents where each
column of the

Folding-in does not maintain the orthogonality of or since arbitrary vectors of weighted terms or documents are appended to or , respectively. However, the amount by which the folding-in method perturbs the orthogonality of or does indicate how much distortion has occurred due to the addition of new terms or documents.

The trade-off in computational complexity and loss of orthogonality
in the coordinate axes for updating databases using LSI poses
interesting future research. Though the SVD-updating process is
considerably more expensive [24] than folding-in, the
true lower-rank approximation to the true term-document matrix ** A**
defined by Figure 1 is maintained. Significant
insights in the future could be gained by monitoring the loss of
orthogonality associated with folding-in and correlating it to the number
of relevant documents returned within particular cosine thresholds
(see Section 3.1).

To illustrate SVD-updating, suppose the medical topics
in Table 5 are to be added to the original set of
medical topics
in Table 2. In this example, only documents are added and weights
are not adjusted, hence only the SVD of the matrix ** B** in Equation
(10) is computed.

Initially, a 18 by 12 term-document matrix, ** D**, corresponding to
the new topics in Table 5 is generated and then
appended to to form a 18 by 2 matrix

where the columns of and are the left and right
singular vectors, respectively, corresponding to the two largest singular
values of ** B**.

Figure 9 is a two-dimensional plot of the 18 terms and 16 documents (medical topics) using the elements of and for term and document coordinates, respectively. Notice the similar clustering of terms and medical topics in Figures 8 (recomputing the SVD) and 9, and the difference in document and term clustering with Figure 7 (folding-in).

**Figure 9:** Two-dimensional plot of terms and documents using the SVD-updating process.

**A three-dimensional animation** of the terms and documents
from the updating example of Section 4.7
can be used to show the placement of the original 18 terms and 14
documents, the two documents folded-in to the rank-** 3**
LSI model, and the movement of the terms and documents
when SVD-updating is applied. In the video, the blue axis
represents the x-axis, the green axis represents the y-axis,
and the red axis represents the z-axis. Coordinates for these
axes, of course, are derived from the 3-largest singular triplets
of the appropriate matrix

Folding-in and SVD-updating are illustrated by first showing the
terms and documents before topics **M15** and **M16** are
added. Then, the new topics, represented by green spheres labeled M15
and M16, are folded-in to the term-document space. After a
slight pause, all terms and documents are shown
moving to the positions they would assume if SVD-updating is
used to add the topics **M15** and **M16** to the
term-document space. Notice that SVD-updating appropriately moves the medical
topic **M16** to the *centroid* of the term vectors
corresponding to *depressed*, *patients*,
*pressure*, and *fast*.

Because the frames fo the video are computer-generated,
Autodesk FLI/FLC format was used to animate the images rather than
MPEG encoding. On X-windows, videos in FLI format may be viewed
with the **xanim** viewer. On an IBM PC or Apple Macintosh, they
may be viewed using **Waaply** and **Fli-Viewer**, respectively. For
more information about the FLI/FLC animation format and viewing
FLI/FLC videos, please consult **http://crusty.er.usgs.gov/flc.html**.

video/fli flc flito your $HOME/.mimetypes files and add

video/fli; xanim %sto your $HOME/.mailcap file, where

In this section, several applications of LSI
are discussed ranging from information retrieval
and filtering to models of human memory. Some open
computational and statistical-based issues related
to the practical use of LSI for such applications
are also mentioned.

Latent Semantic Indexing was initially developed for information
retrieval applications. In these application, a fixed database is
indexed and users pose a series of retrieval queries. The effectiveness
of retrieval systems is often evaluated using "test collections" developed
by the information retrieval community. These collections consist
of a set of documents, a set of user queries, and relevance judgements
(i.e., for each query every document in the collection has been
judged as relevant or not to the query).

One can evaluate the effectiveness of different systems in retrieving relevant documents and at the same time not returning irrelevant documents. Two measures,

Results were obtained for LSI and compared
against published or computed results for other retrieval
techniques, notably the standard keyword vector method in
SMART [25].
For several information science test collections,
the average precision using LSI ranged from
comparable to 30% better than that
obtained using standard keyword vector methods.
See [5][7]
[13] for details of these evaluations.
The LSI method performs best relative to standard vector methods
when the queries and relevant documents do not share many words,
and at high levels of recall.

One of the common and usually effective methods for improving
retrieval performance in vector methods is to transform the raw
frequency of occurrence of a term in a document
(i.e. the value of a cell in the term by document matrix)
by some function (see Equation 5).
Such transformations normally have two components.
Each term is assigned a *global weight*,
indicating its overall importance in the collection
as an indexing term.
The same global weighting is applied to an entire row (term)
of the term-document matrix.
It is also possible to transform the term's frequency
in the document;
such a transformation is called a *local weighting*,
and is applied to each cell in the matrix.

The performance for several weighting schemes have been
compared in [7]. A transformed matrix is automatically computed,
the truncated SVD shown in Figure 1 is computed, and performance
is evaluated. A *log* transformation of the local cell entries combined
with a global *entropy* weight for terms is the most effective
term-weighting scheme. Averaged over five test collections, *log by
entropy* weighting was 40% more effective than raw term
weighting.

The idea behind relevance feedback is quite simple.
Users are very unlikely to be able to specify their
information needs adequately, especially on the first try.
In interactive retrieval situations,
it is possible to take advantage of user feedback about
relevant and non-relevant documents [26].
Systems can use information about which documents are relevant
in many ways. Typically the weight given to terms occurring
in relevant documents is increased and the weight of terms
occurring in non-relevant documents is decreased.
Most of the tests using LSI have involved a method in which
the initial query is *replaced* with the vector sum
of the documents the users has selected as relevant. The
use of negative information has not yet been exploited in LSI;
for example, by moving the query away from documents
which the user has indicated are irrelevant.
Replacing the users' query with the first relevant document
improves performance by an average of 33% and replacing it
with the average of the first three relevant documents improves
performance by an average of 67% (see [7] for details).
Relevance feedback provides sizable and consistent retrieval advantages.
One way of thinking about the success of these methods is that many words
(those from relevant documents) augment the initial query which is
usually quite impoverished. LSI does some of this kind of query
expansion or enhancement even without relevance information, but
can be augmented with relevance information.

Choosing the number of dimensions (** k**) for shown in
Figure 1 is an interesting problem. While
a reduction in

Information filtering is a problem that is closely related to information retrieval [1]. In information filtering applications, a user has a relatively stable long-term interest or profile, and new documents are constantly received and matched against this standing interest. Selective dissemination of information, information routing, and personalized information delivery are also used to refer to the matching of an ongoing stream of new information to relatively stable user interests.

Applying LSI to information filtering applications is straightforward. An initial sample of documents is analyzed using standard LSI/SVD tools. A users' interest is represented as one (or more) vectors in this reduced-dimension LSI space. Each new document is matched against the vector and if it is similar enough to the interest vector it is recommended to the user. Learning methods like relevance feedback can be used to improve the representation of interest vectors over time.

Foltz [11] compared LSI and keyword vector methods for filtering
*Netnews* articles, and found 12% - 23% advantages for LSI.
Dumais and Foltz in [12] compared several different methods for
representing users interests for filtering technical memoranda.
The most effective method used vectors derived from known
relevant documents (like relevance feedback) combined with
LSI matching.

Recently, LSI has been used for both information filtering and information retrieval in TREC (Text REtrieval Conference), a large-scale retrieval conference conference sponsored by NIST [8][9]. The TREC collection contains more than 1,000,000 documents (representing more than 30 gigabytes of ASCII text), 200 queries, and relevance judgements pooled from the return sets of more than 30 systems. The content of the collections varies widely ranging from news sources (AP News Wire, Wall Street Journal, San Jose Mercury News), to journal abstracts (Ziff Davis, DOE abstracts), to the full text of the Federal Register and U.S. Patents. The queries are very long and detailed descriptions, averaging more than 50 words in length. While these queries may be representative of information requests in filtering applications, they are quite unlike the short requests seen in previous IR collections or in interactive retrieval applications (where the average query is only one or two words long). The fact that the TREC queries are quite rich means that smaller advantages would be expected for LSI or any other methods that attempt to enhance users queries.

The big challenge in this collection was to extend the LSI
tools to handle collections of this size. The results were
quite encouraging. At the time of the TREC conferences it was
not reasonable to compute from Figure 1 for the
complete collection. Instead,
a sample (see [8][9])
of about 70,000 documents and 90,000 terms was used. Such
term by document matrices (** A**) are
quite sparse, containing only 0.001-0.002% non-zero entries.
Computing , i.e. the 200-largest singular values and
corresponding singular vectors, by a single-vector Lanczos algorithm
[3] required about 18 hours of CPU time on a SUN
SPARCstation 10 workstation.
Documents not in the original LSI analysis were

Although it is very difficult to compare across systems in any detail because of large pre-processing, representation and matching differences, LSI performance was quite good [9]. For filtering tasks, using information about known relevant documents to create a vector for each query was beneficial. The retrieval advantage of 31% was somewhat smaller than that observed for other filtering tests and is attributable to the good initial queries in TREC. For retrieval tasks, LSI showed 16% improvement when compared with the keyword vector methods. Again the detailed original queries account for the somewhat smaller advantages than previously observed.

The computation of for the
large sparse TREC matrices ** A** was accomplished without difficulty
(numerical or convergence problems) using sophisticated
implementations of the Lanczos algorithm from SVDPACKC [3].
However, the computational and memory requirements posed by the
TREC collection greatly motivated the development of the SVD-updating
procedures discussed in Section 4.

Because LSI is a completely automatic method, it is widely applicable to new collections of texts (including to different languages, as described below). The fact that both terms and documents are represented in the same reduced-dimension space adds another dimension of flexibility to the LSI retrieval model. Queries can be either terms (as in most information retrieval applications), documents or combinations of the two (as in relevance feedback). Queries can even be represented as multiple points of interest [18]. Similarly, the objects returned to the user are typically documents, but there is no reason that similar terms could not be returned. Returning nearby terms is useful for some applications like online thesauri (that are automatically constructed by LSI), or for suggesting index terms for documents for publications which require them.

Although term-document matrices have been used for simplicity,
the LSI method can be applied to any descriptor-object matrix.
We typically use only single terms to describe documents,
but phrases or n-grams could also be included as rows in the matrix.
Similarly, an entire document is usually the text object of interest,
but smaller, more topically coherent units of text
(e.g., paragraphs, sections) could be represented as well.
For example, LSI has been incorporated as a *fuzzy search* option in NETLIB [6] for retrieving
algorithms, code descriptions, and short articles
from the NA-Digest electronic newsletter.

Regardless of how the original descriptor-object matrix is derived,
a reduced-dimension approximation can be computed.
The important idea in LSI is to go beyond the original descriptors
to more reliable statistically derived indexing dimensions.
The wide applicability of the LSI analysis is further illustrated
by describing several applications in more detail.

It is important to note that the LSI analysis makes no
use of English syntax or semantics. *Words* are identified
by looking for white spaces and punctuation in ASCII text.
Further, no stemming is used to collapse words with the same
morphology. If words with the same stem are used in similar
documents they will have similar vectors in the truncated
SVD defined in Figure 1; otherwise, they will not.
(For example, in analyzing an encyclopedia as done in
[4],
*doctor* is quite near *doctors* but not as similar to
*doctoral*.) This means that LSI is applicable to any language.
In addition, it can be used for cross-language retrieval -
documents are in several languages and user queries
(again in several languages) can match documents in any language.
What is required for cross-language applications is
a common space in which words from many languages are represented.

Landauer and Littman in [21] described one method for creating
such an LSI space. The original term-document matrix is formed
using a collection of abstracts that have versions in more
than one language (French and English, in their experiments).
Each abstract is treated as the *combination* of its French
English versions. The truncated SVD is computed for this term by
combined-abstract matrix ** A**.
The resulting space consists of combined-language abstracts,
English words and French words.
English words and French words which occur in similar combined abstracts
will be near each other in the reduced-dimension LSI space.
After this analysis, monolingual abstracts can be

Landauer and Dumais [20] have recently used LSI spaces to model
some of the associative relationships observed in human memory.
They were interested in term-term similarities.
LSI is often described intuitively as a method for finding synonyms -
words which occur in similar patterns of documents will be near
each other in the LSI space even if they never co-occur in
a single document (e.g., *doctor*, *physician* both occur with many
of the same words like *nurse*, *hospital*, *patient*,
*treatment*, etc.).
Landauer and Dumais tested how well an LSI space would mimic the
knowledge needed to pass a synonym test. They used the synonym
test from ETS's Test Of English as a Foreign Language (TOEFL).
The test consists of 80 multiple choice test items each with
a stem word (e.g., *levied*) and
four alternatives (e.g., *imposed*, *believer*, *requested*,
*correlated*), one of which is the synonym.
An LSI analysis was performed on an encyclopedia represented
by a 61,000 word by 30,473 article matrix ** A**.
For the synonym test they simply computed the similarity of
the stem word to each alternative and picked the closest one
as the synonym (for the above example

In a couple of applications, LSI has been used to return the
best matching *people* instead of documents.
In these applications, people were represented by articles
they had written. In one application [13],
known as the Bellcore Advisor, a system was developed to find
local experts relevant to users' queries. A query was matched
to the nearest documents and project descriptions and the authors
organization was returned as the most relevant internal group.
In another application [10], LSI was used to
automate the assignment of reviewers to submitted conference papers.
Several hundred reviewers were described by means of texts they
had written, and this formed the basis of the LSI analysis.
Hundreds of submitted papers were represented by their abstracts,
and matched to the closest reviewers. These LSI similarities along
with additional constraints to insure that each paper was reviewed
** p** times and that each reviewer received no more than

Because LSI does not depend on literal keyword matching, it
is especially useful when the text input is *noisy*, as in OCR
(Optical Character Reader), open input, or spelling errors.
If there are scanning errors and a word (*Dumais*) is misspelled
(as *Duniais*), many of the other words in the document will be
spelled correctly.
If these correctly spelled context words also occur in documents
which contained a correctly spelled version of *Dumais*, then *Dumais*
will probably be near *Duniais* in the ** k**-dimensional space
determined by (see Equation 2 or Figure
1).

Nielsen et al. in [23] used LSI to index a small collection
of abstracts input by a commercially available pen machine in
its standard recognizer mode. Even though the error rates were
% at the word level, information retrieval performance using LSI
was not disrupted (compared with the same uncorrupted texts).
Kukich [19] used LSI for a related problem, spelling correction.
In this application, the rows were unigrams and bigrams and the
columns were correctly spelled words.
An input word (correctly or incorrectly spelled) was broken down
into its bigrams and trigrams, the query vector was located at
the weighted vector sum of these elements, and the nearest word
in LSI space was returned as the suggested correct spelling.

Word matching results in surprisingly poor retrieval.
LSI can improve retrieval substantially by replacing individual
words with a smaller number of more robust statistically derived
indexing concepts. LSI is completely automatic and widely applicable,
including to different languages. Furthermore, since both terms
and documents are represented in the same space, both
queries and returned items can be either words or documents.
This flexibility has led to a growing number of novel applications. The
authors would welcome the opportunity to index a subset of *accepted*
HTML-formatted papers at **Supercomputing '95** for LSI demonstration
purposes.

There are a number of computational/statistical improvements that would make LSI even more useful, especially for large collections:

- computing the truncated SVD of extremely large sparse matrices (i.e., much larger than the usual 100,000 by 60,000 term by document matrix processed on RISC workstations with under 500 megabytes of RAM,
- perform SVD-updating (see Section 4) in real-time for databases that change frequently, and
- efficiently comparing queries to documents (i.e., finding near neighbors in high-dimension spaces).

A number of other researchers
are using related linear algebra methods for information
retrieval and classification work.
Schutze [27] and Gallant [14] have used SVD and related
dimension reduction ideas for word sense disambiguation and information
retrieval work. Hull [17] and Yang and Chute [29] have used
LSI/SVD as the first step in conjunction with statistical classification
(e.g. discriminant analysis).
Using the LSI-derived dimensions effectively
reduces the number of predictor variables for classification.
Wu et al. in [28] also used LSI/SVD to reduce the
training set dimension for a neural network protein
classification system used in human genome research.

The authors would like to thank Gavin O'Brien
at the National Institute of Standards and Technology
(NIST) for his help with the SVD-updating software
and algorithm design, and David Buck (Ontario, Canada)
for use of the DKBTrace Ray-Tracer Version 2.12 in the
3-D animation of document updating for LSI. This
research was supported in part by the
National Science Foundation under grant Nos. NSF-CDA-9115428 and
NSF-ASC-92-03004.

**1**-
N. J. BELKIN AND W. B. CROFT,
*Information filtering and information retrieval: Two sides of the same coin?*, Communications of the ACM, 35 (1992), pp. 29-38. **2**-
M. W. BERRY,
*Large scale singular value computations*, International Journal of Supercomputer Applications, 6 (1992), pp. 13-49. **3**-
M. W. BERRY ET AL.,
*SVDPACKC: Version 1.0 User's Guide*, Tech. Rep. CS-93-194, University of Tennessee, Knoxville, TN, October 1993. **4**-
M. W. BERRY, S. T. DUMAIS, AND A. T. SHIPPY,
*A case study of latent semantic indexing*, Tech. Rep. CS-95-271, University of Tennessee, Knoxville, TN, January 1995. **5**-
S. DEERWESTER, S. DUMAIS, G. FURNAS, T. LANDAUER, AND R. HARSHMAN,
*Indexing by latent semantic analysis*, Journal of the American Society for Information Science, 41 (1990), pp. 391-407. **6**-
J. J. DONGARRA AND E. GROSSE,
*Distribution of mathematical software via electronic mail*, Communications of the ACM, 30 (1987), pp. 403-407. **7**-
S. T. DUMAIS,
*Improving the retrieval of information from external sources*, Behavior Research Methods, Instruments, & Computers, 23 (1991), pp. 229-236. **8**-
S. T. DUMAIS,
*LSI meets TREC: A status report.*, in The First Text REtrieval Conference (TREC1), D. Harman, ed., March 1993, pp. 137-152. National Institute of Standards and Technology Special Publication 500-207. **9**-
S. T. DUMAIS,
*Latent Semantic Indexing (LSI) and TREC-2.*, in The Second Text REtrieval Conference (TREC2), D. Harman, ed., March 1994, pp. 105-116. National Institute of Standards and Technology Special Publication 500-215. **10**-
S. T. DUMAIS AND J. NIELSEN,
*Automating the assignment of submitted manuscripts to reviewers*, in SIGIR'92: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, N. Belkin, P. Ingwersen, and A. M. Pejtersen, eds., Copenhagen, Denmark, June 1992, ACM Press, pp. 233-244. **11**-
P. W. FOLTZ,
*Using Latent Semantic Indexing for information filtering*, in Proceedings of the ACM Conference on Office Information Systems (COIS), 1990, pp. 40-47. **12**-
P. W. FOLTZ AND S. T. DUMAIS,
*Personalized information delivery: An analysis of information filtering methods*, Communications of the ACM, 35 (1992), pp. 51-60. **13**-
G. W. FURNAS, S. DEERWESTER, S. T. DUMAIS, T. K. LANDAUER, R. A. HARSHMAN,
L. A. STREETER, AND K. E. LOCHBAUM,
*Information retrieval using a singular value decomposition model of latent semantic structure*, in Proceedings of SIGIR, 1988, pp. 465-480. **14**-
S. I. GALLANT,
*A practical approach for representing contexts and for performing word sense disambiguation using neural networks*, Neural Computation, 3 (1991), pp. 293-309. **15**-
G. GOLUB AND C. V. LOAN,
*Matrix Computations*, Johns-Hopkins, Baltimore, second ed., 1989. **16**-
G. GOLUB AND C. REINSCH,
*Handbook for automatic computation II, linear algebra*, Springer-Verlag, New York, 1971. **17**-
D. HULL,
*Improving text retrieval for the routing problem using Latent Semantic Indexing*, in Proceedings of the Seventeenth Annual International ACM-SIGIR Conference, 1994, pp. 282-291. **18**-
Y. KANE-ESRIG, L. STREETER, S. T. DUMAIS, W. KEESE, AND G. CASELLA,
*The relevance density method for multi-topic queries in information retrieval*, in Proceedings of the 23rd Symposium on the Interface, E. Keramidas, ed., 1991, pp. 407-410. **19**-
K. KUKICH,
*A comparison of some novel and traditional lexical distance metrics for for spelling correction*, in Proceedings of INNC-90-Paris, 1990, pp. 309-313. **20**-
T. K. LANDAUER AND S. T. DUMAIS,
*Latent Semantic Analysis and the measurement of knowledge*, in Proceedings of the First Educational Testing Service Conference on Applications of Natural Language Processing in Assessment and Education, 1994. To appear. **21**-
T. K. LANDAUER AND M. L. LITTMAN,
*Fully automatic cross-language document retrieval using latent semantic indexing*, in Proceedings of the Sixth Annual Conference of the UW Centre for the New Oxford English Dictionary and Text Research, UW Centre for the New OED and Text Research, Waterloo Ontario, 1990, pp. 31-38. **22**-
L. MIRSKY,
*Symmetric gage functions and unitarilly invariant norms*, Q. J. Math, 11 (1960), pp. 50-59. **23**-
J. NIELSEN, V. L. PHILLIPS, AND S. T. DUMAIS,
*Retrieving imperfectly recognized handwritten notes*, Behaviour and Information Technology, (1994). Submitted. **24**-
G. W. O'BRIEN,
*Information Management Tools for Updating an SVD-Encoded Indexing Scheme*, Master's thesis, The University of Knoxville, Tennessee, Knoxville, TN, 1994. **25**-
G. SALTON,
*Automatic Information Organization and Retrieval*, McGraw Hill, New York, 1968. **26**-
G. SALTON AND C. BUCKLEY,
*Improving retrieval performance by relevance feedback*, Journal of the American Society for Information Science, 41 (1990), pp. 288-297. **27**-
H. SCHUTZE,
*Dimensions of meaning*, in Proceedings of Supercomputing'92, 1992, pp. 787-796. **28**-
C. WU, M. BERRY, S. SHIVAKUMAR, AND J. MCLARTY,
*Neural networks for full-scale protein sequence classification: Sequence encoding with singular value decomposition*, Machine Learning, (1995). To appear. **29**-
Y. YANG AND C. G. CHUTE,
*An application of least squares fit mapping to text information retrieval*, in Proceedings of the Sixteenth Annual International ACM-SIGIR Conference, 1993, pp. 281-290. **30**-
P. G. YOUNG,
*Cross-Language Information Retrieval Using Latent Semantic Indexing*, Master's thesis, The University of Knoxville, Tennessee, Knoxville, TN, 1994.

Michael Berry (berry@cs.utk.edu)

Fri Feb 24 19:58:34 EST 1995