Latent Semantic Indexing
(or LSI) is a concept-based information retrieval model. Terms and documents
are both encoded for vector space representation so that documents may be
clustered (semantically) near each other yet share no common terms. LSI
addresses the two fundamental problems which plague traditional
lexical-matching indexing schemes: synonymy and polysemy.
Content Analyst Company, LLC owns the original patent to LSI:
Computer information retrieval using latent semantic structure
U.S. Patent No. 4,839,853, June 13, 1989.
comprises four numerical (iterative) methods for computing the singular
value decomposition (SVD) of large sparse matrices using double precision ANSI
Fortran-77. A compatible ANSI-C version (SVDPACKC) is also available. SVDPACK
and SVDPACKC implement Lanczos and subspace iteration-based methods for
determining several of the largest singular triplets for large sparse matrices.
The development of SVDPACK was motivated by the need to compute large-rank
approximations to sparse term-document matrices from information
retrieval applications such as Latent Semantic Indexing
(described at the left). SVDPACKC was used in
in the InfoMap project
developed in the Computational Semantics Laboratory at
The Integrated Modeling Project (IMP) sponsored by the
Environmental Impacts Program of the USDA Forest Service
is an integrated forest health and productivity
assessment of southern and southeastern forests in relation to
changing climate, air quality, and land use changes. The
primary research focus of Prof. Michael W. Berry and Research
Associate Karen S. Minser (Dept. of Computer Science)
is the development of a problem-solving environment or PSE
which facilitates the horizontal integration of
forest responses to environmental stresses and disturbances
through the use of micro-scale cellular automata.
The Interactive Cluster
Analysis Toolkit (or ICAT)
utilizes the Enhanced Hoshen-Kopelman algorithm to provide a highly adaptable
method for cluster analysis. Within the context of diabetic retinopathy,
different neighborhood rules implemented within ICAT
provide better approaches for
classifying retinal features such as neovascularization and
exudates. The flexible design of ICAT allows
new metrics for characterizing cluster geometry or new neighborhood rules
for cluster identification to be easily incorporated.
A Regional Simulation model (RSim) designed
to integrate environmental effects of on-base military training
testing as well as off-base development. Effects considered include
air and water quality, noise, and habitats for endangered and game
species. A risk assessment approach is being used to determine
impacts of single and integrated risks. The RSim simulation
will eventually be available on the Web and will be used in a
gaming mode so that users can explore repercussions of
military and land-use decisions. RSim is currently being developed
for the region around Fort Benning, Georgia but is broadly applicable.
This project is sponsored by the
Strategic Environmental Research &
Development Program (SERDP) -- an initiative funded by the
U.S. Deparments of Energy and Defense and the U.S. Environmental
Protection Agency (EPA). A
for RSim is under current development.
Land-Use Change Analysis System for the simulation of land-cover changes
on a heterogeneous (distributed) computing environment. LUCAS generates
new maps of land cover representing the amount of
land-cover change so that issues such as biodiversity
conservation, assessing the importance of landscape
elements to meet conservation goals, and long-term
landscape integrity can be addressed.
Computer Science and Engineering:
Dr. Michael W. Berry is
serving as the Applications area editor of the Encyclopedia of
Computer Science and Engineering (Wiley Interscience) which is
being edited by
Prof. Benjamin Wah
at the University of Illinois at Urbana-Champaign. Publication
anticipated for 2004.
Three-Day Seminar Course on Information Retrieval,
Matemátics Universidad Autónoma de
Yucatán (UADY) Mérida, México,
News Release. All links are powerpoint files (password protected).
Whole Genome Phylogeny:
As whole genome sequences continue to expand in number and
complexity, effective methods for comparing and categorizing both genes
and species represented within extremely large datasets are required.
Current methods have generally utilized incomplete (and likely
insufficient) subsets of the available data even as additional data
becomes available at
a rapid rate. In collaboration with Prof. Gary Stuart at Indiana
State University, an accurate and efficient method for
producing robust gene and species phylogenies using very large whole genome
protein datasets has been developed.
This method relies on multidimensional protein vector
definitions supplied by the singular value decomposition (SVD) of
large sparse data matrices in which each protein is uniquely represented as
vector of overlapping tetrapeptide frequencies. Link above is to
presentation slides shown on March 23 at the
Bioinformatics Summit 2002, and an updated presentation
was made at a
School of Informatics Colloquim on Nov. 14, 2003 (audio/slides).
Understanding the functional relationship between genes remains to be a
major challenge in interpretation of genomic data. Bioinformatics tools
to automate extraction and utilization of gene information from the
biological databases and the scientific literature are being developed.
We present a new software environment called Semantic Gene Organizer
© (SGO) which utilizes Latent Semantic Indexing (LSI), a
concept-based vector space model, to automatically extract gene
relationships from titles and abstracts in MEDLINE citations.
We have develop a Web-based bioinformatics tool called Feature
Annotation Using Nonnegative matrix factorization
(FAUN) to facilitate
both the discovery and classification of functional relationships among
genes. Both the computational complexity and parameterization of
nonnegative matrix factorization (NMF) for processing gene sets are
currently being investigated.
FAUN has been tested on several manually constructed gene
collections (size ranging from 50 to 800 genes) and has been particularly
engineered to analyze several microarray-derived gene sets obtained
of the developing cerebellum in normal and mutant mice.
utilities for collaborative knowledge discovery and identification of new
gene relationships from text streams and repositories (e.g., MEDLINE). It
is particularly useful for the validation and analysis of gene
associations suggested by microarray experimentation. Click
here for a video
about NIMBioS with Elina Tjioe demonstrating FAUN. This project
is supported by the Gene
Regulation in Time & Space project (funded by the NIH).
Retreat Poster (March 14, 2008, 4.7MB ppt)
UT-ORNL-KBRIN Poster (March 28-30, 2008); published in BMC Bioinformatics
July 8, 2008
The Grid Computing for Ecological Modeling and Spatial Control of Wildfires
project is a National Science Foundation (NSF) funded research project
which began in 2005 and concluded in 2008. The project involved several
students and postdoctoral fellows who developed several different fire
spread models and several different methods to evaluate how spatial
control might be utilized to limit the spread of a wildfire. The software
simulated a fire starting at a variety of possible burnable locations on a
map. The fire would then spread based upon burnable/non-burnable
(green/black) areas in the map, in the simplest case, with the possibility
of including a local fire load which would affect the magnitude of local
burns, as well as the probability of spread. The unique aspect of this
project involved the computation for optimal placement of a fire break with
the objective of enclosing the fire and sparing as much of the region as
possible from burning. The overall goal of the project is to improve the
accuracy of responses to fire spread, to develop effective control
strategies, and to produce a method that might be useful in training for
fire suppression personnel.
Python for Biologists:
The intent of this tutorial, created from a COSC 670 course project
during Spring Semester 2012, is to
enlighten computational biologists with some of the novel
features of the python programming language for problem solving.
This material is intended to accompany a one day in-person
hands-on workshop and serve as a post workshop resource for workshop