Exercise 6, Computer Science P573, Fall 2001

Due: November 8, at start of class time.
Credit: 100 points

This experiment is to explore the singular value decomposition from the context of latent semantic indexing. A bit of nomenclature first: in the SVD decomposition using Matlab syntax,

   [U,D,V] = svd(A)

the columns of U are sometimes called "the left singular vectors of A" and the columns of V are called "the right singular vectors of A". Normally document preparation and processing is far more important and time-consuming in LSI than the mathematical information reduction used, but we'll skirt that issue by doing very primitive processing via shell scripts.

The Documents

You will process a set of man pages listed in the file manpages. Those were chosen primarily to be man pages for commands in /bin from the RedHat Linux installation, with a few other common ones added. The following csh script reads the list from the file, and then for each man page X in it, creates a file called X.doc. That file contains a list of the 40 (or fewer) most frequently occuring terms, with some common "stop words" removed. The csh script is create_docs:

#! /bin/csh
foreach f (`cat manpages`)
   echo $f
   man $f \
   | col -b \
   | tr -cs 'A-Za-z' '\012' \
   | tr 'A-Z' 'a-z' \
   | egrep '^..*$' \
   | egrep -v '^([a-z]|the|or|of|it|above|below|after|an|and)$'\
   | egrep -v '^(on|not|that|used|use|see|then)$'\
   | egrep -v '^(may|with|as|are|but|does|one|have|no)$'\
   | egrep -v '^(in|if|to|from|is|be|by|use|do|like|only|you)$'\
   | sort | uniq -c | sort -n | tail -40 | sort -k 2 > $f.doc
end

Note that on the nations cluster there are at least three different "tr" commands, and they have different behaviour. Be sure to use the one in /usr/local/bin/tr either by setting your path, or modifying the script to explicitly refer to the right one. Also note that you should be able to explain what each of the commands above does - perhaps by referring to the man pages for those commands (col, tr, egrep, sort, ....).

The Dictionary

You next need to create the dictionary for the set of terms that will be used. A csh script for doing this is in create_dict:

#! /bin/csh
cat $argv[*] |\
	tr -cs 'A-Za-z' '\012' |\
	tr 'A-Z' 'a-z' | egrep '^..*$' |\
	sort |\
	uniq |\
	sort > dictionary

which is run by "./create_dict *.doc" in the directory where all the man page term lists are. Make sure you don't also read in any Microsoft Word files by accident; you might do this by putting all the man page term list files into a subdirectory "docs". The file resulting from running create_dict (called "dictionary") will contain all the terms from the lists - a more realistic approach might be to use the word counts and just have the 200 most commonly occuring ones.

The dictionary file should then be processed to be readable from Matlab. You can either edit the script above, write another script, or use a text editor to create a file dictionary.m that uses strvcat to create a matrix of the dictionary words:

dict = strvcat(...
              'accent',...
              'access',...
              'accurate',...
              'acm',...
              'active',...
              'addr',...
              'address',...
              'addressed',...
               ...
              'zip');

Similarly, create a Matlab list of the files containing the term lists, e.g.,

doclist = strvcat(...
              'docs/arch.doc',...
              'docs/ash.doc',...
              'docs/awk.doc',...
              'docs/basename.doc',...
              'docs/bash.doc',...
              'docs/bsh.doc',...
              ...
              'docs/zcat.doc');

Creating The Term-Document Matrix

Within Matlab, we next want to create the term-document matrix, with row numbers corresponding to terms in the dictionary and columns corresponding to documents (the man page .doc files). The (i,j) entry in that matrix is the number of times that term i appears in document j. Most of the entries will be zero, since not all terms appear in all the documents.

The way things are set up now, dict and doclist are Matlab string arrays. Those have to be rectangular, so strvcat will pad any short words with blanks so that every row has the same number of characters. [Another way of approaching this exercise would be to use Matlab cells and/or structs]. To strip away the blanks, we'll have to use deblank(), and to compare two strings we'll use strcmp(). [We cannot use the Matlab comparison == for this; why?].

First, read in the document list and the dictionary, and set up a matrix of all zeros of the right size:

dictionary
docs
nterms = length(dict);
ndocs = length(doclist);
A = zeros(nterms,ndocs);

Next, go through each document (corresponding to each column of A) and get the word counts and their location within the column. Normalize each column, and then compute the SVD of A:

for k = 1:ndocs
    [counts,words] = getdoc(deblank(doclist(k,:)));
    I = findrows(dict, words);
    A(I,k) = counts;
    A(:,k) = A(:,k)/norm(A(:,k));
end
[U,D,V] = svd(A,0);

The helper functions getdoc.m and findrows.m do this.

Query Matching

To match a query against the entire matrix, create a term-array, get the indices of its terms in the dictionary, and set up the corresponding normalized query vector.

     qterms = strvcat('all', 'files', 'list');
     I = findrows(dict,qterms);
     q = zeros(nterms,1);
     q(I) = 1;
     q = q/norm(q);

Now you can find the document that best matches the query vector in the straightforward way described in class:

     cosines = A'*q;
     [a,b] = max(cosines);
     disp(['Best match for list,files,all: ',doclist(b,:)])

The Actual Assignment

All of the above is mainly to get the term-document matrix. Your first task is to fix my findrows() function. It assumes the dictionary and the query vector are in alphabetical order, which is alright - the scripts do the necessary sorting. However, it assumes each term will be matched in the dictionary; modify it so that is not a requirement.

Examine the singular values from decomposing the matrix. Is the rank of the matrix A obvious from looking at them (i.e., is there a clear drop-off to an number effectively zero when going from some singular value s_r to the next s_r+1? What is the rank? You can view a "movie" of the right singular vectors by the loop

for k = 1:ndocs
	plot(V(:,k), '+')
	axis([0 ndocs+1  -1  1])
	title(['Right Singular Vector Number ', num2str(k)])
	drawnow
	pause(1)
end

Does anything interesting happen when you reach the drop-off place for the size of singular values? If so, explain it in the terms/documents terminology.

Next, examine how well using a low-rank approximation works for finding query matches. For this, form the corresponding query vectors q1, q2, q3, and q4 from the four sets of query terms

all, files, list
vi, edit, show
compress, history, zip, bzip2, store
program, purpose, install, name, report, there, better

[For the last one, the terms are fairly common and may not have a strong match even with the full matrix!]. See what the top five documents are that match with each of those, using

the SVD rank-1 approximation to A
the SVD rank-4 approximation to A
the SVD rank-8 approximation to A
the SVD rank-16 approximation to A
the SVD rank-32 approximation to A
the full matrix A

How drastically does the approximation change the matched pages? According to theory, the square of the Frobenius norm of the difference between the original matrix and the approximation matrix is bounded by the sum of squares of the neglected singular values. How does that vary as you increase the rank of the approximant in the six cases above? Does that seem to have a direct correspondance with the quality of matching documents?

Mon Oct 29 09:49:17 EST 2001