[U,D,V] = svd(A)the columns of U are sometimes called "the left singular vectors of A" and the columns of V are called "the right singular vectors of A". Normally document preparation and processing is far more important and time-consuming in LSI than the mathematical information reduction used, but we'll skirt that issue by doing very primitive processing via shell scripts.
#! /bin/csh foreach f (`cat manpages`) echo $f man $f \ | col -b \ | tr -cs 'A-Za-z' '\012' \ | tr 'A-Z' 'a-z' \ | egrep '^..*$' \ | egrep -v '^([a-z]|the|or|of|it|above|below|after|an|and)$'\ | egrep -v '^(on|not|that|used|use|see|then)$'\ | egrep -v '^(may|with|as|are|but|does|one|have|no)$'\ | egrep -v '^(in|if|to|from|is|be|by|use|do|like|only|you)$'\ | sort | uniq -c | sort -n | tail -40 | sort -k 2 > $f.doc endNote that on the nations cluster there are at least three different "tr" commands, and they have different behaviour. Be sure to use the one in /usr/local/bin/tr either by setting your path, or modifying the script to explicitly refer to the right one. Also note that you should be able to explain what each of the commands above does - perhaps by referring to the man pages for those commands (col, tr, egrep, sort, ....).
#! /bin/csh cat $argv[*] |\ tr -cs 'A-Za-z' '\012' |\ tr 'A-Z' 'a-z' | egrep '^..*$' |\ sort |\ uniq |\ sort > dictionarywhich is run by "./create_dict *.doc" in the directory where all the man page term lists are. Make sure you don't also read in any Microsoft Word files by accident; you might do this by putting all the man page term list files into a subdirectory "docs". The file resulting from running create_dict (called "dictionary") will contain all the terms from the lists - a more realistic approach might be to use the word counts and just have the 200 most commonly occuring ones.
The dictionary file should then be processed to be readable from Matlab. You can either edit the script above, write another script, or use a text editor to create a file dictionary.m that uses strvcat to create a matrix of the dictionary words:
dict = strvcat(... 'accent',... 'access',... 'accurate',... 'acm',... 'active',... 'addr',... 'address',... 'addressed',... ... 'zip');Similarly, create a Matlab list of the files containing the term lists, e.g.,
doclist = strvcat(... 'docs/arch.doc',... 'docs/ash.doc',... 'docs/awk.doc',... 'docs/basename.doc',... 'docs/bash.doc',... 'docs/bsh.doc',... ... 'docs/zcat.doc');
The way things are set up now, dict and doclist are Matlab string arrays. Those have to be rectangular, so strvcat will pad any short words with blanks so that every row has the same number of characters. [Another way of approaching this exercise would be to use Matlab cells and/or structs]. To strip away the blanks, we'll have to use deblank(), and to compare two strings we'll use strcmp(). [We cannot use the Matlab comparison == for this; why?].
First, read in the document list and the dictionary, and set up a matrix of all zeros of the right size:
dictionary docs nterms = length(dict); ndocs = length(doclist); A = zeros(nterms,ndocs);Next, go through each document (corresponding to each column of A) and get the word counts and their location within the column. Normalize each column, and then compute the SVD of A:
for k = 1:ndocs [counts,words] = getdoc(deblank(doclist(k,:))); I = findrows(dict, words); A(I,k) = counts; A(:,k) = A(:,k)/norm(A(:,k)); end [U,D,V] = svd(A,0);The helper functions getdoc.m and findrows.m do this.
qterms = strvcat('all', 'files', 'list'); I = findrows(dict,qterms); q = zeros(nterms,1); q(I) = 1; q = q/norm(q);Now you can find the document that best matches the query vector in the straightforward way described in class:
cosines = A'*q; [a,b] = max(cosines); disp(['Best match for list,files,all: ',doclist(b,:)])
Examine the singular values from decomposing the matrix. Is the rank of the matrix A obvious from looking at them (i.e., is there a clear drop-off to an number effectively zero when going from some singular value sr to the next sr+1? What is the rank? You can view a "movie" of the right singular vectors by the loop
for k = 1:ndocs plot(V(:,k), '+') axis([0 ndocs+1 -1 1]) title(['Right Singular Vector Number ', num2str(k)]) drawnow pause(1) endDoes anything interesting happen when you reach the drop-off place for the size of singular values? If so, explain it in the terms/documents terminology.
Next, examine how well using a low-rank approximation works for finding query matches. For this, form the corresponding query vectors q1, q2, q3, and q4 from the four sets of query terms