next up previous
Next: Sample Hypertext Matrices Up: Reordering Techniques Previous: Reordering Techniques

3.1. Metrics for Evaluating Term-Document Matrix Reorderings

 

Term-document matrices are sparse, nonsymmetric and typically, highly overdetermined. As mentioned above, it is desirable that these matrices be reordered so that nonzeroes are clustered evenly about a line from the upper left to the lower right corner of the matrix. This line, though visually a diagonal, is not the conventional diagonal of a nonsquare matrix. We define metrics suitable for evaluating reorderings by adapting some well established metrics used in symmetric matrix computations.

The bandwidth () and envelope size ( ) are two measures used in the context of symmetric sparse matrices reordered to a band form. Let C be an symmetric matrix; , the bandwidth of row i, is the difference between i (the row number) and the smallest column number j such that . Let be the maximum of bandwidth values over all rows, i.e., . The bandwidth is defined as . We chose this definition over other alternatives (such as defining the bandwidth as ) because its natural extension to overdetermined term-document matrices seems more suitable as a metric for evaluating reorderings. The envelope of C incorporates the variation in bandwidth over all rows. The envelope size is defined by

Observe that these terms capture the distance from the diagonal to the farthest nonzero on each row of the matrix.

To extend these definitions to evaluate reorderings of a nonsymmetric term-document matrix A, we consider the straight line from the upper left (row 1, column 1) to the lower right (row m, column n) corner. The equation to this visual diagonal line can be easily computed; the diagonal subscript of a row is defined as the abscissa obtained using as the ordinate value in this equation. To account for nonsymmetry, we define , the bandwidth of row i, as the largest difference between and any column number j such that . The bandwidth and the envelope size are as defined as earlier but using the the new definition of . Defining as twice the maximum over row bandwidths measures how evenly the nonzero clusters are centered about the diagonal.

We also provide values of , a quantity which measures the size of the nonzero band but unlike does not take into account the displacement from the diagonal. Let be the difference between the largest and smallest nonzero subscript in row i of the matrix. Define as the maximum of over all rows; now differs from in not being relative to the diagonal. However, the sum of over all rows is still the same as which was defined earlier in terms of .



next up previous
Next: Sample Hypertext Matrices Up: Reordering Techniques Previous: Reordering Techniques



Michael W. Berry (berry@cs.utk.edu)
Mon Jan 29 14:30:24 EST 1996