Common IR Test Collections

Glasgow Repository: 423 Time Magazine Articles (from 1963), Cranfield Collection (1,400 Abstracts),
                                      Medlars Collection (1033 Abstracts), ADI, CACM, CISI

SMART's English Stoplist

Text REtrieval Conference (TREC)

Reuters-21578 Text Categorization Test Collection

Four Universities Data Set (webpages)

Linguistic Data Consortium (Catalog)

Sample Document-By-Term Matrices
and Term Lists

CISI 1,460 docs by 5,609 terms Matrix, Term List
CRAN 1,398 docs by 4,612 terms Matrix, Term List
MED 1,033 docs by 5,831 terms Matrix, Term List

Each Matrix file is based on the Harwell-Boeing (compressed column) sparse matrix format. Each record in a Term List file contains the term, its id, and its global weight based on a Log-Entropy weighting scheme.