Document Analysis
Document purification
Text organization (title, author, body, etc.)
Location of critical information - tables, charts, graphics, images
What should be indexed and what should not?
Token or term extraction
Which words (or phrases) should be used as referents?
How can semantic content (meaning) be captured?