The Columbus,
A Renaissance Hotel
Columbus, OH

May 1, 2010

Photo credit: Destination360 Columbus, Ohio

       (Download Poster, 679KB)

to be held in conjunction with

Tenth SIAM International Conference on Data Mining (SDM 2010)

Topics of interest | Registration | Submission Requirements | Important Dates
Program | Program Committee | Organizational Committee | Sponsors | Slides

History of the Workshop

This is the eighth in the series of Text Mining workshops held in conjunction with SDM. Previous ones have taken place in 2001, 2002, 2003, 2006, 2007, 2008, and at the most recent workshop (2009) in Sparks, NV, 41 authors representing industry, academia and national research laboratories from 6 different countries submitted a total of 15 papers. After careful review, 8 papers were selected for publication and presentation. In addition, SAS and Small Bear Consulting, LLC sponsored the workshop and provided funds for student travel expenses.

General Topics

The proliferation of digital computing devices and their use in communication has resulted in an increased demand for systems and algorithms capable of mining textual data. Thus, the development of techniques for mining unstructured, semi-structured, and fully structured textual data has become quite important in both academia and industry. As a result, this Workshop will survey the emerging field of Text Mining - the application of techniques of machine learning in conjunction with natural language processing, information extraction and algebraic/mathematical approaches to computational information retrieval. Many issues are being addressed in this field ranging from the development of new learning approaches to the parallelization of existing algorithms. The goal of this workshop is to provide a venue for researchers to share initial approaches and preliminary results of recent research in Text Mining. Through the careful selection and review of submitted workshop papers, we hope to provide a suitable selection of topics that will both generate interest and provide insight into the state of the field of Text Mining.

Special Topics - Text Mining with the Enron Data Set and VAST 2008/2009 Contest Data

Because of the continued interest generated from the availability of the Enron data set of 1.3 million email messages (see Enron Email Dataset) and its versatility in terms of potential research topics (link analysis, pattern matching), researchers are encouraged to submit papers to this workshop. In addition, the text-based datasets of news events and scenario definition used in the IEEE Symposium on Visual Analytics Science and Technology (VAST) 2008 and 2009 Contests is an interesting corpus for research in topic detection/tracking, role playing, and scenario analysis (see VAST 2008 or VAST 2009 contests for more details on those datasets).

Other Specific Topics of Interest Include:

    Algorithms and Models

  • Bayesian Models
  • Concept Decomposition
  • Orthogonal Decomposition
  • Probabilistic Models
  • Vector Space Models
  • Latent Semantic Indexing
  • Graph-based Models
  • Text Streaming Models
  • Clustering
  • Factor Analysis
  • Visualization Techniques
  • Metadata Generation
  • Information Extraction
  • Text Classification
  • Text Purification
  • Text Segmentation
  • Text Summarization
  • Query Structures
  • Trend Detection
  • Distributed Storage and Retrieval


Attendees are required to register for SDM 2010 so that no separate registration is needed for this workshop.
A one-day registration for the conference is available. Workshop attendees do not have to register at the complete conference rate.
Click here for more details.

Submission Requirements

To submit a paper, upload your paper in PDF format (Papers should be printable on 8.5 × 11 paper only and be roughly 10 pages in length using a 11pt font in two-column font with 1 inch margins) by accessing the review system via

In the Authors section you will find the instructions:

1. Use the abstract submission interface to provide the main information
on your paper. You will be given an id/password which must later be used
to access the system during the following steps, so save the login information message that you will receive from the system.

2. Once an abstract has been submitted, you can upload your paper.

To guarantee consideration, manuscripts must be received by January 15, 2010. Submission of work in progress is also encouraged.

Important Dates

Papers Due: January 15, 2010 Deadline passed.

Notifications sent: February 5, 2010 Deadline passed.

Camera ready: Final Papers due to workshop: February 12, 2010 Deadline passed.

Keynote speakers:
John Tredennick (Founder and CEO) and Bruce Kiefer of Catalyst Repository Systems, Inc., Denver, CO
Title of Presentation:
Using Advanced Mathematical Techniques to Help Bring Electronic Discovery Under Control


With digital content exploding, corporations are seeking new and better ways to tame the eDiscovery dragon and reduce the cost of litigation review. In the past, legal teams would divide up the case documents and put eyeballs on every page. With million-document cases now becoming almost routine, clients are balking at the costs. Rather than pay expensive hourly rates to look at documents, they are turning to advanced mathematics and statistical techniques to reduce review populations. In this talk, we will explore real-world techniques used by Catalyst Repository Systems, one of the leading international document hosting companies for complex legal matters. Catalyst's founder and CEO, John Tredennick, a longtime trial lawyer handling these kinds of big cases, will talk about the problem space and how traditional methods for dealing with eDiscovery have failed. Bruce Kiefer, Catalyst's VP of Operations, will talk about how we use advanced clustering and search-based analytics techniques derived from non-negative matrix factorization techniques to help address the problem.

John Tredennick
Over the past thirty years, John Tredennick has spoken before more national and international audiences on legal and technology issues than he or anyone else can remember. He's written and edited five best-selling books and countless articles on litigation and technology issues. Recently, he was named one of the Top 100 Global Technology Leaders by London's CityTech magazine. He also serves as a member of the Short Course Faculty at the University of Virginia Law School where he teaches a course called Electronic Discovery in a Global Environment.

John began his career as a trial lawyer and litigation partner with one of the largest law firms in the Rocky Mountains. He got interested in technology in the late 1980s when his wife brought a computer home for graduate business school. After erasing the data on her hard drive twice, and making lots of other mistakes, he began to see promise in the machines. Working with a team of technologists, he began building software to help his 10-office firm manage complex litigation.

In 1995 John became the firm's Chief Information Officer (CIO), the first in the country for a major law firm. He also continued his full-time practice as a trial lawyer and litigation partner. His passion for finding ways to use technology to improve law practice led the firm to international prominence as a technology pioneer. It also led to a 1999 induction into the Smithsonian Institute Archives as an Information Innovation Pioneer.

John and his team began building web-based litigation repositories in 1998, while still a partner at his firm. He founded Catalyst (originally called CaseShare) in 2000. The company has since grown to more than 100 employees in the U.S. and India. The company is headquartered and maintains two data centers in the Colorado, with representatives in major cities across the U.S.

Over the years, John and the company has won numerous awards including being named Rocky Mountain Entrepreneur of the Year by Ernst and Young and Technology Entrepreneur of the Year by the Colorado Software and Internet Association (CSIA). His company was named Top Company for Technology/Media/Telecommunications by ColoradoBiz Magazine and Colorado Company to Watch by the Colorado Office of Economic Development. It has also been named repeatedly to the Deloitte FAST 50 and FAST 500 (for rapid growth) and received recognition by Socha-Gelbman and Law Technology News as a Top eDiscovery Provider.

Bruce Kiefer
Mr. Kiefer leads the Hosting Applications Division and directs the Research and Development Group at Catalyst Repository Systems, Inc. Bruce has had many careers, but has found a home in the technology sector. For the last twelve years he has built, deployed, managed, scaled, and repaired networks and systems that solve problems. Recently, he worked for Viawest Internet Services as Vice President of Operations. During Bruce's tenure at Viawest, he built all of the internal tools, grew the network to four states, and took over product management for Viawest's managed hosting offering.

Bruce has recently completed his MBA and used the new knowledge to help Viawest move the hosting product from a marginal product line to key revenue contributor. Bruce is excited about being able to use these skills to help product development and operations at Catalyst.

Presentation slides (PDF): Tredennick, Kiefer

Program in PDF format (final version posted on February 15, 2010)

Presentation Slides (PDF): Farahat, Hassan, Kontostathis, Liang, Ratner, Vacek, Wang
Sponsors: SAS Institute Inc. of Cary, NC, Small Bear Technical Consulting, LLC of Thorn Hill, TN, and Catalyst Repository Systems Inc. of Denver, CO; Press Release by Catalyst Repository Systems Inc. (April 15, 2010).
Program Committee

Co-Chairs: Michael W. Berry, University of Tennessee and Jacob Kogan, University of Maryland, Baltimore County

Loulwah AlSumait, Kuwait University
Murray Browne, Turner Broadcasting Systems, Inc.
Malu Castellanos, Hewlett-Packard Laboratories
Carlotta Domeniconi, George Mason University
William Ferng, Boeing
Kyle Gallivan, Florida State University
Efstratios Gallopoulos, University of Patras, Greece
Wilfried Gansterer, University of Vienna
Efim Gendler,
Peg Howland, Utah State University
April Kontostathis, Ursinus University

Choudur Lakshminarayan, Hewlett-Packard Laboratories
Bill Pottenger, DIMACS, Rutgers
Padma Raghavan, Penn State University
Andrea Tagarelli, University of Calabria, Italy
Judith Vogel, Stockton College
Zeev Volkovich, Ort Braude College, Israel
Yu Xia, INRIA Grenoble, France

Organizational Committee

Michael W. Berry
Department of Electrical Engineering & Computer Science
203 Claxton Complex
University of Tennessee
Knoxville, TN 37996-3450
Phone: (865) 974-3838
Fax: (865) 974-4404
berry AT eecs DOT utk DOT edu

Jacob Kogan
Department of Mathematics and Statistics
University of Maryland, Baltimore County
Baltimore, MD 21250
Phone: (410) 455-3297
Fax: (410) 455-1066
kogan AT math DOT umbc DOT edu

Last modified on May 5, 2010