![]() Mehran Sahami |
|
![]() Andreas Janecek |
![]() Jacob Kogan |
Radisson
University Hotel Minneapolis, MN April 28, 2007 |
![]() Peg Howland |
Contest links:
First Place: Cyril Goutte, NRC Institute for Information Technology, Canada
Second Place: The Wake Forest Team: Edward G. Allan, Michael R. Horvath, Christopher V. Kopek, Brian T. Lamb, and Thomas S. Whaples (All students in the Dept. of Computer Science, Wake Forest University, Winston Salem, NC); Michael W. Berry (Advisor, Department of Computer Science, University of Tennessee, Knoxville)
Third Place: Mostafa Keikha and Narjes Sharif-razavian; Dr. Farhad Oroumchian (Advisor) University of Tehran, Iran
Contest Summary by Matthew Otey (NASA Ames Research Center):The 2007 Text Mining Workshop held in conjunction with the Seventh SIAM International Conference on Data Mining is the first to feature a text mining competition. The competition was organized and judged by members of the Intelligent Data Understanding group at NASA Ames Research Center in Moffett Field, California. Being the first such competition held as a part of the workshop, we did not expect the large number of contestants that more established competitions such as KDDCUP have, but we did end up receiving 5 submissions, though one person later withdrew.
A training data set was provided over a month in advance of the deadline, giving the contestants time to develop their approaches. Two days before the deadline we released the test data set. Each contestant submitted their labeling of the test data set, their confidences in the labeling, and source code implementing their approach. The scores of the submissions were calculated using a small Java program that implemented the score function detailed in the contest rules. The program and its source code were released to the contestants prior to the submission deadline so that they could both validate its correctness and use it to tune their algorithms. In addition to scoring the submissions, we ran each contestant's code to ensure that it worked and produced the same output that was submitted, and we inspected the source code to ensure that the contestants properly followed the rules of the contest.
The submissions of the contestants all successfully ran and passed our inspection, and we announced our three winners. In first place is Cyril Goutte at the NRC Institute for Information Technology, Canada with a score of 1.69. In second place is a team consisting of Edward G. Allan, Michael R. Horvath, Christopher V. Kopek, Brian T. Lamb, and Thomas S. Whaples of Wake Forest University, and their advisor, Michael W. Berry of the University of Tennessee at Knoxville, with a score of 1.27. In third place is a team consisting of Mostafa Keikha and Narjes Sharif-razavian of the University of Tehran in Iran, and their advisor, Farhad Oroumchian of the University of Wollongong, Dubai, United Arab Emirates, with a score of 0.97. At NASA, we evaluated Schapire's and Singer's BoosTexter approach, and achieved a maximum score of 0.82 on the test data, showing that the contestants made some significant improvements over standard approaches.
![]() M.W. Berry/WFU (2nd Place) C. Goutte/NRC-Canada (1st Place) |
![]() Winners with NASA-Ames Researchers (D. McIntosh, P. Castle, M. Otey) |
History of the Workshop
This is the fifth in the series of Text Mining workshops held in conjunction with SDM. Previous ones have taken place in 2001, 2002, 2003 and 2006. Last year at the Bethesda, Maryland workshop, 25 researchers from all over the globe submitted papers. Nine papers were accepted and presented at the workshop.
General Topics
The
proliferation of digital computing devices and their use in
communication has resulted in an increased demand for systems and
algorithms capable of mining textual data. Thus, the development of
techniques for mining unstructured, semi-structured, and fully
structured textual data has become quite important in both academia and
industry. As a result, this Workshop will survey the emerging field of
Text Mining - the application of techniques of machine learning in
conjunction with natural language processing, information extraction
and algebraic/mathematical approaches to computational information
retrieval. Many issues are being addressed in this field ranging from
the development of new learning approaches to the parallelization of
existing algorithms. The goal of this workshop is to provide a venue
for researchers to share initial approaches and preliminary results of
recent research in Text Mining. Through the careful selection and
review of submitted workshop papers, we hope to provide a suitable
selection of topics that will both generate interest and provide
insight into the state of the field of Text Mining.
Special Topic - Text Mining with the Enron Data Set
Because of the continued interest generated from the availability of the Enron data set of 1.3 million email messages (See Enron Email Dataset) and its versatility in terms of potential research topics (link analysis, pattern matching), researchers are encouraged to submit papers to this workshop. Researchers interested in the social network aspects of the Enron data set should contact the organizers of the SDM Link Analysis Workshop.
Other Specific Topics of Interest Include:
Attendees are required to register for SDM 2007 so that no separate registration is needed for this workshop.
To
submit a paper, upload your paper in PDF format (Papers should be printable
on 8.5 × 11 paper only and be roughly 10 pages in length using a
11pt font in two-column font with 1 inch margins)
by accessing the review system via
http://www.cs.utk.edu/TextMiningPapers.
In the Authors section you will find the instructions:
1. Use the abstract submission interface to provide the main information
on your paper. You will be given an id/password which must later be used
to access the system during the following steps, so save the login information
message that you will receive from the system.
2. Once an abstract has been submitted, you can upload your paper.
To guarantee consideration,
manuscripts must be received by
January 21, 2007 deadline has passed.
Submission of work in progress is
also encouraged.
Papers
Due: January 8, 2007
extended to January 21, 2007
deadline has passed.
Notifications
sent: February 5, 2007 deadline has passed.
Camera
ready: Final Papers due to workshop: February 15, 2007
deadline has passed.
The Keynote speaker is
Mehran Sahami
of Google.
Title:
Using Text Mining to Measure Similarity Between Words and
Objects
Abstract:
The World Wide Web provides a wealth of data that can be harnessed to help
improve information retrieval and increase understanding of the
relationships between different entities. In many cases, we are often
interested in determining how similar two entities may be to each other,
where the entities may be pieces of text or descriptions of some object. In
this work, we examine multiple instances of this problem, and show how they
can be addressed by harnessing data mining techniques applied to large
web-based data sets. Specifically, we examine the problems of determining
the similarity of short texts (even those that may not share any terms in
common) and also of learning similarity functions for semi-structured data
to address tasks such as record linkage between objects. While we present
rather different techniques for each problem, we show how measuring
similarity between entities in these domains has a direct application to the
overarching goal of improving information access for users of web-based
systems.
Biography:
Mehran Sahami is a Senior Research Scientist at Google. His research
interests include machine learning, data mining, and information retrieval
on the Web. Mehran was also previously a Lecturer in the Computer Science
Department at Stanford University (where he received his PhD), and prior to
Google also involved in a number of commercial and research machine learning
projects at Epiphany, Xerox PARC and Microsoft Research. He has published
dozens of refereed technical papers, served on numerous conference
program/organizing committees and has several patents pending.
Co-Chairs: Malu Castellanos, HP
Labs and Michael W. Berry, University of Tennessee
William Ferng, Boeing
Kyle Gallivan,
Florida State University
Mei Kobayashi,
IBM
Tokyo Research Lab
Stephen Soderland,
University
of Washington
Haesun Park, Georgia Tech
Peg
Howland, Utah State University
April
Kontostathis, Ursinus University
Padma
Raghavan, Penn State University
Efstratios
Gallopoulos, University of Patras, Greece
Choudur
Lakshminarayan, Hewlett-Packard Laboratories
Pierre
Senellart, INRIA (France)
Roger Bilisoly,
Central Connecticut State University
Co-Chairs:
Malu Castellanos
Hewlett-Packard Laboratories, Palo Alto, CA
Phone: (650) 857-3074
Fax: (650) 852 8137
malu.castellanos AT hp DOT com
Michael W. Berry
Department of Computer Science
204 Claxton Complex
University of Tennessee
Knoxville, TN 37996-3450
Phone: (865) 974-3838
Fax: (865) 974-4404
berry AT cs DOT utk DOT edu
Publicity and Coordination
Murray Browne
203 Claxton Building
University of Tennessee
Knoxville, TN 37996-3450
Department of Computer Science
University of Tennessee
(865) 974-3510
mbrowne AT cs DOT utk DOT edu