ICDM'04 Invited Talks
Wray Buntine, Helsinki Institute of Information Technology, Finland
Opensource Search Engines: A Data Mining Platform
The ALVIS Consortium (http://www.alvis.info) and SearchInaBox are two
projects that build open source search engines. The ALVIS partners
believe the open source philosophy is ideal for next generation search
platforms, and are bringing data mining researchers together
with search and information retrieval professionals to develop
such a platform. This talk will outline the tasks and the
relevance of the effort to the data mining community.
Curriculum vitae:
Dr. Wray Buntine is known for his theoretical and applied work in
decision trees, e.g. the IND software, and Bayesian
graphical modeling, e.g. the most cited article "Operations
for Learning with Graphical Models" in JAIR.
Dr. Buntine is a former researcher with NASA
has taught graduate courses at Stanford University.
In addition to his academic career, he is a long time fan of Linux
and has a development background as a lead architect and developer
for a 24/7uptime, scalable, clientserver internet
product delivering personalization. Before joining CoSCo
at Helsinki Institute of Information Technology (HIIT)
in January 2002 he was a consultant at Google
and at NASA Ames Research Center working on statistical computing.
He is now a Senior research Scientist responsible for the ALVIS
Consortium and the SearchInaBox projects with Prof. Henry Tirri,
and is developing statistical language models for information retrieval.
David Hand, Imperial College, London
Deception, distortion, and discovery: data quality in data mining
Data mining is typically a process of secondary data analysis, using
data which were originally collected for some other purpose. They may
have been of high quality for that purpose, but of low quality for the
unspecified future analyses of data mining, and it may be economically
impracticable to require high quality data for all possible future
analyses. This talk gives an overview of data quality, covering
definitions, measurement, monitoring, and improvement. Some important
special topics are discussed in detail, including missing values,
anomaly detection, and deliberate data distortion. The talk is
illustrated with real examples from a wide variety of areas.
Curriculum vitae:
David Hand is Professor of Statistics and Head of the Statistics Section at Imperial College London.
He has published twenty books on statistics and related areas, including Discrimination and Classification,
Analysis of Repeated Measures, Practical Longitudinal Data Analysis, Construction and Assessment of Classification Rules,
Intelligent Data Analysis, Statistics in Finance, and Principles of Data Mining.
He is a Fellow of the Royal Statistical Society and an Honorary Fellow of the Institute of Actuaries.
He launched the journal Statistics and Computing in 1991, and also served a term of office as editor of Journal
of the Royal Statistical Society, Series C. He was awarded the Thomas L. Saaty Prize for Applied Advances in the
Mathematical and Management Sciences in 2001 and the Royal Statistical Society’s Guy Medal in Silver in 2002,
and was elected Fellow of the British Academy, the UK’s leading learned society for the humanities and social sciences,
in 2003. His research interests include classification methods, the fundamentals of statistics, and data mining,
and his applications interests include medicine and finance. He has acted as a consultant to a wide range of organisations,
including governments, banks, pharmaceutical companies, manufacturing industry, and health service providers.
Learning to Predict Complex Objects
Over the last decade, much of the research on discriminative learning
has focused on problems like classification and regression, where the
prediction is a single univariate variable. But what if we need to
predict complex objects like trees, sequences, or orderings? Such
problems arise, for example, when a natural language parser needs to
predict the correct parse tree for a given sentence, when a navigation
assistant needs to predict the route a user prefers for getting to the
destination, or when a search engine needs to predict which ranking is
best for a given query.
This talk will explore the challenges in predicting complex
objects. In particular, I will discuss support vector approaches that
covers some of these problems. They generalize the idea of margins to
complex prediction problems and a large range of loss functions. While
the resulting training problems have exponential size, there is a
simple algorithm that allows training in polynomial time. Empirical
results will be given for several examples.
Curriculum vitae:
Thorsten Joachims is an Assistant Professor in the Department of Computer Science at
Cornell University. In 2001, he finished his dissertation with the title "The
MaximumMargin Approach to Learning Text Classifiers: Methods, Theory, and
Algorithms", advised by Prof. Katharina Morik at the University of Dortmund. From
there he also received his Diplom in Computer Science in 1997 with a thesis on
WebWatcher, a browsing assistant for the Web. His research interests center on a
synthesis of theory and system building in the field of machine learning, with a
focus on Support Vector Machines and machine learning with text. He authored the
SVMLight algorithm and software for support vector learning. From 1994 to 1996 he
was a visiting scientist at Carnegie Mellon University with Prof. Tom Mitchell.

Ming Li, University of Waterloo, Canada
Faster and More Sensitive Homology Search
Homology search is the most popular task in bioinformatics.
It is probably one of the largest data mining tasks second only
to internet search. In the late 1970's, dynamic programming for homology search
was introduced. In the 1980's Blast heuristics was introduced
to trade sensitivity with speed. Today, a large fraction of world's
supercomputing time is consumed by Blast and SmithWaterman
dynamic programming. The explosive growth of genomics data
demands on significantly more sensitive (than Blast) and faster
(than both SmithWaterman and Blast) homology search software.
Can we speed up homology search without compromising sensitivity?
We introduce the fundamental ideas and the mathematical theory of optimized
spaced seeds. Equipped with optimal spaced seeds, our program
PatternHunter runs many times faster than Blast, at higher sensitivity levels.
With multiple optimized spaced seeds, PatternHunter runs 3000 times faster
than SmithWaterman, at the same (full) sensitivity, bringing
homology search back to a full circle.
Curriculum vitae:
Ming Li is a CRC Chair Professor in Bioinformatics, of Computer Science
at the University of Waterloo. He is a recipient of
Canada's E.W.R. Steacie Followship Award in 1996, and the 2001 Killam
Fellowship. Together with Paul Vitanyi they pioneered applications of
Kolmogorov complexity and coauthored the book "An Introduction to Kolmogorov
Complexity and Its Applications" (SpringerVerlag, 1993, 2nd Edition, 1997).
He is a comanaging editor of Journal of Bioinformatics
and Computatational Biology. He currently also serves on the editorial
boards of Journal of Computer and System Sciences, Information
and Computation, SIAM Journal on Computing, Journal of Combinatorial
Optimization, Journal of Software, and Journal
of Computer Science and Technology.