Tutorial 1 : 10 Data Mining Mistakes -- and How to Avoid Them

John F. Elder IV (Chief Scientist, Elder Research, Inc.)


This tutorial will reveal the top mistakes data analysts can make, from the simple to the subtle, using real-world (often humorous) stories. The topics will be presented from case studies of real projects and the (often overlooked) symptoms that suggested something might be amiss

The goal will be to learn "best practices" from their flip side -- mistakes. But also, following the introduction of a topic (e.g., bootstrapping) we'll review how to do it right -- that is, we'll have mini-tutorials on the key principles to keep in mind when using a particular DM technique.

The best background for attendees to have is a problem they want to solve, and experience trying an analysis technique. We'll focus on how to think rightly about a problem, and less on technical equations or details. Practical illustrations emphasize the "uncommon common sense" necessary to practice well the art of Data Mining. The subtler mistakes (e.g., improper sampling) may elude novice practitioners, but the story-symptom-solution format of the tutorial should be very accessible to all conference attendees. Previous versions of this tutorial have received very high marks for clarity and accessibility (and humor) from novices and experts alike.


Dr. John Elder heads a data mining consulting team in Charlottesville, Virginia, and Washington, DC (www.datamininglab.com), founded in 1995. Elder Research, Inc. focuses on investment and commercial applications of pattern discovery, including stock selection, image recognition, biometrics, cross-selling, drug efficacy, credit scoring, market timing, and fraud detection. John obtained a BS and MEE in Electrical Engineering from Rice University, and a PhD in Systems & Information Engineering from the University of Virginia, where he's an adjunct professor teaching Optimization. Prior to ERI, John spent 5 years in high-tech defense consulting, 4 heading research at an investment management firm, and 2 in Rice's Computational & Applied Mathematics department.

Dr. Elder has authored influential articles and innovative data mining tools, is active on Statistics, Engineering, and Finance conferences and boards, and is a program co-chair of the 2004 Knowledge Discovery and Data Mining conference. Since the Fall of 2001, he's served on a panel appointed by Congress to guide technology for the National Security Agency.

Tutorial 2 : Data Mining In Time Series Databases

Eamonn Keogh (University of California, Riverside)


In this tutorial we will review the state of the art in time series data mining. In addition to the ubiquitous classification and similarity search problems, we will also consider clustering, anomaly detection, visualization, motif discovery and other exciting tasks. The ideas presented will be motivated by case studies in domains as diverse as video surveillance, cardiology, text mining, space telemetry monitoring, handwriting indexing, query by humming and motion capture/animation. Rather that simply review previous work, we have taken the time to reimplement and compare most of the work in the literature. For example: we have

* Reimplemented 52 time series distance measures.
* Reimplemented more than 20 time series representations.
* Done more than 15 billion Dynamic Time Warping calculations.

The tutorial will be presented in a "math lite", but highly initiative and graphic intensive format, accessible to professors, grad students, people in industry and advance undergrad students. All attendees will receive a free CD-ROM will a full copy of the tutorial, the worlds largest collection of time series datasets, and a host of other useful teaching/research materials.


Dr. Keogh is an assistant professor of Computer Science at the University of California, R iverside. His research interests include Data Mining, Machine Learning and Information Retrieval. He has published papers on time series in all the top data mining conferences and journals, including VLDB, SIGKDD, SIGIR, SIGMOD, SIGGRAPH, EDBT, PKDD, PAKDD, IEEE ICDM, IEEE ICDM, SIAM SDM, TODS, DMKD and KAIS. Several of his papers have won "best paper" awards. He recently won a 5-year NSF Career Award for "Efficient Discovery of Previously Unknown Patterns and Relationships in Massive Time Series Databases". His papers on time series data mining have been referenced well over 1,000 times (see http://www.cs.ucr.edu/~eamonn/selected_publications.php).

Tutorial 3 : Algorithmic Excursions in Data Streams

Sudipto Guha (University of Pennsylvannia,)


For many recent applications, the concept of a data stream is more appropriate than a data set. By nature, a stored data set is an appropriate model when significant portions of the data are queried repeatedly, and updates are small and/or relatively infrequent. In contrast, a data stream is a more appropriate model in scenarios where large volumes of data or updates arrive continuously and it is either unnecessary or impractical to store the data in some form of memory. Many applications naturally generate data streams as opposed to simple data sets.

The stream view challenges basic assumptions in data mining like random access to data. It also raises several fundamental questions such as, are there effective techniques for mining streams?

In this tutorial we will present a survey of algorithms and applications related to data streams. We begin by presenting the basic data stream model of computation and variations. We will subsequently cover algorithmic techniques for summarizing data streams, namely, estimating frequency moments, computing wavelets, histograms, Fourier decompositions, etc. Most of these techniques will be motivated in the context of various networking and database applications. Finally, we will cover various mining primitives on streams like computing frequent itemsets, clustering, and decision trees. We will summarize and relate streaming to existing models such as online algorithms and other models.


Sudipto Guha is an assistant professor in the Department of Computer and Information Sciences at University of Pennsylvania since Fall 2001. He completed his PhD in 2000 at Stanford University working on approximation algorithms and spent a year working as a senior member of technical staff in Network Optimizations and Analysis Research department in AT&T Shannon Labs Research. He is Alfred P. Sloan Research Fellow.

Tutorial 4 : Data Grid Management Systems (DGMS)

Arun swaran Jagatheesan (University of California at San Diego)


A data grid infrastructure facilitates a logical view of heterogeneous distributed resources that are shared between autonomous administrative domains. Data grids are being built around the world, as the next generation data-handling infrastructures, for coordinated sharing of data and storage resources. A datagrid infrastructure provides a location independent logical namespace, consisting of persistent global identifiers for data resources, storage resources and users in an inter/intra organizational enterprise. Data Grid Management Systems (DGMS) provide services on the data grid infrastructure for inter/intra organizational information storage management.

The tutorial's objective is to introduce the data grid technologies and their relevance to the Data Mining community. The notion of sharable infrastructure storage and data amongst autonomous domains will introduce new challenges to mine and make knowledge and information of the distributed grid data. Novices and distributed computing experts would be benefited from this tutorial. The tutorial would cover introduction, use-cases in large projects, design philosophies, existing technologies, open research issues, and demonstrations if possible.


Arun swaran Jagatheesan ("Arun") is an Adjunct Assistant Researcher (OPS faculty member) at the University of Florida, and a Visiting Scholar at the San Diego Supercomputer Center (SDSC) at University of California, San Diego. His research interests include Data Grid Management, Peer-to-peer Computing, and Workflow Management Systems. He is the founder and technical lead of the SDSC Matrix Project on Gridflow Management Systems. He is a co-chair of the Grid File System Working Group at the Global Grid Forum, and is involved in research and development of multiple datagrid projects at the San Diego Supercomputer Center.