Statistical and Applied Mathematical Sciences Institute
19 T. W. Alexander Drive
P.O. Box 14006
Research Triangle Park, NC 27709-4006
Tel: 919.685.9350 FAX: 919.685.9360
info@samsi.info

 

Technology Transfer Short Course:
Data Mining and Machine Learning

NISS/SAMSI Building

July 25-29, 2005

 

BACKGROUND INFORMATION

 

SAMSI is instituting a new summer activity—technology transfer short courses designed to consolidate results from (in most cases, earlier years') SAMSI programs, and to make the results available to working professionals in a compact, hands-on format. The first such course is derived from the 2003-04 SAMSI program on Data Mining and Machine Learning (DMML).

 

The goals of the DMML technology transfer short course are to:

The theoretical component will emphasize ideas over rigor; the software component will sample the major techniques that are now commonly used for visualization, classification, and regression; and the applications component will walk participants through the practical analysis of some famous real-world data sets.

 

The structure of the short course is that there will be three hours of lecture each morning. Each afternoon will start with a 90 minute computer lab that goes over an application using real data and relevant software, followed by a 90 minute lecture by a guest speaker. There will be several breaks during the day.

The course begins with an introductory overview of data mining: its scope, classical approaches, and the heuristics that guided the initial development of theory and methods. Then the course moves towards the treatment of more modern issues such as boosting, overcompleteness, and large-p small-n problems. This leads to a survey of currently popular techniques, including random forests, support vector machines, wavelets, and PAC bounds.

The main focus is upon a central focus of the SAMSI DMML program—regression inference, a paradigm that informs many data mining applications, but we also discuss clustering, classification, and multidimensional scaling.

The prerequisites for the course are a basic knowledge of applied multivariate inference and a general level of statistical knowledge comparable to a master's degree. Any math will focus upon conveying general insight rather than specific details.

 

COURSE CONTENTS

  1. Background and Overview: Nonparametric Regression, Cross-Validation, the Bootstrap
  2. Key Ideas and Methods: Smoothing, Bias-Variance Tradeoff
  3. Search and Variable Selection: Experimental Design, Gray Codes, Fitness
  4. Nonparametric Regression: Heuristics on Eight Methods
  5. Comparing Methods: Designing Experiments in Data Mining
  6. Local Dimension: How to Pick Problems Wisely
  7. Classification: Boosting, Random Forests, Support Vector Machines
  8. Cluster Analysis: Hierarchical, k-Means, and Mixture Models; SOM
  9. Issues with Bases: Hilbert Space, Shrinkage, Overcompleteness
  10. Wavelets: Introduction, Construction, Examples
  11. Structure Extraction: Regression and Multidimensional Scaling
  12. Vapnik-Cervonenkis Classes and PAC Bounds

 

 

INSTRUCTOR

Principal instructor for the course will be David L. Banks, Professor of the Practice of Statistics and Decision Sciences at Duke University, and co-leader of the SAMSI DMML program.

 

TENTATIVE SCHEDULE

Monday, July 25
9:00 AM - 12:00 N

Introduction, Cross-Validation, the Bootstrap, Search Strategies, and Smoothing
1:30 - 3:30 PM Computer lab on G-Gobi visualization and smoothing

3:45 - 5:15

Jack Liu, GlaxoSmithKline: "Visualization and Data Mining for Microarrays"
Tuesday, July 26
9:00 AM - 12:00 N


Review and comparison of nonparametric regression methods: AM, GAM, PPR, ACE, AVAS, MARS, CART, neural nets; the backfitting algorithm.

1:30 - 3:30 PM

Computer lab on the DRAT package for multivariate nonparametric regression

3:45 - 5:15

J. S. Marron, UNC: "Issues with High Dimension, Low Sample Size Data"
Wednesday, July 27
9:00 AM - 12:00 N

Classification and Clustering: SVMs, random forests, boosting

1:30 - 3:30 PM Computer lab on classification and boosting
3:45 - 5:15
Feng Liang, Duke: "Model Complexity and Regularization"
Thursday, July 28
9:00 AM - 12:00 N

Bases and Wavelets

1:30 - 3:30 PM Computer lab on SVMs and random forests
3:45 - 5:15 Merlise Clyde, Duke: "Bayesian Model Averaging"
Friday, July 29
9:00 AM - 12:00 N

PAC Bounds and VC Classes

1:30 - 3:30 PM Computer lab on wavelets (decimated and nondecimated)
3:45 - 5:15 David Banks, Duke: "Survey of New Ideas in Data Mining"



APPLICATION

Enrollment in the short course is limited to 25. Applications, including requests for financial support, should be submitted as soon as possible.  ONLY ONLINE APPLICATIONS WILL BE ACCEPTED.  The application deadline is July 1, 2005.  In order to ensure your application is correct, we ask that you:

You will be notified as soon as possible (generally within three days) whether your application has been accepted, at which point you will be required to submit payment.

 

ON-LINE APPLICATION

 

Local Information (Hotels, etc.)

 

 

 

SAMSI Home Page

 

© 2005, Statistical and Applied Mathematical Sciences Institute. All rights reserved.