[SAMSI logo] 19 T.W. Alexander Drive
P.O. Box 14006
Research Triangle Park, NC 27709-4006
Tel: 919.685.9350
Fax: 919.685.9360
[email protected]
 
OpportunitiesProgramsWorkshopsCalendarAffiliatesReports & PublicationsDirectoryAbout
 

Technology Transfer Short Course

Data Mining and Machine Learning

July 25-29, 2005
at the NISS/SAMSI building

General Information
Course Contents
Application
Schedule

General Information

SAMSI is instituting a new summer activity—technology transfer short courses designed to consolidate results from (in most cases, earlier years') SAMSI programs, and to make the results available to working professionals in a compact, hands-on format. The first such course is derived from the 2003-04 SAMSI program on Data Mining and Machine Learning (DMML).

The goals of the DMML technology transfer short course are to:

  • Provide a survey of the theoretical basis for modern data mining

  • Give participants hands-on experience with data mining software

  • Convey insights and strategies for data mining practice.

The theoretical component will emphasize ideas over rigor; the software component will sample the major techniques that are now commonly used for visualization, classification, and regression; and the applications component will walk participants through the practical analysis of some famous real-world data sets.

The structure of the short course is that there will be three hours of lecture each morning. Each afternoon will start with a 90 minute computer lab that goes over an application using real data and relevant software, followed by a 90 minute lecture by a guest speaker. There will be several breaks during the day.

The course begins with an introductory overview of data mining: its scope, classical approaches, and the heuristics that guided the initial development of theory and methods. Then the course moves towards the treatment of more modern issues such as boosting, overcompleteness, and large-p small-n problems. This leads to a survey of currently popular techniques, including random forests, support vector machines, wavelets, and PAC bounds.

The main focus is upon a central focus of the SAMSI DMML program—regression inference, a paradigm that informs many data mining applications, but we also discuss clustering, classification, and multidimensional scaling.

The prerequisites for the course are a basic knowledge of applied multivariate inference and a general level of statistical knowledge comparable to a master's degree. Any math will focus upon conveying general insight rather than specific details.

Course Contents

  1. Background and Overview: Nonparametric Regression, Cross-Validation, the Bootstrap

  2. Key Ideas and Methods: Smoothing, Bias-Variance Tradeoff

  3. Search and Variable Selection: Experimental Design, Gray Codes, Fitness

  4. Nonparametric Regression: Heuristics on Eight Methods

  5. Comparing Methods: Designing Experiments in Data Mining

  6. Local Dimension: How to Pick Problems Wisely

  7. Classification: Boosting, Random Forests, Support Vector Machines

  8. Cluster Analysis: Hierarchical, k-Means, and Mixture Models; SOM

  9. Issues with Bases: Hilbert Space, Shrinkage, Overcompleteness

  10. Wavelets: Introduction, Construction, Examples

  11. Structure Extraction: Regression and Multidimensional Scaling

  12. Vapnik-Cervonenkis Classes and PAC Bounds

Principal instructor for the course will be David L. Banks, Professor of the Practice of Statistics and Decision Sciences at Duke University, and co-leader of the SAMSI DMML program.

Application

REGISTRATION IS NOW CLOSED

Schedule

Monday, July 25, 2005
9:00 a.m.-12:00 N Introduction, Cross-Validation, the Bootstrap, Search Strategies, and Smoothing
1:30 p.m.-3:30 p.m. Computer lab on G-Gobi visualization and smoothing
3:45 p.m.-5:15 p.m. Jack Liu (GlaxoSmithKline)
"Visualization and Data Mining for Microarrays"
Tuesday, July 26, 2005
9:00 a.m.-12:00 N Review and comparison of nonparametric regression methods: AM, GAM, PPR, ACE, AVAS, MARS, CART, neural nets; the backfitting algorithm.
1:30 p.m.-3:30 p.m. Computer lab on the DRAT package for multivariate nonparametric regression
3:45 p.m.-5:15 p.m. J. S. Marron (UNC-Chapel Hill)
"Issues with High Dimensional, Low Sample Size Data"
Wednesday, July 27, 2005
9:00 a.m.-12:00 N Classification and Clustering: SVMs, random forests, boosting
1:30 p.m.-3:30 p.m. Computer lab on classification and boosting
3:45 p.m.-5:15 p.m. Feng Liang (Duke)
"Model Complexity and Regularization"
Thursday, July 28, 2005
9:00 a.m.-12:00 N Bases and Wavelets
1:30 p.m.-3:30 p.m. Computer lab on SVMs and random forests
3:45 p.m.-5:15 p.m. Merlise Clyde (Duke)
"Bayesian Model Averaging"
Friday, July 29, 2005
9:00 a.m.-12:00 N PAC Bounds and VC Classes
1:30 p.m.-3:30 p.m. Computer lab on wavelets (decimated and nondecimated)
3:45 p.m.-5:15 p.m. David Banks (Duke)
"Survey of New Ideas in Data Mining"

 




 
 

Entire site © 2001-2008, Statistical and Applied Mathematical Sciences Institute. All Rights Reserved.