ST 790C-001 - Statistical Machine Learning and Data Mining


Lectures: TH 8:15-10:00am, Withers 135 ( Lecture length extended, room changed ) | Syllabus
Office Hours: TH 10:10-11:10am (or by appointment)
Textbooks: The Element of Statistical Learning:data miming, inference, and prediction Hastie, Tibshirani, and Friedman (2001).
Reference Books:
  • Pattern Recognition and Neural Networks by B. Ripley (1996)
  • Learning with Kernels by Scholkopf and Smola (2000)
  • The Nature of Statistical Learning Theory by Vapnik (1998)

    Useful Links:
  • Kernel Machines | Tibshirani's Lasso Page | Hastie's Software and Data
  • Local Working Group

    Software:
  • R Manual | LIBSVM

    Datasets:
  • Download Zip Code Data

    Final Project
  • Description


    Paper List for Journal Club:
  • Utility based data mining for time series analysis: cost-sensitive learning for neural network predictors by Crone, Lessman and Stahlbock (2005), Conference on Knowledge Discovery in Data archive Proceedings of the 1st international workshop on Utility-based data mining. Presenter: Melinda (08/28)
  • Multivariate statistical process control with artificial contrasts by Hwang, Runger and Tuv (2007), IIE Transactions. Presenter: Wook (10/25)
  • Choose between logistic regression and discriminant analysis by Press and Wilson (1978), JASA. Presenter: Tilda (09/11)
  • Mining e-mail content for author identification forensics by de. Vel, Anderson, Corney and Mohay (2001), SIGMOD Record. Presenter: Melinda (09/25)
  • Least Squares SVM Classifiers: a Large Scale Algorithm by Suykens, Lukas, Van Dooren, Moor and Vandewalle (1999). Presenter: Caroline (10/02)
  • Accessing credit card applications using machine learning by Carter and Catlett (1987), IEEE Expert. Presenter: Xiang Lin (10/18)
  • Recent advances in speech recognition by Furui (1997), Pattern Recognition Letters.
  • Improving k-nearest neighbor density and error estimates by Buturovic (1993), Pattern Recognition .
  • Support vector machine classification and validation of cancer tissue samples using microarray expression data by Furey, Cristianini, Duffy, Bednarski, Schummer and Haussler (2000), Bioinformatics.
  • Accessing credit card applications using machine learning by Carter and Catlett (1987), IEEE Expert.
  • Drug design by machine learning: svm for pharmaceutical data analysis by Burbidge, Trotter , Buxon and Holden (2001), Computers and Chemistry.
  • Choose between logistic regression and discriminant analysis by Press and Wilson (1978), JASA.
  • Robustness of Fisher's linear discriminant function under two-component mixed normal models by Ashigaga and Chang (1981), JASA.
  • Forecasting exchange rates using TSMARS by Gooijer, Ray and Horst (1998), Journal of International Money and Finance.


    Course Activities
    Week 1 (August 22-26) Read Chapter 1: Introduction Lecture 1 Notes (08/23)
    Supplementary Reading: Data mining and statistics: what is the connection? Friedman (1997)
    Week 2 (August 27-Sep 2) Read Chapter 2: Overview of Supervised Learning Lecture 2 Notes (08/28)
    Homework 1 Assigned on 08/28, due on 09/11. Journal Club (08/28) Presenter: Melinda
    Supplementary Reading: An overview of statistical learning theory, Vapnik (1999) Lecture 2 (cont.) (08/30)
    Week 3 (Sep 3 - Sep 9) Read Chapter 3.1-3.3: Linear Models and Multiple Regression Lecture 3 Notes (09/04)
    Read Chapter 4 (4.1-4.3): Linear Discriminant Analysis
    Suppl. Reading: Sparse Principal Component Analysis, Zou, Hastie, and Tibshirani (2005) Lecture 4 Notes (09/06)
    Week 4 (Sep 10 - Sep 16) Read Chapter 4 (4.3, 4.4) : QDA and Logistic Regression Lecture 5 Notes (09/11)
    Homework 2 Assigned on 09/11, due on 09/25. Journal Club (09/11) Presenter: Tilda
    Read Chapter 4 (4.5) : Separating Hyperplanes Lecture 6 Notes (09/13)
    Suppl. Reading: Flexible Linear Discriminant Analysis by Optimal Scoring, Hastie, Tibshirani, and Buja (1994)
    Week 5 (Sep 17 - Sep 23) Read Chapter 12 : Support Vector Machines Lecture 7 Notes (09/18)
    Suppl. Reading: Statistical Properties of Support Vector Machines, Lin (1999)
    Read Chapter 12: Multiclass Support Vector Machines Lecture 8 Notes (09/20)
    Week 6 (Sep 24 - Sep 30) Read Chapter 9 : Extension of Support Vector Machines Lecture 8 Notes (cont.) (09/25)
    Homework 3 Assigned on 09/25, due on 10/09. Journal Club (09/25) Presenter: Melinda
    Read Chapter 9 (9.1) : Additive Models Lecture 9 Notes (09/27)
    Suppl. Reading: Additive logistic regression: a statistical view of Boosting , by Friedman, Hastie and Tibshirani (2001)
    Week 7 (Oct 1 - Oct 7) Read Chapter 9 (9.2) : Tree-based Methods Lecture 10 Notes (10/02)
    Suppl. Reading: Projection pursuit regression, Friedman and Stuetzle (1981) Joural Club (10/02) Presenter: Caroline
    Read Chapter 9.4 : Multivariate Adaptive Regression Splines (MARS) Lecture 11 Notes (10/04)
    Suppl. Reading: Multivariate Adaptive Regression Splines (MARS) by Friedman (1990)
    Week 8 (Oct 8 - Oct 14) Read Chapter 8.7 : Bootstrap and Bagging Lecture 12 Notes (10/09)
    Homework 4 Assigned on 10/09, due 10/23.
    Suppl. Reading: The self-organizing map, Kohonen (1990)
    No class on 10/11, Fall break
    Week 9 (Oct 15 - Oct 21) Read Chapter 10: Boosting Lecture 13 Notes (10/16)
    Suppl. Reading: Independent component analysis - a new concept?, Comon (1994)
    Read Chapter 13: Prototype Methods and Nearsest Neighbors Lecture 13 Notes (continued) (10/18)
    Suppl. Reading: Estimation of Prediction Error Efron (2004) Journal Club (10/18) Presenter: Xiang
    Final Project Assigned on 10/18, due 12/05.
    Week 10 (Oct 22 - Oct 28) Read Chapter 14 (14.1-14.4) : Unsupervised Learning Lecture 14 Notes (10/23)
    Homework 5 Assigned on 10/23, due on 11/30.
    Read Chapter 14 (14.5-14.6) : PCA and ICA Lecture 15 Notes (10/25)
    Suppl. Reading: Ridge regression: biased estimation for nonorthogonal problems, by Hoerl and Kennard (1970) Journal Club (10/25) Presenter: Wook
    Week 11 (Oct 29 - Nov 04) Read Chapter 3 (3.3 and 3.4): Penalized Least Squares and Variable Selection Lecture 16 Notes (10/30)
    Suppl. Reading: Regression shrinkage and selection via the lasso, Tibshirani (1996)
    Read Chapter 5 : Spline Methods Lecture 17 Notes (11/01)


    Auditing
  • Auditors are expected to attend class regularly and submit homework on the same schedule as the other students. The final grade for auditors (AU or NR) will be based on their final homework average. A homework score of 75 or better is required for an AU.

    Policy on Academic Integrity
  • The University policy on academic integrity is spelled out in Appendix L of the NCSU Code of Student Conduct. For a more though elaboration see the NCSU Office of Student Conduct website. For this course group work on homework is encouraged. However copying someone else's work and calling them your own is plagiarism, so the work you turn in should be your own.

    Students with Disabilities
  • Reasonable accommodations will be made for students with verifiable disabilities. In order to take advantage of available accommodations, students must register with Disability Services for Students (DSS), 1900 Student Health Center, CB# 7509, 515-7653.

    Online Class Evaluation
  • Online class evaluations will be available for students to complete during the last two weeks of class (November 26-December 9). All evaluations are confidential; instructors will never know how any one student responded to any question, and students will never know the ratings for any particular instructors. Click Online evaluation. More information at ClassEval.