Johns Hopkins Machine Learning 601.475 Spring 2019: Syllabus

We will use the following textbooks for this course:

[HTF] The elements of statistical learning: data mining, inference and prediction. Trevor Hastie, Robert Tibshirani and Jerome Friedman. Springer. 2001. Q325.75.F75 2001 c. 1. Available at http://statweb.stanford.edu/~tibs/ElemStatLearn/.

[BIS] Pattern recognition and machine learning. Christopher M. Bishop. 2009. Q327.B52 2009 c. 1

Other useful references are:

[MRT] Foundations of machine learning. Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. 2012. Q325.5 .M64 2012 c. 1

[SSBD] Understanding Machine Learning: From Theory to Algorithms. Shai Shalev-Shwartz and Shai Ben-David. 2014. Q325.5 .S475 2014 c. 1. Available at http://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/understanding-machine-learning-theory-algorithms.pdf.

[MUR] Machine learning: a probabilistic perspective. Kevin P. Murphy. 2012. Q325.5 .M87 2012 c. 1

[TM] Machine learning. Tom M. Mitchell. 1997. Q325.5.M58 1997 c. 1

Date	Topics	Readings
Introduction
T 1/29	Introduction to Machine Learning Define learning, why/when do we need machine learning, discuss different types of machine learning, recent success, cool applications	MUR Chapter 1 PDF, BIS Chapter 1, SSBD Chapter 1
Th 1/31	Course overview, formal introduction Supervised learning: task, performance, evaluation; classification, regression, loss function, risk	MRT Chapter 1 HTF Chapter 1
F 2/1	Recitation: Probability and Statistics Events, random variables, probabilities, pdf, pmf, cdf, mean, mode, median, variance, multivariate distributions, marginals, conditionals, Bayes theorem, independence	Review slides Aaditya Ramdas' Tutorials Math review (BIS Appendix B, MRT Appendix A, C). BIS Chapter 2.

T 2/5	Foundations Bayes optimal rule, Bayes risk, empirical risk minimization (ERM), generalization error, supervised learning: classification/regression, rote learning, lazy learning, model fitting	SSBD Chapter 2, HTF Section 2.1-2.3, MUR Sections 1.1-1.2
Th 2/7	Recitation: Linear Algebra Vector spaces, norms, metric spaces, inner product spaces, Cauchy-Schwarz, Orthonormal bases	Video Lecture Review Notes Review Slides
F 2/8	Recitation: MLE, MAP, Intro Python Parametric distributions, parameter estimation (MLE), MAP; introduction to Python, Jupyter notebook, numpy	BIS Chapter 2, Review Notes, Python Tutorial, Learn Python

T 2/12	Linear Regression Linear functions, loss function, empirical risk minimization, least squares solution, generalization, error decomposition	HTF Sections 2.3, 3.2, BIS Section 3.1
Th 2/14	Error analysis, statistical view Bayes optimal predictor, statistical view, Gaussian model, Maximum Likelihood Estimation (MLE), Polynomial regression, general additive regression, overfitting	HTF Sections 2.4, 2.6, 2.9, BIS Sections 1.1, 1.2, 3.1, 3.2, MUR Sections 7.1-7.3, SSBD Section 9.2,
F 2/15	Gradient descent; bias-variance tradeoff Gradient ascent/descent, step-size, convergence, Bias of an estimator, consistency, estimation and regression, bias-variance decomposition in regression, Cramer Rao inequality, Fisher information

T 2/19	Regularization Model complexity and overfitting, penalizing model complexity, description length, shrinkage methods, ridge regression, Lasso	HTF Section 3.3, BIS Sections 1.1, 1.3, 3.1.4, MUR Section 7.5
Th 2/21	Classification Introduction, classification as regression, linear classifiers, risk, conditional risk, logistic regression, MLE, surrogate loss, generalized additive models	HTF Sections 4.1, 4.4, BIS Sections 1.5, 4.3.2, MUR Sections 8.1-8.3
F 2/22	Recitation: Convex Optimization Convex sets, convex function, standard form, Lagrange multipliers, equivalence of constrained and unconstrained versions of ridge regression and Lasso regression	MRT Appendix B BIS Appendix D, E

T 2/26	Logistic Regression Log odds ratio, logistic function, gradient descent, Newton-Raphson	BIS Section 4.3.4, SSBD Sections 9.3, 14.1, MUR Section 8.5
T 2/28	Softmax; stochastic gradient descent Overfitting with logistic regression, MAP estimation, regularization, Softmax, Stochastic gradient descent (SGD)	BIS Sections 7.1, SSBD Sections 15.1-15.1.1, MUR Sections 14.5
F 3/1	Recitation: Matrix cookbook partial derivatives, gradient, linear form, quadratic form, Hessian

T 3/5	Decision Trees I Partition tree, classification and regression trees (CART), regression tree construction, region splitting,	BIS Section 14.4
Th 3/7	Decision Trees II regression tree complexity, regression tree pruning, classification trees, Gini index	HTF Section 9.2, 9.5
F 3/8	Recitation: practice midterms

T 3/12	MIDTERM REVIEW
Th 3/14	MIDTERM
F 3/15	Discuss MIDTERM solutions
Spring Break (3/18 - 3/22)
T 3/26	Ensemble methods Combining "weak" classifiers, greedy assembly, AdaBoost algorithm, exponential loss function, weighted loss, optimizing weak learner	BIS Section 14.3; HTF Sections 10.1-10.3
Th 3/28	AdaBoost AdaBoost derivation, AdaBoost behavior, boosting the margin, boosting decision stumps, Boosting and bias-variance tradeoff, combination of regressors, forward stepwise regression, combining regression trees, random forests, classification with random forest, bagging (bootstrap aggregation)	HTF Sections 10.4-10.5
F 3/29	Constrained optimization Constrained optimization problems, Lagrangian, dual function, dual variables, weak duality, strong duality, Slater's condition, KKT conditions, equivalence of constrained and regularized problems	M14.5 Notes on duality B7.1

T 4/2	Support Vector Machines Optimal separating hyperplane, Large margin classifier, margin and regularization, Lagrange multipliers, KKT conditions, max-margin optimization, quadratic programming, support vectors	HTF Section 4.5, MRT Sections 4.1-4.2
Th 4/4	Support Vector Machines II Non-separable case, SVMs with slack, loss in SVM, solving SVM in the primal SVM regression	BIS Section 7.1, SSBD Sections 15.1-15.1.1, MUR Section 14.5
F 4/5	Recitation: HW3

T 4/9	Kernel methods solving SVM in the primal, subgradient, subgradient descent, nonlinear features, feature space, kernel trick, representer theorem, kernel SVM in the primal, Mercer's kernels, radial basic function, kernel SVM, SVM regression, multiclass SVMs	BIS Sections 6.1, 6.2, MRT Sections 5.1, 5.2, 5.3.1-5.3.2, SSBD Sections 15.2, 15.4-15.5, MUR Sections 14.1-2
Th 4/11	Generative models for classification , generative models, discriminant functions, likelihood ration test, Gaussian discriminant analysis, linear discriminant, quadratic discriminants, generative models for classification, mixture models	HTF Sections 4.3, 12.4-12.6
F 4/12	No recitation/class

T 4/16	Mixture models, the EM algorithm Review multivariate Gaussians, mixture models, likelihood of mixture models, mixture density estimation, expectation maximization, EM for GMM, generic EM for mixture models, EM overfitting and regularization	HTF Section 12.7; BIS Sections 9.2, 9.3
Th 4/18	Conditional mixture models Mixture model for regression, mixture of experts model, gating network, conditional mixtures, EM for mixture of experts	BIS Section 14.5
F 4/19	Perceptron, Neural Networks Perceptron, perceptron loss, perceptron and neurons, general linear methods - representation as neural networks, two-layer network, feed-forward networks, training the networks	BIS Sections 5.1, 5.2

T 4/23	No CLASS; recitation: HW4
Th 4/25	Deep learning Training neural networks, back propagation, MLP as universal approximators, deep vs shallow, non convex optimization, classification networks, multi-class networks, classification networks, multi-class, multi-label, model complexity, learning rate, momemuntum, reLU activation, network ensembles, dropout, multilayer network, network topology, skip-layer connections	BIS Sections 5.3-5.5; BIS Section 14.5
F 4/26	Non-parametric methods, evaluating ML Parametric vs non-parametric methods, nearest neighbor methods, extensions, bias and variance in nearest neighbor methods, locally weighted regression, Classifier evaluation, precision-recall tradeoff, ROC space, significance of results, significance of cross-validation

T 4/30	Representation Learning Unsupervised feature learning, principal component analysis (PCA), power iteration method, kernel PCA, canonical correlation (CCA)	BIS Sections 12.1, 12.3; HTF Section 14.5
Th 5/2	Clustering Cluster analysis, dissimilarity based on attributes, k-means, soft k-means, hierarchical clustering, spectral clustering	BIS Section 9.1; HTF Section 14.3
F 5/3	Learning Theory Probably approximately correct (PAC) learning, finite hypothesis class, infinite hypothesis class, VC dimension, Rademacher complexity, Final Review	MRT Chapter 2