Valiant, L.	 A Theory of the Learnable. Communications of the ACM. Volume 27. pp. 1134--1142. 1984.
 
Servedio, R.	 On PAC Learning Using Winnow, Perceptron, and a Perceptron-Like Algorithm. 
Proceedings of the Twelfth Annual Conference on Computational Learning Theory (COLT)	. pp. 296--307. 1999.
 
SVMs, SVM Convex Hull Algorithms
Cortes, C and Vapnik, V.	 Support Vector Networks. 
Machine Learning	.
Vol 20.  pp. 273--277. 1995.
 
Crisp, D. and Burges, C.	 A Geometric Interpretation of nu-SVM Classifiers. In 
Advances in Neural Information	. Solla, S. A., Leen, T. K., and Muller K. R. eds. MIT Press. Volume 12. pp. 244--250. 1999.
 
Franc, V. and Hlavac, V.	 An iterative algorithm learning the maximal margin classifier. 
Pattern Recognition	. Volume 36. pp. 1985--1996. 2003.
 
Mavroforakis, M. E., Sdralis, M., and Theodoridis, S	. 
A Novel SVM Geometric Algorithm based on Reduced Convex Hulls. 
18th International Conference on Pattern Recognition	. pp. 564--568. 2006.
 
Liu, Zhenbing, Liu, J. G., Pan, Chao, and Wang, Guoyou	. 
A Novel Geometric Approach to Binary Classification Based on Scaled Convex Hulls. 
IEEE Transactions on Neural Networks	. Volume 20. Number 7. pp. 1215--1220.  July 2009.
 
Experimental Methods - Confusion Matrix, Cross-Validation, etc 
 
Pearson, K	. 
Mathematical contributions to the theory of evolution. Dulau Co. 1904. Paper credited with introducing Confusion matrix.
 
Larson, S. C.	  The shrinkage of the coefficient of multiple correlation. 
Journal of Educational Psychology	. Volume 22. Issue 1. pp. 45--55. 1931. Paper often cited as introducing cross-validation.
 
Newton Methods and Gradient Descent Results
Newton, I.	 De analysi per aequationes numero terminorum infinitas. Circulated as a Letter to the Royal Society and continental mathematicians in 1669. Was rejected for publication by both Cambridge University Press and the Royal Society. Eventually published in 1711 by William Jones.
 
Raphson, J.	 Analysis Aequationum Universalis. Churchill. 1690.
 
Debye, P.	 Näherungsformeln für die Zylinderfunktionen für große Werte des Arguments und unbeschränkt veränderliche Werte des Index. 
Mathematische Annalen	. Volume 67. Issue 4. pp. 535--538. 1909. First paper to present gradient descent.con
 
Robbins, H. and Munro, S.	  A Stochastic Approximation Method. 
Annals of Mathematical Statistics	. Volume 22. pp. 400--407. 1951. Early paper analysing stochastic gradient descent.
 
Broyden, C. G.	 The convergence of a class of double-rank minimization algorithms. 
Journal of the Institute of Mathematics and Its Applications	. Volume 6. pp. 76--90. 1970.
 
Fletcher, R.	 A New Approach to Variable Metric Algorithms. 
Computer Journal	. Volume 13. Issue 3. pp. 317--322. 1970.
 
Goldfarb, D.	 A Family of Variable Metric Updates Derived by Variational Means. 
Mathematics of Computation	. Volume 24. Issue 109. pp. 23--26. 1970.
 
Shanno, David F.	 Conditioning of quasi-Newton methods for function minimization. 
Mathematics of Computation	. Volume 24. Issue 111. pp 647--656. 1970.
 
Nocedal, J.	. 
Updating Quasi-Newton Matrices with Limited Storage. 
Mathematics of Computation	. Volume 35. Issue 151. pp. 773--782.
1980.
 
Bertsekas, D. P. and Tsitsiklis, J. N.	. 
Gradient Convergence In Gradient Methods With Errors. Siam Journal of Optimization. Volume 10. Issue 3. pp. 627--642. 2000.
 
Back Propagation, Multi-Layer Learning Results
Rumelhart, D. E., Hinton, G. E., and Williams, R. J.	
Learning internal representations by error propagation.
Nature	. Vol. 323. pp. 533--537. Oct 1986.
 
Arora, R., Basu, A., Mianjy, P., and Mukherjee, A.	. 
Understanding Deep Neural Networks with Rectified Linear Units. Electronic Colloquium on Computational Complexity. Revision 1 of Report Number 98. 2017.
 
Regularization
Ulusoy, I. and Bishop, C. M.	 Generative versus Discriminative Methods for Object Recognition.
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Volume 2. pp. 258--265. 2005.
 
Laserre, J., Bishop, C. M., and Minka, T. P.	 Resource Description for Principled Hybrids of Generative and Discriminative Models Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Volume 1. pp. 87--94. 2006.
 
Convolutional Neural Networks
LeCun, Y., Boser, B., Denker, J. S., Henderson, D.,  Howard, R. E., Hubbard, W., and Jackel,  L. D.	. 
Handwritten digit recognition with a back-propagation network. 
Proceedings of the 2nd International Conference on Neural Information Processing Systems	. 
pp. 396--404. 1992.
 
Neural Network Applications
Watkins, C. and Dayan, P.	 Q-Learning. 
Machine learning	. Volume 8. Issue 3--4. pp. 279--292. 1992.
 
Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C	. 
A Neural Probabilistic Language Model. 
Journal of Machine Learning Research	 Volume 3. pp. 1137--1155. 2003.
 
Mnih, V. and Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M.	. 
Playing Atari with Deep-Reinforcement Learning. In
Deep Learning, Neural Information Processing Systems Workshop	. 2013.