Ying Xiao Defense Dissertation

Monday, August 4, 2014 - 9:30am
Klaus 2100
Title: New Tools for Unsupervised Learning
Advisor: Dr. Santosh Vempala, School of Computer Science
Committee: Dr. Justin Romberg, School of Electrical and Computer Engineering
  Dr. Arkadi Nemirovski, School of Industrial and Systems Engineering
  Dr. Le Song, Computational Science and Engineering
  Dr. Nina Balcan, School of Computer Science, Carnegie Mellon University
Reader: Dr. Nina Balcan, School of Computer Science, Carnegie Mellon University

In an unsupervised learning problem, one is given an unlabelled dataset and hopes to find some hidden structure; the prototypical example is clustering data. Such problems arise naturally in machine learning and statistics, but also in signals processing, theoretical computer science, and any number of quantitative scientific fields. The distinguishing feature of unsupervised learning is that there are no privileged variables or labels which are particularly informative, and thus the greatest challenge is often to differentiate between what is relevant or irrelevant in any particular dataset or problem.

In the course of this thesis, we study a number of problems which span the breadth of unsupervised learning. We advance the state of the art in Gaussian mixtures, independent component analysis (where we solve the open problem of underdetermined ICA), and we formulate and solve a feature selection/dimension reduction model. Throughout, our goal is to give finite sample complexity bounds for our algorithms -- these are essentially the strongest type of quantitative bound that one can prove for such algorithms. Some of our algorithmic techniques in fact turn out to be very efficient in practice as well.

Our major technical tool is tensor spectral decomposition: tensors are generalisations of matrices, and often allow access to the "fine structure" of data. Thus, they are often the right tools for unravelling the hidden structure in unsupervised learning setting. However, naive generalisations of matrix algorithms to tensors run into NP-hardness results almost immediately, and thus to solve our problems, we are obliged to develop two tensor decompositions (with robust analyses) from scratch. Both of these decompositions are efficient, and can be viewed as polynomial time generalisations of PCA extended to tensors.