Suppose one wishes to accurately recover the set of eigenvalues of the covariance matrix of a distribution, given access to samples from the distribution. To what extent can this set be accurately approximated given an amount of data that is sublinear in the dimension of the space? The naive empirical estimate has poor performance when the number of samples is linear or sub-linear in the dimension of the data. In the "large sample, large dimension" setting, we proposed an efficient and information theoretically near-optimal algorithm to learn the moments of the covariance spectral distribution. Further we show that given the first k moments of a distribution, we can pin down the distribution in Earthmover distance up to an error of O(1/k). These two results combined allow us to efficiently and accurately learn the spectrum of the underlying covariance matrix, yielding a consistent spectrum estimator even with sub-linear number of samples.
Weihao is a PhD student in the Computer Science Department at Stanford. His current research focuses on applying ideas and insights from probability theory and random matrix theory to yield algorithms for accurately inferring information about large/complex distribution, given surprisingly little data.