Abstract: Given society’s increasing reliance on data, its collection and processing into useful information is a technical problem of growing focus, and perhaps paradoxically, a critical bottleneck in many data science and machine learning applications. My research focuses on designing algorithms that push the limits of both statistical efficiency and computational efficiency. In particular, my work tackles the divide between the theory and practice of data science, which exists even for the most basic statistical problems including mean and (co)variance estimation. Conventional methods such as the sample mean, while supported by theoretical results under strong assumptions, are often brittle in the presence of extreme data points. To counter such deficiencies, practitioners often use ad-hoc and unprincipled “outlier removal” heuristics, revealing a marked gap between the theory and practice even for these fundamental problems.
In this talk, I will describe my work towards building a new toolbox of optimal statistical primitives, bridging the theory-practice divide. I will specifically highlight 3 works: A) constructing a statistically-optimal and computationally-efficient 1-dimensional mean estimator, whose estimation error is optimal even in the leading multiplicative constant, under bare minimum distributional assumptions, B) a rather different but also optimal mean estimator for the “very high-dimensional” regime, and C) a recent result on robustly clustering Gaussian mixtures based on their covariances even in the presence of adversarial data corruption. To conclude the talk, I will discuss my vision for the new theory and toolbox, serving as a blueprint for my long-term future research.
Bio: Jasper Lee is a postdoctoral research associate at the University of Wisconsin-Madison, mentored by Ilias Diakonikolas in the Department of Computer Sciences, and also affiliated with the Institute for Foundations of Data Science. He completed his PhD at Brown University, advised by Paul Valiant.
His research interests are broadly in the foundations of data science, aiming to design practical, data-efficient and computationally-efficient algorithms for a variety of statistical applications.
His work is partially supported by a Croucher Fellowship for Postdoctoral Research.