Robustly finding the needles in a haystack of high-dimensional data

Eric Chi, Rice University

Photo of Eric Chi

From the Netflix challenge to microarray assays, high dimensional/low sample-size data is in. For example, in the latter case it is now possible to comprehensively query a patient's genetic profile and transcriptional activity. Patterns in these profiles can help refine subtypes of a disease according to sensitivity to treatment options or identify previously unknown genetic components of a disease's pathogenesis. The immediate statistical challenge is finding those patterns when the number of predictors or model parameters overwhelmingly exceed the number of samples. To that end L1-penalized maximum-likelihood model fitting has been very successful at producing “sparse” models through continuous variable selection. Nonetheless, while these penalized likelihood approaches have proved their worth at recovering parsimonious models, noticeably less attention has been given to extending these methods to handle outliers in high-dimensional data. Outliers can bias many standard likelihood-based procedures. This well-known fact, however, warrants more attention because bias can have material effects on variable selection procedures based on the L1 penalty. In this talk we discuss the devilry outliers can inflict upon variable selection procedures based on maximizing an L1-penalized likelihood as well as introduce a robust modification to ward off the bad influence of outliers.

Abstract Author(s): Eric C. Chi