Compression for Prediction: The Deterministic Information Bottleneck

Daniel Strouse, Princeton University

Photo of Daniel Strouse

Compression is a ubiquitous task for humans and machines alike. For example, humans must compress the vast stream of ongoing sensory information they receive into small changes in the brain that form memories and machines must turn the large pixel grids of color that form pictures into small files we can quickly share on the Web.

All forms of compression involve an implicit decision about what is relevant and what is not. In the example of human memory, our brains deem some details important enough to warrant attention and others are forgotten. In the example of image compression, the algorithms we use deem some features essential to representing the subject matter well and others are thrown away.

In these and many other cases, the criterion for “relevance” can be described as information about some other variable of interest. Let’s call X the signal we are compressing, T the compressed version, Y the other variable of interest, and I(T;Y) the “information” that T has about Y. For human memory, X is current or past sensory input and T the brain’s internal representation (e.g. the activity of a neural population, or the strengths of a set of synapses). Though it is an open experimental question, Y might be the future state of the environment, so that I(T;Y) represents the predictive power of the memories formed. For image compression, X is the raw pixel values, T the compressed image parameters (e.g. the coefficients of a series of basis functions), and Y the perceptual experience that a human has looking at the original image. Thus, I(T;Y) represents how much the compressed image “looks like” the original (and I(X;Y)−I(T;Y) is the perceptual information “lost”).

In summary, a good compression algorithm can be described as a tradeoff between the compression of a signal and the selective maintenance of the “relevant” bits that help predict another signal. This problem was formalized as the “information bottleneck” (IB) by Tishby, Pereira and Bialek. Here we begin with a general introduction to IB. Next, we motivate and introduce an alternative formulation that we call the deterministic information bottleneck (DIB), along with a generalized method which interpolates between the IB and DIB solutions. Lastly, we compare the IB and DIB solutions on synthetic data to help illustrate their differences.

Abstract Author(s): D.J. Strouse, David Schwab