Auditory Texture Synthesis From Task-optimized Convolutional Neural Networks

Jenelle Feather, Massachusetts Institute of Technology

Photo of Jenelle Feather

Textures are signals generated by superpositions of large numbers of similar elements and are distinguished by homogeneity in time or space. As such, textures are believed to be represented with statistics that average information across space or time. Sound textures are generated by rain, fire, swarms of insects etc., and have been suggested to be represented in the auditory system with time-averaged statistics. In a typical auditory texture-generation model, a set of time-averaged statistics is measured on a natural sound and a synthetic sound is produced that has the same statistics as the natural sound. A model that replicates perceptual representations of auditory textures should produce synthetic sounds that are perceived like the natural sounds to which they are matched. The statistics in classic auditory texture models (and also visual texture models) are often chosen ad hoc and statistics from multiple stages of the model must be included in order to maximize the perceptual similarity between the synthesized sound and the original signal. We found that auditory textures generated simply from the time-averaged power in the first layer activations of a task-optimized convolutional neural network replicated the synthesis quality of the best previous auditory texture model. Unlike textures generated from traditional models, the textures from task-optimized filters did not require statistics from earlier stages in the sensory model to be recognizable or realistic. Further, the textures generated from the task-optimized CNN filters were more realistic than textures generated from a hand-engineered spectro-temporal model of primary auditory cortex or from randomly initialized filters. The results demonstrate that better sensory models can be obtained by task-optimizing sensory representations.

Abstract Author(s): Jenelle Feather, Josh H. McDermott