All Paths Don't Lead to Rome: Interpreting Reasoning Pathways in Language Models

Mansi Sakarvadia, University of Chicago

Photo of Mansi Sakarvadia

The ability to answer multi-hop reasoning questions effectively requires humans to retrieve information from diverse sources and synthesize them coherently. However, existing Large Language Models (LLMs) often struggle to exhibit such reasoning consistently. To address this problem, we propose an approach to pinpoint and rectify multi-hop reasoning failures within the attention heads of LLMs through targeted interventions. This research analyzes the inner workings of the GPT-2 model, examining its activations at each layer when presented with multi-hop prompts compared to their single-hop counterparts (e.g., "The first president" versus "George Washington"). By scrutinizing and contrasting the latent vocabulary representation of the output from each of GPT-2’s attention heads for a range of input prompts, we discern the distinctions between the reasoning pathways of single versus multiple-hop prompts. Furthermore, we introduce a mechanism that allows users to inject pertinent prompt-specific memories strategically at specific locations within an LLM. This strategy empowers the LLM to incorporate relevant information during its reasoning process, effectively enhancing the quality of multi-hop prompt completions. Our findings not only advance the field of interpretability but also open new avenues for further research in knowledge retrieval and editing. By reverse engineering the inner workings of LLMs and enabling targeted corrections, we pave the way for future work focused on enhancing the transparency and effectiveness of these models in complex reasoning tasks.

Abstract Author(s): Mansi Sakarvadia, Aswathy Ajith, Arham Khan, Nathaniel Hudson, Kyle Chard, Ian Foster