CMSC848R: Selected Topics in Information Processing; Language Model Interpretability
Fall 2025
Tuesdays and Thursdays, 12:30pm to 1:45pm
AJC (Clark Hall) 2132
Office hours:
- Instructor: Sarah Wiegreffe
Pronouns: she/her
Office Hours: 1:45-2:45pm Thurs (right after class in IRB 4210; starting 09/11)
- Teaching Assistant: Ming Li
Pronouns: he/his
Office Hours: by appointment
Resources:
[Syllabus] [Piazza] [Presentation Signup and Paper Reading List][Course feedback form]
Course description:
This course focuses on state-of-the-art methods for interpreting language models and understanding their learned behaviors. We will discuss approaches centered on both understanding models’ internal mechanisms/representations and attributing behaviors back to the training data. We will focus on model tendencies including hallucination, factuality, memorization, and explanation/reasoning elicitation. If time allows, we will discuss recent developments in ameliorating learned behaviors, such as model editing, unlearning, and steering.
We will examine the current state-of-the-art methods, their limitations, and the ongoing efforts to address these challenges. Through this course, you will engage in paper discussions and gain a deeper understanding of the latest developments in the field and contribute to the ongoing discussions and research in this exciting area.
Schedule
| Date | Notes & Deadlines | Topic |
|---|---|---|
| September 2 (Tues) | Slides | Intro + Logistics |
| September 4 (Thurs) | Deadline to submit prospective (11:59pm) Slides | LM Background |
| September 9 (Tues) | Deadline to sign up for presentation slots (11:59pm) Slides | LM Background continued + Interpretability Overview |
| September 11 (Thurs) | Behavioral Analysis | |
| September 16 (Tues) | Training Data Attribution-- Overview + Contributive Methods | |
| September 18 (Thurs) | Training Data Attribution-- Corroborative Methods | |
| September 23 (Tues) | Deadline to submit Project Group Size Request Form if requesting a project group size of 1 (11:59pm) | Localization of Internal Mechanisms-- Probing |
| September 25 (Thurs) | P0 Due | Localization of Internal Mechanisms-- Causal Attribution and Patching Part 1 |
| September 30 (Tues) | Localization of Internal Mechanisms-- Causal Attribution and Patching Part 2 | |
| October 2 (Thurs) | Logistics Slides | Localization of Internal Mechanisms-- Geometry of Hidden States (specifically, linearity) |
| October 7 (Tues) | Localization of Internal Mechanisms-- Neuron-level Analysis | |
| October 9 (Thurs) |
| |
| October 14 (Tues) | No Class (Fall Break) | |
| October 16 (Thurs) | Localization of Internal Mechanisms-- Superposition & other units of analysis for probing Part 1 (Sparse Autoencoders) | |
| October 21 (Tues) | Localization of Internal Mechanisms-- Superposition & other units of analysis for probing Part 2 (Advancements on Sparse Autoencoders) | |
| October 23 (Thurs) | Localization of Internal Mechanisms-- Circuits | |
| October 28 (Tues) | Class on Zoom | Localization of Internal Mechanisms-- Attention Mechanisms |
| October 30 (Thurs) | No class or office hour (Sarah traveling). Instead, a short written assignment on today's readings will be due. | Localization of Internal Mechanisms-- MLPs and Factual Recall |
| November 4 (Tues) | Localization of Internal Mechanisms/Textual Explanations-- Using LMs to generate textual descriptions of interpretations | |
| November 6 (Thurs) | Textual Explanations-- Faithfulness of Chain of Thought Part 1 | |
| November 11 (Tues) | Deadline for Intermediate Project Reports | Textual Explanations-- Faithfulness of Chain of Thought Part 2 |
| November 13 (Thurs) | Training Dynamics | |
| November 18 (Tues) | Applications/Evaluations-- Updating weights (finetuning + rank reduction) | |
| November 20 (Thurs) | Applications/Evaluations-- Updating weights (unlearning) | |
| November 25 (Tues) | Applications/Evaluations-- Updating representations (steering) Part 1 | |
| November 27 (Thurs) | No Class or Office Hour (Thanksgiving Break) | |
| December 2 (Tues) | Applications/Evaluations-- Updating representations (steering) Part 2 | |
| December 4 (Thurs) | No Class or Office Hour (Sarah traveling) | |
| December 9 (Tues) | Retrospective due (11:59pm) | Applications/Evaluations-- Safety |
| December 11 (Thurs) | Last Day of Class | Retrospective/Recap |
| December 19 (Friday) | Deadline (11:59pm) for final project reports (in lieu of final exam) |