CMSC848R: Selected Topics in Information Processing; Language Model Interpretability

Fall 2025
Tuesdays and Thursdays, 12:30pm to 1:45pm
AJC (Clark Hall) 2132



Sarah Wiegreffe

Instructor

lm-interp@umd.edu (to reach both of us)

Ming Li

Teaching Assistant


Office hours:

  • Instructor: Sarah Wiegreffe
    Pronouns: she/her
    Office Hours: 1:45-2:45pm Thurs (right after class in IRB 4210; starting 09/11)
  • Teaching Assistant: Ming Li
    Pronouns: he/his
    Office Hours: by appointment

Resources:

[Syllabus] [Piazza] [Presentation Signup and Paper Reading List]
[Course feedback form]

Course description:

This course focuses on state-of-the-art methods for interpreting language models and understanding their learned behaviors. We will discuss approaches centered on both understanding models’ internal mechanisms/representations and attributing behaviors back to the training data. We will focus on model tendencies including hallucination, factuality, memorization, and explanation/reasoning elicitation. If time allows, we will discuss recent developments in ameliorating learned behaviors, such as model editing, unlearning, and steering.

We will examine the current state-of-the-art methods, their limitations, and the ongoing efforts to address these challenges. Through this course, you will engage in paper discussions and gain a deeper understanding of the latest developments in the field and contribute to the ongoing discussions and research in this exciting area.

Schedule

Date Notes & Deadlines Topic
September 2 (Tues) Slides
Intro + Logistics
September 4 (Thurs) Deadline to submit prospective (11:59pm)
Slides
LM Background
September 9 (Tues) Deadline to sign up for presentation slots (11:59pm)
Slides
LM Background continued + Interpretability Overview
September 11 (Thurs) Behavioral Analysis
September 16 (Tues) Training Data Attribution-- Overview + Contributive Methods
September 18 (Thurs) Training Data Attribution-- Corroborative Methods
September 23 (Tues) Deadline to submit Project Group Size Request Form if requesting a project group size of 1 (11:59pm) Localization of Internal Mechanisms-- Probing
September 25 (Thurs) P0 Due Localization of Internal Mechanisms-- Causal Attribution and Patching Part 1
September 30 (Tues) Localization of Internal Mechanisms-- Causal Attribution and Patching Part 2
October 2 (Thurs) Logistics Slides Localization of Internal Mechanisms-- Geometry of Hidden States (specifically, linearity)
October 7 (Tues) Localization of Internal Mechanisms-- Neuron-level Analysis
October 9 (Thurs)
October 14 (Tues) No Class (Fall Break)
October 16 (Thurs) Localization of Internal Mechanisms-- Superposition & other units of analysis for probing Part 1 (Sparse Autoencoders)
October 21 (Tues) Localization of Internal Mechanisms-- Superposition & other units of analysis for probing Part 2 (Advancements on Sparse Autoencoders)
October 23 (Thurs) Localization of Internal Mechanisms-- Circuits
October 28 (Tues) Class on Zoom Localization of Internal Mechanisms-- Attention Mechanisms
October 30 (Thurs) No class or office hour (Sarah traveling). Instead, a short written assignment on today's readings will be due. Localization of Internal Mechanisms-- MLPs and Factual Recall
November 4 (Tues) Localization of Internal Mechanisms/Textual Explanations-- Using LMs to generate textual descriptions of interpretations
November 6 (Thurs) Textual Explanations-- Faithfulness of Chain of Thought Part 1
November 11 (Tues) Deadline for Intermediate Project Reports Textual Explanations-- Faithfulness of Chain of Thought Part 2
November 13 (Thurs) Training Dynamics
November 18 (Tues) Applications/Evaluations-- Updating weights (finetuning + rank reduction)
November 20 (Thurs) Applications/Evaluations-- Updating weights (unlearning)
November 25 (Tues) Applications/Evaluations-- Updating representations (steering) Part 1
November 27 (Thurs) No Class or Office Hour (Thanksgiving Break)
December 2 (Tues) Applications/Evaluations-- Updating representations (steering) Part 2
December 4 (Thurs) No Class or Office Hour (Sarah traveling)
December 9 (Tues) Retrospective due (11:59pm) Applications/Evaluations-- Safety
December 11 (Thurs) Last Day of Class Retrospective/Recap
December 19 (Friday) Deadline (11:59pm) for final project reports (in lieu of final exam)
Note: This is a tentative schedule, and subject to change as necessary - monitor the course ELMS page for current deadlines. In the unlikely event of a prolonged university closing, or an extended absence from the university, adjustments to the course schedule, deadlines, and assignments will be made based on the duration of the closing and the specific dates missed.