Welcome to the Metrics and Models homepage!
Talk Title: How can we automate Interpretability to help people make better and faster decisions?
Abstract: Current interpretability methods have made substantial progress in explaining model internals, but they rarely connect understanding to action. We propose a research agenda for automated interpretability-driven model auditing and control: a system where domain experts can query a model’s behavior, receive explanations grounded in their expertise, and instruct targeted corrections—all without needing to understand how AI systems work internally. This talk based on my research agenda comprises eight interrelated research questions forming a complete pipeline: from translating queries into testable hypotheses about model internals, to localizing capabilities in specific components, generating human-readable explanations, and performing surgical edits with verified outcomes. Our approach is distinguished by explicit hypothesis generation and testing rather than purely learned mappings from latent space to language, compositional concept graphs that capture how capabilities combine and interact, domain-grounded explanations that enable expert oversight, and human-in-theloop intervention with predicted side effects. We evaluate the framework on three safety applications: chain-of-thought faithfulness verification, emergent capability prediction during training, and capability composition mapping. This talk will cover the high level approaches and ideas for wider reach and accessibility.
Bio: Fazl Barez is a Principal Investigator at the University of Oxford, where he work on research on AI safety, interpretability, and governance. His work focuses on understanding how AI systems make decisions and ensuring they remain safe and beneficial.