AI SAFETY FUNDAMENTALS
ABOUT THE COURSE
XLab’s AI Safety Fundamentals course is a seven-week reading group aiming to build a thorough context in AI Safety technical research, governance, and policy. Each week for 90 minutes, students meet over dinner to read and discuss key papers. Students examine both technical safety challenges and broader policy considerations such as AI governance frameworks and regulatory approaches.
All backgrounds are welcome. Applications will open until the end of week one each quarter.

Week 1: Scaling and Instrumental Convergence
Explore the implications of increasingly intelligent systems, focusing on scaling laws, superintelligence, and instrumental convergence.
- The case for taking AI seriously as a threat to humanityWe read only through section 5. This Vox article from 2020 ages remarkably well, laying out the key arguments for why we should consider AI a threat to humanity.
- Transformer Language Models (Video) (0:00 – 11:30)
Watch 0:00 – 11:30 for an accessible introduction to scaling laws in language models.
- Optional: The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents
Nick Bostrom’s influential work on power-seeking and instrumental convergence in AI systems.
Week 2: Outer Alignment
Examine the challenges in correctly specifying training goals for AI systems.
- Specification Gaming: How AI Can Turn Your Wishes Against You
A fun video from 2023 that discusses the problem of specification gaming.
- Specification gaming: the flip side of AI ingenuity
A comprehensive overview of outer alignment issues from DeepMind researchers.
- Learning from human preferences
Explore how alignment researchers have attempted to address issues in goal specification using human preferences.
Week 3: Deception, Inner Alignment & Mechanistic Interpretability
Investigate the concept of mesa-optimizers and the potential for deceptive behavior in AI systems.
- Alignment Faking
Anthropic’s research on alignment faking, where LLMs strategically attempt to preserve their values during training.
- Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases
An informal note on some intuitions related to Mechanistic Interpretability by Chris Olah.
- Deceptive Alignment
An in-depth exploration of deceptive alignment and pseudo-alignment, providing insights into inner alignment issues.
Week 4: AI Security
Explore various AI security issues including jailbreaks, adversarial examples, and potential vulnerabilities.
- A Playbook for Securing AI Model Weights
A comprehensive playbook for protecting AI models from theft and misuse.
- Four Fallacies of AI Cybersecurity
Aguement that AI cybersecurity must learn from past security lessons, not reinvent them.
- Stealing Part of a Production Language Model (only abstract)
How Researchers extract embedding layers from language models through inexpensive API attacks.
- Optional: Sleight of hand: How China weaponizes software vulnerabilities
China’s new regulations force companies to report software vulnerabilities to government agencies.
- Optional: Ironing Out the Squiggles
A paper review post about adversarial examples, their implications, and potential solutions.
- Optional: SolidGoldMagikarp – tokens that jailbreak LLMs
Explore a famous case of LLM jailbreaking and its implications for AI security.
Week 5: AI Governance
Examine the challenges and approaches to governing AI development and deployment.
- Open Problems in Technical AI Governance
An overview of technical AI governance and its methods for evaluating and enforcing AI control mechanisms.
- Certified Safe: A Schematic for Approval Regulation of Frontier AI
A proposal for FDA-style approval regulation for frontier AI systems.
Week 6: Criticisms and Counter-Arguments
Examine critiques of AI safety concerns and alternative perspectives on AI development.
- Will AI kill all of us? | Marc Andreessen and Lex Fridman (00:00 – 10:30)
Listen to 00:00 – 10:30 for a discussion on criticisms of AI safety concerns.
- Terrorism, Tylenol, and dangerous information
A useful reading for understanding infohazards in AI development.
- Against Almost Every Theory of Impact of Interpretability
A critical examination of interpretability approaches in AI alignment.
Week 7: Further Reading and Discussion
Explore various AI alignment approaches and dive deeper into specific areas of interest. Fellows will choose one of the optional readings to focus on for the week.
- Optional: A Brief Introduction to some Approaches to AI Alignment
An overview of various AI alignment approaches, providing a foundation for further exploration.
- Optional: Why Agent Foundations? An Overly Abstract Explanation
A deeper dive into the concept of agent foundations in AI alignment.
- Optional: Goal Misgeneralisation: Why Correct Specifications Aren’t Enough For Correct Goals
An in-depth exploration of inner alignment issues and goal misgeneralization.
- Optional: Toy Models of Superposition
A technical exploration of interpretability in neural networks.
- Optional: Steering Llama-2 with contrastive activation additions
An examination of techniques for controlling large language models.