-
Sparse Autoencoders for a More Interpretable RLHF
Extending Anthropic's recent monosemanticity results toward a new, more interpretable way to fine-tune.
-
Examining Llama 2's Propensity for Following the System Prompt
Large Language Models seem to 'forget' their system prompt over the course of a long conversation. Can we measure this effect?