Remove Conservancy Remove Learning Theory Remove Phase
article thumbnail

Research directions Open Phil wants to fund in technical AI safety

The AI Alignment Forum

Motivation: Control evaluations are an attempt to conservatively evaluate the safety of protocols like AI-critiquing-AI (e.g., We prefer this definition of success at unlearning over the less conservative metrics like in Lynch et al because we think this definition more clearly distinguishes unlearning from safety training/robustness.