Remove Conservancy Remove Learning Theory Remove Metrics
article thumbnail

Stanford AI Lab Papers and Talks at NeurIPS 2021

Stanford AI Lab Blog

Kochenderfer Contact : philhc@stanford.edu Links: Paper Keywords : deep learning or neural networks, sparsity and feature selection, variational inference, (application) natural language and text processing Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss Authors : Jeff Z.

Contact 40
article thumbnail

Other Papers About the Theory of Reward Learning

The AI Alignment Forum

We also managed to leverage these results to produce a new method for conservative optimisation, that tells you how much (and in what way) you can optimise a proxy reward, based on the quality of that proxy (as measured by a STARC metric ), in order to be guaranteed that the true reward doesnt decrease (and thereby prevent the Goodhart drop).

article thumbnail

Research directions Open Phil wants to fund in technical AI safety

The AI Alignment Forum

Motivation: Control evaluations are an attempt to conservatively evaluate the safety of protocols like AI-critiquing-AI (e.g., We prefer this definition of success at unlearning over the less conservative metrics like in Lynch et al because we think this definition more clearly distinguishes unlearning from safety training/robustness.