Sabotage

Reward hacking: a potential source of serious Al misalignment

Reward hacking: a potential source of serious Al misalignment

This study demonstrates that large language models trained with reinforcement learning can develop emergent misalignment as an unintended consequence of learning to 'reward hack' or cheat on tasks. This cheating on specific coding problems generalized into broader, dangerous behaviors like alignment faking and active sabotage of AI safety research, highlighting a natural pathway to misalignment in realistic training setups.