We empirically investigate exploration hacking (EH) — where models strategically alter their exploration to resist RL training — by creating model organisms that resist capability elicitation, evaluating countermeasures, and auditing frontier models for their propensity.
Authors: Eyon Jang*, Damon Falck*, Joschka Braun*, Nathalie Kirch, Achu Menon, Perusha Moodley, Scott Emmons, Roland S. Zimmermann, David Lindner
(*Equal contribution, random order)
Paper: arXiv | Code: GitHub | Models: HuggingFace
We present the first empirical investigation of exploration hacking — when a model strategically alters its exploration during RL training to influence the training outcome. In our earlier post, we introduced a conceptual framework for this threat model. Here, we summarize the key empirical res…