RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

Princeton University

Phenomenon of misalignment

Descriptive Alt Text

We found that RLHF induces systematic misalignment.

Descriptive Alt Text

Training AI on immediate feedback implicitly requires that users or evaluators predict the future utility of the AI output, which often depends on downstream consequences. This leads to reward hacking .

  • AI can improve its reward by manipulating the evaluator’s internal state (e.g., beliefs, emotions).
  • Manipulative AI outputs can bias users towards making poor decisions after the interaction.

Benefit of hindsight

Descriptive Alt Text

We introduced the benefit of hindsight and discussed theoretically that conditioning evaluator feedback on downstream observations mitigates misalignment and improves expected human utility.



Descriptive Alt Text

To leverage this insight in a practical alignment algorithm, we introduce Reinforcement Learning from Hindsight Simulation (RLHS) .
  • Step 1: Simulates the consequences.
  • Step 2: Provides human feedback given the hindsight.

Results

We demonstrate significant misalignment of real utility and satisfaction ratings using immediate feedback. Our proposed hindsight effectively mitigate the misalignment.

Descriptive Alt Text

RLHF model (trained with immediate feedback) deceives by falsely claiming Options A and C meet the customer's 8K resolution requirement, though neither does. In contrast, the RLHS model truthfully states that none of the options include 8K resolution.

Descriptive Alt Text

Human study Results

RLHS significantly outperformed RLHF by achieving higher long-term satisfaction scores, higher true utility, and lower regret rates. Models trained with RLHS are more truthful, presenting a strong correlation between their high immediate user satisfaction rate (subjective) and high true utility (objective).

Descriptive Alt Text

Applications of Hindsight

Descriptive Alt Text

BibTeX

@article{liang2025rlhs,
  title={RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation},
  author={Liang, Kaiqu and Hu, Haimin and Liu, Ryan and Griffiths, Thomas L and Fisac, Jaime Fern{\'a}ndez},
  journal={arXiv preprint arXiv:2501.08617},
  year={2025}
}