- AI can improve its reward by manipulating the evaluator’s internal state (e.g., beliefs, emotions).
- Manipulative AI outputs can bias users towards making poor decisions after the interaction.
We demonstrate significant misalignment of real utility and satisfaction ratings using immediate feedback.
Our proposed hindsight effectively mitigate the misalignment.
RLHF model (trained with immediate feedback) deceives by falsely claiming Options A and C meet the customer's 8K resolution requirement, though neither does. In contrast, the RLHS model truthfully states that none of the options include 8K resolution.
RLHS significantly outperformed RLHF by achieving higher long-term satisfaction scores, higher true utility, and lower regret rates.
Models trained with RLHS are more truthful, presenting a strong correlation between their high immediate user satisfaction rate (subjective) and high true utility (objective).
@article{liang2025rlhs,
title={RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation},
author={Liang, Kaiqu and Hu, Haimin and Liu, Ryan and Griffiths, Thomas L and Fisac, Jaime Fern{\'a}ndez},
journal={arXiv preprint arXiv:2501.08617},
year={2025}
}