Intrinsic Reward Policy Optimization for Sparse-Reward Environments
Published in arXiv preprint arXiv:2601.21391, 2026
Abstract
Key Contributions
- New surrogate policy gradient: We developed a surrogate gradient computed using intrinsic rewards to collect diverse experiences while using extrinsic rewards to optimize the policy.
- Formal and empirical analysis: We provide extensive analysis to characterize theoretical benefits and the underlying mechanism of the surrogate gradient.
- Improved performance: We evaluate our algorithm across widely used discrete and continuous environments, from basic dynamics to complex locomotion.
- Extensive ablation: We provide five ablation studies to justify our algorithmic design and its robustness across different intrinsic rewards.
🎥 Comparative Results on Sparse-Reward Environments
Environment: PointMaze-v1



Environment: AntMaze-v3



🛠 Method Overview — IRPO Gradient
At the core of IRPO is a surrogate policy gradient that aggregates learning signals from multiple exploratory policies to directly optimize extrinsic performance.
IRPO Surrogate Gradient
We define the IRPO gradient as:
\[\nabla J_{\mathrm{IRPO}}\big(\theta,\{\tilde{\theta}_k\}_{k=1}^K\big) := \sum_{k=1}^K \omega_k \, \nabla_\theta J_R(\tilde{\theta}_k)\]where each exploratory policy \(\pi_{\tilde{\theta}_k}\) contributes proportionally to its extrinsic performance.
Performance-Based Weighting
The contribution of each exploratory policy is determined by a softmax weighting:
\[\omega_k := \frac{\exp(J_R(\tilde{\theta}_k)/\tau)}{\sum_{k'=1}^K \exp(J_R(\tilde{\theta}_{k'})/\tau)}, \quad \tau \in (0,1]\]Backpropagated Policy Gradient
The gradient propagated from each exploratory policy to the base policy is calculated via the chain rule:
\[\nabla_\theta \log \pi_{\tilde{\theta}_k}(a \mid s) := (\nabla_\theta \tilde{\theta}_k)^\top \nabla_{\tilde{\theta}_k} \log \pi_{\tilde{\theta}_k}(a \mid s)\]Visualizing the IRPO Update Mechanism
We provide a visual illustration of the IRPO update mechanism using a two-dimensional parameter space \(\theta = [\theta_1, \theta_2]^\top \in \mathbb{R}^2\) with strictly concave extrinsic and intrinsic performance objectives. Below is our considered extrinsic and intrinsic performance objectives in this analysis:
Objective Functions: \(J(\theta) := -\Vert \theta \Vert^2_2 \quad \text{(Extrinsic)}, \; \tilde{J}_1(\theta) := -\Vert \theta - [0, -2]^\top \Vert^2_2 \quad \text{(Intrinsic)}.\)
The result of IRPO on the objectives is depicted below.
Figure 2: Empirical analysis of the IRPO update mechanism showing the relationship between intrinsic rewards and policy convergence.
Figure 3: Visual illustration of IRPO’s update mechanism. We used one intrinsic reward with 5 number of exploratory policy updates.
We highlight key observations below.
- Exploratory Policy Updates: The exploratory policy updates (red arrows) are explicitly directed towards the intrinsic performance objective (red star), facilitating the collection of diverse experiences.
- Surrogate Gradient Guidance: The base policy updates—driven by the IRPO gradient—guide the agent toward a region where \(N\) exploratory updates will land precisely at the extrinsic optimum.
- Degree of Exploration: Increasing the number of exploratory policy updates (\(N=2\) vs \(N=5\)) increases the degree of exploration, as indicated by the increased length of the exploratory vectors.
📖 BibTeX Citation
@article{cho2026intrinsic,
title = {Intrinsic Reward Policy Optimization for Sparse-Reward Environments},
author = {Cho, Minjae and Tran, Huy T.},
journal = {arXiv preprint arXiv:2601.21391},
year = {2026}
}Recommended citation: Cho, M., & Tran, H. T. (2026). "Intrinsic Reward Policy Optimization for Sparse-Reward Environments." arXiv preprint arXiv:2601.21391.
Download Paper
