Intrinsic Reward Policy Optimization for Sparse-Reward Environments

Published in arXiv preprint arXiv:2601.21391, 2026

Abstract

Exploration is crucial to reinforcement learning, yet sparse rewards make naive exploration strategies ineffective. Intrinsic Reward Policy Optimization (IRPO) leverages multiple intrinsic reward functions via a surrogate policy gradient to directly optimize a policy with respect to an extrinsic reward. Our algorithm, thus, avoids existing limitations such as credit assignment, sample inefficiency, and suboptimality across discrete and continuous environments.

Key Contributions

  • New surrogate policy gradient: We developed a surrogate gradient computed using intrinsic rewards to collect diverse experiences while using extrinsic rewards to optimize the policy.
  • Formal and empirical analysis: We provide extensive analysis to characterize theoretical benefits and the underlying mechanism of the surrogate gradient.
  • Improved performance: We evaluate our algorithm across widely used discrete and continuous environments, from basic dynamics to complex locomotion.
  • Extensive ablation: We provide five ablation studies to justify our algorithmic design and its robustness across different intrinsic rewards.

🎥 Comparative Results on Sparse-Reward Environments

IRPO (Ours)
HRL
PPO

Environment: PointMaze-v1

IRPO PointMaze
HRL PointMaze
PPO PointMaze

Environment: AntMaze-v3

IRPO AntMaze
HRL AntMaze
PPO AntMaze

🛠 Method Overview — IRPO Gradient

At the core of IRPO is a surrogate policy gradient that aggregates learning signals from multiple exploratory policies to directly optimize extrinsic performance.

IRPO Surrogate Gradient

We define the IRPO gradient as:

\[\nabla J_{\mathrm{IRPO}}\big(\theta,\{\tilde{\theta}_k\}_{k=1}^K\big) := \sum_{k=1}^K \omega_k \, \nabla_\theta J_R(\tilde{\theta}_k)\]

where each exploratory policy \(\pi_{\tilde{\theta}_k}\) contributes proportionally to its extrinsic performance.

Performance-Based Weighting

The contribution of each exploratory policy is determined by a softmax weighting:

\[\omega_k := \frac{\exp(J_R(\tilde{\theta}_k)/\tau)}{\sum_{k'=1}^K \exp(J_R(\tilde{\theta}_{k'})/\tau)}, \quad \tau \in (0,1]\]

Backpropagated Policy Gradient

The gradient propagated from each exploratory policy to the base policy is calculated via the chain rule:

\[\nabla_\theta \log \pi_{\tilde{\theta}_k}(a \mid s) := (\nabla_\theta \tilde{\theta}_k)^\top \nabla_{\tilde{\theta}_k} \log \pi_{\tilde{\theta}_k}(a \mid s)\]

Visualizing the IRPO Update Mechanism

We provide a visual illustration of the IRPO update mechanism using a two-dimensional parameter space \(\theta = [\theta_1, \theta_2]^\top \in \mathbb{R}^2\) with strictly concave extrinsic and intrinsic performance objectives. Below is our considered extrinsic and intrinsic performance objectives in this analysis:

Objective Functions: \(J(\theta) := -\Vert \theta \Vert^2_2 \quad \text{(Extrinsic)}, \; \tilde{J}_1(\theta) := -\Vert \theta - [0, -2]^\top \Vert^2_2 \quad \text{(Intrinsic)}.\)

The result of IRPO on the objectives is depicted below.

IRPO Update Mechanism

Figure 2: Empirical analysis of the IRPO update mechanism showing the relationship between intrinsic rewards and policy convergence.

Figure 3: Visual illustration of IRPO’s update mechanism. We used one intrinsic reward with 5 number of exploratory policy updates.

We highlight key observations below.

  • Exploratory Policy Updates: The exploratory policy updates (red arrows) are explicitly directed towards the intrinsic performance objective (red star), facilitating the collection of diverse experiences.
  • Surrogate Gradient Guidance: The base policy updates—driven by the IRPO gradient—guide the agent toward a region where \(N\) exploratory updates will land precisely at the extrinsic optimum.
  • Degree of Exploration: Increasing the number of exploratory policy updates (\(N=2\) vs \(N=5\)) increases the degree of exploration, as indicated by the increased length of the exploratory vectors.

📖 BibTeX Citation

@article{cho2026intrinsic,
  title   = {Intrinsic Reward Policy Optimization for Sparse-Reward Environments},
  author  = {Cho, Minjae and Tran, Huy T.},
  journal = {arXiv preprint arXiv:2601.21391},
  year    = {2026}
}

Recommended citation: Cho, M., & Tran, H. T. (2026). "Intrinsic Reward Policy Optimization for Sparse-Reward Environments." arXiv preprint arXiv:2601.21391.
Download Paper