Intrinsic Reward Policy Optimization for Sparse-Reward Environments

Published in arXiv preprint arXiv:2601.21391, 2026

Abstract

Exploration is crucial to reinforcement learning, yet sparse rewards make naive exploration strategies ineffective. Intrinsic Reward Policy Optimization (IRPO) leverages multiple intrinsic reward functions via a surrogate policy gradient to directly optimize a policy with respect to an extrinsic reward. Our algorithm, thus, avoids existing limitations such as credit assignment, sample inefficiency, and suboptimality across discrete and continuous environments.

Key Contributions

New surrogate policy gradient: We developed a surrogate gradient computed using intrinsic rewards to collect diverse experiences while using extrinsic rewards to optimize the policy.
Formal and empirical analysis: We provide extensive analysis to characterize theoretical benefits and the underlying mechanism of the surrogate gradient.
Improved performance: We evaluate our algorithm across widely used discrete and continuous environments, from basic dynamics to complex locomotion.
Extensive ablation: We provide five ablation studies to justify our algorithmic design and its robustness across different intrinsic rewards.

🎥 Comparative Results on Sparse-Reward Environments

IRPO (Ours)

HRL

PPO

Environment: PointMaze-v1

Environment: AntMaze-v3

🛠 Method Overview — IRPO Gradient

At the core of IRPO is a surrogate policy gradient that aggregates learning signals from multiple exploratory policies to directly optimize extrinsic performance.

IRPO Surrogate Gradient

We define the IRPO gradient as:

\[\nabla J_{\mathrm{IRPO}}\big(\theta,\{\tilde{\theta}_k\}_{k=1}^K\big) := \sum_{k=1}^K \omega_k \, \nabla_\theta J_R(\tilde{\theta}_k)\]

where each exploratory policy \(\pi_{\tilde{\theta}_k}\) contributes proportionally to its extrinsic performance.

Performance-Based Weighting

The contribution of each exploratory policy is determined by a softmax weighting:

\[\omega_k := \frac{\exp(J_R(\tilde{\theta}_k)/\tau)}{\sum_{k'=1}^K \exp(J_R(\tilde{\theta}_{k'})/\tau)}, \quad \tau \in (0,1]\]

Backpropagated Policy Gradient

The gradient propagated from each exploratory policy to the base policy is calculated via the chain rule:

\[\nabla_\theta \log \pi_{\tilde{\theta}_k}(a \mid s) := (\nabla_\theta \tilde{\theta}_k)^\top \nabla_{\tilde{\theta}_k} \log \pi_{\tilde{\theta}_k}(a \mid s)\]

Visualizing the IRPO Update Mechanism

We provide a visual illustration of the IRPO update mechanism using a two-dimensional parameter space \(\theta = [\theta_1, \theta_2]^\top \in \mathbb{R}^2\) with strictly concave extrinsic and intrinsic performance objectives. Below is our considered extrinsic and intrinsic performance objectives in this analysis:

Objective Functions: \(J(\theta) := -\Vert \theta \Vert^2_2 \quad \text{(Extrinsic)}, \; \tilde{J}_1(\theta) := -\Vert \theta - [0, -2]^\top \Vert^2_2 \quad \text{(Intrinsic)}.\)

The result of IRPO on the objectives is depicted below.

Figure 2: Empirical analysis of the IRPO update mechanism showing the relationship between intrinsic rewards and policy convergence.

Figure 3: Visual illustration of IRPO’s update mechanism. We used one intrinsic reward with 5 number of exploratory policy updates.

We highlight key observations below.

Exploratory Policy Updates: The exploratory policy updates (red arrows) are explicitly directed towards the intrinsic performance objective (red star), facilitating the collection of diverse experiences.
Surrogate Gradient Guidance: The base policy updates—driven by the IRPO gradient—guide the agent toward a region where \(N\) exploratory updates will land precisely at the extrinsic optimum.
Degree of Exploration: Increasing the number of exploratory policy updates (\(N=2\) vs \(N=5\)) increases the degree of exploration, as indicated by the increased length of the exploratory vectors.

📖 BibTeX Citation

@article{cho2026intrinsic,
  title   = {Intrinsic Reward Policy Optimization for Sparse-Reward Environments},
  author  = {Cho, Minjae and Tran, Huy T.},
  journal = {arXiv preprint arXiv:2601.21391},
  year    = {2026}
}

Recommended citation: Cho, M., & Tran, H. T. (2026). "Intrinsic Reward Policy Optimization for Sparse-Reward Environments." arXiv preprint arXiv:2601.21391.
Download Paper

Share on

X (formerly Twitter) Facebook LinkedIn

MJ (Minjae) Cho (조민재)