Constrained Meta-Reinforcement Learning for Adaptable Safety Guarantee with Differentiable Convex Programming

Published in (AAAI-24) The Association for the Advancement of Artificial Intelligence, 2024

Abstract

We propose Meta-Learning via Constrained Policy Optimization (Meta-CPO) to study the unique challenges of ensuring safety in non-stationary environments by solving constrained problems through the lens of the meta-learning approach (learning-to-learn). We employ successive convex constrained policy updates across multiple tasks with differentiable convex programming later which we use end-to-end differentiability through parameters for meta-policy update.

Key Contributions

  • Adaptable Safety Guarantee using RL: We developed a policy optimization framework where the resulting policy can achieve adaptable safety guarantee to unseen tasks.
  • Improved performance: We develop adaptable safety benchmark and evaluate our algorithm across various baselines.

🎥 Comparative Results on Adaptation to Unseen Tasks

Meta-CPO (Ours)
Meta-TRPO
CPO

Environment: Point-Button-Hazard

Meta_CPO
Meta_TRPO
CPO

🛠 Method Overview — Safe Meta-Policy Optimization

The core of Meta-CPO is a bi-level constrained optimization framework that enables end-to-end meta-training while maintaining strict safety guarantees across unseen (in-distributional) tasks.

1. Local Task Updates (Inner-Loop)

For each specific task \(\mathcal{T}_i\) among given \(M\) number of tasks \(\{\mathcal{T}_i \}_{i=1}^{M}\), we first initialize a local learner \(\phi_i\) with parameters of a meta-learner \(\theta\). For each local iteration \(k\), we solve constrained policy optimization (CPO) using differentiable convex programming as follows:

\[\phi^{k+1}_i = \text{argmax}_{\phi_i} \; g(\phi^k_i, D^{tr}_i)^\top (\phi_i - \phi^k_i)\] \[\qquad\text{s.t. } \frac{1}{2}\|\phi_i - \phi^k_i\|^2_H \leq \delta, \quad b_{\phi_i} + a(\phi^k_i, D^{tr}_i)^\top (\phi_i - \phi^k_i) \leq 0.\]

This ensures that task-specific exploration respects both trust-region constraints and a safety constraint \(J_C(\pi) \leq h\).

2. Differentiable Meta-Update (Outer-Loop)

We update the meta-learner \(\theta\) by maximizing the average performance across \(M\) tasks. A major challenge in meta-learning is differentiating through the local learner updates, for which we employ computational graph generated by differentiable constrained optimization in local task updates to easily compute the meta-gradient via the chain rule as follows:

\[\frac{dF}{d\theta} = \frac{1}{M} \sum_{i=1}^M \left( \prod_{k=0}^{K-1} \frac{d\text{Alg}_i(\phi^{k+1}_i)}{d\phi^k_i} \right) \cdot g(\phi^K_i, D^{tr}_i).\]

By differentiating through the local learner updates, our algorithm is now ready to update the meta-learner by aggregating the backpropagated gradients with projection onto a global safety satisfaction \(G(\theta) \leq 0\) as follows:

\[\theta' = \text{argmax}_{\theta'} \; \left( \frac{dF}{d\theta} \right)^\top (\theta' - \theta)\] \[\qquad\text{s.t. } \frac{1}{2}\|\theta' - \theta\|^2_H \leq \delta_\theta, \quad b_\theta + \left( \frac{dG}{d\theta} \right)^\top (\theta' - \theta) \leq 0.\]

This mechanism allows the policy to generalize to new (but still in-distributional) tasks with adaptable safety guarantee.


📖 BibTeX Citation

@inproceedings{cho2024constrained,
  title={Constrained meta-reinforcement learning for adaptable safety guarantee with differentiable convex programming},
  author={Cho, Minjae and Sun, Chuangchuang},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={38},
  number={19},
  pages={20975--20983},
  year={2024}
}

Recommended citation: Cho, Minjae, and Chuangchuang Sun. "Constrained meta-reinforcement learning for adaptable safety guarantee with differentiable convex programming." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 19. 2024.
Download Paper