Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimization

Abstract

Combining offline and online reinforcement learning (RL) is crucial for efficient and safe learning. However, previous approaches treat offline and online learning as separate procedures, resulting in redundant designs and limited performance. We ask: Can we achieve straightforward yet effective offline and online learning without introducing extra conservatism or regularization? In this study, we propose Uni-O4, which utilizes an on-policy objective for both offline and online learning. Owning to the alignment of objectives in two phases, the RL agent can transfer between offline and online learning seamlessly. This property enhances the flexibility of the learning paradigm, allowing for arbitrary combinations of pretraining, fine-tuning, offline, and online learning. In the offline phase, specifically, Uni-O4 leverages diverse ensemble policies to address the mismatch issues between the estimated behavior policy and the offline dataset. Through a simple offline policy evaluation (OPE) approach, Uni-O4 can achieve multi-step policy improvement safely. We demonstrate that by employing the method above, the fusion of these two paradigms can yield superior offline initialization as well as stable and rapid online fine-tuning capabilities. Through real-world robot tasks, we highlight the benefits of this paradigm for rapid deployment in challenging, previously unseen real-world environments. Additionally, through comprehensive evaluations using numerous simulated benchmarks, we substantiate that our method achieves state-of-the-art performance in both offline and offline-to-online fine-tuning learning.

Motivated Example

The objective of offline-to-online RL algorithms is to strike a trade-off between fine-tuning stability and asymptotical optimality. Challenges arise from the inherent conservatism during the offline stage and the difficulties associated with off-policy evaluation during the offline-to-online stage. To provide a clearer understanding, we track the average values of the V and Q functions during fine-tuning with three different methods. As depicted in the right Figure below, the Q values of the off-policy (SAC) and conservative algorithms exhibit instability and slow improvement, respectively.

Hence, we raise the question: Is it possible to avoid introducing the conservatism term during offline training and eliminate the need for off-policy evaluation during offline-to-online fine-tuning? At the heart of this paper is an on-policy optimization method that unifies both offline and online training without extra regularization, which we term Uni-O4. Uni-O4 presents steady and rapid improvement in the V values as the fine-tuning performance progresses, emerging as a favorable choice for achieving both stable and efficient fine-tuning.

First Image — Normalized returns during online fine-tuning

Second Image — Average values during online fine-tuning

Method

Owning to the alignment of objectives in offline and online phases, Uni-O4 Uni-O4 enables flexible combinations of pretraining, fine-tuning, offline learning, and online learning. In this work, our main focus lies on introducing three settings: pure offline, offline-to-online, and online-to-offline-to-online. In the offline-to-online setting, our framework comprises three stages: 1) the supervised learning stage, 2) the multi-step policy improvement stage, and 3) the online fine-tuning stage, as illustrated in the figure below.

Uni-O4 employs supervised learning to learn the components for initializing the subsequent phase. In offline multi-step optimization phase (middle), policies query AM-Q to determine whether to update the behavior policies after a certain number of training steps. For instance, AM-Q allows \(\pi^{2}\) to update its behavior policy but rejects the others. Subsequently, one policy is selected as the initialization for online fine-tuning. Specifically, \(OOS_{\mathcal{D}}\) indicates out-of-support of dataset.

Pipeline Image — Pipeline of the offline-to-online setting for Uni-O4

Real-World Robot Tasks

Here, Uni-O4 showcases its ability to excel in real-world robot applications. Bridging the sim-to-real gap is a widely recognized challenge in robot learning. Previous studies tackled this issue by employing domain randomization, which involves training the agent in multiple randomized environments simultaneously. However, this approach comes with computational overhead and poses challenges when applied to real-world environments that are difficult to model in simulators.

To address this issue, we propose to leverage Uni-O4 in an online-offline-online framework. The agent is initially pretrained in simulators (online), followed by fine-tuning on real-world robots (offline and online), as illustrated in the following pipeline. Offline fine-tuning proves the safety of real-world robot learning, while online learning undergoes policy improvement. This paradigm showcases sample-efficient fine-tuning and safe robot learning. Results are presented in the following videos.

Offline fine-tuned (0.18 million env steps collected) by Uni-O4 VS. simulator pretrained (10 minutes training) policy deployment with low speed ↓

Online fine-tuned (0.1 million env steps) by Uni-O4 VS. Offline fine-tuned by Uni-O4 policy deployment with high speed ↓

Online fine-tuned by Uni-O4 VS. IQL offline-to-online fine-tuned policy deployment with high speed ↓

Online fine-tuned by Uni-O4 VS. sim2real baseline policy deployment with high speed ↓

Simulated Tasks

For simulated tasks,Uni-O4 exhibits an integration of stability, consistency, and efficiency, eclipsing all baseline methods with its unified training scheme. As shown in the figures below, Uni-O4 can provide a better initialization from offline compared with other baselines and then shows a stable and rapid online fine-tuning performance improvement. For more results, please refer to our paper.