Real‑World Reinforcement Learning

Beyond Human Capability: The Robotic Era

Kun Lei
October 2025

Foreword

The wave of embodied intelligence is sweeping the globe, reaching unprecedented visibility across academia and industry. Startups are springing up everywhere, converging on a shared vision—bringing AGI into the physical world. In this “first year of embodiment,” we focus on one core topic: deployment. We also introduce our latest result in real‑robot reinforcement learning—RL‑100.

We see dazzling robot demos every day—parkour, backflips, dancing, boxing. Impressive, yes—but what kind of robots do we actually need? We look forward to the day a robot can do laundry, make breakfast, and act as a reliable daily assistant. To reach that vision, we prioritize deployability metricsreliability, efficiency, and robustness. No one wants a helper that breaks three plates while doing dishes or needs five hours to cook a meal. These metrics motivate this work and serve as our evaluation yardsticks.

The demands of real-world deployment
The demands of real-world deployment.

Reinforcement learning has dazzled the world in milestones like AlphaGo and GPT. Yet the refrain—“RL is nothing for robotics”—persists. Why? In games or language, data is abundant (massive simulation, web text). In robotics, real‑world data is expensive and sim‑to‑real gaps are severe (dynamics, sensing—vision, touch). A data‑hungry method like RL struggles without careful system design.

Below, we outline the pros/cons of Imitation Learning (IL), Human‑in‑the‑Loop (HITL) augmentation, and Reinforcement Learning (RL)—and how to combine them.

Imitation Learning: Strengths and Ceiling

Learning from smart human priors speeds up robot onboarding. But high‑quality real data is scarce:

Thus, pure supervision faces a ceiling of imitation: performance is bounded by the demonstrator’s ability and inherits inefficiencies, biases, and occasional errors.

HITL Augmentation: Gains and Limits

Adding smart humans to patch IL is practical: use HITL or a world/reward model to identify weak regions of the policy’s state/trajectory distribution, then target data collection and fine‑tuning. This expands coverage and boosts generalization—essentially coverage + human correction to compensate for IL’s bottlenecks.

However, if a policy never experiences failure and relies mainly on external handholding, it won’t learn to avoid bad states. RL treats failure as a crucial learning signal: through interaction, exploration, and correction, the policy learns which behaviors push the system toward undesirable states—and how to act optimally within them. The goal isn’t to “fall into a bad distribution and then fix it,” but to avoid entering and amplifying those bad distributions. Also, IL is bounded by its dataset; truly superhuman robots must go beyond imitation.

Why RL—and What Blocks It

RL provides a complementary path: it optimizes return, not imitation error, and discovers strategies rare or absent in demos. Two blockers in robotics:

  1. Costly real‑world data: starting from scratch is sample‑inefficient.
  2. Sim‑to‑real gaps: differences in dynamics and perception (vision, touch, etc.).

A central question emerges: How do we leverage strong human priors and continually improve via self‑directed exploration? Consider how children learn to walk: guided by parents first, then autonomous practice until mastery across terrains. Likewise, a practical robot system should combine human priors with self‑improvement to reach—and surpass—human‑level reliability, efficiency, and robustness.

Human-like learning paradigm
Human-like learning paradigm.

RL‑100: Learning Like Humans

We start with human priors—teachers “teach,” but students need self‑practice to generalize. RL‑100 adds real‑world RL post‑training on top of a diffusion‑policy IL backbone. It retains diffusion’s expressivity while using lightly guided exploration to optimize deployment metrics—success rate, efficiency, and robustness. In short: start from human, align with human‑grounded objectives, then go beyond human.

Three stages
  1. IL pretraining: Teleop demos provide a low‑variance, stable starting point—the sponge layer of the cake.
  2. Iterative Offline RL post‑training: Update on a growing buffer of interaction data—the cream layer delivering major gains.
  3. Online RL post‑training: The last mile to remove rare failures—the cherry on top. A small, targeted budget pushes ~95% → 99%+.
RL‑100 stages vs. cake making
RL‑100 stages vs. cake making.

RL‑100 Algorithmic Framework

We begin with Diffusion Policy (DP)‑based IL, which models multi‑modal, complex behavior distributions in demos. This is a powerful behavior prior for RL. On top of that, we apply objective‑aligned weighting and fine‑tuning so the policy learns when/where/why to choose better actions rather than replay frequent demo actions.

Because human data alone can’t cover the full state–action space, we place the policy in the real environment for trial‑and‑error and combine human data with policy data:

Online RL is precious and sensitive; we use it for the last mile, polishing ~95% to 99%+. With a unified offline–online objective, the transition is seamless and regression‑free.

RL‑100 training workflow
RL‑100 training workflow.
RL-100 algorithm pipeline
RL-100 algorithm pipeline.

Key Modules — Takeaways

Experiments

Main Results

7 real‑robot tasks, 900/900 total successes; 250 consecutive successes in a single task, >2 hours nonstop. Under physical disturbances, zero‑shot, and few‑shot adaptation, success remains high. An orange‑juicing robot provided 7 hours of continuous service in a mall with zero failures.

Tasks
Tasks.
7‑hour mall service, zero failures
7‑hour mall service, zero failures.
Main results
Main results.

Robustness Under Human Disturbances

Overall: 95.0% average success across tested scenarios, indicating reliable recovery under unstructured perturbations.

Zero‑Shot Generalization

Average 92.5% success across four change types without retraining.

Few‑Shot Generalization

Average 86.7% with only 1–3 hours of additional training.

Efficiency vs. Baselines & Humans

Summary: Gains from (1) efficient encoding (point cloud > RGB), (2) reward‑driven optimization (γ < 1), (3) one‑step inference. Monotonic trend: DP‑2D → DP3 → RL‑100 DDIM → RL‑100 CM.

Execution efficiency
Execution efficiency.

Robot vs. Humans — Bowling

Robot vs. 5 people: 25/25 vs. 14/25.

RL-100 workflow
RL-100 workflow.

Training Efficiency

Training efficiency curve
Training efficiency curve.

Ablations — Takeaways

Ablation results
Ablation results.

Outlook

Next we will stress‑test in more complex, cluttered, partially observable settings—closer to homes and factories: dynamic multi‑object scenes, occlusions, reflective/transparent materials, lighting changes, and non‑fixed layouts. Building on near‑perfect success, long‑horizon stability, and near‑human efficiency, these tests will better reveal deployment limits and failure modes.

Small diffusion policies with modest fine‑tuning already achieve high reliability and efficiency. We plan to extend post‑training to multi‑task, multi‑robot, multi‑modal VLA models, including:

Although the pipeline supports conservative operation and stable fine‑tuning, reset and recovery remain bottlenecks. We will explore autonomous reset mechanisms—learned reset policies, scripted recovery motions, task‑aware fixtures, and failure‑aware action chunking—to reduce human intervention, downtime, and stabilize online improvement—complementing RL‑100.