782 Views
August 29, 24
スライド概要
DL輪読会資料
Learning Reward for Robot Skills Using Large Language Models via Self-Alignment Presenter: Yaonan Zhu, Matsuo Iwasawa Lab 1
Paper Information (ICML 2024) • • • • Title: Learning Reward for Robot Skills Using Large Language Models via SelfAlignment Authors: Yuwei Zeng, Yao Mu, Lin Shao Affiliations: National University of Singapore, The University of Hong Kong Link:https://sites.google.com/view/rewardselfalign We proposed a method to learn rewards more efficiently in the absence of humans. Our approach consists of two components: We first use the LLM to propose features and parameterization of the reward, then update the parameters through an iterative self-alignment process. In particular, the process minimizes the ranking inconsistency between the LLM and the learnt reward functions based on the execution feedback. The method was validated on 9 tasks across 2 simulation environments. It demonstrates a consistent improvement in training efficacy and efficiency, meanwhile consuming significantly fewer GPT tokens compared to the alternative mutation-based method. 2024/8/28 2
Introduction • Reinforcement learning – Acquiring complex skills • Waking over uneven terrain, dexterous manipulation etc – Depends on carefully designed reward function • Relies on expert knowledge of tasks • Followed by non-trivial tunning to both optimize the efficacy and prevent the policy from exploiting flaws • Inverse reinforcement learning – Automatically learning the reward function from expert demonstrations – Requires expert demonstration gathering to cover the vast variety and complexity of the state space and yield a robust control 2024/8/28 RL IRL 3
Introduction • Large Language Models (LLMs) – Trained using extensive human data, have demonstrated to be embedded with richly useful task-related knowledge – LLM to directly propose action , or reward values • LLMs for reward learning – Using LLM to learn reward functions still presents a challenge due to the task sensitivity to the exact numerical values while LLM shows limited capacity 2024/8/28 Action Value 4
Proposed method • Research question – Is there a way we can learn the reward more efficiently in the absence of humans? (by utilizing LLMs) • LLM capabilities – LLMs have shown promising ability in summarizing and classifying text, which allows them to effectively distinguish different observations in textual form • Propose – Extract ranking signals from LLM, which could be more robust to guide reward learning than direct value prediction of the parameters 2024/8/28 5
Approach • Overview of the method We learn the reward function using LLM with a bi-level optimization structure. We first use the LLM to propose features and parameterization of the reward function. Next, we update the parameters of this proposed reward function through an iterative selfalignment process. In particular, this process minimizes the ranking inconsistency between the LLM and our learned reward functions based on the new observations. 2024/8/28 6
Approach • Approach – First utilize LLM to break down a task into steps with Dos and Don′ ts through Chain of Thought and propose the initial reward parameterization, particularly the feature selection and template structure – Next, we iteratively update the parameters of the proposed reward function in a self-alignment process which operates on a double-loop structure • The inner loop induces the optimal policy from the current reward function, samples trajectories using this policy and generates execution descriptions with the proposed reward features • The outer loop updates the reward parameters by aligning the ranking between LLM proposed with the execution description feedback, and the ranking from the current reward function • When no discrepancy exists yet no effective policy is developed, we also actively adjust reward parametrization in the direction LLM reflection hints (Liu et al., 2023), and numerically optimize it to keep the same ranking self-consistency 2024/8/28 7
Approach • Bi-level optimization (IRL) general subject to: • • IRL subject to: This process is similar to IRL’s bi-level optimization structure, with one key difference in the outer loop: instead of minimizing differences between expert demonstrations, the method employs ranking from LLM Since all supervision signals come from LLM, describe this as the selfalignment reward update 2024/8/28 8
Contributions • The paper proposed a framework to learn the reward functions with LLM through an interative self-alignment process, which periodically updates the reward function to minimize the ranking inconsistency of execution generated from LLM and the current reward function. • Leveraged upon the self-alignment process, we included active parameter adjustment with LLM heuristic to improve reward saliency, while preventing it from unintentional flaw through enforcing the consistency. • The paper validated the framework on 9 tasks on 2 simulation environments. It demonstrates a consistent improvement over training efficacy and efficiency while being token efficient compared to alternative method. 2024/8/28 9
Problem Definition • Finite horizon Markov Decision Process Parameterized by (", $, %, &) where S, A are the state and action spaces, %: " → ℝ is the reward function, and T is the horizon A policy + is a mapping from states to probabilities over actions, +(,|.) The expected return of the policy is given by The expert policy should be one that optimizes this return w.r.t. the ground-truth reward % In the paper, a partial MDP is given without i) the reward function R ii) any forms of expert demonstrations. Instead, we have access to an LLM that can rank a sequence of M trajectories with decreasing preference for k = 1, . . . , m based on the last state 2024/8/28 10
Reward learning from pairwise preference In reward learning, denoting D is the dataset with N pairwise trajectory preferences (τi , τj ) where τi ≻ τj , we seek to estimate the true reward parameter θ that maximizes the posterior: Prior is system-dependent and a common choice without special assumption is a uniform prior within the domain U [θmin, θmax]. Pairwise preference likelihood is modeled with the Bradley-Terry model: Sampling-based method is used to optimize θ to learn reward by LLM by reward The parameter θ is optimized to maximize the agreement between the reward model's rankings (based on !! ) and the LLM's rankings (ground truth). How to construct pairwise dataset D becomes important 2024/8/28 11
Main Method
Self alignment reward update
• Within each iteration, the policy is first updated using
RL with the current reward function (line 2-4)
• Next, we draw M samples from the updated policy by
rolling it out such that the collected trajectories reflect
the current policy behavior (line 5)
• We then aggregate the samples and retrieve the two
ranking sets from the current reward function Rθ and
the LLM through the textual feedback (line 7-8)
• To generate the dataset of pairwise comparison 0 =
{ 3!" , 3#" , 3!$ , 3#$ , … }where 3!% > 3#% , we first parse
all inconsistent pairs by comparing the two ranking
sets
• To resolve the reward inconsistency but also maintain
the achieved consistency, we additionally sample an
equal amount of consistent pairs from the comparison
(line 9 - 10)
2024/8/28
12
A example of self alignment • Example ranking from LLM Line:9 If the reward model !! ranks sample 0 higher than sample 2 (meaning τ0≻τ2), but the LLM ranks sample 2 higher than sample 0 (meaning τ2≻τ0), this pair (τ2, τ0) would be included in #"#$ Line:10 If both the reward model and the LLM rank sample 5 higher than sample 4 (meaning τ5≻τ4), because sample 5 has a better overall performance (e.g., closer to the hole with better alignment), then (τ5,τ4) would be part of #%&' Acquire 0 = 0&'( + 0)*+ Learning reward that is shaped by LLM preferences 2024/8/29 13
Experiments • Tasks – Six evaluation tasks from ManiSkill2: PickCube, PickSingleYCB, PegInsertionSide, OpenCabinetDoor, OpenCabinetDrawer, PushChair • Baselines – Expert-designed oracle rewards from the original environment implementations (Can this pipeline generate effective reward functions to induce optimal policies on varied skills learning?) – LLM proposed reward with the parameterization stays fixed during training (Can the periodic update through self-alignment improve the numerical impreciseness and instability, thus the efficacy of reward functions?) 2024/8/28 Six evaluation tasks from ManiSkill2 16
Demonstrations Pick YCB Mug 2024/8/29 17
Demonstrations 2024/8/29 18
Demonstrations 2024/8/29 19
Results Success rates vs exploration steps on 6 ManiSkill Tasks with SAC. • The updated reward is able to produce policy with similar performance to that is trained with oracle reward on 5 tasks. • Compared to using fixed reward function generated by LLM, our approach consistently improves the training with faster convergence rate and/or higher convergence performance 2024/8/29 20
Future perspectives • Integration with LLM based robot control – Most of the LLM based robots use LLM as task planning tool, and accomplish each subtask by predefined skill library (manually configured or human demonstrated) – Automatically acquire skill libraries using the outcome of this paper First-time Input Open the oven Human-robot collaboration Input is processed by LLM and named the task as “open_oven_handle” 1 Basic Library move_to_position(oven_handle) 2 gripper_control(close) 3 base_cycle_move() UI Delete 2 Replace 1 3 Instruction with manual teleoperation Original motion functions sequence gripper_control(close) move_to_position(oven_handle) base_cycle_move() ® ® dmp_pub(open_oven_handle) dmp_pub(open_oven_handle_ex) Updated motion functions sequence + DMP Library “open_oven_handle” One-shot task reusing 1 2 3 The LLM processed and named the task as “open_oven_handle” 4 Execution Execute the updated motion functions under the task name “open_oven_handle” in the DMP library 2024/8/29 Same task Reuse Open the oven Liu et.al 2024, RAL 21
Thank You ! 2024/8/29 22