技术文档Actor-Critic 强化学习方法

Actor-Critic 强化学习方法

项目强化学习AI
内容

https://www.bilibili.com/video/BV1Sq4y1q7sw/?spm_id_from=333.337.search-card.all.click&vd_source=36d7ac85fed0652f1ac894d4fa5e20f2 Actor运动员 Critic裁判 两个神经网络,通过环境的奖励来学习两个网络 Actor-Critic Method=Value Network + Policy network

state-value function approximation

State-Value Function

$$ V_\pi(s) = \sum_a \pi(a \mid s) \cdot Q_\pi(s, a) $$

  • $V_\pi(s)$状态价值函数:策略 $\pi$ 下状态 $s$ 的价值(state-value),是动作价值函数$Q_\pi(s, a)$的期望。
  • $\sum_a$:对于离散情况,对所有动作 $a$ 求和/连续情况,定积分。
  • $\pi(a \mid s)$:policy function 策略函数,在状态 $s$ 下采取动作 $a$ 的概率(策略分布)。
  • $Q_\pi(s, a)$动作价值函数:策略 $\pi$ 下状态 $s$ 采取动作 $a$ 后的动作价值(action-value)。 当前策略和动作价值函数都不知道的情况下,用两个神经网络分别近似这两个函数,然后用actor-critic方法同时学习这两个神经网络。

Policy network (actor)

策略神经网络

  • Use neural net $\pi(a \mid s;\theta)$ to approximate $\pi(a \mid s)$
  • $\theta$ : trainable parameters of the neural net

Value network (critic)

  • Use neural net $q(s \mid a;w)$ to approximate $Q_\pi(s, a)$.
  • w: trainable parameters of the neural net. 价值神经网络参数,不控制运动,只是给动作打分。 actor相当于运动员,critic相当于裁判,运动员做动作,但他不知道自己的动作好坏,所以他需要裁判给打分。

Training process of DRL Agent

在训练开始,我们从数据集里得到n个候选箱子 把箱子分成batch进行训练,每个batch里有k个箱子 f:actor,也就是我们今天讨论到的图片内容 g:critic,他会作为基准,得到正确的当前状态和下一个状态,并和f计算对比学习损失

  • 状态s是一个向量,包括箱子是否放置(0/1),箱子大小,箱子摆放位置

训练过程就是不断从(actor)如图中采样得到箱子大小,箱子方向和箱子放置位置作为f

最后计算f和g的对比学习损失并更新f和g的参数

At the beginning of training, we obtain n candidate boxes from the dataset. The boxes are divided into batches, and each batch contains k boxes.

The actor, denoted as f (corresponding to the architecture shown in the figure), is responsible for predicting the box index, orientation, and placement position. The critic, denoted as g, serves as the baseline by providing the correct current and next states, and is also used together with f to compute the contrastive learning loss.

  • g will output the embedding vector of $(S_t, S_{T+1})$
  • Each state $S_t$ is represented as a vector, which includes whether the box is placed (0/1), the box dimensions, and its placement position.

The training process continuously samples from the actor f to generate the box size, orientation, and placement position. Finally, the contrastive learning loss between f and g is computed, and the parameters of both f and g are updated accordingly.