[ot][ml] Fwd: Learning to Identify Critical States for Reinforcement Learning

17 Aug 2023

      I’m just posting this because part of me finds AI algorithms fun, it’s
nice to enjoy something esoteric and powerful and supported by a
current trend. I imagine computer scientists might find such things
interesting too, and there’s little content on the list.

I was thinking on the high utility of exploring states known or
guessed to be useful for tasks. The paper prepares this information
offline unsupervised from video, somehow, likely with limitations.

Via "X (formerly Twitter)"

Learning to Identify Critical States for Reinforcement Learning from Videos
Published on Aug 15
abs: https://arxiv.org/abs/2308.07795
pdf: https://arxiv.org/pdf/2308.07795.pdf
paper page: https://huggingface.co/papers/2308.07795

Recent work on deep reinforcement learning (DRL) has pointed out that
algorithmic information about good policies can be extracted from
offline data which lack explicit information about executed actions.
For example, videos of humans or robots may convey a lot of implicit
information about rewarding action sequences, but a DRL machine that
wants to profit from watching such videos must first learn by itself
to identify and recognize relevant states/actions/rewards. Without
relying on ground-truth annotations, our new method called Deep State
Identifier learns to predict returns from episodes encoded as videos.
Then it uses a kind of mask-based sensitivity analysis to
extract/identify important critical states. Extensive experiments
showcase our method's potential for understanding and improving agent
behavior. The source code and the generated datasets are available at
https://github.com/AI-Initiative-KAUST/VideoRLCS.

Snippets:
…
Inspired by the existing evidence that frequently only a few decision
points are important in determining the return of an episode [1, 13],
and as shown in Fig. 1, we focus on identifying the state underlying
these critical decision points. However, the problem of directly
inferring critical visual input based on the return is nontrivial
[13], and com- pounded by our lack of explicit access to actions or
policies during inference. To overcome these problems—inspired by the
success of data-driven approaches [72, 44, 27]—our method learns to
infer critical states from historical visual trajectories of agents.
…
Our proposed architecture comprises a return predictor and a critical
state detector. The former predicts the return of an agent given a
visual trajectory, while the latter learns a soft mask over the visual
trajectory where the non-masked frames are sufficient for accurately
predicting the return. Our training technique explicitly minimizes the
number of critical states to avoid redundant information through a
novel loss function. If the predictor can achieve the same performance
using a small set of frames, we con- sider those frames critical.
Using a soft mask, we obtain a rank that indicates the importance of
states in a trajec- tory, allowing for the selection of critical
states with high scores. During inference, critical states can be
directly de- tected without relying on the existence of a return
predictor.
…

Karl’s Conclusion:
Their approach is agnostic to mode and involves an existing reward
function. They train a model to identify the minimal data needed to
predict the reward. This data reflects the observations that are found
to relate to the reward.
Reading a little further, I note the solution the found has a little
complexity to it, involving multiple loss functions (importance,
compactness, reverse loss), and iterative training sessions where a
return predictor is trained alongside the critical state detector. The
loss functions are summed and shared. Pseudocode is in an appendix.

…
Our approach outperforms comparable methods for identifying critical
states in the analyzed environments. It can also explain the behav-
ioral differences between policies and improve policy performance
through rapid credit assignment. Future work will focus on applying
this method to hierarchical RL and ex- ploring its potential in more
complex domains.

fuzzyTew

tags

participants (1)