[ot][ml] Fwd: Learning to Identify Critical States for Reinforcement Learning
I’m just posting this because part of me finds AI algorithms fun, it’s nice to enjoy something esoteric and powerful and supported by a current trend. I imagine computer scientists might find such things interesting too, and there’s little content on the list. I was thinking on the high utility of exploring states known or guessed to be useful for tasks. The paper prepares this information offline unsupervised from video, somehow, likely with limitations. Via "X (formerly Twitter)" Learning to Identify Critical States for Reinforcement Learning from Videos Published on Aug 15 abs: https://arxiv.org/abs/2308.07795 pdf: https://arxiv.org/pdf/2308.07795.pdf paper page: https://huggingface.co/papers/2308.07795 Recent work on deep reinforcement learning (DRL) has pointed out that algorithmic information about good policies can be extracted from offline data which lack explicit information about executed actions. For example, videos of humans or robots may convey a lot of implicit information about rewarding action sequences, but a DRL machine that wants to profit from watching such videos must first learn by itself to identify and recognize relevant states/actions/rewards. Without relying on ground-truth annotations, our new method called Deep State Identifier learns to predict returns from episodes encoded as videos. Then it uses a kind of mask-based sensitivity analysis to extract/identify important critical states. Extensive experiments showcase our method's potential for understanding and improving agent behavior. The source code and the generated datasets are available at https://github.com/AI-Initiative-KAUST/VideoRLCS. Snippets: … Inspired by the existing evidence that frequently only a few decision points are important in determining the return of an episode [1, 13], and as shown in Fig. 1, we focus on identifying the state underlying these critical decision points. However, the problem of directly inferring critical visual input based on the return is nontrivial [13], and com- pounded by our lack of explicit access to actions or policies during inference. To overcome these problems—inspired by the success of data-driven approaches [72, 44, 27]—our method learns to infer critical states from historical visual trajectories of agents. … Our proposed architecture comprises a return predictor and a critical state detector. The former predicts the return of an agent given a visual trajectory, while the latter learns a soft mask over the visual trajectory where the non-masked frames are sufficient for accurately predicting the return. Our training technique explicitly minimizes the number of critical states to avoid redundant information through a novel loss function. If the predictor can achieve the same performance using a small set of frames, we con- sider those frames critical. Using a soft mask, we obtain a rank that indicates the importance of states in a trajec- tory, allowing for the selection of critical states with high scores. During inference, critical states can be directly de- tected without relying on the existence of a return predictor. … Karl’s Conclusion: Their approach is agnostic to mode and involves an existing reward function. They train a model to identify the minimal data needed to predict the reward. This data reflects the observations that are found to relate to the reward. Reading a little further, I note the solution the found has a little complexity to it, involving multiple loss functions (importance, compactness, reverse loss), and iterative training sessions where a return predictor is trained alongside the critical state detector. The loss functions are summed and shared. Pseudocode is in an appendix. … Our approach outperforms comparable methods for identifying critical states in the analyzed environments. It can also explain the behav- ioral differences between policies and improve policy performance through rapid credit assignment. Future work will focus on applying this method to hierarchical RL and ex- ploring its potential in more complex domains.
participants (1)
-
fuzzyTew