Introduction

In the last article, we introduced the basic motivation and building blocks and motivation of model learning and state representation learning. We will now look at the main challenges, improvements and evaluation methods.

Challenges

There are several fundamental issues which need to be addressed when a model is learned, namely stochasticity, uncertainty, partial observability, non-stationarity, and multi-step predictions [1].

Stochasticity

If an agent navigates through an environment (or Markov Decision Processes) we can consider two different cases. If the same action taken in the same state always leads to the same next state and reward, we say the environment is deterministic. If the next state and reward can change, we have a stochastic environment. If we want to learn a model in a stochastic environment, it is necessary to use descriptive models that can approximate the entire distribution of possible next states or generative models that can generate samples from the distribution. Descriptive models such as tabular models, Gaussian models [2], and Gaussian mixture models are feasible for small state spaces, but for high-dimensional state spaces, deep generative models such as variational inference models [3, 4] or generative adversarial networks are more successful.

Uncertainty

In model-based learning, limited data can lead to uncertainty, which can be categorized into two types: epistemic uncertainty and aleatoric uncertainty (also known as stochasticity). Epistemic uncertainty can be decreased by gathering more data, while aleatoric uncertainty cannot be reduced. To ensure the reliability of predictions, we ideally should estimate both. Two statistical approaches to uncertainty estimation are frequentist and Bayesian. Bayesian methods, such as non-parametric Bayesian methods like Gaussian Processes [5], have been effective in model-based RL [2, 6], but they are not well-suited for high-dimensional state spaces. Recently, Bayesian techniques have been developed for approximating dynamics using neural networks, based on variational dropout [7] and variational inference [8].

Partial Observability

In MDPs, partial observability happens when the current observation does not supply complete information about the MDP's true state. Unlike stochasticity, which is a fundamental noise in state transitions and cannot be eliminated, partial observability stems from a lack of information in the current observation. It can be partly reduced by including information retrieved from past observations. This can be done in four main ways: windowing, belief states, recurrency, and external memory [1].

In the windowing approach several of the latest observations are merged and used as a state. This method increases the size of the states by a multiplicative factor, leading to high memory requirements. For high-dimensional observations this might therefore not be feasible and cannot benefit from generalization.

For belief states the learned dynamics model consists of an observation model \(p(\mathrm{o_t} \vert \mathrm{s_t})\) and a latent transition model \(p(\mathrm{s_{t+1}} \vert \mathrm{s_t}, a_t)\), similar to state space models. [9]. We will consider state space models in more detail in the next article. Planning for such belief state models has been studied extensively, and the principles have also been combined with neural networks for high-dimensional problems.

In the case of recurrent neural networks (RNNs), most notably using long short-term memory (LSTM) [10] the transition parameters are shared between all time steps. This means the model size is independent of the history length, making it useful for gradient-based training and high-dimensional state spaces. [11, 12]

External memory [13] is useful for long-range dependencies, as we do not need to keep propagating information but can recall it once it becomes relevant. Neural Turing Machines (NTMs) [14] have read/write access to external memory and can be trained with gradient descent.

Ignoring partial observability can lead to a complete failure of the solution and should therefore be attended to.

Non-stationarity

Non-stationarity describes the case where the reward and/or transition function change over time. In such situations, the agent's performance can decline if it continues to rely on its prior model without realizing the change. To tackle non-stationarity, a prevalent method is to use multiple models that the agent can alternate between [15]. Various techniques can be employed to detect regime switches, such as observing prediction errors in reward and transition models [16]. Another approach is to meta-learn different policies [17, 18] or the optimizer [19].

Multi-Step prediction

While one-step models can be utilized for multi-step predictions by repeatedly feeding the prediction back into the model, doing so may lead to errors accumulating and resulting in predictions that deviate from the true dynamics. To overcome this issue, two approaches have been identified: incorporating multi-step prediction losses into the overall training target via different loss functions [20] or learning a unique dynamics model for each n-step prediction [21]. In the end, relying solely on one-step prediction errors might not be a reliable way to gauge model performance in multi-step planning scenarios.

Improvements

Two ideas to improve the quality of the learned states is firstly to incorporate information about the reward and secondly to use alternative objective functions which model different ideas of how a good representation should look like.

Rewards To improve the learned state representation, rewards can optionally be used as additional information. One can learn a model which tries to predict the reward for a given state and action [22, 23], combined with a forward model or a value prediction, this helps to keep information in the learned representation that is useful to determine the reward. It can also be easier to just reconstruct rewards and / or values instead of full observations. We can also place constraints on the learned representations based on the rewards. A so called casuality prior [24, 25] assumes that if we obtain two different rewards after performing the same action, then the states should be distanced from each other in the state space.

Alternative Objective functions

We can also add prior knowledge into the state representation learning process through different cost functions. This can range from laws of physics to things considered common sense [26, 24].

Some examples are the following:

  • Slowness assumes that interesting features change slowly through time and sudden changes are improbable [27, 28, 29, 30]. For example, we do not expect observed objects in a real-world scenario to randomly teleport to another position from one time step to another. To achieve this, we can minimize the expected squared distance between representations from consecutive time steps:
    $$\mathcal{L}_S = \mathbb{E}\left[\parallel s_{t} - s_{t-1}\parallel^2\right]$$
  • Proportionality [24] hypothesizes that if the same action was performed at two different times \(t_1\) and \(t_2\), the representations must change by an equal amount, so assuming that \(a_{t_1} = a_{t_2}\), we try to minimize the expected squared error between the changes in the representations:
    $$\mathcal{L}_P = \mathbb{E}\left[ \left(\parallel s_{t_1} - s_{t_1-1}\parallel - \parallel s_{t_2} - s_{t_2-1}\parallel\right)^2\right]\hspace{0.5 cm}\mathrm{if} \hspace{3mm}a_{t_1} = a_{t_2}$$
  • Variation [30] posits that random observations of objects vary and therefore the internal representations ought to also vary. We know that \(e^{-x}\) is \(1\) for \(x = 0\) and goes to \(0\) for \(x \rightarrow \infty\) so, for two states with indices \(a\) and \(b\) we can use the loss:
    $$\mathcal{L}_V = \mathbb{E}\left[e^{-\parallel s_a - s_b\parallel}\right]$$
  • Dynamic Verification [31] consists of taking a sequence of state, action, next state triplets and trying to find an inserted incorrect corrupted observation \(o\) which was retrieved from swapping it with an observation from a nearby time step.

Evaluation

The most common form to evaluate the performance of learned representation is to examine how well reinforcement learning algorithms perform compared to other representations under the same algorithm. As the landscape of such algorithms is ever-increasing, this makes comparison between such learned state spaces cumbersome over time as continuous re-evaluation would be necessary. RL algorithms by their nature mostly also do not deliver a deterministic result for the same input [26].

Another approach is the disentanglement metric [32, 33] which is a measure of how well a representation separates the underlying factors of variation in the data.

KNN-MSE [25] computes the mean squared error over the k nearest neighbors in the learned state space. The nearest neighbors in the state space can then be compared with the corresponding nearest observation neighbors.

Distortion [34] measures how much the distance between an original (observation) space between specific points and a projected (representation) space changes.

Normalization Independent Embedding Quality Assessment [35] is a more complex metric to measure the quality of embeddings. It examines whether it preserves the global topology and whether it preserves the geometric structure of the local neighbor neighborhoods.

Summary

This concludes our look at the main challenges of model learning and state representation learning, improvements and evaluations. In the next article we will look at how we can use variational inference to learn states in an unsupervised manner.

References

[1] T. M. Moerland, et al. Model-based Reinforcement Learning: A Survey. 2022

[2] M. P. Deisenroth and C. E. Rasmussen. PILCO: A Model-Based and Data-Efficient Approach to Policy Search. Proceedings of the 28th International Conference on International Conference on Machine Learning. 465–472. 2011

[3] D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. arXiv preprint arXiv:1312.6114. 2013

[4] R. G. Krishnan, U. Shalit and D. Sontag. Deep kalman filters. arXiv preprint arXiv:1511.05121. 2015

[5] C. E. Rasmussen and C. K. Williams. Gaussian processes for machine learning. 2006

[6] M. P. Deisenroth, D. Fox and C. E. Rasmussen. Gaussian Processes for Data-Efficient Learning in Robotics and Control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37. 408-423. 2015

[7] R. McAllister and C. Rasmussen. Improving PILCO with Bayesian Neural Network Dynamics Models. 2016

[8] L. Girin, et al. Dynamical variational autoencoders: A comprehensive review. arXiv preprint arXiv:2008.12595. 2020

[9] K. P. Murphy. Machine learning: a probabilistic perspective. 2012

[10] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9. 1735-1780. 1997

[11] S. Chiappa, et al. Recurrent Environment Simulators. 2017

[12] D. Ha and J. Schmidhuber. World Models. arXiv: Learning. 2018

[13] L. Peshkin, N. Meuleau and L. P. Kaelbling. Learning Policies with External Memory. 1999

[14] A. Graves, G. Wayne and I. Danihelka. Neural Turing Machines. 2014

[15] K. Doya, et al. Multiple model-based reinforcement learning. Neural Computation, 14. 1347-1369. 2002

[16] B. C. Da Silva, et al. Dealing with non-stationary environments using context detection. Proceedings of the 23rd international conference on Machine learning. 217–224. 2006

[17] C. Finn, P. Abbeel and S. Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. arXiv: Learning. 2017

[18] I. Clavera, et al. Model-based reinforcement learning via meta-policy optimization. Conference on Robot Learning. 617–629. 2018

[19] A. Nagabandi, C. Finn and S. Levine. Deep Online Learning via Meta-Learning: Continual Adaptation for Model-Based RL. 2018

[20] D. Hafner, et al. Learning Latent Dynamics for Planning from Pixels. Proceedings of the 36th International Conference on Machine Learning. 2555-2565. 2019

[21] K. Asadi, et al. Towards a Simple Approach to Multi-step Model-based Reinforcement Learning. arXiv: Learning. 2018

[22] J. Munk, J. Kober and R. Babuska. Learning state representation for deep actor-critic control. 2016 IEEE 55th Conference on Decision and Control (CDC). 4667-4673. 2016

[23] J. Oh, S. Singh and H. Lee. Value Prediction Network. 2017

[24] R. Jonschkowski and O. Brock. Learning state representations with robotic priors. Autonomous Robots, 39. 407-428. 2015

[25] T. Lesort, et al. Unsupervised state representation learning with robotic priors: a robustness benchmark. arXiv preprint arXiv:1709.05185. 2017

[26] T. Lesort, et al. State representation learning for control: An overview. Neural Networks, 108. 379–392. 2018

[27] L. Wiskott and T. J. Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural computation, 14. 715–770. 2002

[28] P. Song and C. Zhao. Slow Down to Go Better: A Survey on Slow Feature Analysis. IEEE Transactions on Neural Networks and Learning Systems. 1-21. 2022

[29] V. R. Kompella, M. Luciw and J. Schmidhuber. Incremental slow feature analysis: Adaptive low-complexity slow feature updating from high-dimensional input streams. Neural Computation, 24. 2994–3024. 2012

[30] R. Jonschkowski, et al. PVEs: Position-Velocity Encoders for Unsupervised Learning of Structured State Representations. arXiv preprint arXiv:1705.09805. 2017

[31] E. Shelhamer, et al. Loss is its own reward: Self-supervision for reinforcement learning. arXiv preprint arXiv:1612.07307. 2016

[32] I. Higgins, et al. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. 2016

[33] M. Carbonneau, et al. Measuring disentanglement: A review of metrics. IEEE Transactions on Neural Networks and Learning Systems. 2022

[34] P. Indyk. Algorithmic applications of low-distortion geometric embeddings. Proceedings 42nd IEEE Symposium on Foundations of Computer Science. 10–33. 2001

[35] P. Zhang, Y. Ren and B. Zhang. A new embedding quality assessment method for manifold learning. Neurocomputing, 97. 251–266. 2012

Comments


Published

Category

Entry

Contact