# Simultaneous Estimation of Rewards and Dynamics

## Motivation and Problem Statement

Imitation learning approaches based on Inverse Reinforcement Learning (IRL) estimate a reward function that explains the observed behavior of an expert. Hence, IRL allows to learn behavioral models, which can either be used to teach new tasks or to predict an agent’s behavior. Since the agent's behavior originates in its policy and Markov Decision Process (MDP) policies depend on stochastic system dynamics as well as the reward function, the solution of the inverse problem is significantly influenced by both. As a consequence, extending IRL approaches by simultaneously learning the environments dynamics improves the quality of the reward as well as transition model estimates.

The capabilities of autonomous systems are constantly improving, which allows them to be applied in more and more complex environments. For example, an autonomous vehicle must be able to correctly predict the behavior of all agents in its immediate environment (e.g. pedestrians, cars, …) to decide what to do next, taking into consideration the influence of own decision on future situations. At the same time, we want an autonomous car to behave according to the desires of the consumer. Both, the prediction of other agent’s behavior and the prediction of the desired behavior, are challenges which can be solved through Learning from Demonstration (LfD) [2].

As the model-based IRL [6,9,10] approaches need to repeatedly solve the reinforcement learning problem as part of solving the IRL problem, they typically require an accurate model of the system's dynamics. Most IRL approaches assume that an MDP model, including the dynamics, is either given or can be estimated well enough from demonstrations. However, as the observations are the result of an expert's policy, they only provide demonstrations of desired, goal-directed behavior. As a consequence, a large area of the state and action space is rarely or never observed. Therefore, it is often not possible to estimate an accurate transition model directly from expert demonstrations. Model-free approaches [1,4] typically require either sufficient samples of the system's dynamics to accurately learn the reward function, access to the environment, or a simulator to generate additional data. However, in many applications these pre-conditions are not met: realistic simulators often do not exist, neither is it possible to query the environment. As such, current model-free IRL approaches either don't consider rewards or transitions of unobserved states and actions, or tend to suffer from wrong generalizations due the use of heuristics.

## Proposed Solution

We argue that simultaneously optimizing both, the likelihood of the demonstrations with respect to the reward function and the dynamics of the MDP, can improve the accuracy of the estimates and with it the resulting policy. Even though many transitions have never been observed, they can to some degree be inferred by taking into account that the data has been generated by an expert's policy. Since the expert's policy is the result of both his reward function and his belief about the system's dynamics, the frequency of state-action pairs in the data carries information about the expert's objective (see Figure 1).

This can be exploited to improve the sample efficiency and the accuracy of the estimation of the system's dynamics as well as the reward function, as both influence the policy. This led us to the formulation of a new problem class [2], which we call Simultaneous Estimation of Rewards and Dynamics (SERD), see Figure 2:

One solution to this type of problem is to find a parameterization of rewards and dynamics, which explains the observed behavior in terms of the log likelihood of the observed demonstrations. Assuming independent trajectories in the demonstration set, the optimization problem can be specified by the equation given in Figure 3.

Expert demonstrations consist of tuples 〈s,a,s’〉, which are samples from the environment’s dynamics P(s’|s,a ) and tuples 〈s,a〉, which are samples from the expert’s policy π(a|s). To solve this optimization problem, it is necessary to determine the influence of reward and dynamics on the policy. Several policy models which learn from human demonstrations have been proposed. In Herman et al. [2,3] we propose gradient-based approaches that solve this optimization problem for different expert policy models.

## Result

For evaluation purposes it is necessary to provide ground truth demonstrations of expert behavior from which both the reward function and the environment’s dynamics can be learned. While it is possible to use human demonstrations, specifying a reward function and creating toy data allows us to exactly evaluate the performance of the proposed approach. Therefore, a toy example has been designed for a grid-based navigating on map segments, as illustrated in Figure 4. The agent is allowed to choose between moving into four directions or staying in the grid cell. It is assumed that different transition models exist on open terrain and in the forest, where the outcome of moving into the desired direction is more random. The reward function indicates that being in the the goal is desired and walking on the street is preferable to walking on grass or in the forest. Figure 5 shows the expected log likelihood of ground truth demonstrations under the learned model parameters versus the demonstration set size. We compare our proposed approach to two IRL approaches, which use a naïve estimate of the dynamics for learning. It can be seen that SERD is sample efficient and that it produces more accurate models, which better explain the observed behavior.

## References

[1] A. Boularias, J. Kober, and J. Peters, “Relative entropy inverse reinforcement learning”, in Proceedings of Fourteenth International Conference on Artificial Intelligence and Statistics, (AISTATS ’11).

[2] D. C. Halbert, “Programming by example,” Ph.D. dissertation, 1984.

[3] M. Herman, T. Gindele, J. Wagner, F. Schmitt, and W. Burgard, "Inverse Reinforcement Learning with Simultaneous Estimation of Rewards and Dynamics”, in Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS), 2016, pp. 102-110.

[4] M. Herman, T. Gindele, J. Wagner, F. Schmitt, and W. Burgard, "Simultaneous estimation of rewards and dynamics from noisy expert demonstrations”, in 24th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), 2016, pp. 677-682.

[5] E. Klein, M. Geist, B. Piot, and O. Pietquin. "Inverse Reinforcement Learning through Structured Classification”, in Advances in Neural Information Processing Systems (NIPS ‘12).

[6] G. Neu and C. Szepesvári, “Apprenticeship learning using inverse reinforcement learning and gradient methods,” in UAI 2007, Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence, Vancouver, BC, Canada, July 19-22, 2007, 2007, pp. 295–302.

[7] A. Y. Ng and S. J. Russell, “Algorithms for inverse reinforcement learning,” in Proceedings of the Seventeenth International Conference on Machine Learning, ser. ICML ’00. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2000, pp. 663–670.

[8] S. Russell, “Learning agents for uncertain environments,” in Proceedings of the eleventh annual conference on Computational learning theory. ACM, 1998, pp. 101–103.

[9] B. D. Ziebart, A. Maas, J. A. D. Bagnell, and A. Dey, “Maximum entropy inverse reinforcement learning,” in Proceeding of AAAI 2008, July 2008.

[10] B. D. Ziebart, J. A. Bagnell, and A. K. Dey, “Modeling interaction via the principle of maximum causal entropy”, in Proc. of the International Conference on Machine Learning, 2010, pp. 1255-1262.