Optimism Makes Algorithm for Reinforcement Learning More Effective

An international team of scientists from Russia, France and Germany (including researchers of the HSE Faculty of Computer Science, the HSE Artificial Intelligence Centre and the Artificial Intelligence Research Institute) have developed a new reinforcement learning algorithm (Bayes-UCBVI). This is the first Bayesian algorithm that has a mathematical proof of effectiveness and has been successfully tested in Atari games. The result was presented at the ICML-2022 conference.

Reinforcement learning is a type of machine learning. The key feature of this method, unlike classical machine learning, is the constant interaction of the agent (algorithm) with the environment which gives feedback in the form of rewards and punishments. The agent's goal is to maximise the rewards received from the environment for ‘correct’ interaction.

The agent does not just try to understand which actions are correct by relying on current knowledge about the environment; it also explores this environment, looking for new opportunities to get an even bigger reward. Thus, there is an exploration-exploitation dilemma.

The problem of exploring the environment or exploiting existing knowledge to accumulate rewards is one of the main ones for building effective reinforcement learning algorithms. The Bayes-UCBVI algorithm developed by the researchers operates in the paradigm of optimism, ie, the agent rechecks the value of actions that it performs rarely.

The principle of optimism results in the agent choosing an action for one of two reasons: either it has rarely tried it, or it is certain that it’s good. This is what ensures the agent's exploration of the environment.

Imagine there is a cafe near your house. Every morning, you buy coffee and pastries that you like there. One day, another cafe opens nearby, and you think “what if the coffee and pastries are better there?” The next morning, you face a dilemma: explore a new cafe or go to a tried-and-tested place where you are sure of the result. You decide to explore the new place, and the coffee there turns out to be bland. But you've only tried their coffee once—who knows, maybe it was just a bad pack of coffee beans. According to the principle of optimism, you will give this cafe at least one more chance.

Daniil Tiapkin
Research Assistant, International Laboratory of Stochastic Algorithms and High-Dimensional Inference

The researchers note that despite its theoretical effectiveness, the principle of optimism was difficult to use to create practical reinforcement learning algorithms that would work for complex environments (such as computer games) or to control a real robot. The algorithm presented by the researchers made it possible to bridge the gap between theory and practice. The team of authors first proposed a generalisation of this algorithm and tested it on 57 Atari games.

This is the first algorithm with theoretical and practical significance. The proven results of Bayes-UCBVI play an important role in the development of machine learning; they bring theorists and practitioners together. Using this algorithm in practice will significantly speed up the learning process of artificial intelligence.

Alexey Naumov
Head of the International Laboratory of Stochastic Algorithms and High-Dimensional Inference

Author: Polina Dergacheva, November 10, 2022

All texts by

High Tech