March 23, 2021
Written by: Catrina Hacker
Imagine you sit down at your favorite restaurant ready to order your favorite meal. You know for sure that you will enjoy it, but as you scan the rest of the menu another dish catches your eye. It sounds delicious, but what if it isn’t as good as your usual favorite? This conflict between doing what we know and trying something new is woven throughout many of the decisions we make each day. Should I re-watch an old tv show or try a new recommendation? Continue with my current hobby or start a new one? Neuroscientists call this dilemma the explore-exploit trade-off. If we only ever do the same things and exploit what we know, then we risk not discovering something better, like a meal that you enjoy more than your current favorite. On the other hand, if we only ever explore and try new things, then we are likely to sample many things that are much worse than what we already know, in this case limiting how much you enjoy visits to your favorite restaurant. Understanding how our brain navigates this trade-off can help us make better decisions, and we can apply this understanding to develop artificial agents that are capable of learning on their own.
One leading theory of how the brain balances exploration and exploitation centers on a small region in the brainstem called the locus coeruleus (LC)1. The LC is responsible for releasing almost all of the norepinephrine (NE) used throughout the brain. NE is a chemical that can change the way that neurons communicate. Thus, different levels of LC activity impact the release of NE and the activity of neurons throughout the brain. In this theory, exploration and exploitation are related to the LC switching between two different modes called tonic and phasic. Tonic and phasic refer to different patterns of activity that neurons can have; either a slow, constant activity (tonic) or a rapid response to a stimulus (phasic). In the phasic mode, the LC responds with bursts of activity related to the task that an animal is engaged in. Typically, these are quick bouts of activity that occur immediately after specific stimuli that are relevant to the current task. This mode is associated with exploitation, as animals whose LC is in the phasic mode stay engaged in the current task and have high performance. In the tonic mode, the LC is constantly active instead of bursting in response to relevant stimuli. The tonic mode is associated with exploration, as animals whose LC is in the tonic mode become distracted and explore other behaviors. For example, if an animal is doing a task that leads to consistent reward, then the LC is most likely in the phasic mode facilitating exploitation. If the reward becomes inconsistent, then the LC might switch to the tonic mode so that the animal begins to explore other options that could give more consistent reward.
There are two important factors that influence LC activity: cost and reward2. If the current action has high cost and low reward, then it makes sense to switch to exploration, whereas if the current action has high reward and low cost, it makes sense to continue exploiting. Cost is represented in a brain region called the anterior cingulate cortex. This region becomes active when subjects experience pain or make errors when completing a task. Reward, on the other hand, is represented within the prefrontal cortex. Different regions of the prefrontal cortex become active when subjects are rewarded. The balance between activation in these two regions is thought to reflect the balance between cost and reward associated with the current action and to influence LC such that the animal switches between exploration and exploitation accordingly.
Animals naturally switch between exploration and exploitation, but in order to build artificial intelligence capable of solving the same complex tasks, we need to program decisions about exploring or exploiting into the models ourselves. The concept of the explore-exploit trade-off is especially relevant to a type of machine learning called reinforcement learning that aims to build artificial agents that learn from and adapt to new environments similarly to humans. Most real-world problems are very complex, making it impossible to write down an explicit set of instructions telling an agent how to behave in every possible situation. Instead, we would like to build models that learn in the same way as humans: by exploring an environment to learn how to complete a task, and then exploiting that rule as long as it works. Just like humans, if agents don’t try new things, then they will never learn the best way to solve the task or adapt to changes in the environment. If they only try new things, then they will never find a strategy that allows them to complete the task and maximize reward. To navigate this problem, scientists intentionally have the agent perform randomly on a small subset of trials3. Thus, the agent exploits their current estimate of the best action on most of the trials, but on some small percentage of trials the agent randomly tries something new. This allows the agent to adapt to changes in the environment so that they can perform complex tasks like playing video games and driving cars.
While these descriptions begin to explain how the brain is navigating the explore-exploit trade-off, there is still much left to understand. These mechanisms can explain behavior in certain kinds of environments, but there is evidence that humans sometimes behave in ways that violate the predictions made by these models. For example, humans sometimes explore alternatives once they become very certain of a given outcome4. This is counter to predictions made by the model described above that suggest that when the outcome is highly certain it is best to exploit that strategy. In addition, things like emotions and attention are not incorporated into this model, even though they may strongly influence our decision to explore or exploit. Future models of the explore-exploit trade-off will need to incorporate more of the complexity of humans and the environments that we occupy if they hope to account for behavior in all real-world situations.
So what should you do next time you find yourself staring at the menu, unsure whether to order an old favorite or try something new? As long as you continue to enjoy your favorite, go ahead and exploit the familiar! But every once in a while let yourself explore a little just in case there’s something better to be found.
References:
1. Aston-Jones, G. & Cohen, J. D. AN INTEGRATIVE THEORY OF LOCUS COERULEUS-NOREPINEPHRINE FUNCTION: Adaptive Gain and Optimal Performance. Annu. Rev. Neurosci. 28, 403–450 (2005).
2. Cohen, J. D., McClure, S. M. & Yu, A. J. Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration. Philos. Trans. R. Soc. B Biol. Sci. 362, 933–942 (2007).
3. Sutton, R. S. & Barto, A. G. Reinforcement learning: an introduction. Cambridge, MA: MIT Press. (1998).
4. Montague, P. R., King-Casas, B. & Cohen, J. D. Imaging valuation models in human choice. Annu. Rev.Neurosci. 29, 417–448 (2006).