Experimenting With Hierarchical Reinforment Learning for Planning

Utilizing Computational Techniques to Model How Humans Approach Cognitive Tasks

What is HRL?

Suppose Jane, sitting in her Boston apartment, gets a notification from her boss that there is an important business meeting happening in Paris next week. How will she go about planning this trip? Rather than go about goal achievement through a series of detailed actions (“get out of bad”, ”go to desk”, “open laptop”, “search flights to Paris”, etc.), Jane starts with actions that will take her generally in the direction of her broader goal and increases granularity of planning as she approaches this target. She might first look for flights to Paris, then at how to rent a car from the Paris airport, and finally which combination of streets and turns she needs to take to drive from the airport to the office where the meeting is located. This type of planning requires a hierarchical representation of the world.

A key area where psychology and neuroscience combine is the formal understanding of human behavior in relation to assigned actions. Specifically, what is the planning and methodology employedby human agents when faced with accomplishing some task? This is especially interesting in light of the unique ability of humans and animals to adapt to new environments. Previous literature on animal learning suggests this flexibility stems from a hierarchical representation of goals that allow for complex tasks to be broken up into low-level subroutines that can be extended across a variety of contexts. This process, known as “chunking”, occurs as past of a neurological transfer between a goal-based structure in the brain focused on the immediate consequences of actions to a habitual system. From a computational standpoint, such a hierarchical representation allows for agents to quickly execute actions in an open loop, reuse familiar action sequences whenever a known problemis encountered, learn faster by tweaking established action sequences to solve problems reminiscent of those seen previously, and plan over extended time horizons as agents do not need to be concerned with the minuscule tasks associated with goal achievement, i.e. the goal of going to the store being broken down into leaving the house, walking, entering the store as opposed to getting up from the bed, moving left foot forward, right foot forward, left foot forward, etc. Hierarchical reinforcement learning (HRL) has become the prevailing framework for representing hierarchical learning and planning. Within research around modeling of HRL, several ideas have presented around potential methods of model construction. Simsek et al. presents a method based on “betweenness”, a graph centrality concept measuring the shortest paths that go through everyvertex of a graph²⁰. Through the construction of an interaction graph representing potential statetransitions, a metric is produced indicative of the cost of each individual path. These weights are thenused to divide and group clusters of low-level tasks. Meanwhile, Van Dijk et al. defined subgoals asstates in which there is significant change in the amount of relevant goal information and used thisdelta value to construct a potential hierarchical structure¹⁹. In addition to these studies exploring the constructions of hierarchies, work by Solway et al. looked to evaluate these models to identifythe best HRL for a particular behavior domain; it was proposed that the optimal hierarchy is one that facilitates adaptive behavior when presented with a new problem and was formerly defined as a Bayesian model selection¹.

This article focuses on the addition of reward learning to discrete-time stochastic CRP (Chinese Restaurant Process) clustering. In the context of a graph G , rewards can be interpreted as observable features of vertices. Because people often cluster based on observable features, it is reasonable to model clusters induced by rewards³. Furthermore, we can assume that each state delivers a randomly determined reward, and that the agent’s goal is to maximize total reward⁷. I attempt to understand, through the comparison of human experiment results to computer models, to what degree clusters drive inferences about rewards, and to what degree rewards drive the formation of clusters.

Experiment 1a: Clusters induce rewards (human set up)

The goal of the first experiment was to understand how clusters induce rewards, testing whether graph structure drives cluster formations. It was conducted by asking 32 human subjects to choose a most likely node to visit as specified in the following task scenario. Participants were randomly presented with one of the two graphs below, to ensure that bias of handedness or graph structure was not inadvertently introduced. I predicted that participants would choose the node adjacent to the labeled one that was located in the larger cluster, i.e. the gray node to the left of the blue one in the first case, and the gray node to the right of the blue one in the second case.

I expected for most participants to automatically identify the following clusters, with nodes colored in peach and lavender to denote the different clusters, and make a decision about which mine to select with this in mind. It was hypothesized that participants would select a peach-colored node as opposed to a lavender once, since the node with label 30, a fairly-larger-than-average reward, is in the peach-colored cluster

Experiment 1b: Clusters induce rewards (modeling expected behavior)

I approximated Bayesian inference over H using Metropolis-within-Gibbs sampling⁸, which updates each component of H by sampling from its posterior conditioned on all other components ina single Metropolis-Hastings step. I employed a Gaussian random walk as the proposal distributionfor continuous components, and the conditional Chinese restaurant process (CRP) prior as the proposal distribution for cluster assignments⁹. Essentially, the approach can be interpreted as a stochastic hill climbing with respect to a utility function defined by the posterior.

The top three clusterings outputted by the model outputted by the model were:

Clusterings identified by the static rewards model. All top three results were the same, indicating that the model identified the colored groupings with high confidence

The results for participants, as well as those for the static rewards model, are visualized in the table and image below. There were 32 participants total in each of the human and simulated groups.

Proportion of human and simulated subjects, out of 32 total, who chose to visit node 2 next. The solid black line indicates the mean and the dotted black lines indicate the 2.5th and 97.5th percentiles.

The listed p-values were calculated via a right-tailed binomial test, where the null was assumed to be a binomial distribution over choosing left or right gray node. The significance level was takento be α= 0.05, and both the human experimental results and modeling results were statistically significant.

Experiment 2a: Rewards induce clusters (human set up)

In the second experiment, the goal was to determine whether rewards induce clusters. That is, I predict that nodes with the same reward that are positioned adjacent to each other would be clustered together, even if the structure of the graph alone would not induce such clusters. Recall that Solway et. al showed that people prefer paths that cross the fewest hierarchy boundaries¹. Therefore, between two otherwise identical paths from node A to node B, the only reason to prefer one over the other would be because it crosses fewer hierarchy boundaries. One possible counterargument to this is that people pick the path with higher rewards. However, in my setup rewards are given only in the goal state, not cumulatively over the path taken. Therefore, it is unlikely that people would favor a path because nodes along that path, excluding the goal, have higher rewards.

This experiment was conducted on the web using Amazon Mechanical Turk. Similarly to Experiment 1, participants were given the following context about the task:

As in Experiment 1, they were also randomly given either the configuration shown below or the horizontally-flipped version of the same graph in order to control for potential left-right asymmetry.

The diagram presented to participants

The nodes are numbered for reference. The expected induced clusters are indicated by color

We can refer to the first case (where participants are free to navigate to any mine) as free-choice and the second case (where participants navigate to a specified mine) as fixed-choice. It is also important to note that participants received a monetary reward (one cent per point earned) for each trial to discourage random responses. At each trial, there was a 0.2 probability that the rewards would be changed. New rewards were drawn uniformly at random from the interval [0, 300]. However, the grouping of rewards remained the same across trials. That is, nodes 1, 2, and 3 always had one reward value; nodes 4, 5, and 6 had a different reward value; and nodes 7, 8, 9, and 10 had a third reward value. The first 99 trials allowed the participant to develop a hierarchy of clusters. The final trial, which acted as the test trial, presented to each participant asked them to navigate from node 6 to node 1. I predicted that more participants would take the path through node 5, which crosses only one cluster boundary, than through node 7, which crosses two cluster boundaries, in keeping with existing findings¹.

Experiment 2b: Rewards induce clusters (modeling expected behavior)

I modeled the fixed-choice case, with the assumption that the tasks in all 100 trials were all the same as the 100th trial presented to participants (the test trial). First, I assumed static rewards,where the rewards remained constant across all trials. Next, I assumed dynamic rewards, where rewards changed for each trial as described in Experiment 1. While in Experiment 1 the participant picks a single node and we want to model what that node is, in Experiment 2 we want to find the full path the participant (or simulated participant) chose to take from the start node to the goal node, and then check the second node in that path. Because of this, in order to compare the model to human data, I used a variant of breadth-first search, hereafter referred to as hierarchical BFS, to predict a path from the start node (node 6) to the goal (node 1).

For each subject, I sampled from the posterior using Metropolis-within-Gibbs sampling and chose the most probable hierarchy (i.e. the hierarchy with the highest posterior probability). Then, I used hierarchical BFS to first find a path between clusters and then between the nodes within the clusters. For dynamic rewards, I used online inference. That is, for each simulated participant, I allowed the sampling for each trial to progress for only 10 steps. Then, I saved the hierarchy, and added information about the modified rewards.

Subsequently, I allowed sampling to progress again, starting from the saved hierarchy. As in the human experiment, at the beginningof each trial, there was a 0.2 probability that the rewards would be re-randomized to new values, although the rewards were always equal within clusters. The simulated rewards were scaled to be float values in the interval [0, 30], rather than the integers in the interval [0, 300] used for the human experiment. This inference method simulates how human participants might learn cumulatively over the course of many trials. I also assumed, for the purpose of this experiment, that people keep only one hierarchy in mind at a time, rather than updating multiple hierarchies in parallel. I also modified the log posterior to penalize disconnected clusters, because such clusters became muchmore common under this type of inference.

In each group (human, and each simulated group), there were 95 total participants. The null hypothesis is represented by an equal number of participants choosing a path through node 5 and through node 7, since in the absence of any other information, a participant is equally likely to choose either. The results of the human experiment and static rewards modeling were statistically significant, since the respective p-values were less than α= 0.05. Furthermore, as shown in the graph below, the results of the human experiment are in the 90th percentile of a normal distribution centered around 0.5, the expected proportion given the null hypothesis.

Proportion of human and simulatedsubjects, out of 95 total, whose chosen path’s second node was node 5. The solid blackline indicates the expected proportion given the null hypothesis and the dotted black lines indicate the 10th and 90th percentiles

The dynamic rewards model did not produce results that were statistically significant

I used 1000 iterations of Metropolis-within-Gibbs sampling to generate each sample, with a burnin and lag of 1 each. The simulation under static rewards certainly favors paths through node 5, to a level that is statistically significant (its p-value, 0.015< α= 0.05. However, since its purpose is to model human behavior, this result is less meaningful in light of the human data not being statistically significant. For dynamic rewards within simulated subjects, in order to mimic the human trials, I ran 100 trials, each with 10 iterations of Metropolis-within-Gibbs to sample from the posterior. Burnin and lag were again set to 1. Interestingly, the online inference method (dynamic rewards modeling)appears to have modeled human data better than modeling for static rewards, even though the group of simulated participants under dynamic rewards modeling is farther from the hypothesis (i.e. less frequently chooses the route that traverses the fewest cluster boundaries) than the group simulated under static rewards modeling. 56 human participants and 54 simulated participants under dynamic rewards modeling chose to go through node 5 (a 3.4% difference), compared to 64 simulated participants under static rewards modeling (an 18.5% difference).

Clusterings identified by [1] the static rewards model, [2] the static rewards model with cluster formation between disconnected components penalized, and [2] the dynamic rewards model

Conclusion

Though modeling and experimentation on human participants, I show that an optimal hierarchy depends on the environment structure, not just the graph structure but also on the probability distribution of tasks to be completed in that environment, as well as the distribution of rewards. Through human experiments and built Bayesian models to understand how clusters induce staticrewards (Experiment 1) and how both static and dynamic rewards induce clusters (Experiment 2),and found that most results were statistically significant in terms of how closely the models captured human actions.

Experimenting With Hierarchical Reinforment Learning for Planning