27

XRL: eXplainable Reinforcement Learning

 3 years ago
source link: https://towardsdatascience.com/xrl-explainable-reinforcement-learning-4cd065cdec9a?gi=992c5a26e686
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

With AI technologies being deployed at scale as well as performing autonomously, it becomes imperative to instil the property of explainability in AI; which would result in the users trusting the AI technologies. A user will confidently use the technology if he/she can trust the technology, and in order for the technology to be trusted, it needs to be transparent. This transparency in AI can be achieved if a model is able to provide justifications with regards to its predictions and decision-making. Explainability is more crucial in the field of Reinforcement Learning where agents learn by itself without any human intervention.

The objective of this article is to make the readers aware of the XRL techniques currently pursued by different research teams. An important thing to consider regarding XRL is that much of the work in the domain should be done keeping in mind the human side of the equation. As a result, in order to advance XRL (and XAI), an interdisciplinary approach should be undertaken to heed to the needs of a human user who has no particular domain expertise and is utilizing the AI technology. With respect to the article, it is assumed that the reader has an intermediate level of knowledge of Reinforcement learning theory as well as a basic understanding of eXplainable AI.

Moving forward, let us first categorize XRL techniques. Similar to XAI methodologies, XRL techniques can be classified into the following based on the scope and timing at which information is extracted from the XRL technique.

zia2Ejv.png!web

Fig 1. Categorization of XRL technique

Below are some of the potential XRL methodologies developed by research teams showing promising leads.

(Local, Intrinsic) Hierarchical and Interpretable Skill Acquisition in Multi-task Reinforcement Learning

Hierarchical and Interpretable Skill Acquisition in Multi-task Reinforcement Learning described in [1] extends multi-task RL by introducing modularity in the policy design through a two-layer hierarchical structure. The framework is fundamentally based on the fact that a complex task requires different skills and is composed of several simpler subtasks. [1] trained their model to perform object manipulation task in Minecraft i.e. finding, getting, putting or stacking a black in Minecraft of a certain type. The model utilized the ‘ advantage actor-critic’ (A2C) for policy optimization using off-policy learning. The model possesses a hierarchical structure as every top-level policy can be decomposed into low-level actions. In the case of Minecraft, the task of stacking a cobblestone block represents a complex task which the agent then decomposes into the actions of finding cobblestone block , getting cobblestone block and putting cobblestone block. The interpretability of the framework comes from the fact that each task (for instance stack cobblestone block ) is described by human instruction, and the trained agents can only access learnt skills through these human descriptions, making the agent’s policies and decisions human-interpretable.

7RveQn.png!web

Fig 2. Example of multi-level hierarchical policy for a given task — stacking two blue blocks. Each arrow represents one step generated by a certain policy and the colours of arrows indicate the source policies. Note that at each step, a policy either utters an instruction for the lower-level policy or directly takes an action. Source: Figure 1 in [1]

Additionally, the framework integrated a stochastic temporal grammar (STG) methodology to model temporal relationships and priorities of tasks (for instance, before stacking a cobblestone block on top of another cobblestone block; you must first find a cobblestone block, pick it up and then find another cobblestone block to put the one on the hand on it). Hence the key idea of the framework is to disintegrate complex tasks to simpler subtasks. Thereafter if these simpler subtasks can be solved using the policies already learnt or skill already gained by the agent then no learning happens; otherwise, new skill needs to be learnt to perform a novel action. This framework can be represented pictorially as below.

iq6vU3q.png!web

Fig 3. Note that here the instruction Get white is given by a human, making the policy human-interpretable. Source: Figure 2 in [1]

Assumptions

Let G be defined as the task set in which each task g represents the human instruction (for instance ‘Get white’ in the above figure). In [1], g = ⟨ skill , instruction ⟩ two-word tuple is used to describe the object manipulation task in Minecraft. As it is a multi-task framework, reward differs for goals specified for different tasks. Hence R(s,g) is used to depict the task-specific reward function where s =state . Initially, the agent is trained a terminal policy πₒ for terminal basic tasks Gₒ . The task set is successively increased as the human instructor keeps on giving instruction such that Gₒ Gₗ ⊂…⊂ Gₖ, which results in learning of policies as portrayed in Fig 2. Also, note that at stage k and with h=k-1 , we define Gₕ as the base task set for Gₖ and πₕ is defined as the base policy for πₖ. Weak supervision from humans in the form of tasks is required for the learning of policies and hence augmentation of the agent. Hence task augmentation from humans is required for hierarchical and interpretable skill acquisition in multi-task reinforcement learning.

Hierarchical Policy (note that here h=k-1)

The main idea behind hierarchical policy as detailed in [1] is that the current task set Gₖ can be fragmented into several subtasks, which can be present in the base task set Gₕ and can be solved using base policy πₕ. As a result, instead of mapping the current state and human instruction to perform an action as described in Fig 3(a)’s flat policy structure, hierarchical policy design takes advantage by reusing the base policies for performing base tasks characterized as subtasks in the current stage. The global policy at stage k, πₖ consists of four sub-policies: “a base policy for executing previously learned tasks, an instruction policy that manages communication between the global policy and the base policy, an augmented flat policy which allows the global policy to directly execute actions, and a switch policy that decides whether the global policy will primarily rely on the base policy or the augmented flat policy.” — [1]. Instruction policy accounts for mapping the current state s and g Gₖ to Gₕ , hence its main function is to convey information to the base policy πₕ as to which base task Gₕ, it needs to perform. As mentioned earlier g consists of two things (skill and item from the phrase of an instruction) which are conditionally independent of each other, therefore

When the base policy cannot be implemented for performing a certain task, augmented flat policy comes into action mapping state s and task g to action a, so that new skills can be learnt to solve novel tasks in Gₖ . The switch policy in the above framework plays the role of a mediator choosing as to when to implement base policy and when to implement augmented flat policy. The switch policy outputs a binary variable e such that when e=0, the global policy πₖ follows base policy and when e=1, πₖ follows augmented flat policy.

Thus at each step, the model first samples the binary variable eₜ from the switch policy and a new instruction from the instruction policy so that our model can sample actions from the base policy. Below figure summarizes the process at each stage.

3Ifyuez.png!web

Hierarchical policy steps at each step. Source: [1]

Stochastic Temporal Grammar (STG)

Stochastic Temporal Grammar is used to ascertain temporal relations in distinct tasks. For example, to stack an object, we first need to find, pick and place the object. STG is used as a prior in modifying the switch and instruction policies mentioned in the hierarchical policy i.e. sampling eₜ and . In [1], STG is defined at each step k>0 for a task g by the following: 1) transition probabilities,

and 2) the distribution of

As a result, incorporating STG into the hierarchical policy, we get refined switch and instruction policies as follows:

i6N3qar.png!web

Refined switch and instruction policies by including STG into the hierarchical policy. Source: [1]

The resulting framework exhibited a higher learning efficiency, was able to generalize well in new environments and was inherently interpretable as it needs weak human supervision to give the agent instructions in order to learn new skills.

(Local, Post-hoc) Explainable Reinforcement Learning through a Causal Lens

Cognitive science proposes that in order to understand or interpret a phenomenon, humans build causal models to encode the cause-effect mapping of events happening around us. While building a causal model, we constantly ask the question of why or why not . Continuing with this logic, Explainable Reinforcement Learning through a Causal Lens detailed in [2] tries to build a structural casual model for generating causal explanations of the behaviour of model-free reinforcement learning agents through variables of interest. Counterfactual analysis is carried out of this structural causal model in order to generate explanations. Explainable Reinforcement Learning through a Causal Lens also investigates the comprehension gained by the users through the explanations, explanation satisfaction to the users as well as trust induced in the user for model-free reinforcement learning agent through the explanations. In Explainable Reinforcement Learning through a Causal Lens , action influence models are incorporated for Markov Decision Processes (MDP) based RL agents, extending structural causal models (SCMs) with the addition of actions. First, let us learn about SCMs.

Structural Causal Models

Structural Causal Models or SCMs were introduced in Halpern and Pearl 2005 . SCMs portray the world using exogeneous/external and endogeneous/internal random variables; some of these variables might possess causal relationships which each other which is represented using a set of structural equations . Formally, for depicting SCMs, we have to first define a signature S which is a tuple (U, V, R) where U is the set of exogenous variables, V is the set of endogenous variables and R represents a function which designates the range of values for every variable Y U V .

Formal Definition:A structural causal model is a tuple M=(S, F) where F represents the set of structural equations, one for each X∈ V such that

giving the value of X based on other variables in U∪ V . In other words, Fₓ gives the value of X in terms of other variables in the model. Additionally, a context is defined as a vector of unique values for each exogenous variable u∈ U . A situation is a model-context pair (M, ) . Assigning values to the variables of the model according to the structural equations leads to instantiation. An actual cause of an event φ is represented by the endogenous variables and their values. If the endogenous variables differ then the event φ does not occur, hence there is some counterfactual context embedded in the actual cause of the event φ.

Action Influence Models

The main intent behind action influence models is to facilitate explanations of the agent’s behaviour from how the actions influence the environment. Thus we extend the idea of SCMs to action influence models by incorporating actions in causal relationships.

Formal Definition:An action influence model is a tuple (Sₐ, F) where Sₐ is the signature extending the SCM’s signature with a set of actions A i.e. Sₐ in action influence models = (U, V, R, A) and F is the set of structural equations, however here F has multiple values for each X∈ V depending on the unique action set applied. Thus, F_{X, } portrays the causal effect on X when action is exerted. A set of reward variables Xᵣ⊆ V are allocated to sink nodes.

iyqqe2a.png!web

Fig 4. Action Influence graph of a Starcraft II agent. Source: Figure 1 in [2]. Note that here we have considered a finite set of state variables as well as a finite set of actions. There exists a set of structural equations for each state variable (represented as nodes) depending on the unique incoming action (represented as edges). For instance, state variable Aₙ is influenced by state variables S and B when action Aₘ is applied, hence the structural equation F_{Aₙ, Aₘ} (S, B) captures the causal relationship.

Generating explanations from Action Influence Model

An explanation consists of an explanandum, the event to be explained and explanan , the cause which justifies the happening of the event. Explanation generation requires the following steps: 1) defining the action influence model ; 2) learning the structural equations during reinforcement learning; and finally, 3) generating explanans for explanandum. Explainable Reinforcement Learning through a Causal Lens mainly focuses on providing explanations for questions of the form ‘ Why action A ? ’ or ‘ Why not action A ? ’. Also, it characterizes explanations in the context of reinforcement learning into two types: 1) Complete explanations ; and 2) Minimally complete explanations . Now let us define them formally.

Why? ’ questions

Complete explanations:A complete explanation can be given for an action under the actual instantiation M_{V←S} as a tuple (Xᵣ=xᵣ, Xₕ=xₕ, Xᵢ=xᵢ) , where Xᵣ is the vector comprising of reward variables achieved by moving through the causal chain of the graph to the sink nodes; Xₕ is the vector comprising of variables present at the head node of the action ; Xᵢ is the vector comprising of intermediate variables from the head to sink node in the action influence graph; and xᵣ, xₕ, xᵢ represent the values of the associated variables given through the actual instantiation M_{V←S} .

Hence, the complete explanation according to the above definition provides the complete causal chain from an action to any future reward variable that will be associated with the action . For instance, the causal chain for action Aₛ is shown as darkened in Fig 4. The complete explanation tuple, in this case, would be ([S=s], [Aₙ=aₙ], [Dᵤ=dᵤ, D_{b}=d_{b}]) . The depth-first search algorithm can be used for traversing the action influence graph from the head node of the action to all the sink nodes.

Minimally complete explanations:For larger graphs, the higher number of intermediate nodes may result in confusion. In this case, minimally complete explanations might come in handy. A minimally complete explanation is defined as a tuple (Xᵣ=xᵣ, Xₕ=xₕ, Xₚ=xₚ), where Xᵣ=xᵣ and Xₕ=xₕ are similar to the earlier definition, and Xₚ represents the vector of variables which are the immediate predecessor to Xᵣ . Immediate predecessors are chosen as they depict the immediate causes of the reward. Note that the number of intermediate nodes to be considered in minimally complete explanations depends on the application as well as the knowledge of the user.

‘Why not?’ questions

Counterfactual explanations can be produced by comparing the causal chain of the event that occurred and explanandum (counterfactual action). First, let us define counterfactual instantiation under which we assign optimal state variable values such that the counterfactual action will be chosen.

Counterfactual instantiation:A counterfactual instantiation pertaining to a counterfactual action can be given from a model M_{Z←S_{z}} , where Z instantiates all the predecessor variables of action as well as successor variables falling in the causal chain of , utilizing the structural equations .

Minimally complete contrastive explanations:Given a minimally complete explanation X=x for an action under the actual instantiation M_{V←S} and a minimally complete explanation Y=y for an action under the counterfactual instantiation M_{Z←S_{z}} , we can define a minimally complete contrastive explanation using a tuple (Xˡ=xˡ, Yˡ=yˡ, Xᵣ=xᵣ) where and are the maximal sets of variables X and Y satisfying the condition (Xˡ=xˡ) ∩ (Yˡ=yˡ) ≠ . Thereafter we contrast and ( difference condition ). Note that here Xᵣ corresponds to the reward nodes for action .

Hence, the minimally complete contrastive explanation is used to extract the differences between the actual causal chain for action and the counterfactual causal chain for action .

Example

Examine the action influence graph for the Starcraft II agent shown in Fig 4. Using this graph, we will try to answer the question:

Why build supply depots (Aₛ)? ’ and Why not build barracks (A_{b})? .

Let m = [W = 12, S = 3, B = 2, Aₙ = 22, Dᵤ = 10, D_{b} = 7] be the actual instantiation and mˡ = [W = 12, S = 1, B = 2, Aₙ = 22, Dᵤ = 10, D_{b} = 7] be the counterfactual instantiation. Implementing the difference condition , we get the minimally complete contrastive explanation as

([S = 3] , [S = 1] , [Dᵤ = 10, D_{b} = 7]). On comparing [S=3] with [S=1], an explanation can be given from the NLP template: Because the goal is to increase destroyed units (Dᵤ) and destroyed buildings (D_{b}), it makes sense to build supply depots (Aₛ) to increase supply number (S). Note that the value of variables is obtained during instantiation through learning structural causal equations.

Learning structural causal model

Given a directed acyclic graph depicting the causal relationship between variables, structural equations can be learnt as multivariate regression models during the training period of the reinforcement learning agent. By saving eₜ = (sₜ, aₜ, rₜ, s_{t+1}) at each time step in a dataset Dₜ={e₁, …, eₜ} , experience replay can be implemented. Then a subset of structural equations F_{X, A} are updated utilizing a regression learner . Note that we update structural equations of only the variables which are associated with the specified action. For instance, referring to Fig 4, for any experience frame with the action Aₛ , only the equation F_{S, Aₛ}(W) will be updated. Note that any regression learner can be utilized as a learning model , such as MLP regressors.

MFriQzN.png!web

Fig 5. Algorithm to learn the structural equations. Source: [2]

For reviewing further topics like the computational and human evaluation of the technique as well as its results, refer [2].

(Local, Post-hoc) Distal Explanations for Explainable Reinforcement Learning Agents

[extension to Explainable Reinforcement Learning through a Causal Lens ]

Even though the explanations generated using action influence model beat the standard state-action model , structural equations performed poorly during computational task prediction accuracy. Hence substituting structural equations to model causal relationships between variables, Distal Explanations for Explainable Reinforcement Learning Agents detailed in [4] suggests using decision nodes from a decision tree to generate explanations utilizing the causal chain.

Decision trees generating causal explanations

A model for generating distal explanation consists of a decision node from a decision tree portraying the agent’s complete policy and linked with the causal chain from the action influence model from the earlier section. Suppose represents a decision tree model. Then while training the reinforcement learning agent, we implement experience replay by storing eₜ = (sₜ, aₜ) at each stage t into a dataset Dₜ = {e₁,…,eₜ} . Thereafter we uniformly sample mini-batches from D to train with sₜ as input and aₜ as output. A decision tree with no constraints on the number of its nodes would lead to confusing and overwhelming explanations hence in [4] the growth of a decision tree is limited by setting the number of leaves equal to the number of possible actions in the domain of the agent. In [4], it is shown in the evaluation that constraining the decision tree hardly affects the computation task prediction accuracy in comparison to the one with no constraints. In order to store the decision nodes of the decision tree at a state Sₜ , the model traverses the decision tree from the root node until it reaches a leaf node, during which it stores the modes in its path. For instance, from Fig 6, Aₙ and B are the decision nodes for action Aₛ . Note that the decision nodes represent the feature variables of the agent’s state space.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK