

Effective data generation for explaining decision services
source link: https://blog.kie.org/2021/10/effective-data-generation-for-explaining-decision-services.html?utm_campaign=Feed%3A+droolsatom+%28Drools+-+Atom%29
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Effective data generation for explaining decision services
When working with decision services, it’s often hard to understand the rationale behind the output for a given prediction. A noteworthy example is the one related to having a decision service for loan approval denying the loan request to a given user. The user would surely like to know the rationale behind such a denial.
Within TrustyAI initiative we developed an optimized implementation of the LIME (aka Local Interpretable Model agnostic explanations) that is well suited for the decision-service scenario (see our preprint). LIME is a widely known algorithm for detecting which input features were mostly responsible for the output of classifiers/regressors/etc.
One key aspect in LIME is to meaningfully perturb the original input features in order to generate close yet reasonable copies of the input data to pass to the AI model/decision service. Such perturbed copies are used to build a surrogate (linear) model which will lead to the generation of the final feature scores.
One of the difficulties of extending the original LIME implementation to the decision service scenario has been to deal with missing training data. In fact, LIME leverages training data that has been used to train a classifier (for instance) in order to decide how to perturb the numerical features. In particular it uses training data points to decide which reasonable value a perturbed numerical feature might have.
Let’s imagine to have the loan approval request having a bunch of features, one of them is the number of people in the family of the person making the loan request. Of course valid numbers for such a value would be 1, 2, 3, 4, etc., not -1 nor 0.123 or 10000.
In a common machine learning scenario by passing through existing values of such a feature within the training set, it would be clear that good values would be integers bigger than 1, and rarely bigger than 20. However in the decision service scenario we cannot make any such assumption, as the decision service could be anything (a proper black-box) from a rule based engine to a neural network. Additionally the training data, even if originally available, might not be available at the time when the explanation is requested.
In order to address this concern the TrustyAI implementation of LIME has originally started with a simple solution: to generate new values for a given numerical feature, sample values from a standard normal distribution centered around the original feature value. For example, if the original feature value were 3 (in red), we could sample points below the bell curve in the following graph:
Here the sampled values (in white) would have values of 2.7, 3.2 and 3.3. The problem with these samples is that they are not meaningful for "numbers of people in the family", there are either 2 or 3, not 2.7 persons.
While training data might not be available, decision services are often used over time on a number of different inputs. We could use those past predictions and a technique from statistics called bootstrap to calculate more accurate parameters for the Gaussian distribution. With bootstrap you sample (with replacement) many times to obtain statistical measures like mean, standard deviation over bootstrapped samples that can be used to generate a better suited normal distribution.
This way we obtain more samples, some of them might be meaningful (red ones), some of them might not. This partially solves the problem since many (white) points are still hardly likely in reality.
In order to filter out the bad points, we observe that the decision service is more confident when it make predictions with "likely" inputs. So, given the samples generated with bootstrap, we calculate the confidence of the decision service (regardless of the actual decision output). The confidence defines how much "confident" is the service in the output decision. So we plot the confidence of inputs having each of the generated data points.
We can pick an area in the plot where confidence is above high (e.g. above the mean confidence value), and only pick those samples whose confidence fall in that area.
This leads to less but more likely points, which in turn will generate more pertinent perturbations and therefore better explanations for the decision service at hand.
In our example the final list of generated samples still contains a couple of unlikely points (3.9, 4.1) while the other points are likely values for our "number of persons in the family" feature (0, 1, 3, 5, 6).
If you’re curious about the technical details of the implementation you can check the related PR within the kogito-apps repository.
Recommend
-
37
Posted by Keith Smyth This is the fourth in a series of blog posts in which outline strategies and guidance in Android with regard to power. A process is not forever Android is a mobile opera...
-
22
(This article was first published on R on notast , and kindly contributed toR-bloggers)
-
9
InfoQ Homepage
-
7
How to deliver decision services with Kogito Skip to main content This article is the first of two presenting...
-
6
Explaining Data Fabrics to the CEOs: How and WhyTo compete in the future, enterprises need fast answers and the ability to share uniform information to everyone. That's why IT leadership must garner support for data fabri...
-
6
Explaining Event Sourcing with Data Structures (1/3)Published May 26, 2020In this series, we’ll revisit the concept of event sourcing by implementing a PoC of a...
-
13
August 23, 2022 ...
-
10
Forrester Study: How IT Decision Makers Are Using Next-Generation Data PlatformsBoris Bialek and
-
12
Smarter Decision Tables Generation through Data Types Constraints Smarter Decision Tables Generation through Data Types C...
-
6
Thursday, 01 February 2024 15:43 Trusted, transparent AI is fundamental to next-generation decision making By Ray Greenwood
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK