Decision Trees — Introduction (ID3) - JOYK Joy of Geek, Geek News, Link all geek

Decision Trees — Introduction (ID3)

Have you ever wondered about learning from past experiences?

Aug 29 ·6min read

You meet different types of persons throughout your life, after some experience, you get the idea that what kind of person you like, right? I mean after several experiences with many humans, when you meet a new human, most of the time you get the idea if you like them or not. How do you do that? With ‘Experience’ ! right? But you don’t keep all years of experience at the top of your brain always, rather than that it feels some simple and quick decision mechanism working inside your brain.

So, rather than going deeper into the biology of the brain, let’s try to build a similar mechanism at a simpler level.

Let’s say after your encounter with several people, you don’t want vampires to be your friend in future :P

So you made a list of several people you met, their characteristics and if they turned out to be a vampire or not. ( “?” in shadow attribute is because you met those people only in dark conditions so you couldn’t verify if they cast a shadow or not )

IbQVRz2.png!web

After observing this data , we may come up with a naive model as this tree,

aYFve2q.png!web

Since with the help of that tree we can make a decision, we call it “Decision Tree”. This tree must satisfy all data in the given dataset, and we hope that it will also satisfy future inputs.

But how could we come up with such a tree? The tree given above is made just by some random observation on data…

Following observations…

All people with pale complexion are not vampires .
All people who have a ruddy complexion and eats garlic are not vampires and if they don’t eat garlic then they are a vampire .
All people who have an average complexion, and they don’t cast a shadow or we don’t know if they cast a shadow or not, then they are a vampire , or else if they cast a shadow then they are not a vampire .

But is that the right way to build a decision tree? Is that tree is the simplest tree we can get from the given dataset?

Such random analysis on a large dataset will not be feasible. We need some systematic approach to attack this problem.

Let’s try to attack this with a greedy approach…

So first, we look at the dataset and decide which attribute should we pick for the root node of the tree…

This is a Boolean classification, so at the end of the decision tree we would have 2 possible results (either they are a vampire or not), so each example input will classify as true (a positive example) and false (a negative example).

Here ‘ P’ refers to positive, which means a person is a vampire, and ’N’ refers to negative, which means the person is not a vampire.

We want attribute which divides more data into homogenous sets, which means in such sets where only P or only N exists because if we have that, we can definitely answer about a vampire or not, thus those will be leaf nodes of the tree.

i26feqe.png!web

Check for each attribute, and see which one has the highest number of elements in the homogenous set. Here we find that the ‘Shadow’ attribute has the highest count for elements in a homogenous set, so we choose this attribute.

So till now, we got this much of the tree…

IbQBBj3.png!web

For the shadow attribute “yes” and “no” , we can decide if a person is a vampire or not, but in case of “?” we don’t know, we need to decide which attribute divides data well when shadow = ‘?’

So, let’s analyze another attribute while the shadow is unknown…

7NrieyZ.png!web

Here we find that “Garlic?” attribute divides maximum elements, in fact, all elements in homogenous sets.

So, our tree now looks like this,

qmIZny3.png!web

This tree looks simpler than the one we created by picking random attributes, so we observe that the greedy approach is helping us to get better results.

But, is that the right way to do so?

No, because if the dataset is large, we need not end up with attributes dividing into the homogenous set, we may find for all attributes elements in the homogeneous set are 0.

How should we proceed then?

So now let’s dive into the ID3 algorithm for generating decision trees, which uses the notion of information gain , which is defined in terms of entropy , the fundamental quantity in information theory.

Imagine these 2 divisions of some an attribute…

EbUv2uq.png!web

We observe that that one on left has the equal number of P s and N s, so that doesn’t give us any hint about the decision, but one on right has more P s than N s, so it may direct us somewhat towards P , so in these 2 we might consider right one.

So, now instead of scoring them 0 right away, let’s go with another way. Let’s say, one where P s and N s are equal numbers has the highest entropy (1), and one where there are only P s or N s has the lowest entropy (0). We can have something like this, P/(P+N) vs Entropy graph.

6RJZrqr.png!web

So, when P=N , thus P/(P+N) = 0.5 then Entropy = 1 ,

if P=k (some integer) & N=0 then Entropy = 0 .

That feels like a pretty much appropriate graph to achieve what we want, so is there some mathematical way to to get this graph…

Luckily for us, this curve can be achieved by the following equation

Which can be written in P/(P+N) and Entropy form,

by replacing x= P/( P+N ) and y = Entropy ,

JNRve2M.png!web

Where P and N is the count of P s and N s of an attribute for which we are finding the attribute,

We want to find information gain from the attribute, which is defined as,

( IG — Information gain from some attribute A is the expected reduction in entropy )

IG(Attribute) = Entropy of attribute — Weighted average of Entropy of each child set

For example,

mYFJF3J.png!web

^ ( Example calculation of IG )

Since now you got the idea about Entropy and Information Gain , let’s build our decision tree again from scratch with this new approach!

AVB73mb.png!web

We observe here that we get maximum Information Gain from shadow attribute, Choosing this as our root node,

IbQBBj3.png!web

We need to decide another attribute for Shadow = ‘?’

iYRVvaI.png!web

We get maximum Information Gain from Garlic,

So our tree will look like this,

qmIZny3.png!web

This is exactly the same as the previous approach, because luckily at each step we were able to find some attributes dividing into a homogenous set, but the approach with Information Gain is more robust, which can be applied to make a decision tree from a large dataset.

Decision Trees — Introduction (ID3)