30

Random Forests for Complete Beginners

 5 years ago
source link: https://www.tuicool.com/articles/hit/EfQjAzr
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

In my opinion, most Machine Learning tutorials aren’t beginner-friendly enough.

Last month, I wrote an introduction to Neural Networks for complete beginners . This post will adopt the same strategy, meaning it again assumes ZERO prior knowledge of machine learning . We’ll learn what Random Forests are and how they work from the ground up.

Ready? Let’s dive in.

1. Decision Trees :evergreen_tree:

A Random Forest :evergreen_tree::evergreen_tree::evergreen_tree: is actually just a bunch of Decision Trees :evergreen_tree: bundled together (ohhhhh :bulb: that’s why it’s called a forest ). We need to talk about trees before we can get into forests.

Look at the following dataset:

dataset.svgThe Dataset

If I told you that there was a new point with an xx x coordinate of 11 1 , what color do you think it’d be?

Blue, right?

You just evaluated a decision tree in your head:

decision-tree.svg

That’s a simple decision tree with one decision node that tests x<2x < 2 2 . If the test passes ( x<2x < 2 2 ), we take the left branch and pick Blue. If the test fails ( x≥2x \geq 2 2 ), we take the right branch and pick Green.

dataset-split.svgThe Dataset, split at x=2

Decision Trees are often used to answer that kind of question: given a labelled dataset, how should we classify new samples?

Labelled: Our dataset is labelled because each point has a class (color): blue or green.

Classify: To classify a new datapoint is to assign a class (color) to it.

Here’s a dataset that has 3 classes now instead of 2:

dataset2.svgThe Dataset v2

Our old decision tree doesn’t work so well anymore. Given a new point (x,y)(x, y) ( x , y ) ,

  • If x≥2x \geq 2 2 , we can still confidently classify it as green.
  • If x<2x < 2 2 , we can’t immediately classify it as blue - it could be red, too.

We need to add another decision node to our decision tree:

decision-tree2.svg

dataset2-split.svg

Pretty simple, right? That’s the basic idea behind decision trees.

2. Training a Decision Tree

Let’s start training a decision tree! We’ll use the 3 class dataset again:

dataset2.svgThe Dataset v2

2.1 Training a Decision Tree: The Root Node

Our first task is to determine the root decision node in our tree. Which feature ( xx x or yy y ) will it test on, and what will the test threshold be? For example, the root node in our tree from earlier used the xx x feature with a test threshold of 22 2 :

decision-tree2-root.svg

Intuitively, we want a decision node that makes a “good” split, where “good” can be loosely defined as separating different classes as much as possible . The root node above makes a “good” split: all the greens are on the right, and no greens are on the left.

Thus, our goal is now to pick a root node that gives us the “best” split possible. But how do we quantify how good a split is? It’s complicated. I wrote an entire blog post about one way to do this using a metric called Gini Impurity . ← I recommend reading it right now before you continue - we’ll be using those concepts later in this post.

Welcome back!

Hopefully, you just read my Gini Impurity post . If you didn’t, here’s a very short TL;DR: We can use Gini Impurity to calculate a value called Gini Gain for any split. A better split has higher Gini Gain .

Back to the problem of determining our root decision node. Now that we have a way to evaluate splits, all we have to do to is find the best split possible! For the sake of simplicity, we’re just going to try every possible split and use the best one (the one with the highest Gini Gain). This is not the fastest way to find the best split , but it is the easiest to understand.

Trying every split means trying

  • Every feature ( xx x or yy y ).
  • All “unique” thresholds. We only need to try thresholds that produce different splits.

For example, here are the thresholds we might select if we wanted to use the xx x coordinate:

dataset2-thresholds-x.svgx Thresholds

Let’s do an example Gini Gain calculation for the x=0.4x = 0.4 0 . 4 split.

Split Left Branch Right Branch x=0.4x = 0.4 0 . 4

First, we calculate the Gini Impurity of the whole dataset:

Ginitial=∑i=13p(i)∗(1−p(i))=3∗(13∗23)=23\begin{aligned} G_{initial} &= \sum_{i=1}^3 p(i) * (1 - p(i)) \\ &= 3 * (\frac{1}{3} * \frac{2}{3}) \\ &= \boxed{\frac{2}{3}} \\ \end{aligned} G i n i t i a l = i = 1 3 p ( i ) ( 1 p ( i ) ) = 3 ( )

Then, we calculate the Gini Impurities of the two branches:

Gleft=0∗1+1∗0+0∗1=0G_{left} = 0 * 1 + 1 * 0 + 0 * 1 = \boxed{0} G l e f t = 0

Gright=38∗58+28∗68+38∗58=2132\begin{aligned} G_{right} &= \frac{3}{8} * \frac{5}{8} + \frac{2}{8} * \frac{6}{8} + \frac{3}{8} * \frac{5}{8} \\ &= \boxed{\frac{21}{32}} \\ \end{aligned} G r i g h t = + + =

Finally, we calculate Gini Gain by subtracting the weighted branch impurities from the original impurity:

Gain=Ginitial−19Gleft−89Gright=23−19∗0−89∗2132=0.083\begin{aligned} \text{Gain} &= G_{initial} - \frac{1}{9} G_{left} - \frac{8}{9} G_{right} \\ &= \frac{2}{3} - \frac{1}{9} * 0 - \frac{8}{9} * \frac{21}{32} \\ &= \boxed{0.083} \\ \end{aligned} = G i n i t i a l G l e f t G r i g h t = 0 = 0 . 0 8 3

Confused about what just happened? I told you you should’ve read my Gini Impurity post . It’ll explain all of this Gini stuff.

We can calculate Gini Gain for every possible split in the same way:

Split Left Branch Right Branch Gini Gain x=0.4x = 0.4 0 . 4 0.0830.083 0 . 0 8 3 x=0.8x = 0.8 0 . 8 0.0480.048 0 . 0 4 8 x=1.1x = 1.1 1 . 1 0.1330.133 0 . 1 3 3 x=1.3x = 1.3 1 . 3 0.2330.233 0 . 2 3 3 x=2x = 2 2 0.3330.333 0 . 3 3 3 x=2.4x = 2.4 2 . 4 0.1910.191 0 . 1 9 1 x=2.8x = 2.8 2 . 8 0.0830.083 0 . 0 8 3 y=0.8y = 0.8 0 . 8 0.0830.083 0 . 0 8 3 y=1.2y = 1.2 1 . 2 0.1110.111 0 . 1 1 1 y=1.8y = 1.8 1 . 8 0.2330.233 0 . 2 3 3 y=2.1y = 2.1 2 . 1 0.2330.233 0 . 2 3 3 y=2.4y = 2.4 2 . 4 0.1110.111 0 . 1 1 1 y=2.7y = 2.7 2 . 7 0.0480.048 0 . 0 4 8 y=2.9y = 2.9 2 . 9 0.0830.083 0 . 0 8 3 dataset2-thresholds.svgAll Thresholds

After trying all thresholds for both xx x and yy y , we’ve found that the x=2x = 2 2 split has the highest Gini Gain, so we’ll make our root decision node use the xx x feature with a threshold of 22 2 . Here’s what we’ve got so far:

decision-tree2-build1.svg

Making progress!

2.2: Training a Decision Tree: The Second Node

Time to make our second decision node. Let’s (arbitrarily) go to the left branch. We’re now only using the datapoints that would take the left branch (i.e. the datapoints satisfying x<2x < 2 2 ), specifically the 3 blues and 3 reds.

To build our second decision node, we just do the same thing! We try every possible split for the 6 datapoints we have and realize that y=2y = 2 2 is the best split. We make that into a decision node and now have this:

decision-tree2-build2.svg

Our decision tree is almost done…

2.3 Training a Decision Tree: When to Stop?

Let’s keep it going and try to make a third decision node. We’ll use the right branch from the root node this time. The only datapoints in that branch are the 3 greens.

Again, we try all the possible splits, but they all

  • Are equally good.
  • Have a Gini Gain of 0 (the Gini Impurity was already 0 and can’t go any lower).

It doesn’t makes sense to add a decision node here because doing so wouldn’t improve our decision tree. Thus, we’ll make this node a leaf node and slap the Green label on it. This means that we’ll classify any datapoint that reaches this node as Green .

If we continue to the 2 remaining nodes, the same thing will happen: we’ll make the bottom left node our Blue leaf node, and we’ll make the bottom right node our Red leaf node. That brings us to the final result:

decision-tree2.svg

Once all possible branches in our decision tree end in leaf nodes, we’re done.We’ve trained a decision tree!

3. Random Forests :evergreen_tree::deciduous_tree::evergreen_tree::deciduous_tree::evergreen_tree:

We’re finally ready to talk about Random Forests. Remember what I said earlier?

A Random Forest is actually just a bunch of Decision Trees bundled together.

That’s true, but is a bit of a simplification.

3.1 Bagging

Consider the following algorithm to train a bundle of decision trees given a dataset of nn n points:

  1. Sample, with replacement , nn n training examples from the dataset.
  2. Train a decision tree on the nn n samples.
  3. Repeat tt t times, for some tt t .

To make a prediction using this model with tt t trees, we aggregate the predictions from the individual decision trees and either

  • Take the majority vote if our trees produce class labels (like colors).
  • Take the average if our trees produce numerical values (e.g. when predicting temperature, price, etc).

This technique is called bagging , or b ootstrap agg regating . The sampling with replacement we did is known as a bootstrap sample.

random-forest.svgBagged Decision Trees predicting color

Bagged decision trees are very close to Random Forests - they’re just missing one thing…

3.2 Bagging → Random Forest

Bagged decision trees have only one parameter: tt t , the number of trees.

Random Forests have a second parameter that controls how many features to try when finding the best split . Our simple dataset for this tutorial only had 22 2 features ( xx x and yy y ), but most datasets will have far more (hundreds or thousands).

Suppose we had a dataset with pp p features. Instead of trying all features every time we make a new decision node, we only try a subset of the features , usually of size p\sqrt{p} p or p3\frac{p}{3} . We do this primarily to inject randomness that makes individual trees more unique and reduces correlation between trees , which improves the forest’s performance overall. This technique is sometimes referred to as feature bagging .

4. Now What?

That’s a beginner’s introduction to Random Forests! A quick recap of what we did:

  • Introduced decision trees , the building blocks of Random Forests.
  • Learned how to train decision trees by iteratively making the best split possible.
  • DefinedGini Impurity, a metric used to quantify how “good” a split is.
  • Saw that a random forest = a bunch of decision trees.
  • Understood how bagging combines predictions from multiple trees.
  • Learned that feature bagging is the difference between bagged decision trees and a random forest.

A few things you could do from here:

That concludes this tutorial. I like writing about Machine Learning (but also other topics), so subscribe if you want to get notified about new posts.

Thanks for reading!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK