Summary of Chapter 6, Decision Trees 第六章总结 决策树

What is Decision Tree? How is it different from other models? Let’s understand our first non-linear model, non-parametric model. 何为决策树?跟其他模型区别在哪儿?来了解我们第一个非线性的非参数的模型。

In this chapter summary, I will briefly introduce how Decision Tree in Scikit-Learn works, as well as measurement of impurity. 在这节总结中,我将展示决策树的简单原理。

Table of Content 内容提要

  • CART 分类和回归树
  • Impurity 不纯度
  • Cost function 代价函数
  • Probability 概率
  • Regularization 正则
  • Regression 回归

Like SVMs (introduced in Chapter 5 Summary), Decision Trees are versatile Machine Learning algorithms that can perform both classification and regression tasks. They are very powerful algorithms, capable of fitting complex datasets.

What is a Decision Tree 决策树是什么

What is a decision tree? Decision Trees are tree-like models, where each node contain a subset of the datasets and, if it is a nonleaf node, makes decisions on how the subset can be further split into more branches.
'Decision Tree'
In the chart above, there are 150 samples in the top-level node. These samples are split into two subsets based on whether their pental length<=2.45.
在上图中,初始节点(顶层节点)有150个样本。然后根据pental length<=2.45的标准,分成了两个子集。

Classification and Regression Tree (CART) 分类和回归树

There are many algorithms for Decision Trees. Scikit-Learn uses the CART algorithm, which produces only binary trees: nonleaf nodes always have two children. As you can tell from the name, CART can be applied to both classification and regression problems. You can probably tell that there are also algorithms produces more than just two children; one example is ID3.

Impurity 不纯度

At nonleaf nodes, Decision Trees make decision on how to further split the dataset based on certain criteria. How are these criteria decided? Intuitively, in a classification problem, we wish a split could separate instances of different classes as distinct as possible. In the best case, one subset will contain instances from one class, the other subset the other class. In another word, we hope the subset as pure as possible. And the best split should give the least impurity.

Now we need some measurements of impurity. Here are two common ones.

  • gini_impurity 基尼不纯度

  • Entropy

subscript $i$ is for nodes, subscript $k$ for classes. For example, in the decision tree shown just now, at the top-level node, $p_{i,1}=p_{i,2}=p_{i,3}=\frac{50}{150}=0.33$, $G_i = 1-\sum_{k=1}^3p_{i,k}^2=1-3\cdot 0.33\cdot 0.33=0.667$.
其中,角标 $i$ 是节点,角标 $k$ 对应类。在之前的决策树例子中,初始节点中基尼不纯度的运算如上。

It is easy to tell that these two metric changes almost the same way when $p_{i,k}$ changes. So they are almost equivalent.
很容易看出,当$p_{i,k}$ 变化时,这两个标准的值的变化几乎一致。这两种标准基本上是等价的。

A reduction of entropy is often called an information gain.

Cost function for Classification 分类问题的代价函数

To minimize the impurity, we can subsequently define the cost function of each split as 以着最小化不纯度的标准,定义代价函数如下 'Cost function for classification'

Regularization 正则

Now we know how to split at a single node. Iterating this process on all the produced children node generates a decision tree.

When do we stop splitting? If we only stop when there is no more impurity, we are obviously just remembering the mapping, which is an extreme case of overfitting. To regularize, there are various hyperparameters to put an early stop to the decision tree growing. Common hyperparameters are max_depth, min_sample_split, etc. And again, cross-validation is our good friend on checking overfitting.
不过,分割过程该何时停止呢?如果分到分无可分,我们相当于把整个数据集背了下来。这明显是极端过拟合的情况。有很多超参数可以让决策树提前停止迭代。常见的超参数有max_depth, min_sample_split等。当然,别忘了我们的检验过拟合的好朋友cross-validation.

Probability 概率

A Decision Tree can also estimate the probability that an instance belongs to a particular class k: first it traverses the tree to find the leaf node for this instance, and then it returns the ratio of training instances of class k in this node.

Regression 回归

We have previously mentioned that CART can also be used for regression problem. The CART algorithm works mostly the same way as earlier, except that instead of trying to split the training set in a way that minimizes impurity, it now tries to split the training set in a way that minimizes the MSE.

'Cost function for regression'

Discussion 讨论

A single decision tree model suffers a few issues. For example, the split can only be orthogonal to the axes, which makes it suffer from dataset rotation. Besides, Decision Trees are sensitive to small variations in the training data.
How to overcome these issues? You will see it in the next summary, the Ensemble methods. 单一的决策树面临一些问题。比如,分割一定是垂直于坐标轴的,数据集相对于坐标轴的旋转会导致些困扰。决策树也对训练集的改变很敏感。如何解决这些问题呢?敬请期待下篇总结:集成方法。

This article is part of a series of summaries on the book Hands-On Machine Learning with Scikit-Learn and TensorFlow. The summaries are meant to explain machine learning concepts and ideas, instead of covering the maths behind the models.
本文是《Hands-On Machine Learning with Scikit-Learn and TensorFlow》这本书的总结随笔系列的一部分。总结旨在解释机器学习的观念和想法,而不是数学和模型

Written on May 18, 2018