# Summary of Chapter 3, Classification 第三章总结， 分类问题

Will this stock price rise? Is this merchant trustworthy? Let’s take a look at machine learning approach on classification problems. 股票会不会涨？卖家是否诚信？让我们看看机器学习如何解决分类问题。

This article opens a door to classification problems in machine learning, from the simple binary classification to more complex classification problems.

1. Binary Classification
2. Performance Measures
• Cross-validation
• Confusion Matrix
• Precision, Recall, and F1 score
• Sensitivity, Specificity, and ROC
3. Multiclass Classification
4. Models

1. 二分类
2. 表现评估
• Cross-validation
• Confusion Matrix
• Precision, Recall, and F1 score
• Sensitivity, Specificity, and ROC
3. 多项分类
4. 模型

### Binary Classification 二分类

If someone does not know MNIST dataset, he probably does not know much of classification problems. Just like programmers printing “hello world” for trying a new language, machine learners usually use MNIST datasets for trying a new classification algorithm.
So what is this magical dataset? It is a dataset of handwritten digits from 0 to 9. For example, here is number 5.

Binary classification, apparently, means classifying instances into two classes, “Yes” or “No”, “Go” or “No Go”, etc. You can probably guess that this type of algorithms is great for decision making or identification. And you are right.

### Performance Measures 表现评估

#### Cross-Validation

Cross-Validation is one of the most important concepts in machine learning. It is so dominatingly popular that you see it from the simplest models like simple linear regression to most sophisticated models like the deep neural network. The concept is very simple and intuitive. If something, which could be a time management strategy or a body exercise plan, works for you, it does not necessarily work for others. Only until it is verified by a few other people, you can safely say this thing is generally effective.
In our case, this “something” is a machine learning model. Here is how you do it. You train a model on part of the training data you have, then apply the trained model on the rest of the training data. After that, compare the scores of both parts. If the results are same as good, it means the model is generally effective. The first part of the data is usually called training set, while the second set is called validation set.

Cross-Validation 是机器学习里最重要的概念之一。它是如此流行，无论从最简单的线性回归，还是复杂的深度网络，你都能找到。原理很简单易懂。如果某个东西对你有效，比如说时间管理法或瘦身计划，并不代表对其他人也有用。只有被其他人验证后，它才可以总结为广泛适用的。 在机器学习里面，这个东西是我们常说的模型。你这样去做，先用部分数据来训练模型，然后用这个模型来预测剩下的数据。对比下两组数据的表现。如果一样好，那么这模型就是广泛适用的。这里第一组数据通常叫做training set，第二组叫做validation set

#### Confusion matrix, precision, recall, and F1 score

In machine learning, we quantify the concept of “working well” models. Naturally, and in practice, in most of the cases, accuracy is a good benchmark. 机器学习里，我们要量化模型的好坏程度。自然的，accuracy是很好的一个标准，在实践中也确实如此。

Nevertheless, as I mentioned in the chapter 2 summary, while accuracy is good, sometimes it is not helpful to our concerns. This is up to the nature of the problems. For example, according to MOH Singapore, about 0.7% of the population is infected with HIV. If there is a project on predicting if a person is infected with HIV, one can easily get about 99.3% accuracy by stating everyone are not infected with HIV.

To discuss alternative performance measures, I need to first introduce confusion matrix as the picture below. 所以我们需要介绍其他的表现评估标准。首先，介绍confusion matrix如下图

Here are a few definitions. 一些定义

Precision: Ratio of true(correct) predictions among all positive predictions. 所有阳性的预测中正确预测的比率 Recall: Ratio of positive predictions among all true predictions, aka sensitivity or true positive rate 所有正确的预测结果的中阳性预测的比率。也叫做sensitivity或者true positive rate $F_1$: A score to combine both precision and recall. precision和recall两种数据的综合分数

Most of the time, a rise in precision leads to a drop in recall. That’s also why we need to find a balance using $F_1$ score. 通常，precision的升高会导致recall减少。因此我们需要用$F_1$分数来找平衡点。

To balance the two metrics, $F_1$ score is introduced as a combined metric.

#### Sensitivity, Specificity, Receiver operating characteristic(ROC)

Similar to precision and recall pair, there is another pair of performance measures. 类似precision和recall的组合，还有另一评分组合。

### Multiclass classification 多项分类

What if there are more just “Yes” or “No”? Don’t worry, expanding from binary classification, we have the multiclass classification. Some classification models can be easily expanded from two classes to multiple classes, like the decision tree. But some are meant for binary classification in nature, like SVM. In the second case, the model will split multiclass classification problems into several binary classification problems and combine the results. Again, I don’t want to discuss deeper than the basic concept.

### Models 模型

To avoid introducing too many concepts at the same time, I have intended not to mention any specific models until now. Here are several common models for classification. You can search them on your own if you are interested. 为了防止一次介绍太多内容，我之前刻意没提及用来分类的模型。下面是常用的模型，可以根据自己兴趣自行深入学习。

• Support Vector Machine (SVM) 向量机
• Logistic Regression Classifier 逻辑回归分类
• Nearest Neighborhood 近邻算法
• Decision Tree 决策树
• Random Forest 随机森林
• Native Bayes 朴素贝叶斯算法
• XGBoost (It is more a tool than a model. But I feel it is necessary to mention it as it performs extremely well on Kaggle.) （XGBoost其实是一个库，而不是模型，不过在kaggle里表现非常好，因此有必要）
• Neural Network 神经网络

This article is part of a series of summaries of the book Hands-On Machine Learning with Scikit-Learn and TensorFlow. The summaries are meant to explain machine learning concepts and ideas, instead of covering the maths and models.

Written on March 25, 2018