欢迎访问 生活随笔!

生活随笔

当前位置: 首页 > 编程资源 > 编程问答 >内容正文

编程问答

基于决策树的多分类_R中基于决策树的糖尿病分类—一个零博客

发布时间:2023/12/15 编程问答 37 豆豆
生活随笔 收集整理的这篇文章主要介绍了 基于决策树的多分类_R中基于决策树的糖尿病分类—一个零博客 小编觉得挺不错的,现在分享给大家,帮大家做个参考.

基于决策树的多分类

Article Outline

文章大纲

  • What is a decision tree?

    什么是决策树?
  • Why use them?

    为什么要使用它们?
  • Data Background

    资料背景
  • Descriptive Statistics

    描述性统计
  • Decision Tree Training and Evaluation

    决策树培训和评估
  • Decision Tree Pruning

    决策树修剪
  • Hyperparameters Tuning

    超参数调整

什么是决策树? (What is a decision tree?)

A decision tree is a representation of a flowchart. The classification and regression tree (a.k.a decision tree) algorithm was developed by Breiman et al. 1984 (usually reported) but that certainly was not the earliest. Wei-Yin Loh of the University of Wisconsin has written about the history of decision trees. You can read it here “Fifty Years of Classification and Regression Trees”.

决策树是流程图的表示。 分类和回归树(又名决策树)算法是由Breiman等人开发的 1984年 ( 通常报道 ),但这当然不是最早的。 威斯康星大学的卢伟贤(Loe-Yin Yin)撰写了有关决策树的历史。 您可以在这里阅读“ 分类树和回归树五十年 ”。

In a decision tree, the top node is called the “root node” and the bottom node “terminal node”. The other nodes are called “internal nodes” which includes a binary split condition, while each leaf node contains associated class labels.

在决策树中,顶部节点称为“根节点”,而底部节点称为“终端节点”。 其他节点称为“内部节点”,其中包含二进制拆分条件,而每个叶节点均包含关联的类标签。

Photo by Saed Sayad on saedsayad.com Saed Sayad在saedsayad.com上的照片

A classification tree uses a split condition to predict a class label based on the provided input variables. The splitting process starts from the top node (root node), and at each node, it checks whether supplied input values recursively continue to the left or right according to a supplied splitting condition (Gini or Information gain). This process terminates when a leaf or terminal node is reached.

分类树使用拆分条件基于提供的输入变量来预测类标签。 拆分过程从最高节点(根节点)开始,并在每个节点处根据提供的拆分条件(Gini或信息增益)检查提供的输入值是递归地在左侧还是右侧。 当到达叶节点或终端节点时,此过程终止。

为什么要使用它们? (Why use them?)

A single decision tree-based model is easy to build, plot and interpret which makes this algorithm so popular. You can use this algorithm for performing classification as well as a regression task.

基于单个决策树的模型易于构建,绘制和解释,这使得该算法如此受欢迎。 您可以使用此算法执行分类以及回归任务。

资料背景 (Data Background)

In this example, we are going to use the Pima Indian Diabetes 2 data set obtained from the UCI Repository of machine learning databases (Newman et al. 1998).

在本示例中,我们将使用从机器学习数据库的UCI存储库中获得的Pima Indian Diabetes 2数据集( Newman等,1998 )。

This data set is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the data set is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the data set. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

该数据集最初来自美国糖尿病与消化及肾脏疾病研究所。 数据集的目的是根据数据集中包含的某些诊断测量值来诊断性预测患者是否患有糖尿病。 从较大的数据库中选择这些实例受到一些限制。 特别是,这里的所有患者均为皮马印第安人血统至少21岁的女性。

The Pima Indian Diabetes 2 data set is the refined version (all missing values were assigned as NA) of the Pima Indian diabetes data. The data set contains the following independent and dependent variables.

Pima印度糖尿病2数据集是Pima印度糖尿病数据的精炼版本(所有缺失值均指定为NA)。 数据集包含以下独立变量和因变量。

Independent variables (symbol: I)

自变量(符号:I)

  • I1: pregnant: Number of times pregnant

    I1: 怀孕 :怀孕次数

  • I2: glucose: Plasma glucose concentration (glucose tolerance test)

    I2: 葡萄糖 :血浆葡萄糖浓度(葡萄糖耐量试验)

  • I3: pressure: Diastolic blood pressure (mm Hg)

    I3: 压力 :舒张压(毫米汞柱)

  • I4: triceps: Triceps skinfold thickness (mm)

    I4: 三头肌 :三头肌皮褶厚度(毫米)

  • I5: insulin: 2-Hour serum insulin (mu U/ml)

    I5: 胰岛素 :2小时血清胰岛素(mu U / ml)

  • I6: mass: Body mass index (weight in kg/(height in m)\²)

    I6: 质量 :体重指数(重量,单位:千克/(身高,单位:m)\²)

  • I7: pedigree: Diabetes pedigree function

    I7: 谱系 :糖尿病谱系功能

  • I8: age: Age (years)

    I8: 年龄 :年龄(年)

Dependent Variable (symbol: D)

因变量(符号:D)

  • D1: diabetes: diabetes case (pos/neg)

    D1: 糖尿病 :糖尿病病例(正/负)

建模目的 (Aim of the Modelling)

  • fitting a decision tree classification machine learning model that accurately predicts whether or not the patients in the data set have diabetes

    拟合决策树分类机器学习模型,该模型可准确预测数据集中的患者是否患有糖尿病
  • Decision tree pruning for reducing overfitting

    决策树修剪以减少过度拟合
  • Decision tree hyperparameters tuning

    决策树超参数调整

加载相关库 (Loading relevant libraries)

The first step of data analysis starts with loading relevant libraries.

数据分析的第一步从加载相关库开始。

library(mlbench) # Diabetes dataset
library(rpart) # Decision tree
library(rpart.plot) # Plotting decision tree
library(caret) # Accuracy estimation
library(Metrics) # For diferent model evaluation metrics

加载数据集 (Loading dataset)

The very next step is to load the data into the R environment. As this comes with mlbench package one can load the data calling data( ).

下一步是将数据加载到R环境中。 由于mlbench软件包随附软件包,因此可以加载调用data()的数据。

# load the diabetes dataset
data(PimaIndiansDiabetes2)

数据预处理 (Data Preprocessing)

The next step would be to perform exploratory analysis. First, we need to remove the missing values using the na.omit( ) function. Print the data types using glimpse( ) method from dplyr library. You can see that all the variables except the dependent variable (diabetes: categorical/factor) are double type.

下一步将进行探索性分析。 首先,我们需要使用na.omit()函数删除丢失的值。 使用dplyr库中的glimpse()方法打印数据类型。 您会看到除因变量(糖尿病:分类/因子)以外的所有变量都是双精度类型。

Diabetes <- na.omit(PimaIndiansDiabetes2) # Data for modelingdplyr::glimpse(Diabetes)Data Types资料类型

训练和测试拆分 (Train and Test Split)

The next step is to split the dataset into 80% train and 20% test. Here, we are using the sample( ) method to randomly pick the observation index for train and test split with replacement. Next, based on indexing we split out the train and test data.

下一步是将数据集分为80%训练和20%测试。 在这里,我们使用sample()方法随机选择火车的观察指标,并用替换进行测试拆分。 接下来,基于索引,我们拆分了训练和测试数据。

set.seed(123)index <- sample(2, nrow(Diabetes), prob = c(0.8, 0.2), replace = TRUE)Diabetes_train <- Diabetes[index==1, ] # Train data
Diabetes_test <- Diabetes[index == 2, ] # Test data

The train data includes 318 observations and test data included 74 observations. Both contain 9 variables.

火车数据包括318个观测值,测试数据包括74个观测值。 两者都包含9个变量。

print(dim(Diabetes_train))
print(dim(Diabetes_test))Train and Test Dimension训练和测试尺寸

模型训练 (Model Training)

The next step is the model training and evaluation of model performance

下一步是模型训练和模型性能评估

训练决策树 (Training a Decision Tree)

For decision tree training, we will use the rpart( ) function from the rpart library. The arguments include; formula for the model, data and method.

为了进行决策树训练,我们将使用rpart库中的rpart()函数。 参数包括: 模型,数据和方法的公式。

formula = diabetes ~. i.e., diabetes is predicted by all independent variables (excluding diabetes)

公式=糖尿病〜。 即,糖尿病是由所有独立变量预测的(糖尿病除外)

Here, the method should be specified as the class for the classification task.

在此,应将方法指定为分类任务的类。

# Train a decision tree model
Diabetes_model <- rpart(formula = diabetes ~.,
data = Diabetes_train,
method = "class")

模型图 (Model Plotting)

The main advantage of the tree-based model is that you can plot the tree structure and able to figure out the decision mechanism.

基于树的模型的主要优点是您可以绘制树结构并能够确定决策机制。

# type: 0; Draw a split label at each split and a node label at each leaf.
# yesno = 2; provides spli yes or no
# Extra = 0; no extra informationrpart.plot(x = Diabetes_model, yesno = 2, type = 0, extra = 0)Diabetes_model Tree StructureDiabetes_model树结构

模型性能评估 (Model Performance Evaluation)

Next, step is to see how our trained model performs on the test/unseen dataset. For predicting the test data class we need to supply the model object, test dataset and the type = “class” inside the predict( ) function.

接下来,步骤是查看我们训练有素的模型如何在测试/看不见的数据集上执行。 为了预测测试数据类,我们需要在predict()函数中提供模型对象测试数据集type =“ class”

# class prediction
class_predicted <- predict(object = Diabetes_model,
newdata = Diabetes_test,
type = "class")

(a) Confusion matrix

(a)混淆矩阵

To evaluate the test performance we are going to use the confusionMatrix( ) from caret library. We can observe that out of 74 observations it wrongly predicts 17 observations. The model has achieved about 77.03% accuracy using a single decision tree.

为了评估测试性能,我们将使用插入符号库中的confusionMatrix() 。 我们可以观察到,在74个观察结果中,它错误地预测了17个观察结果。 使用单个决策树,该模型已达到约77.03%的准确性。

# Generate a confusion matrix for the test dataconfusionMatrix(data = class_predicted,
reference = Diabetes_test$diabetes)Diabetes_model Test Evaluation StatisticsDiabetes_model测试评估统计

(b) Test accuracy

(b)测试准确性

We can also supply the predicted class labels and original test dataset labels to the accuracy( ) function for estimating the model accuracy.

我们还可以将预测的类别标签和原始测试数据集标签提供给precision()函数,以估计模型的准确性。

accuracy(actual = class_predicted,
predicted = Diabetes_test$diabetes)Diabetes_model Test AccuracyDiabetes_model测试准确性

基于分裂准则的模型比较 (Splitting Criteria Based Model Comparision)

While building the model the decision tree algorithm uses splitting criteria. There are two popular splitting criteria used in decision trees; one is called “gini” and others called “information gain”. Here, we try to compare the model performance on the test set after training with different split criteria. The splitting criteria are supplied using parms argument as a list.

在构建模型时,决策树算法使用拆分标准。 决策树中使用了两种流行的拆分标准: 一种称为“基尼”,另一种称为“信息增益”。 在这里,我们尝试在使用不同的拆分标准进行训练后,对测试集上的模型性能进行比较。 使用parms参数作为列表来提供拆分条件。

# Model training based on gini-based splitting criteria
Diabetes_model1 <- rpart(formula = diabetes ~ .,
data = Diabetes_train,
method = "class",
parms = list(split = "gini"))# Model training based on information gain-based splitting criteria
Diabetes_model2 <- rpart(formula = diabetes ~ .,
data = Diabetes_train,
method = "class",
parms = list(split = "information"))

测试数据的模型评估 (Model Evaluation on Test Data)

After model training, the next step is to predict the class labels of the test dataset.

经过模型训练后,下一步是预测测试数据集的类标签。

# Generate class predictions on the test data using gini-based splitting criteria
pred1 <- predict(object = Diabetes_model1,
newdata = Diabetes_test,
type = "class")# Generate class predictions on test data using information gain based splitting criteria
pred2 <- predict(object = Diabetes_model2,
newdata = Diabetes_test,
type = "class")

预测精度比较 (Prediction Accuracy Comparision)

Next, we compare the accuracy of the models. Here, we can observe that “gini” based splitting criteria is providing a more accurate model than “information” based splitting.

接下来,我们比较模型的准确性。 在这里,我们可以观察到,基于“ 基尼 ”的分割标准比基于“ 信息 ”的分割提供了更准确的模型。

# Compare classification accuracy on test data
accuracy(actual = Diabetes_test$diabetes,
predicted = pred1)accuracy(actual = Diabetes_test$diabetes,
predicted = pred2)Diabetes_model1 Test AccuracyDiabetes_model1测试准确性 Diabetes_model2 Test AccuracyDiabetes_model2测试准确性

The initial model (Diabetes_model) and the “gini” based model (Diabetes_model1) providing the same accuracy, as rpart model uses “gini” as its default splitting criteria.

初始模型( Diabetes_model )和基于“ gini ”的模型( Diabetes_model1 )提供相同的准确性,因为rpart模型使用“ gini ”作为其默认拆分标准。

决策树修剪 (Decision Tree Pruning)

The initial model (Diabetes_model) plot shows that the tree structure is deep and fragile which might reduce the easy interpretation in the decision-making process. Thus here we would try to explore other ways to make the tree more interpretable without losing performance. One way of doing this is by pruning the fragile part of the tree (part contributes to model overfitting).

初始模型( Diabetes_model )曲线表明,树结构深而脆弱,这可能会降低决策过程中的易解释性。 因此,在这里我们将尝试探索其他方法来使树更易于解释,而不会损失性能。 一种方法是修剪树的脆弱部分(部分有助于模型拟合)。

(a) Plotting the error vs complexity parameter

(a)绘制误差与复杂度参数

The decision tree has one parameter called complexity parameter (cp) which controls the size of the decision tree. If the cost of adding another variable to the decision tree from the current node is above the value of cp, then tree building does not continue. We can generate the cp vs error plot using the plotcp( ) library.

决策树具有一个称为复杂性参数(cp)的参数 ,该参数控制决策树的大小。 如果从当前节点向决策树添加另一个变量的开销高于cp的值,则树的构建不会继续。 我们可以使用plotcp()库生成cp vs错误图。

# Plotting Cost Parameter (CP) Table
plotcp(Diabetes_model1)Error vs CP Plot误差与CP图

(b) Generating complexity parameter table

(b)生成复杂度参数表

We can also generate the cp table by calling model$cptable. Here, you can observe that xerror is minimum with CP value of 0.025.

我们还可以通过调用model $ cptable生成cp表。 在这里,您可以观察到xerror最小, CP值为0.025。

# Plotting the Cost Parameter (CP) Table
print(Diabetes_model1$cptable)

(c) Obtaining an optimal pruned model

(c)获得最佳修剪模型

We can filter out the optimal CP value by identifying the index of minimum xerror and by supplying it to the CP table.

我们可以通过识别最小xerror索引并将其提供给CP表来筛选出最佳CP值。

# Retrieve of optimal cp value based on cross-validated error
index <- which.min(Diabetes_model1$cptable[, "xerror"])cp_optimal <- Diabetes_model1$cptable[index, "CP"]

The next step is to prune the tree using prune( ) function by supplying optimal CP value. If we plot the optimal pruned tree we can now observe that the tree is very simple and easy to interpret.

下一步是通过提供最佳CP值,使用prune()函数对树进行修剪 。 现在,如果绘制最佳修剪树,则可以观察到该树非常简单且易于解释。

If a person has a glucose level above 128 and age greater than 25 will be designated as diabetes positive else negative.

如果一个人的葡萄糖水平高于128且年龄大于25岁,将被认定为糖尿病阳性或阴性

# Pruning tree based on optimal CP valueDiabetes_model1_opt <- prune(tree = Diabetes_model1, cp = cp_optimal)rpart.plot(x = Diabetes_model1_opt, yesno = 2, type = 0, extra = 0)

(d) Pruned tree performance

(d)修剪的树木表现

The next step is to check whether the prune tree has similar performance or the performance has been compromised. After the performance check, we can see that the pruned tree is as capable as the earlier fragile tree but now it is simple and easy to interpret.

下一步是检查修剪树是否具有相似的性能或性能是否受到损害。 经过性能检查后,我们可以看到修剪的树与早期的脆弱树一样强大,但是现在它变得简单易懂。

pred3 <- predict(object = Diabetes_model1_opt,
newdata = Diabetes_test,
type = "class")accuracy(actual = Diabetes_test$diabetes,
predicted = pred3)

决策树超参数调整 (Decision Tree Hyperparameter Tuning)

Next, we would try to increase the performance of the decision tree model by tuning its hyperparameters. The rpart( ) offers different hyperparameters but here we will try to tune two important parameters which are minsplit, and maxdepth.

接下来,我们将尝试通过调整决策树模型的超参数来提高其性能。 rpart()提供了不同的超参数,但是在这里我们将尝试调整两个重要的参数minsplitmaxdepth

  • minsplit: the minimum number of observations that must exist in the node in order for a split to be attempted.

    minsplit :节点中必须存在的最小尝试观察数。

  • maxdepth: The maximum depth of any node of the final tree.

    maxdepth :最终树的任何节点的最大深度。

(a) Generating hyperparameter grid

(a)生成超参数网格

First, we generate a sequence 1 to 20 for both minsplit and maxdepth. Then we build a parameter combination grid using expand.grid( ) function.

首先,我们为最小分裂和最大深度生成一个1到20的序列。 然后,我们使用expand.grid()函数构建参数组合网格。

#############################
## Hyper parameter Grid Search
############################## Setting values for minsplit and maxdepth## the minimum number of observations that must exist in a node in order for a split to be attempted.
## Set the maximum depth of any node of the final tree
minsplit <- seq(1, 20, 1)
maxdepth <- seq(1, 20, 1)# Generate a search grid
hyperparam_grid <- expand.grid(minsplit = minsplit, maxdepth = maxdepth)

(b) Training grid-based models

(b)训练基于网格的模型

The next step is to train different models based on each grid hyperparameter combination. This could be done through the following steps:

下一步是根据每个网格超参数组合训练不同的模型。 这可以通过以下步骤完成:

  • using a for loop to loop through each hyperparameter in the grid and then supplying it to rpart( ) function for model training

    使用for循环遍历网格中的每个超参数,然后将其提供给rpart()函数进行模型训练
  • storing each model into an empty list (diabetes_models)

    将每个模型存储到一个空列表中(diabetes_models)
# Number of potential models in the grid
num_models <- nrow(hyperparam_grid)# Create an empty list
diabetes_models <- list()# Write a loop over the rows of hyper_grid to train the grid of models
for (i in 1:num_models) {

minsplit <- hyperparam_grid$minsplit[i]
maxdepth <- hyperparam_grid$maxdepth[i]

# Train a model and store in the list
diabetes_models[[i]] <- rpart(formula = diabetes ~ .,
data = Diabetes_train,
method = "class",
minsplit = minsplit,
maxdepth = maxdepth)
}

(c) Computing test accuracy

(c)计算测试准确性

The next step is to check the model performance of each model on test data and retrieving the best model. This could be done through the following steps:

下一步是根据测试数据检查每个模型的模型性能,并检索最佳模型。 这可以通过以下步骤完成:

  • using a for loop to loop through each model in the list, and then predicting the test data and computing accuracy

    使用for循环遍历列表中的每个模型,然后预测测试数据和计算精度
  • storing each model accuracy into an empty vector (accuracy_values)

    将每个模型精度存储到一个空向量中(accuracy_values)
# Number of models inside the grid
num_models <- length(diabetes_models)# Create an empty vector to store accuracy values
accuracy_values <- c()# Use for loop for models accuracy estimation
for (i in 1:num_models) {

# Retrieve the model i from the list
model <- diabetes_models[[i]]

# Generate predictions on test data
pred <- predict(object = model,
newdata = Diabetes_test,
type = "class")

# Compute test accuracy and add to the empty vector accuracy_values
accuracy_values[i] <- accuracy(actual = Diabetes_test$diabetes,
predicted = pred)
}

(d) Identifying the best model

(d)确定最佳模式

The next step is to retrieve the best performing model (maximum accuracy) and printing its hyperparameters using model$control. We can observe that with a minimum split of 17 and a maximum depth of 6 the model provides most accurate results when evaluated on unseen/test dataset.

下一步是检索性能最佳的模型(最大精度),并使用model $ control打印其超参数。 我们可以观察到,在看不见/测试的数据集上进行评估时,该模型的最小拆分度为17 , 最大深度为6 ,可提供最准确的结果。

# Identify the model with maximum accuracy
best_model <- diabetes_models[[which.max(accuracy_values)]]# Print the model hyper-parameters of the best model
best_model$control

(e) Best model evaluation on test data

(e)对测试数据进行最佳模型评估

After identifying the best performing model, the next step is to see how accurate the model is. Now, with the best hyperparameters, the model achieved an accuracy of 81.08% which is really great.

确定最佳性能模型后,下一步就是查看模型的准确性。 现在,使用最佳超参数,该模型的精度达到了81.08%,这的确非常棒。

# Best_model accuracy on test data
pred <- predict(object = best_model,
newdata = Diabetes_test,
type = "class")
accuracy(actual = Diabetes_test$diabetes,
predicted = pred)

(f) Best model plot

(f)最佳模型图

Now it is time to plot the best model.

现在是时候绘制最佳模型了。

rpart.plot(x = best_model, yesno = 2, type = 0, extra = 0)Best Model’s Layout最佳模型的布局

Even the above plot is for best performing model, still, it looks a little bit fragile. So your next task would be to prune it and see if you get a better interpretable decision tree or not.

即使上面的图是表现最佳的模型,仍然看起来有些脆弱。 因此,您的下一个任务是修剪它,看看您是否获得了更好的可解释性决策树。

I hope you learned something new. See you next time!

我希望你学到了一些新东西。 下次见!

Note

注意

This article was first published on onezero.blog, a data science, machine learning and research related blogging platform maintained by me.

本文首次发表于onezero.blog 数据科学,机器学习和研究相关的博客平台维护由我。

**Read more by vising my personal blog website: https://onezero.blog/

** 访问我的个人博客网站以了解更多信息https : //onezero.blog/

If you learned something new and liked this article, say 👋 / follow me on onezero.blog (my personal blogging website), Twitter, LinkedIn, YouTube and Github.

如果您学到新知识并喜欢这篇文章,请在 onezero.blog ( 我的个人博客网站 )Twitter LinkedIn YouTube 上说“👋/关注我”。 Github

[1] Breiman, L., Friedman, J., Stone, C.J. and Olshen, R.A., 1984. Classification and regression trees. CRC press.

[1] Breiman,L.,Friedman,J.,Stone,CJ和Olshen,RA,1984。 分类和回归树 。 CRC媒体。

[2] Loh, W. (2014). Fifty Years of Classification and Regression Trees 1.

[2] Loh,W.(2014)。 分类树和回归树五十年1。

[3] Newman, C. B. D. & Merz, C. (1998). UCI Repository of machine learning databases, Technical report, University of California, Irvine, Dept. of Information and Computer Sciences.

[3] Newman,CBD&Merz,C.(1998)。 UCI机器学习数据库存储库,技术报告,加利福尼亚大学欧文分校信息和计算机科学系。

Originally published at https://onezero.blog on August 2, 2020.

最初于 2020年8月2日 在 https://onezero.blog 上 发布 。

翻译自: https://towardsdatascience.com/diabetes-classification-using-decision-trees-c4fd6dd7241a

基于决策树的多分类

总结

以上是生活随笔为你收集整理的基于决策树的多分类_R中基于决策树的糖尿病分类—一个零博客的全部内容,希望文章能够帮你解决所遇到的问题。

如果觉得生活随笔网站内容还不错,欢迎将生活随笔推荐给好友。