Decision Trees have become a popular algorithm for modern machine learning models. There are two types of a Decision Tree: A Classification Tree where the statements are classified into text categories and a Regression Tree where statements are numbers. This is an example of a Classification Tree:
library(DiagrammeR)
grViz('
digraph {
forcelabels=true;
node [];
{
"Are you happy?" -> Party [xlabel = "True "];
"Are you happy?" -> "Take a nap" [xlabel = " False "];
}
}')
A Regression tree looks like:
library(data.tree)
tree <- Node$new("Subject > 100 lbs?")
leaf1 <- tree$AddChild("Subject > 5 ft tall?")
leaf2 <- tree$AddChild("Subject <= 5 ft tall?")
plot(tree)
We will first go over the details of Classification Trees. Note that
a Classification Tree can have a mix of numeric and categorical
statements. Also, it can have the same numeric variable with different
inequalities. From now on we will proceed to the left arrow if the
condition is True
and proceed to the right otherwise.
Let’s say we have the following data:
smoking <- c("yes", "yes", "yes", "no", "no")
diabetes <- c("yes", "yes", "no", "yes", "no")
age <- c(19, 21, 23, 27, 31)
cancer <- c("yes", "no", "no", "yes", "no")
df <- data.frame(smoking, diabetes, age, cancer)
library(dplyr)
knitr::kable(df, align = "c", caption = "<center>Table 1</center>") %>%
kableExtra::kable_styling(position = "center", full_width = F)
smoking | diabetes | age | cancer |
---|---|---|---|
yes | yes | 19 | yes |
yes | yes | 21 | no |
yes | no | 23 | no |
no | yes | 27 | yes |
no | no | 31 | no |
We will make a classification tree from scratch that predicts if a
patient will get cancer or not. Firstly, we need to decide which column
will be the root node or the top of the tree. We analyze columns one by
one with our cancer response
.
tree <- Node$new("Smoking?")
leaf1 <- tree$AddChild("Cancer \n yes: 1 \n no: 2")
leaf2 <- tree$AddChild("Cancer \n yes: 1 \n no: 1")
plot(tree)
The way I calculated the number for yes and no is just by following
the category for cancer
row by row. Starting with
smoking
and cancer
, the first row has
yes
for smoking
and yes
for
cancer
so we increment 1 to the yes
part of
the left leaf node. We do the same thing until we reach the last row.
Note the distribution of yes
and no
for both
leaf nodes: it has at least one or more count for yes
and
no
. We call such trees impure
.
Next, let’s compare diabetes
and
cancer
:
tree <- Node$new("Diabetes?")
leaf1 <- tree$AddChild("Cancer \n yes: 2 \n no: 1")
leaf2 <- tree$AddChild("Cancer \n yes: 0 \n no: 2")
plot(tree)
So this tree is less impure
than the first one as you
can see from the right leaf: all of them are classified into
no
. To quantify this impurity
, we will use
Gini Impurity. The Gini Impurity for
the left leave of the first tree is: \[1-P(Yes^2)-P(No^2)=1-
(\frac{1}{3})^2-(\frac{2}{3})^2=0.44\] The Gini
Impurity for the right leave of the first tree is: \[1- (\frac{1}{2})^2-(\frac{1}{2})^2=0.50\]
Note that the first leaf has 3 samples out of the total 5 and 2 out of 5
for the other leaf. We have different number of observations in the
leaves. So we need to get the weighted average of the two Gini
Impurities: \[0.44 * \frac{3}{5} + 0.5 *
\frac{2}{5}=0.464\]
Now let’s calculate the Gini Impurity for our second tree. The left tree is: \[1- (\frac{2}{3})^2-(\frac{1}{3})^2=0.44\] The right tree is: 1-(4/9)-(1/9) \[1 - (\frac{0}{2})^2 - (\frac{2}{2})^2=0\]
Finally, the weighted average: .44*(3/5) \[0.44 * \frac{3}{5} + 0 * \frac{2}{5}=0.264\] Because 0.264 is smaller than 0.464, the second tree has less impurity.
To calculate the Gini Impurity for our numeric
variable age
, it is slightly different. We first order the
variable in ascending order. Then we compute the mean for all
neighboring rows:
\[(19+21)/2=20\] \[(22+23)/2=22\] \[(23+27)/2=25\] \[(27+31)/2=29\] Next, we calculate the
Gini Impurity for all of these four with the root node
being age
less than the four averages, respectively. We
start with age
< 20:
tree <- Node$new("Age < 20")
leaf1 <- tree$AddChild("Cancer \n yes: 1 \n no: 0")
leaf2 <- tree$AddChild("Cancer \n yes: 1 \n no: 3")
plot(tree)
Gini Impurity for the left leave is 0 and the right is \(1 - (\frac{1}{4})^2 - (\frac{3}{4})^2=0.375 => Weighted ~ Gini ~ Impurity=0 + 0.375(\frac{4}{5})^2=0.3\).
Likewise, we get 0.38, 0.38, 0.4 for the remaining three averages.
From these four Gini Impurities we choose the minimum (age
< 20) which is 0.3. We are now ready to choose which variable will be
selected as the root. Among 0.4640, .264 and 0.3 the least is 0.264,
hence we choose diabetes.
Now the updated tree will look like based on the bold rows of
Table 2
:
tree <- Node$new("Diabetes")
leaf1 <- tree$AddChild("Cancer \n yes: 2 \n no: 1")
leaf2 <- tree$AddChild("Cancer \n yes: 0 \n no: 2")
plot(tree)
The cancer distribution for ‘diabetes == yes’ (the left leaf) are represented as bold font:
knitr::kable(df, align = "c", caption = "<center>Table 2</center>") %>%
kableExtra::kable_styling(position = "center", full_width = F) %>%
kableExtra::row_spec(c(1, 2, 4), bold=T) # , hline_after = T
smoking | diabetes | age | cancer |
---|---|---|---|
yes | yes | 19 | yes |
yes | yes | 21 | no |
yes | no | 23 | no |
no | yes | 27 | yes |
no | no | 31 | no |
We don’t have to do anything to the right leaf as the node has zero Impurity. However the left leaf is still impure. Thus, we need to check once again to see if we can reduce the Impurity by comparing the remaining variables ‘smoke’ and ‘age’. We start with ‘smoke’.
Of the rows having ‘diabetes == yes’, the Gini Impurity for ‘smoke’ is \(1/3\):
tree <- Node$new("Smoke")
leaf1 <- tree$AddChild("Cancer \n yes: 1 \n no: 1")
leaf2 <- tree$AddChild("Cancer \n yes: 1 \n no: 0")
plot(tree)
Next, of the three rows having ‘diabetes == yes’, the Gini Impurity for ‘age’ is also \(1/3\):
tree <- Node$new("Smoke")
leaf1 <- tree$AddChild("Cancer \n yes: 1 \n no: 1")
leaf2 <- tree$AddChild("Cancer \n yes: 1 \n no: 0")
plot(tree)
Since both Gini Impurities are \(1/3\), we can choose anything. We will pick ‘smoke’. Again our updated tree is:
root <- Node$new("Diabetes")
leaf1 <- root$AddChild("Smoke")
leaf1_1 <- leaf1$AddChild("Cancer \n yes: 1 \n no: 1")
leaf1_2 <- leaf1$AddChild("Cancer \n yes: 1 \n no: 0")
leaf2 <- root$AddChild("Cancer \n yes: 0 \n no: 2")
plot(root)
We still have one more left leave that is still impure. We foucs on the rows when ‘diabetes == yes’ and ‘smoke == yes’:
knitr::kable(df, align = "c", caption = "<center>Table 2</center>") %>%
kableExtra::kable_styling(position = "center", full_width = F) %>%
kableExtra::row_spec(c(1, 2), bold=T)
smoking | diabetes | age | cancer |
---|---|---|---|
yes | yes | 19 | yes |
yes | yes | 21 | no |
yes | no | 23 | no |
no | yes | 27 | yes |
no | no | 31 | no |
and the updated tree:
root <- Node$new("Diabetes")
leaf1 <- root$AddChild("Smoke")
leaf1_1 <- leaf1$AddChild("Age < 20")
leaf1_1_1 <- leaf1_1$AddChild("Cancer \n yes: 1 \n no: 0")
leaf1_1_2 <- leaf1_1$AddChild("Cancer \n yes: 0 \n no: 1")
leaf1_2 <- leaf1$AddChild("Cancer \n yes: 1 \n no: 0")
leaf2 <- root$AddChild("Cancer \n yes: 0 \n no: 2")
plot(root)
Now all of the leaves are pure.
Lastly, we need to have a final output from all leaves. In general, we pick the maximum value between ‘yes’ and ‘no’. Thus, the final tree is:
root <- Node$new("Diabetes")
leaf1 <- root$AddChild("Smoke")
leaf1_1 <- leaf1$AddChild("Age < 20")
leaf1_1_1 <- leaf1_1$AddChild("Cancer")
leaf1_1_2 <- leaf1_1$AddChild("No Cancer")
leaf1_2 <- leaf1$AddChild("Cancer")
leaf2 <- root$AddChild("No Cancer")
plot(root)
We can now predict/classify with new data. If we have a new patient having:
smoking <- c("no")
diabetes <- c("yes")
age <- c(32)
df_new <- data.frame(smoking, diabetes, age)
knitr::kable(df_new, align = "c", caption = "<center>Table 3</center>") %>%
kableExtra::kable_styling(position = "center", full_width = F)
smoking | diabetes | age |
---|---|---|
no | yes | 32 |
We classify this patient as getting cancer.
One last note: Look how few classifications were made on the leaves. This can be an issue since the tree model we built may have been overfit which will not generalize well to incoming data. There are two solutions:
Pruning
Requiring a minimum number of classification even though a leaf remains impure e.g., we could have stopped at the left leaf:
root <- Node$new("Diabetes")
leaf1 <- root$AddChild("Cancer \n yes: 1 \n no: 1")
leaf2 <- root$AddChild("Cancer \n yes: 0 \n no: 2")
plot(root)
So the left leaf is classified as cancer
. To determine
this number, we use cross validation to select the best one.