1.0 Introduction

This is a dataset made available by Spain’s Ministry of Education and Science in its KEEL (Knowledge Extraction and Evolutionary Learning) repository. I think it would be terrific if we could leverage machine learning to improve predictions in healthcare. It could potentially spare patients from a lot of anxiety as well as save a lot of money and effort.

2.0 Getting ahold of the data

2.1 Downloading the data

The data can be downloaded here. The data glossary is here.

2.2 Reading the data

Having downloaded the data, we read it into R. We skip the first 26 lines of the file which contain information about the file but no actual data.

thyroid <- read.csv("thyroid.dat", header = FALSE, skip = 26) # skipping the first 26 lines
colnames(thyroid) <- c("age", "sex", "on_thyroxine", "query_on_thyroxine", "antithyroid_medication", "sick", "pregnant", "thyroid_surgery", "I131_treatment", "query_hypothyroid", "query_hyperthyroid", "lithium", "goitre", "tumor", "hypopituitary", "psych", "TSH", "T3", "TT4", "T4U", "FTI","class")
str(thyroid)
'data.frame':   7200 obs. of  22 variables:
 $ age                   : num  0.73 0.24 0.47 0.64 0.23 0.69 0.85 0.48 0.67 0.62 ...
 $ sex                   : int  0 0 0 1 0 1 1 1 0 0 ...
 $ on_thyroxine          : int  1 0 0 0 0 0 0 0 1 0 ...
 $ query_on_thyroxine    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ antithyroid_medication: int  0 0 0 0 0 0 0 0 0 0 ...
 $ sick                  : int  0 0 0 0 0 0 0 0 0 1 ...
 $ pregnant              : int  0 0 0 0 0 0 0 0 0 0 ...
 $ thyroid_surgery       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ I131_treatment        : int  1 0 0 0 0 0 0 0 0 0 ...
 $ query_hypothyroid     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ query_hyperthyroid    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ lithium               : int  0 0 0 0 0 0 0 0 0 0 ...
 $ goitre                : int  0 0 0 0 0 0 0 0 0 0 ...
 $ tumor                 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ hypopituitary         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ psych                 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ TSH                   : num  0.0006 0.00025 0.0019 0.0009 0.00025 0.00025 0.00025 0.00208 0.0013 0.011 ...
 $ T3                    : num  0.015 0.03 0.024 0.017 0.026 0.016 0.023 0.02 0.024 0.008 ...
 $ TT4                   : num  0.12 0.143 0.102 0.077 0.139 0.086 0.128 0.086 0.087 0.073 ...
 $ T4U                   : num  0.082 0.133 0.131 0.09 0.09 0.07 0.104 0.078 0.109 0.074 ...
 $ FTI                   : num  0.146 0.108 0.078 0.085 0.153 0.123 0.121 0.11 0.08 0.098 ...
 $ class                 : int  3 3 3 3 3 3 3 3 3 2 ...


There are \(7200\) observations with \(21\) predictors and \(1\) response variable, \(class\).

3.0 Data exploration and munging

3.1 Data exploration

There are \(21\) predictors. \(age\) is given as number between \(0\) and \(1\). \(sex\) is a Boolean variable. It’s not clear whether \(1\) means “male” or “female”. There are many Boolean variables: \(on\_thyroxine\) tells us whether the patient is on Levothyroxine, \(query\_on\_thyroxine\) whether the patient was asked if he was taking Levothyroxine, \(antithyroid\_medication\) whether the patient is on any medication to treat a thyroid condition, and so on. Some that aren’t too obvious, or at least not to me: \(I131\_treatment\) tells us if the patient is under I131 treatment, \(lithium\) if the patient is taking lithium for a different condition, \(goitre\) whether the patient exhibits goiter, and \(hypopituitary\) whether the patient has hypopituitarism. I am not sure what \(psych\) stands for. The variables \(TSH\), \(T3\), \(TT4\), \(T4U\), and \(FTI\) represent the results of various blood tests used to screen for thyroid problems.

The response variable is \(class\), which can take \(1\) of \(3\) values: \(1\) is normal, \(2\) is hyperthyroidism, and \(3\) hypothyroidism. Over \(90\%\) of the observations are classified as hypothyroidism.

3.2 Data munging

We can start by changing \(age\) to numbers we are familiar with, and converting the response to a factor variable. We will also change the response from type integer to factor.

# Changle "class" to factor
thyroid$class <- factor(thyroid$class)
# Multiply "age" by 100
thyroid$age <- 100 * thyroid$age


There is no missing data in the dataset.

3.4 Splitting into training and testing datasets

Let’s split the data into training and testing data sets.

library(caTools)
set.seed(1000) # reproducibility
split <- sample.split(thyroid$class, SplitRatio = 0.7)
thyroid_train <- subset(thyroid, split==TRUE)
thyroid_test <- subset(thyroid, split==FALSE)


4.0 Decision trees

Loading the required libraries

library(rpart)
library(rpart.plot)


Now let’s build the tree and plot it. We will use all the predictors to build the tree. The parameter minbucket allows us to select how the minimum number of observations that are placed in one of the tree’s “leaves” or “buckets”.

# Building a tree with a minimum of 50 observations on each leaf
thyroid_tree <- rpart(class ~ ., data = thyroid_train, control=rpart.control(minbucket=50))
prp(thyroid_tree)


The tree splits on 3 of the variables. If we were to change the minbucket parameter, we would get more or fewer splits. Sticking with this tree for now, let’s get its accuracy on the training set itself.

# Generate predictions on training set
PredictCART_train = predict(thyroid_tree, type = "class")
# Confusion matrix of training set
conf_matrix_train <- table(thyroid_train$class, PredictCART_train)
conf_matrix_train
   PredictCART_train
       1    2    3
  1  116    0    0
  2    0  258    0
  3   10   32 4624


The rows are the ground truth whereas the columns are the predictions generated by the tree. The tree perfectly predicts the normal and hyperthyroidism cases on the training set. We can compute its overall accuracy as the ratio of the correct predictions to the size of the dataset.

\[ Accuracy=\frac{Number\ of\ normal,\ hyperthyroidism,\ and\ hypothyroidism\ cases\ correctly\ predicted}{Total\ number\ of\ cases} \]

The number of the correctly predicted cases is the sum of the diagonals of the matrix.

sum(diag(conf_matrix_train)) / sum(conf_matrix_train)
[1] 0.9916667


So the accuracy is over \(99\%\). What is the performance of this tree on the test set?

PredictCART_test = predict(thyroid_tree, newdata = thyroid_test, type = "class")
# Confusion matrix of test set
conf_matrix_test <- table(thyroid_test$class, PredictCART_test)
conf_matrix_test
   PredictCART_test
       1    2    3
  1   50    0    0
  2    0  110    0
  3    6   18 1976


The accuracy of this tree on the testing set is:

sum(diag(conf_matrix_test)) / sum(conf_matrix_test)
[1] 0.9888889


The performance of the tree on the test set is just as strong as on the training set. Can we do even better?

4.1 Cross-validation

We can use cross-validation to obtain the tree that maximizes accuracy. There is a parameter called cp, or complexity parameter, that allows us, indirectly, to specify how many observations will be placed on each leaf. We can do 10-fold cross-validation for each cp value, compute the average accuracy of each cp value, and then take the best cp value and build the final tree with it.

Let’s load the libraries we will need to perform the cross-validation:

library(caret)
library(e1071)


Now define the parameters of the cross-validation. We will try values of \(cp\) from \(0.01\) to \(0.50\), in steps of \(0.01\).

# Setting cross-validation to be 10-fold
fitControl = trainControl( method = "cv", number = 10 )
# Setting cp to .01, .02, ..., 0.5
cartGrid = expand.grid( .cp = (1:50)*0.01) 


Perform the cross-validation to determine the best cp parameter. The train function will, for each value of \(cp\), perform 10-fold cross-validation, and report the average accuracy of the \(10\) runs for that value of \(cp\). Since there are \(50\) values of \(cp\) and we are performing 10-fold cross-validation for each of them, train() builds \(500\) trees and computes the accuracy of each. So it takes a little while to run.

train(class ~ ., data = thyroid_train, method = "rpart", trControl = fitControl, tuneGrid = cartGrid)
CART 

5040 samples
  21 predictor
   3 classes: '1', '2', '3' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 4538, 4536, 4535, 4535, 4536, 4535, ... 
Resampling results across tuning parameters:

  cp    Accuracy   Kappa    
  0.01  0.9938468  0.9574844
  0.02  0.9942440  0.9604630
  0.03  0.9942440  0.9604630
  0.04  0.9920623  0.9460961
  0.05  0.9916655  0.9435706
  0.06  0.9916655  0.9435706
  0.07  0.9916655  0.9435706
  0.08  0.9916655  0.9435706
  0.09  0.9916655  0.9435706
  0.10  0.9916655  0.9435706
  0.11  0.9916655  0.9435706
  0.12  0.9916655  0.9435706
  0.13  0.9916655  0.9435706
  0.14  0.9916655  0.9435706
  0.15  0.9916655  0.9435706
  0.16  0.9916655  0.9435706
  0.17  0.9916655  0.9435706
  0.18  0.9916655  0.9435706
  0.19  0.9916655  0.9435706
  0.20  0.9916655  0.9435706
  0.21  0.9851229  0.9062083
  0.22  0.9801654  0.8766649
  0.23  0.9757963  0.8504515
  0.24  0.9757963  0.8504515
  0.25  0.9757963  0.8504515
  0.26  0.9757963  0.8504515
  0.27  0.9757963  0.8504515
  0.28  0.9757963  0.8504515
  0.29  0.9757963  0.8504515
  0.30  0.9757963  0.8504515
  0.31  0.9615248  0.7625517
  0.32  0.9527816  0.7061530
  0.33  0.9527816  0.7061530
  0.34  0.9527816  0.7061530
  0.35  0.9527816  0.7061530
  0.36  0.9428676  0.4861434
  0.37  0.9295582  0.1334101
  0.38  0.9257958  0.0000000
  0.39  0.9257958  0.0000000
  0.40  0.9257958  0.0000000
  0.41  0.9257958  0.0000000
  0.42  0.9257958  0.0000000
  0.43  0.9257958  0.0000000
  0.44  0.9257958  0.0000000
  0.45  0.9257958  0.0000000
  0.46  0.9257958  0.0000000
  0.47  0.9257958  0.0000000
  0.48  0.9257958  0.0000000
  0.49  0.9257958  0.0000000
  0.50  0.9257958  0.0000000

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was cp = 0.03. 


The train() function from the caret package selects the best value of cp: \(0.03\). We can then use this value to build the final tree:

thyroid_tree_cv <- rpart(class ~ ., data = thyroid_train, control=rpart.control(cp=0.03))
prp(thyroid_tree_cv)


Now we can compute the accuracy on the training set again.

# Generate predictions on training set using cross-validated tree
PredictCART_train_cv = predict(thyroid_tree_cv, type = "class")
# Confusion matrix of training set
conf_matrix_train_cv <- table(thyroid_train$class, PredictCART_train_cv)
conf_matrix_train_cv
   PredictCART_train_cv
       1    2    3
  1  116    0    0
  2    0  258    0
  3   10   17 4639


The accuracy on the training set is even higher now:

sum(diag(conf_matrix_train_cv)) / sum(conf_matrix_train_cv)
[1] 0.9946429


Finally, what about the accuracy of the tree on the test set?

# Generate predictions on test set using cross-validated set
PredictCART_test_cv = predict(thyroid_tree_cv, newdat = thyroid_test, type = "class")
# Confusion matrix of training set
conf_matrix_test_cv <- table(thyroid_test$class, PredictCART_test_cv)
conf_matrix_test_cv
   PredictCART_test_cv
       1    2    3
  1   50    0    0
  2    0  110    0
  3    6    8 1986


sum(diag(conf_matrix_test_cv)) / sum(conf_matrix_test_cv)
[1] 0.9935185


Out of more than \(2100\) cases in the test set, only \(14\) were misclassified by the model. How big an improvement is this over not bothering to use any model at all and merely predicting each case will be hypothyroidism, i.e., \(class=3\)? The vast majority of the cases are indeed hypothyroidism.

hypothyroidism_num <- nrow(thyroid_test[thyroid_test$class == "3",])
total_cases <- nrow(thyroid_test)
baseline_accuracy <- hypothyroidism_num / total_cases
baseline_accuracy
[1] 0.9259259


The no-model-at-all baseline accuracy of \(92.6\%\) is \(6.8\%\) worse than that of the optimized tree we built.

5.0 References

  1. Bertsimas, D., O’Hair, A. The Analytics Edge. Spring 2014. edX.org.
---
title: "Predicting thyroid diagnoses with decision trees"
output: html_notebook
---

## 1.0 Introduction

This is a dataset made available by Spain's Ministry of Education and Science in its KEEL (Knowledge Extraction and Evolutionary Learning) [repository](http://sci2s.ugr.es/keel/datasets.php). I think it would be terrific if we could leverage machine learning to improve predictions in healthcare. It could potentially spare patients from a lot of anxiety as well as save a lot of money and effort.
<br>

## 2.0 Getting ahold of the data

### 2.1 Downloading the data

The data can be downloaded [here](http://sci2s.ugr.es/keel/dataset/data/classification/thyroid.zip). The data glossary is [here](http://sci2s.ugr.es/keel/dataset/data/classification/thyroid-names.txt). 

### 2.2 Reading the data
Having downloaded the data, we read it into R. We skip the first 26 lines of the file which contain information about the file but no actual data.
```{r}
thyroid <- read.csv("thyroid.dat", header = FALSE, skip = 26) # skipping the first 26 lines
colnames(thyroid) <- c("age", "sex", "on_thyroxine", "query_on_thyroxine", "antithyroid_medication", "sick", "pregnant", "thyroid_surgery", "I131_treatment", "query_hypothyroid", "query_hyperthyroid", "lithium", "goitre", "tumor", "hypopituitary", "psych", "TSH", "T3", "TT4", "T4U", "FTI","class")
str(thyroid)
```
<br>

There are $7200$ observations with $21$ predictors and $1$ response variable, $class$.

## 3.0 Data exploration and munging

### 3.1 Data exploration

There are $21$ predictors. $age$ is given as number between $0$ and $1$. $sex$ is a Boolean variable. It's not clear whether $1$ means "male" or "female". There are many Boolean variables: $on\_thyroxine$ tells us whether the patient is on [Levothyroxine](http://www.webmd.com/drugs/2/drug-13826/l-thyroxine-oral/details), $query\_on\_thyroxine$ whether the patient was asked if he was taking Levothyroxine, $antithyroid\_medication$ whether the patient is on any medication to treat a thyroid condition, and so on. Some that aren't too obvious, or at least not to me: $I131\_treatment$ tells us if the patient is under [I131 treatment](https://www.radiologyinfo.org/en/info.cfm?pg=radioiodine), $lithium$ if the patient is taking [lithium](http://psycheducation.org/treatment/mood-stabilizers/the-big-three-for-bipolar-depression/lithium/lithium-risks/) for a different condition, $goitre$ whether the patient exhibits [goiter](https://en.wikipedia.org/wiki/Goitre), and $hypopituitary$ whether the patient has [hypopituitarism](https://pituitarysociety.org/patient-education/pituitary-disorders/hypopituitarism/what-are-the-symptoms-of-hypopituitarism). I am not sure what $psych$ stands for. The variables $TSH$, $T3$, $TT4$, $T4U$, and $FTI$ represent the results of various blood tests used to screen for thyroid problems.
<br>

The response variable is $class$, which can take $1$ of $3$ values: $1$ is normal, $2$ is [hyperthyroidism](https://en.wikipedia.org/wiki/Hyperthyroidism), and $3$ [hypothyroidism](https://www.thyroid.org/hypothyroidism/). Over $90\%$ of the observations are classified as hypothyroidism.
<br>

### 3.2 Data munging

We can start by changing $age$ to numbers we are familiar with, and converting the response to a factor variable. We will also change the response from type integer to factor.
```{r}
# Changle "class" to factor
thyroid$class <- factor(thyroid$class)
# Multiply "age" by 100
thyroid$age <- 100 * thyroid$age
```
<br>

There is no missing data in the dataset.

### 3.4 Splitting into training and testing datasets

Let's split the data into training and testing data sets.
```{r}
library(caTools)
set.seed(1000) # reproducibility
split <- sample.split(thyroid$class, SplitRatio = 0.7)
thyroid_train <- subset(thyroid, split==TRUE)
thyroid_test <- subset(thyroid, split==FALSE)
```
<br>

## 4.0 Decision trees

Loading the required libraries
```{r}
library(rpart)
library(rpart.plot)
```
<br>

Now let's build the tree and plot it. We will use all the predictors to build the tree. The parameter *minbucket* allows us to select how the minimum number of observations that are placed in one of the tree's "leaves" or "buckets".
```{r}
# Building a tree with a minimum of 50 observations on each leaf
thyroid_tree <- rpart(class ~ ., data = thyroid_train, control=rpart.control(minbucket=50))
prp(thyroid_tree)
```
<br>

The tree splits on 3 of the variables. If we were to change the *minbucket* parameter, we would get more or fewer splits. Sticking with this tree for now, let's get its accuracy on the training set itself.
```{r}
# Generate predictions on training set
PredictCART_train = predict(thyroid_tree, type = "class")
# Confusion matrix of training set
conf_matrix_train <- table(thyroid_train$class, PredictCART_train)
conf_matrix_train
```
<br>

The rows are the ground truth whereas the columns are the predictions generated by the tree. The tree perfectly predicts the normal and hyperthyroidism cases on the training set. We can compute its overall accuracy as the ratio of the correct predictions to the size of the dataset.
<br>

$$
Accuracy=\frac{Number\ of\ normal,\ hyperthyroidism,\ and\ hypothyroidism\ cases\ correctly\ predicted}{Total\ number\ of\ cases}
$$
<br>

The number of the correctly predicted cases is the sum of the diagonals of the matrix.
```{r}
sum(diag(conf_matrix_train)) / sum(conf_matrix_train)
```
<br>

So the accuracy is over $99\%$. What is the performance of this tree on the test set?
```{r}
PredictCART_test = predict(thyroid_tree, newdata = thyroid_test, type = "class")
# Confusion matrix of test set
conf_matrix_test <- table(thyroid_test$class, PredictCART_test)
conf_matrix_test
```
<br>

The accuracy of this tree on the testing set is:
```{r}
sum(diag(conf_matrix_test)) / sum(conf_matrix_test)
```
<br>

The performance of the tree on the test set is just as strong as on the training set. Can we do even better?

### 4.1 Cross-validation

We can use cross-validation to obtain the tree that maximizes accuracy. There is a parameter called *cp*, or complexity parameter, that allows us, indirectly, to specify how many observations will be placed on each leaf. We can do 10-fold cross-validation for each *cp* value, compute the average accuracy of each *cp* value, and then take the best *cp* value and build the final tree with it.
<br>

Let's load the libraries we will need to perform the cross-validation:
```{r}
library(caret)
library(e1071)
```
<br>

Now define the parameters of the cross-validation. We will try values of $cp$ from $0.01$ to $0.50$, in steps of $0.01$.
```{r}
# Setting cross-validation to be 10-fold
fitControl = trainControl( method = "cv", number = 10 )
# Setting cp to .01, .02, ..., 0.5
cartGrid = expand.grid( .cp = (1:50)*0.01) 
```
<br>

Perform the cross-validation to determine the best *cp* parameter. The `train` function will, for each value of $cp$, perform 10-fold cross-validation, and report the average accuracy of the $10$ runs for that value of $cp$. Since there are $50$ values of $cp$ and we are performing 10-fold cross-validation for each of them, `train()` builds $500$ trees and computes the accuracy of each. So it takes a little while to run.
```{r}
train(class ~ ., data = thyroid_train, method = "rpart", trControl = fitControl, tuneGrid = cartGrid)
```
<br>

The `train()` function from the **caret** package selects the best value of *cp*: $0.03$. We can then use this value to build the final tree:
```{r}
thyroid_tree_cv <- rpart(class ~ ., data = thyroid_train, control=rpart.control(cp=0.03))
prp(thyroid_tree_cv)
```
<br>

Now we can compute the accuracy on the training set again.
```{r}
# Generate predictions on training set using cross-validated tree
PredictCART_train_cv = predict(thyroid_tree_cv, type = "class")
# Confusion matrix of training set
conf_matrix_train_cv <- table(thyroid_train$class, PredictCART_train_cv)
conf_matrix_train_cv
```
<br>

The accuracy on the training set is even higher now:
```{r}
sum(diag(conf_matrix_train_cv)) / sum(conf_matrix_train_cv)
```
<br>

Finally, what about the accuracy of the tree on the test set?
```{r}
# Generate predictions on test set using cross-validated set
PredictCART_test_cv = predict(thyroid_tree_cv, newdat = thyroid_test, type = "class")
# Confusion matrix of training set
conf_matrix_test_cv <- table(thyroid_test$class, PredictCART_test_cv)
conf_matrix_test_cv
```
<br>

```{r}
sum(diag(conf_matrix_test_cv)) / sum(conf_matrix_test_cv)
```
<br>

Out of more than $2100$ cases in the test set, only $14$ were misclassified by the model. How big an improvement is this over not bothering to use any model at all and merely predicting each case will be hypothyroidism, i.e., $class=3$? The vast majority of the cases are indeed hypothyroidism.
```{r}
hypothyroidism_num <- nrow(thyroid_test[thyroid_test$class == "3",])
total_cases <- nrow(thyroid_test)
baseline_accuracy <- hypothyroidism_num / total_cases
baseline_accuracy
```
<br>

The no-model-at-all baseline accuracy of $92.6\%$ is $6.8\%$ worse than that of the optimized tree we built.

## 5.0 References

1. Bertsimas, D., O'Hair, A. ***The Analytics Edge***. Spring 2014. [edX.org](www.edX.org).