Machine Learning
Data Science
SciKit
Python
Projects
Disease classification is the process by which a patient’s symptoms (and other diagnostics such as using an ECG to detect abnormal heart rhythms) can be used to diagnose a patient with a disease.
Another way to think about it is, that given the set of symptoms (and diagnostics) of a patient, what is the likely cause of those symptoms. This is what is commonly referred to as a prognosis.
Using a dataset provided on Kaggle, we can associate a One Hot Encoded vector of symptoms and diagnostics, with a prognosis (the target class). The One Hot Encoded vector simply being a list of booleans, in which the boolean value at every index represents whether a symptom is present or not.
There are many methods and metrics used to evaluate Machine Learning models. The standard metrics being; accuracy, precision, recall (also known as sensitivity) and F1-score.
In the field of medical science and more specifically diagnostics, recall (more commonly referred to as hit rate in medicine), is a very important metric. This is because recall is a measure of how many times a person with a given disease, was correctly identified as having that disease.
In other words, it measures how well a model avoids type II errors (false-negatives). Type II errors, in most cases in medicine are far worse than type I errors (false-positives), as it means a patient may not be given the treatment for a disease. As opposed to a type I, which are more benign and likely to be fixed.
How recall is calculated. Image source: Precision and Recall in Machine Learning.
Today’s CAD (Computer Aided Diagnosis) systems have been shown to achieve up to 90% recall. As such, achieving above 90% recall would be considered a success.
Given the constraints of the selected dataset, the following assumptions need to made:
The dataset used in this project consists of 132 symptom/diagnostic fields and 1 prognosis field (the target class). There are 41 unique diseases stored in the prognosis field. The original dataset was stored in a training and testing csv
, which had 4920 and 42 rows respectively. The training set had perfectly balanced target classes, with 120 rows per prognosis. While the test set had 1 row per prognosis, except for fungal infections which had 2 rows.
The question of how correlated symptoms are is an important to investigate, as it could potentially impact the performance of our model. For example, the Naive Bayes classifier assumes that features are conditionally independent. Hence, if two symptoms were correlated it could imply one is dependent on another and hence violate this assumption, and cayase the classifier to perform poorly.
There are many methods for measuring correlation between two boolean variables. While Pearson’s correlation can only be used on continuous, numeric data, other measures such as Spearman’s and Kendall’s correlation can work on ordinal variables.
While technically a boolean variable could be considered ordinal (assuming you treat True
as a value greater than False
), it doesn’t really suit what our symptom variables are measuring.
As such, I choose Proportion of Agreement as measure of correlation, as it is simple and effective for boolean variables.
The Proportion of Agreement is the proportion of times two variables (symptoms) share the same value, as shown below:
This assumes both variables have the same number of instances/samples.
The results from our initial correlation calculations, where that the overwhelming majority of symptoms were highly correlated with other symptoms, as shown by the histogram below.
This was puzzling at first as through observation, there appeared to be many symptoms that would not be associated with many other symptoms. However, reviewing the y-data profiling report (a library that automatically generates a data analysis report), it was clear this was because the majority of values for each symptom was False
(the symptom being not present, which makes sense given that there are 41 symptoms).
To get a better representation of how correlated symptom pairs were, I instead calculated the Proportion of Agreement for only positive instances of each symptom pair. Meaning that both symptoms would have to be present (both equal to True
) to be counted as being in agreement.
The results were that all of the symptoms were not correlated with each other.
To verify this I looked at the symptoms that were most correlated (if the measure was working properly, the symptoms should be medically similar). This revealed the measure was working properly as it included symptom pairs such as fatigue and high feaver as well as vomiting and nausea.
Symptom 1 | Symptom 2 | Correlation |
---|---|---|
fatigue | high_fever | 0.199949 |
vomiting | nausea | 0.197917 |
vomiting | abdominal_pain | 0.174797 |
loss_of_appetite | yellowing_of_eyes | 0.160315 |
fatigue | loss_of_appetite | 0.156504 |
vomiting | loss_of_appetite | 0.155234 |
yellowish_skin | abdominal_pain | 0.154726 |
vomiting | fatigue | 0.153963 |
nausea | loss_of_appetite | 0.136179 |
fatigue | malaise | 0.135163 |
The dataset provided required almost no data cleaning as there were no missing values and all of the values were in the correct, consistent format. The only cleaning steps involved:
False
and True
(not necessary but would make more sense in interpretting models later).The only preprocessing task that was necessary was increasing the size of the test set, as it was less than 1% of the size of the training set. I increased the test set size to 20% of the entire dataset while also ensuring each target prognosis class had the same number of rows in both the training and testing set (to ensure there were no imbalances).
Given the simplicity of the features in the dataset, feature engineering wasn’t necessary.
The decision tree model was trained and evaluated using two methods; the first being a basic 80/20 train-test split and the second being the more conclusive stratified Kfolds validation method.
As hinted in the title, the model performed very well out of the box, so I didn’t seek to tune hyperparameters using a validation set.
The only custom parameter set was the random state (using an arbitary value), to ensure results were repeatable.
On the train-test split, the decision tree model achieved:
Initially, I thought these results might have been so high because the model was being overfit on a certain portion of the data. To test this hypothesis I utilised stratified Kfolds cross validation to train and test the decision tree model on different parts of the dataset, to ensure that the model was consistent overall. Furthremore, the “stratified” version of this technique ensures there is an equal number of each target class (each type of disease/prognosis) in each training and testing set. To ensure there aren’t any imbalances in the model training and any biases in evaluation.
Diagram of how Kfolds cross validation works. Image from Ultralytics.
The number of folds I choose was 10 (arbitary rule of thumb I found through research).
The final cross validation score is the average of the test accuracies from each fold, which for the tree model was: 99.98%.
This degree of performance from a decision tree model, out of the box with no need for hyperparameter tuning or feature engineering seems to be strikingly unusual.
After revisiting the original dataset and furthering the initial analysis, it became clear that the dataset lended itself to being easily interpretable/learnable. The primary reason being, that there were no overlaps between any prognoses, meaning the feature space was quite isolated. An overlap being where a pair of prognoses have the identifical symptom vectors.
An example of what overlaps between classes would look like on a logistic regression. In our dataset, each side of the line would be a completely distinct colour. Image from Gustavo.
Furthermore, the majority of symptom vectors for all prognoses were duplicates.
These two characteristics of the dataset made the boundaries of each target class easily seperable by the tree model. Hence leading to the great out of the box performance.
https://github.com/LGXprod/Disease-Prediction