Article written by Kihoon Yoon of HPC and AI Innovation Lab in October 2018
In the previous blog, Machine Learning in Genomics #1, we looked at how we can decide the optimal number of training data. In this blog #2, we are going to look at a real problem in Breast Cancer Diagnosis.
Based on ImmunoHistoChemistry (IHC) tests, Triple Negative Breast Cancer (TNBC) is diagnosed by the lack of three proteins; estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor (HER2). TNBCs are found about 10-20% of breast cancers, and there is intensive interest in finding new medications since TNBC is very aggressive (fast growing), tends to strike younger women, metastasizes early, shows high rate of recurrence in the first 2-3 years, and does not respond to hormonal treatment. The problem is exacerbated with the rate of misdiagnosis on ER/PR/HER negatives. A team of Yale Cancer Center has confirmed that between 10-20% of breast cancer classified as ER- are really positives. If the misdiagnosis rate of PR/HER2 negative is considered, the misdiagnosis rate could be even higher.
This is a good example that Next Generation Sequencing can provide complementary information to IHC tests. However, applying RNA-seq data to this problem is not that simple. The underlying complexity of gene regulations prevent using gene expression data from ESR1(ER coding gene), PGR (PR coding gene), and ERBB2 (HER2 coding gene) genes. Indeed, these three genes are producing mRNAs regardless of the type of breast cancers. This is a typical problem associated with mammalian gene expressions and their gene expression control mechanisms. Hence, we need to consider how these mRNAs from the three genes do not translate to any proteins in the context of gene regulation network.
The data used in this study is collected from The Cancer Genome Atlas (TCGA) site. 1033 normalized RNA-seq data from 1033 breast cancer patients and their corresponding clinical data are downloaded. These two-separate data sets are integrated based on patient bar codes and TNBC patients are labeled with their IHC test results. The final data comprises 116 TNBC and 917 non-TNBC cases.
Figure 1 Three gene expression between TNBC and Non-TNBC cancer patients: gene expressions of these genes in TNBC and Non-TNBC are nearly identical.
Figure 1 shows three gene expressions in TNBC and Non-TNBC patients, and these three genes are not different in mRNA expression levels. Hence, we are going to pick what other genes are responsible for prohibiting these three receptor proteins by building Machine Learning (ML) models.
The RNA-seq data consists of 20,532 gene expressions which is normalized by a standard 75th-percentile (upper-quartile). This is a case when the dimensionality of data is very large while the number of total instances (1,033) is quite small. Reducing the number of features will be necessary in this case. Also, we need to keep in mind that the labels of Non-TNBC/TNBC have the error rate of at least 10-20% due to the errors in IHC tests.
The choice of ML package for the study is Weka which is a collection of machine learning algorithms, implemented in Java. The version of weka-3-9-2 is used.
Finding a suitable ML algorithm for a particular data set is tedious work. Various ML algorithms were tested for the data; however, the results from decision tree (DT) and Bayesian Network models are reported here. The accuracy of a model is measured by 10-fold cross validation.
First choice of algorithm from Weka is J48 pruned DT, and J48 is an implementation of C4.5 algorithm. Although DT is considered as a weak classifier, there are several advantages over other algorithms. It is easy to interpret and explain, performs implicit variable screening or feature selection, and nonlinear relationships between parameters do not affect tree performance.
The overall accuracy of this model is 89.45%. The number of correctly classified instances is 924 while 109 instances were incorrectly classified. The detailed accuracy by class is shown in Table 1 and Table 2.
|TP Rate||FP Rate||Precision||Recall||F-Measure||MCC||Class|
|0.9491||0.5342||0.9333||0.9494||0.9415||0.4406||0 - Non-TNBC|
|0.46677||0.051||0.535||0.466||0.498||0.440||1 - TNBC|
|Classified as Non-TNBC||Classified as TNBC||Actual Class|
|870 (True Positive: TP)||47 (False Negative: FN)||Non-TNBC|
|62 (False Positive: FP) - Type I error||54 (True Negative: TN)||TNBC|
As shown in Table 1, J48 is quite biased toward to Non-TNBC class. It achieves higher accuracy on predicting Non-TNBC class than TNBC class. This is due to the imbalanced data distribution, 116 TNBC instances versus 917 Non-TNBC instances. Although the overall accuracy is 89.45%, this is not quite an optimal performance in terms of prediction TNBC class. Type I error in Table 2 represents actual TNBC patients are incorrectly classified as Non-TNBC while Type II error shows the number of Non-TNBC patients are classified as TNBC patients. In this case, both types of errors are equally problematic.
Despite of the poor performance, a DT can provide a valuable information about what genes are important to distinguish TNBC from Non-TNBC cases as shown in Figure 2. The model starts with FOXA1 gene for a classification of unknown instance. If the FOXA1 gene expression is ≤ 1320.1436, then C6orf146 gene expression is examined. If C6orf146 is expressed (> 0), then any instance falls into this category can be classified as TNBC. When there is no expression for C6orf146 gene, it moves onto GNAO1 gene for the further evaluation.
Figure 2 The model built by J48 Decision Tree
|TP Rate||FP Rate||Precision||Recall||F-Measure||MCC||Class|
|0.891||0.121||0.983||0.891||0.935||0.613||0 - Non-TNBC|
|0.879||0.109||0.505||0.879||0.642||0.613||1 - TNBC|
|Classified as Non-TNBC||Classified as TNBC||Acutal Class|
|817 (True Positive: TP)||100 (False Negative: FN) - Type II error||Non-TNBC|
|14 (False Positive: FP) - Type I error||102 (True Negative: TN)||TNBC|
Figure 3 Pathway analysis on transcription factor network from TRANSFAC®: the arrow indicates the control direction of gene expression between the genes. Three genes on the left side are the genes for three receptors whereas the genes on the right side in green are the genes picked by the DT model.
The quality of this TNBC data set is decent, and Bayesian Network algorithm shows a better TNBC classification accuracy among the ML algorithms tested. Next step is to remove all the features that are not informative from the data set. Sometimes these useless features hurt the performance. Also, in nature, not all the genes out of 20,531 are expressed or differentially expressed. After this feature selection process, a learning curve and ROC curve need to be examined to ensure the healthiness of the model. The final goal of this work is to see if this entire procedure can be automated and to design a system around this type of data and workflow with the existing Dell EMC Ready Solutions.
Article ID: SLN314227
Last Date Modified: 01/08/2019 04:53 PM
Thank you for your feedback.