Diagnose Triple Negative Breast Cancer – Machine Learning in Genomics #2

Diagnose Triple Negative Breast Cancer – Machine Learning in Genomics #2


Article written by Kihoon Yoon of HPC and AI Innovation Lab in October 2018



In the previous blog, Machine Learning in Genomics #1, we looked at how we can decide the optimal number of training data. In this blog #2, we are going to look at a real problem in Breast Cancer Diagnosis.

Based on ImmunoHistoChemistry (IHC) tests, Triple Negative Breast Cancer (TNBC) is diagnosed by the lack of three proteins; estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor (HER2). TNBCs are found about 10-20% of breast cancers, and there is intensive interest in finding new medications since TNBC is very aggressive (fast growing), tends to strike younger women, metastasizes early, shows high rate of recurrence in the first 2-3 years, and does not respond to hormonal treatment. The problem is exacerbated with the rate of misdiagnosis on ER/PR/HER negatives. A team of Yale Cancer Center has confirmed that between 10-20% of breast cancer classified as ER- are really positives. If the misdiagnosis rate of PR/HER2 negative is considered, the misdiagnosis rate could be even higher.

This is a good example that Next Generation Sequencing can provide complementary information to IHC tests. However, applying RNA-seq data to this problem is not that simple. The underlying complexity of gene regulations prevent using gene expression data from ESR1(ER coding gene), PGR (PR coding gene), and ERBB2 (HER2 coding gene) genes. Indeed, these three genes are producing mRNAs regardless of the type of breast cancers. This is a typical problem associated with mammalian gene expressions and their gene expression control mechanisms. Hence, we need to consider how these mRNAs from the three genes do not translate to any proteins in the context of gene regulation network.

Data Collection

The data used in this study is collected from The Cancer Genome Atlas (TCGA) site. 1033 normalized RNA-seq data from 1033 breast cancer patients and their corresponding clinical data are downloaded. These two-separate data sets are integrated based on patient bar codes and TNBC patients are labeled with their IHC test results. The final data comprises 116 TNBC and 917 non-TNBC cases.

Figure 1 Three gene expression between TNBC and Non-TNBC cancer patients: gene expressions of these genes in TNBC and Non-TNBC are nearly identical.

Figure 1 shows three gene expressions in TNBC and Non-TNBC patients, and these three genes are not different in mRNA expression levels. Hence, we are going to pick what other genes are responsible for prohibiting these three receptor proteins by building Machine Learning (ML) models.
The RNA-seq data consists of 20,532 gene expressions which is normalized by a standard 75th-percentile (upper-quartile). This is a case when the dimensionality of data is very large while the number of total instances (1,033) is quite small. Reducing the number of features will be necessary in this case. Also, we need to keep in mind that the labels of Non-TNBC/TNBC have the error rate of at least 10-20% due to the errors in IHC tests.

Machine Learning Tools

The choice of ML package for the study is Weka which is a collection of machine learning algorithms, implemented in Java. The version of weka-3-9-2 is used.

Building Machine Learning Models

Finding a suitable ML algorithm for a particular data set is tedious work. Various ML algorithms were tested for the data; however, the results from decision tree (DT) and Bayesian Network models are reported here. The accuracy of a model is measured by 10-fold cross validation.

Decision Tree

First choice of algorithm from Weka is J48 pruned DT, and J48 is an implementation of C4.5 algorithm. Although DT is considered as a weak classifier, there are several advantages over other algorithms. It is easy to interpret and explain, performs implicit variable screening or feature selection, and nonlinear relationships between parameters do not affect tree performance.
The overall accuracy of this model is 89.45%. The number of correctly classified instances is 924 while 109 instances were incorrectly classified. The detailed accuracy by class is shown in Table 1 and Table 2.

Table 1 Detailed accuracy by class
TP Rate FP Rate Precision Recall F-Measure MCC Class
0.9491 0.5342 0.9333 0.9494 0.9415 0.4406 0 - Non-TNBC
0.46677 0.051 0.535 0.466 0.498 0.440 1 - TNBC
0.894 0.480 0.889 0.894 0.891 0.440 Weighted Average
1TP Rate = TP / Number of Non-TNBC instances with respect to Non-TNBC class
2FP Rate = FP / Number of TNBC instances with respect to Non-TNBC class
3Precision = TP / (TP + FP) with respect to Non-TNBC class
4Recall = TP / (TP + FN) with respect to Non-TNBC class
5F-Measure = 2 x (Precision x Recall) / (Precision + Recall) with respect to Non-TNBC class: This is a harmonic mean between precision and recall. We want to have higher precision and recall.
6MCC is Matthews correlation coefficient. This returns a value between -1 and +1. A coefficient +1 represents a perfect prediction, 0 no better than random prediction and -1 indicates total disagreement between prediction and observation.
7TP Rate = TN / Number of TNBC instances with respect to TNBC class. The numbers in Table 2 need to be rearranged in terms of TNBC class to calculate the numbers in second row of Table 1.

Table 2 Confusion Matrix with respect to Non-TNBC class
Classified as Non-TNBC Classified as TNBC Actual Class
870 (True Positive: TP) 47 (False Negative: FN) Non-TNBC
62 (False Positive: FP) - Type I error 54 (True Negative: TN) TNBC


As shown in Table 1, J48 is quite biased toward to Non-TNBC class. It achieves higher accuracy on predicting Non-TNBC class than TNBC class. This is due to the imbalanced data distribution, 116 TNBC instances versus 917 Non-TNBC instances. Although the overall accuracy is 89.45%, this is not quite an optimal performance in terms of prediction TNBC class. Type I error in Table 2 represents actual TNBC patients are incorrectly classified as Non-TNBC while Type II error shows the number of Non-TNBC patients are classified as TNBC patients. In this case, both types of errors are equally problematic.
Despite of the poor performance, a DT can provide a valuable information about what genes are important to distinguish TNBC from Non-TNBC cases as shown in Figure 2. The model starts with FOXA1 gene for a classification of unknown instance. If the FOXA1 gene expression is ≤ 1320.1436, then C6orf146 gene expression is examined. If C6orf146 is expressed (> 0), then any instance falls into this category can be classified as TNBC. When there is no expression for C6orf146 gene, it moves onto GNAO1 gene for the further evaluation.

Figure 2 The model built by J48 Decision Tree

A model from Decision Tree can be self-explanatory and useful for exploring the data. Genes picked by the model are the most indicative genes that can differentiate between Non-TNBC and TNBC. Although the overall accuracy from this model is not accurate enough to be used in a practical way, the model provides great insights what genes can be therapeutic targets in terms of TNBC patient treatments. For instance, SRD5A1 is currently used as a therapeutic target for highly aggressive TNBC. Beside SRD5A1, FOXA1, GNAO1, GAGE13, SMCP, GALP, OR10H2, C13orf34, HIST1H4L, CHST8, and FABP5 genes are directly either breast cancer or TNBC. However, it is necessary to explore how these genes are associated with turning-off generating three receptor proteins in TNBC with other analyses such as Gene Set Enrichment Analysis (GSEA) or pathway analysis.

Bayesian Network

Baysian Network is a type of probabilistic graphical model that assumes conditional dependencies in a set of features through a directed acyclic graph (DAG). This is a better ML model for gene expression data since there are complex dependencies among genes. Unlike DT which evaluates one gene at a time, Baysian Network evaluates the entire features via modeling of conditional dependence among features.
Compare to the result from DT (overall accuracy of 89.45%), overall accuracy is 88.96% from Bayesian Network. 919 instances are correctly classified whereas 114 instances are incorrectly classified. Apparently, DT gives a better overall accuracy, Bayesian Network should be considered a better model since it provides more balanced prediction accuracy than DT as shown in Table 3 and Table 4. TP rates for Non-TNBC and TNBC are similar, and the Recall for TNBC is improved significantly while the Precision is kept at the similar level in DT.
For skewed (imbalanced) data sets, often overall accuracy of a ML model is useless. For an example, if a ML model is configured to classify all the instances as Non-TNBC, the model still achieves 88.77% of the overall accuracy for this TNBC data used in this study; that is the model correctly classified 917 Non-TNBC instances out of 1033 instances and mis-classified all the 116 TNBC instances. This behavior is not acceptable in most cases and makes the model completely useless. For many existing ML models, it is extremely hard to minimize both Type I and Type II errors and to achieve better performances on Precision and Recall at the same time. Usually, they tend to move opposite directions.

Table 3 Detailed accuracy by class
TP Rate FP Rate Precision Recall F-Measure MCC Class
0.891 0.121 0.983 0.891 0.935 0.613 0 - Non-TNBC
0.879 0.109 0.505 0.879 0.642 0.613 1 - TNBC
0.890 0.119 0.929 0.890 0.902 0.613 Weighted Average

Table 4 Confusion Matrix
Classified as Non-TNBC Classified as TNBC Acutal Class
817 (True Positive: TP) 100 (False Negative: FN) - Type II error Non-TNBC
14 (False Positive: FP) - Type I error 102 (True Negative: TN) TNBC

Nonetheless, among all other ML tested for the TNBC data, Bayesian Network shows the best performances for the both Non-TNBC and TNBC classes. This is somewhat expected since the data has the features intertwined in the most complex way.

What is the next? Checking Sanity - Pathway Analysis

In many cases, ML algorithms tend to pick the genes based on the supplied data set. Then, how do we ensure that the model is reasonable from the biological aspect, not just by the accuracies. There are several ways to add some values on the model built from ML algorithms. Unlike DT, other ML algorithms are not clear about how the model is built. However, it is useful to rank the features (genes in this example) with the same algorithm used to build the model. Further, these features can be examined one at a time to check how important these features are by removing them from the model. Those features that, when removed, do not decrease the accuracy are removed from the model. This procedure is the so-called feature selection process. Once the best set of features is decided, further analysis such as GESA and/or a pathway analysis can be performed with the selected gene. Here, the result of a pathway analysis is presented with the genes selected from the DT model. The tool used for this analysis is Cytoscape, and the pathway data, transcription factor target data from TRANSFAC® is used (Figure 3). Nine genes (green diamonds) out of 23 genes are direct targets of transcription factors which control the expressions of the nine genes as well as ERBB2 (HER2), ESR1 (ER) and PGR (PR) genes. Although there is not an evidence that these genes are directly associated with the down regulations of the three receptor translations, these nine genes are closely associated to gene expressions with the transcript factors controlling three receptor genes.

Due to the limitation of pathway information, the graph shown in Figure 3 cannot be completed with the publicly available pathway data. However, overall trend in the graph shows the selected genes are responding to the changes in ESR1 (ER) and PGR (PR) receptors with some feed-back loops (edges in green color) through CYP26A1 and PPARA (edges in blue color). Also, these two receptors relate to ERBB2 (HER2) receptor through CEBPA, PPARA and CYP26A1 (edges in purple color).

Figure 3 Pathway analysis on transcription factor network from TRANSFAC®: the arrow indicates the control direction of gene expression between the genes. Three genes on the left side are the genes for three receptors whereas the genes on the right side in green are the genes picked by the DT model.

On-going work

The quality of this TNBC data set is decent, and Bayesian Network algorithm shows a better TNBC classification accuracy among the ML algorithms tested. Next step is to remove all the features that are not informative from the data set. Sometimes these useless features hurt the performance. Also, in nature, not all the genes out of 20,531 are expressed or differentially expressed. After this feature selection process, a learning curve and ROC curve need to be examined to ensure the healthiness of the model. The final goal of this work is to see if this entire procedure can be automated and to design a system around this type of data and workflow with the existing Dell EMC Ready Solutions.


Quick Tips content is self-published by the Dell Support Professionals who resolve issues daily. In order to achieve a speedy publication, Quick Tips may represent only partial solutions or work-arounds that are still in development or pending further proof of successfully resolving an issue. As such Quick Tips have not been reviewed, validated or approved by Dell and should be used with appropriate caution. Dell shall not be liable for any loss, including but not limited to loss of data, loss of profit or loss of revenue, which customers may incur by following any procedure or advice set out in the Quick Tips.

Article ID: SLN314227

Last Date Modified: 01/08/2019 04:53 PM


Rate this article

Accurate
Useful
Easy to understand
Was this article helpful?
Yes No
Send us feedback
Comments cannot contain these special characters: <>()\
Sorry, our feedback system is currently down. Please try again later.

Thank you for your feedback.