Existing algorithms generate one solution for a biomarker detection dataset. This protocol demonstrates the existence of multiple similarly effective solutions and presents a user-friendly software to help biomedical researchers investigate their datasets for the proposed challenge. Computer scientists may also provide this feature in their biomarker detection algorithms.
Biomarker detection is one of the more important biomedical questions for high-throughput 'omics' researchers, and almost all existing biomarker detection algorithms generate one biomarker subset with the optimized performance measurement for a given dataset. However, a recent study demonstrated the existence of multiple biomarker subsets with similarly effective or even identical classification performances. This protocol presents a simple and straightforward methodology for detecting biomarker subsets with binary classification performances, better than a user-defined cutoff. The protocol consists of data preparation and loading, baseline information summarization, parameter tuning, biomarker screening, result visualization and interpretation, biomarker gene annotations, and result and visualization exportation at publication quality. The proposed biomarker screening strategy is intuitive and demonstrates a general rule for developing biomarker detection algorithms. A user-friendly graphical user interface (GUI) was developed using the programming language Python, allowing biomedical researchers to have direct access to their results. The source code and manual of kSolutionVis can be downloaded from http://www.healthinformaticslab.org/supp/resources.php.
Binary classification, one of the most commonly investigated and challenging data mining problems in the biomedical area, is used to build a classification model trained on two groups of samples with the most accurate discrimination power1,2,3,4,5,6,7. However, the big data generated in the biomedical field has the inherent "large p small n" paradigm, with the number of features usually much larger than the number of samples6,8,9. Therefore, biomedical researchers have to reduce the feature dimension before utilizing the classification algorithms to avoid the overfitting problem8,9. Diagnosis biomarkers are defined as a subset of detected features separating patients of a given disease from healthy control samples10,11. Patients are usually defined as the positive samples, and the healthy controls are defined as the negative samples12.
Recent studies have suggested that there exists more than one solution with identical or similarly effective classification performances for a biomedical dataset5. Almost all the feature selection algorithms are deterministic algorithms, producing only one solution for the same dataset. Genetic algorithms may simultaneously generate multiple solutions with similar performances, but they still try to select one solution with the best fitness function as the output for a given dataset13,14.
Feature selection algorithms can be roughly grouped as either filters or wrappers12. A filter algorithm chooses the top-k features ranked by their significant individual association with the binary class labels based on the assumption that features are independent of each other15,16,17. Although this assumption does not hold true for almost all real-world datasets, the heuristic filter rule performs well in many cases, for instance, the mRMR (Minimum Redundancy and Maximum Relevance) algorithm, the Wilcoxon test based feature filtering (WRank) algorithm, and the ROC (Receiver operating characteristic) plot based filtering (ROCRank) algorithm. mRMR, is an efficient filter algorithm because it approximates the combinatorial estimation problem with a series of much smaller problems, comparing to the maximum-dependency feature selection algorithm, each of which only involves two variables, and therefore uses pairwise joint probabilities which are more robust18,19. However, mRMR may underestimate the usefulness of some features as it does not measure the interactions between features which can increase relevancy, and thus misses some feature combinations that are individually useless but are useful only when combined. The WRank algorithm calculates a non-parametric score of how discriminative a feature is between two classes of samples, and is known for its robustness for outliers20,21. Furthermore, the ROCRank algorithm evaluates how significant the Area Under the ROC Curve (AUC) of a particular feature is for the investigated binary classification performance22,23.
On the other hand, a wrapper evaluates the pre-defined classifier's performance of a given feature subset, iteratively generated by a heuristic rule, and creates the feature subset with the best performance measurement24. A wrapper generally outperforms a filter in the classification performance but runs slower25. For example, the Regularized Random Forest (RRF)26,27 algorithm uses a greedy rule, by evaluating the features on a subset of the training data at each random forest node, whose feature importance scores are evaluated by the Gini index. The choice of a new feature will be penalized if its information gain does not improve that of the chosen features. Additionally, the Prediction Analysis for Microarrays (PAM)28,29 algorithm, also a wrapper algorithm, calculates a centroid for each of the class labels, and then selects features to shrink the gene centroids toward the overall class centroid. PAM is robust for outlying features.
Multiple solutions with the top classification performance may be necessary for any given dataset. Firstly, the optimization goal of a deterministic algorithm is defined by a mathematical formula, e.g., minimum error rate30, which is not necessarily ideal for biological samples. Secondly, a dataset may have multiple, significantly different, solutions with similar effective or even identical performances. Almost all existing feature selection algorithms will randomly select one of these solutions as the output31.
This study will introduce an informatics analytic protocol for generating multiple feature selection solutions with similar performances for any given binary classification dataset. Considering that most biomedical researchers are not familiar with informatic techniques or computer coding, a user-friendly graphical user interface (GUI) was developed to facilitate the rapid analysis of biomedical binary classification datasets. The analytic protocol consists of data loading and summarizing, parameter tuning, pipeline execution, and result interpretations. With a simple click, the researcher is able to generate the biomarker subsets and publication-quality visualization plots. The protocol has been tested using the transcriptomes of two binary classification datasets of Acute Lymphoblastic Leukemia (ALL), i.e., ALL1 and ALL212. The datasets of ALL1 and ALL2 were downloaded from the Broad Institute Genome Data Analysis Center, available at http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi. ALL1 contains 128 samples with 12,625 features. Of these samples, 95 are B-cell ALL and 33 are T-cell ALL. ALL2 includes 100 samples with 12,625 features as well. Of these samples, there are 65 patients that suffered relapse and 35 patients that did not. ALL1 was an easy binary classification dataset, with a minimum accuracy of four filters and four wrappers being 96.7%, and 6 of the 8 feature selection algorithms achieving 100%12. While ALL2 was a more difficult dataset, with the above 8 feature selection algorithms achieving no better than 83.7% accuracy12. This best accuracy was achieved with 56 features detected by the wrapper algorithm, Correlation-based Feature Selection (CFS).
NOTE: The following protocol describes the details of the informatics analytic procedure and pseudo-codes of the major modules. The automatic analysis system was developed using Python version 3.6.0 and the Python modules pandas, abc, numpy, scipy, sklearn, sys, PyQt5, sys, mRMR, math and matplotlib. The materials used in this study are listed in the Table of Materials.
1. Prepare the Data Matrix and Class Labels
2. Load the Data Matrix and Class Labels
3. Summarize and Display the Baseline Statistics of the Dataset
4. Determine the Class Labels and the Number of Top-ranked Features
5. Tune System Parameters for Different Performances
6. Run the Pipeline and Produce the INTERACTIVE VISUALIZED RESULTS
7. Interpret the 3D Scatter Plots-Visualize and Interpret the Feature Subsets with Similarly Effective Binary Classification Performances Using 3D Scatter Plots
8. Find Gene Annotations and Their Associations with Human Diseases
NOTE: Steps 8 to 10 will illustrate how to annotate a gene from the sequence level of both DNA and protein. Firstly, the gene symbol of each biomarker ID from the above steps will be retrieved from the database DAVID32, and then two representative web servers will be used to analyze this gene symbol from the levels of DNA and protein, respectively. The server GeneCard provides a comprehensive functional annotation of a given gene symbol, and the Online Mendelian Inheritance in Man database (OMIM) provides the most comprehensive curation of disease-gene associations. The server UniProtKB is one of the most comprehensive protein database, and the server Group-based Prediction System (GPS) predicts the signaling phosphorylation’s for a very large list of kinases.
9. Annotate the Encoded Proteins and the Post-Translational Modifications
10. Annotate Protein-Protein Interactions and Their Enriched Functional Modules
11. Export the Generated Biomarker Subsets and the Visualization Plots
The goal of this workflow (Figure 6) is to detect multiple biomarker subsets with similar efficiencies for a binary classification dataset. The whole process is illustrated by two example datasets ALL1 and ALL2 extracted from a recently-published biomarker detection study12,48. A user may install kSolutionVis by following the instructions in the supplementary materials.
Dataset ALL1 profiled 12 625 transcriptomic features of 95 B-cell and 33 T-cell ALL patient blood samples. While dataset ALL2 detected the expression levels of 12 625 transcriptomic features for 65 ALL patients who relapsed after the treatment and 35 ALL patients who did not. For the user's convenience, both transcriptomic datasets and their class labels are provided in version 1.4 of the software. Both datasets are in the subdirectory "data" of the software's source code directory.
The two datasets, ALL1 and ALL2, were formatted as .csv files and loaded into the software using the Load data matrix and Load class labels buttons, as shown in Figure 7A–B. Figure 7A shows that all 128 samples with 12 625 features were loaded, and all 128 samples also have class labels. The final data matrix has 95 negative samples (B-cell ALL) and 33 positive samples (T-cell ALL). Additionally, users may also determine which class label is the positive class label (Figure 7A, bottom). If the class label file defines more than two classes, users may want to choose which two class labels to investigate. Similar operations were also conducted for the difficult dataset ALL2, as shown in Figure 7B.
The value distributions of the features in the data matrix may be investigated by clicking the button Summarize while searching for a user-specific keyword in the feature names, as shown in Figure 8. Figure 8A illustrates the histogram of feature 1012_at in the dataset ALL1. Furthermore, as seen in Figure 8B, the same feature 1012_at has a similar distribution of expression in both datasets. If no keyword was specified by the user, some feature names would be listed to help the users to decide which features to summarize.
The easier dataset ALL1 screened the top 10 ranked features (pTopX) for biomarker subsets with the pMeasurement Acc ≥ 0.90 (pCutoff). After clicking the button Run, the algorithm was executed, and the results as seen in Figure 9A, were illustrated in the bottom part of the software after a few seconds. From this, 120 qualified biomarker subsets were detected and listed in the left table of Figure 9A. ALL1 was an easy-to-discriminate dataset, in that it has 57 triplet biomarker subsets with 100% in Acc. This protocol emphasizes the existence of multiple similarly effective solutions for a binary classification problem. Therefore, the first 3D scatter plot may illustrate more than 10 (parameter piFSNum) biomarker subsets, if they have the classification performance Acc (parameter pMeasurement) ≥ that of the top 10 ranked (parameter piFSNum) biomarker subset. The user may also choose to display fewer biomarker subsets by changing the parameter piCutoff in the parameter box above the table in Figure 9A. The manual tuning of the 3D plots may be found in the section Manual tuning of the 3D dot plots in the supplementary material.
Furthermore, all the results may be exported as external files for further analysis by clicking the button Export the Table under the table or scatter plots, as shown in Figure 9.
The first biomarker subset (38319_at, 38147_at, and 33238_at) for the dataset ALL1 was chosen for functional investigations, as shown in Figure 9A. The search module of ENSEMBL (http://useast.ensembl.org/Multi/Search/New?db=core) annotated these three features as a gene cluster of differentiation 3 delta (CD3D, 38319_at), Signaling Lymphocytic Activation Molecule-Associated gene (SH2D1A, 38147_at) and Lymphocyte Cell-Specific Protein-Tyrosine Kinase (LCK, 33238_at). Furthermore, the gene-disease association database OMIM37,40 suggested that the gene CD3D encodes the delta subunit of the T-cell antigen receptor complex and is involved in the 11q23 translocations frequently observed in acute leukemia in humans49,50. OMIM also suggested that genomic mutations within the gene SH2D1A in the chromosome region of Xq25 may be associated with B-cell leukemia51,52. Additionally, OMIM also highlighted a possible T-cell ALL associated fusion event of the LCK and beta T-cell receptor (TCRB)53. Users may investigate other functional aspects of these biomarkers with their gene symbols, e.g., gene function annotations in Entrez Gene36, protein function annotations in UniProtKB38 or Pfam41, 3D protein structures in PDB/PDB_REDO35, and PTM residues in GPS7,42,43,44. The interacting sub-network (database string47) and enriched functional modules (database David32) may also be screened for these biomarkers as an entirety. Various other databases or web servers may also facilitate the annotations and in silico predictions using the symbols or primary gene/protein sequences of these genes.
As seen in Table 2, the necessity of detecting more than one solution with identical or similarly effective performances is evident, with 57 groups of features with binary classification accuracies of 100% between B-cell and T-cell ALL samples. These particular biomarker subsets were called the perfect solutions. Quite a few biomarkers appeared in these perfect solutions repeatedly, suggesting that they may represent the key differences, on the molecular level, between B- and T-cell ALL. If the biomarker detection algorithm stops at detecting the first perfect solution of three genes CD3D/SH2D1A/LCK, another perfect solution CD74/HLA-DPB1/PRKCQ will be missed. For example, HLA-DPB1 is known to be significantly associated with the pediatric T-cell ALL but not B-cell ALL54.
The three features of the first biomarker subset of ALL2 were chromatin assembly factor 1 subunit B (CHAF1B, 36912_at), exonuclease 1 (EXO1, 36041_at), and signal transducer and activator of transcription 6 (STAT6, 41222_at). CHAF1B was observed to be highly expressed in leukemia cell lines and the antibody against the CHAF1B encoded protein was significantly developed in acute myeloid leukemia (AML) patients55. EXO1 was lost in some cases of acute leukemia56, and upregulated in the leukemia cell line HL-60[R]. It also has been found to negatively regulate the alternative lengthening of telomeres (ALT) pathway, which facilitated the formation of ALT-associated PML (promyelocytic leukemia) bodies (APBs)57. STAT6 was phosphorylated to activate the pro-survival and proliferative signaling pathway in the cases of relapsed AML58. Taken together, the three genes were associated with the development and relapse of leukemia, but no explicit evidence was published on their associations with the ALL relapse. This may represent an interesting topic for further investigation.
The same annotation procedure may be conducted on any biomarker subset for ALL1 and ALL2. The three biomarkers investigated in the above section were not identified as relapse biomarkers in the dataset ALL2, as shown in Figure 9B. This suggests that biomarkers are phenotype-specific, which is another major challenge for biomarker detection, alongside the existence of multiple similarly effective solutions.
Some technical modules were implemented and described here for interested users. The error handling module provides informative messages for the user when errors occur during the execution of the software. The main error messages are listed and explained in "Error messages" in the supplementary material. A parallel calculation of the biomarkers was implemented for computers with more than one CPU core. The detailed improvements to the running time may be found in "Parallel running time" in the supplementary material. The data suggests that the usage of more CPU cores may not improve the running time due to the cost of switching between different CPU cores.
Figure 1: The example dataset extracted from the transcriptome dataset ALL1 has the first six features of the first nine samples of ALL1. The data matrix was formatted in (a) the visualization form, (b) the TAB-delimited text format file, and (c) the comma-delimited text format file. (d) The class label data was formatted in the visualization form. Due to the TAB character is invisible, it is illustrated as [TAB] in (b). The column Platform gives the microarray platform Affy in (b), and is not a required data column. Please click here to view a larger version of this figure.
Figure 2: Graphical user interface of the software. The baseline statistics are summarized in the top left box. Users may search for features of interest and investigate the value distributions in the two top right boxes. All the parameters for biomarker detection procedure may be tuned in the middle horizontal bar. All the biomarker subsets and their corresponding visualized distributions may be found in the bottom part. Please click here to view a larger version of this figure.
Figure 3: Biomarker subsets and their visualizations generated. Users may further refine the table and two 3D scatter plots using the parameters piCutoff and piFSNum. Please click here to view a larger version of this figure.
Figure 4: Gene annotations of the feature IDs detected in this study. Take the three feature IDs 38319_at/38147_at/33238_at of the first biomarker subset of the dataset ALL1. (a) Get the ID conversion module by clicking the link Gene ID Conversion. (b) Input the feature IDs in the red box 1, choose the feature type in the red box 2 (default "AFFYMETRIX_3PRIME_IVT_ID" is correct for this study), choose Gene List in the red box 3, and click Submit List in the red box 4. (c) Get all the functional annotations in this page and click Show Gene List to get the gene symbols of these queried features. (d) Get the gene symbols of the queried feature IDs. Please click here to view a larger version of this figure.
Figure 5: Annotations and enrichment analysis of the detected feature subsets. (a) Gene annotations from Gene Card. (b) OMIM describes the disease associations of each feature/gene. (c) Annotate the protein encoded by the gene of interest in the database UniProtKB. (d) Predict the tyrosine phosphorylation residues in the given protein using the online tool GPS. A red box was added to show the user where to click to input the query data. The primary sequence of the example protein CD3D may be retrieved as the FASTA format from the red box in (c), and input in the query window by click the red box in (d). Please click here to view a larger version of this figure.
Figure 6: Workflow of kSolutionVis. Each module of the software was described in the above protocol. Please click here to view a larger version of this figure.
Figure 7: Baseline statistics of the two representative datasets. The numbers of samples, features and classes in (a) ALL1 and (b) ALL2 are calculated. The file sizes of the data matrix and class labels are also detected. And a new data matrix is extracted from the samples with class labels. Please click here to view a larger version of this figure.
Figure 8: Histogram visualization of the feature 1012_at in the two datasets. Both baseline statistics and histogram were generated for (a) ALL1 and (b) ALL2. Please click here to view a larger version of this figure.
Figure 9: Biomarker subsets and the scatter plots of the two datasets. Users may change the parameters in the second row of parameter boxes to further refine the lists of biomarker subsets and 3D scatter plots for the datasets (a) ALL1 and (b) ALL2. Please click here to view a larger version of this figure.
Web site | Link | Functionality |
GeneCards | http://www.genecards.org/cgi-bin/carddisp.pl?gene=CD3D | Gene annotation |
OMIM | https://omim.org/entry/186790?search=CD3D&highlight=cd3d | Gene-disease association |
UniProtKB | http://www.uniprot.org/uniprot/P04234 | Protein annotation |
GPS | http://gps.biocuckoo.org/ | Protein’s PTM prediction |
String | https://string-db.org/ | Protein-protein interaction |
David | https://david.ncifcrf.gov/ | Gene Set Enrichment Analysis |
Table 1. Websites for annotating and analyzing the detected biomarkers. A list of useful online tools that help annotate the detected biomarkers.
f1 | f2 | f3 | Acc | Symbol1 | Symbol2 | Symbol3 |
38319_at | 38147_at | 33238_at | 1.0000 | CD3D | SH2D1A | LCK |
33238_at | 35016_at | 37039_at | 1.0000 | LCK | CD74 | HLA-DRA |
38147_at | 33238_at | 35016_at | 1.0000 | SH2D1A | LCK | CD74 |
38147_at | 33238_at | 2059_s_at | 1.0000 | SH2D1A | LCK | LCK |
38147_at | 33238_at | 37039_at | 1.0000 | SH2D1A | LCK | HLA-DRA |
38147_at | 33238_at | 38095_i_at | 1.0000 | SH2D1A | LCK | HLA-DPB1 |
38147_at | 33238_at | 33039_at | 1.0000 | SH2D1A | LCK | TRAT1 |
38147_at | 35016_at | 2059_s_at | 1.0000 | SH2D1A | CD74 | LCK |
38147_at | 35016_at | 33039_at | 1.0000 | SH2D1A | CD74 | TRAT1 |
38147_at | 35016_at | 38949_at | 1.0000 | SH2D1A | CD74 | PRKCQ |
38147_at | 2059_s_at | 37039_at | 1.0000 | SH2D1A | LCK | HLA-DRA |
38147_at | 2059_s_at | 38095_i_at | 1.0000 | SH2D1A | LCK | HLA-DPB1 |
38147_at | 37039_at | 33039_at | 1.0000 | SH2D1A | HLA-DRA | TRAT1 |
38147_at | 37039_at | 38949_at | 1.0000 | SH2D1A | HLA-DRA | PRKCQ |
38319_at | 38147_at | 35016_at | 1.0000 | CD3D | SH2D1A | CD74 |
38147_at | 38833_at | 38949_at | 1.0000 | SH2D1A | HLA-DPA1 | PRKCQ |
33238_at | 35016_at | 33039_at | 1.0000 | LCK | CD74 | TRAT1 |
38319_at | 38833_at | 38949_at | 1.0000 | CD3D | HLA-DPA1 | PRKCQ |
33238_at | 35016_at | 38949_at | 1.0000 | LCK | CD74 | PRKCQ |
33238_at | 2059_s_at | 37039_at | 1.0000 | LCK | LCK | HLA-DRA |
33238_at | 37039_at | 38095_i_at | 1.0000 | LCK | HLA-DRA | HLA-DPB1 |
33238_at | 37039_at | 33039_at | 1.0000 | LCK | HLA-DRA | TRAT1 |
33238_at | 37039_at | 38949_at | 1.0000 | LCK | HLA-DRA | PRKCQ |
33238_at | 38095_i_at | 38949_at | 1.0000 | LCK | HLA-DPB1 | PRKCQ |
33238_at | 38833_at | 38949_at | 1.0000 | LCK | HLA-DPA1 | PRKCQ |
33238_at | 33039_at | 38949_at | 1.0000 | LCK | TRAT1 | PRKCQ |
35016_at | 2059_s_at | 33039_at | 1.0000 | CD74 | LCK | TRAT1 |
35016_at | 2059_s_at | 38949_at | 1.0000 | CD74 | LCK | PRKCQ |
35016_at | 38095_i_at | 38949_at | 1.0000 | CD74 | HLA-DPB1 | PRKCQ |
2059_s_at | 37039_at | 33039_at | 1.0000 | LCK | HLA-DRA | TRAT1 |
2059_s_at | 38095_i_at | 38949_at | 1.0000 | LCK | HLA-DPB1 | PRKCQ |
2059_s_at | 38833_at | 38949_at | 1.0000 | LCK | HLA-DPA1 | PRKCQ |
38319_at | 33039_at | 38949_at | 1.0000 | CD3D | TRAT1 | PRKCQ |
38147_at | 38095_i_at | 38949_at | 1.0000 | SH2D1A | HLA-DPB1 | PRKCQ |
38319_at | 33238_at | 38833_at | 1.0000 | CD3D | LCK | HLA-DPA1 |
38319_at | 2059_s_at | 38833_at | 1.0000 | CD3D | LCK | HLA-DPA1 |
38319_at | 33238_at | 33039_at | 1.0000 | CD3D | LCK | TRAT1 |
38319_at | 33238_at | 38095_i_at | 1.0000 | CD3D | LCK | HLA-DPB1 |
38319_at | 33238_at | 37039_at | 1.0000 | CD3D | LCK | HLA-DRA |
38319_at | 35016_at | 38833_at | 1.0000 | CD3D | CD74 | HLA-DPA1 |
38319_at | 33238_at | 2059_s_at | 1.0000 | CD3D | LCK | LCK |
38319_at | 35016_at | 33039_at | 1.0000 | CD3D | CD74 | TRAT1 |
38319_at | 33238_at | 35016_at | 1.0000 | CD3D | LCK | CD74 |
38319_at | 35016_at | 38949_at | 1.0000 | CD3D | CD74 | PRKCQ |
38319_at | 2059_s_at | 37039_at | 1.0000 | CD3D | LCK | HLA-DRA |
38319_at | 38147_at | 38949_at | 1.0000 | CD3D | SH2D1A | PRKCQ |
38319_at | 38147_at | 33039_at | 1.0000 | CD3D | SH2D1A | TRAT1 |
38319_at | 33238_at | 38949_at | 1.0000 | CD3D | LCK | PRKCQ |
38319_at | 2059_s_at | 38095_i_at | 1.0000 | CD3D | LCK | HLA-DPB1 |
38319_at | 38147_at | 38833_at | 1.0000 | CD3D | SH2D1A | HLA-DPA1 |
38319_at | 2059_s_at | 33039_at | 1.0000 | CD3D | LCK | TRAT1 |
38319_at | 38147_at | 38095_i_at | 1.0000 | CD3D | SH2D1A | HLA-DPB1 |
38319_at | 37039_at | 33039_at | 1.0000 | CD3D | HLA-DRA | TRAT1 |
38319_at | 38147_at | 37039_at | 1.0000 | CD3D | SH2D1A | HLA-DRA |
38319_at | 38147_at | 2059_s_at | 1.0000 | CD3D | SH2D1A | LCK |
38319_at | 2059_s_at | 38949_at | 1.0000 | CD3D | LCK | PRKCQ |
38319_at | 35016_at | 2059_s_at | 1.0000 | CD3D | CD74 | LCK |
2059_s_at | 37039_at | 38095_i_at | 0.9922 | LCK | HLA-DRA | HLA-DPB1 |
35016_at | 33039_at | 38949_at | 0.9922 | CD74 | TRAT1 | PRKCQ |
2059_s_at | 37039_at | 38949_at | 0.9922 | LCK | HLA-DRA | PRKCQ |
35016_at | 2059_s_at | 37039_at | 0.9922 | CD74 | LCK | HLA-DRA |
35016_at | 37039_at | 38949_at | 0.9922 | CD74 | HLA-DRA | PRKCQ |
35016_at | 38833_at | 38949_at | 0.9922 | CD74 | HLA-DPA1 | PRKCQ |
2059_s_at | 33039_at | 38949_at | 0.9922 | LCK | TRAT1 | PRKCQ |
37039_at | 38833_at | 38949_at | 0.9922 | HLA-DRA | HLA-DPA1 | PRKCQ |
37039_at | 33039_at | 38949_at | 0.9922 | HLA-DRA | TRAT1 | PRKCQ |
38319_at | 38095_i_at | 38949_at | 0.9922 | CD3D | HLA-DPB1 | PRKCQ |
33238_at | 37039_at | 38833_at | 0.9922 | LCK | HLA-DRA | HLA-DPA1 |
38095_i_at | 33039_at | 38949_at | 0.9922 | HLA-DPB1 | TRAT1 | PRKCQ |
33238_at | 2059_s_at | 38949_at | 0.9922 | LCK | LCK | PRKCQ |
38319_at | 38833_at | 33039_at | 0.9922 | CD3D | HLA-DPA1 | TRAT1 |
38833_at | 33039_at | 38949_at | 0.9922 | HLA-DPA1 | TRAT1 | PRKCQ |
38147_at | 33039_at | 38949_at | 0.9922 | SH2D1A | TRAT1 | PRKCQ |
38319_at | 37039_at | 38833_at | 0.9922 | CD3D | HLA-DRA | HLA-DPA1 |
38147_at | 2059_s_at | 38949_at | 0.9922 | SH2D1A | LCK | PRKCQ |
38147_at | 38095_i_at | 38833_at | 0.9922 | SH2D1A | HLA-DPB1 | HLA-DPA1 |
38147_at | 33238_at | 38949_at | 0.9922 | SH2D1A | LCK | PRKCQ |
38147_at | 2059_s_at | 33039_at | 0.9922 | SH2D1A | LCK | TRAT1 |
38319_at | 37039_at | 38949_at | 0.9922 | CD3D | HLA-DRA | PRKCQ |
38319_at | 38095_i_at | 38833_at | 0.9922 | CD3D | HLA-DPB1 | HLA-DPA1 |
38147_at | 2059_s_at | 38833_at | 0.9922 | SH2D1A | LCK | HLA-DPA1 |
33238_at | 35016_at | 2059_s_at | 0.9922 | LCK | CD74 | LCK |
38319_at | 35016_at | 38095_i_at | 0.9922 | CD3D | CD74 | HLA-DPB1 |
33238_at | 35016_at | 38095_i_at | 0.9922 | LCK | CD74 | HLA-DPB1 |
38319_at | 35016_at | 37039_at | 0.9922 | CD3D | CD74 | HLA-DRA |
38147_at | 33238_at | 38833_at | 0.9922 | SH2D1A | LCK | HLA-DPA1 |
38147_at | 37039_at | 38095_i_at | 0.9844 | SH2D1A | HLA-DRA | HLA-DPB1 |
38147_at | 35016_at | 38833_at | 0.9844 | SH2D1A | CD74 | HLA-DPA1 |
38147_at | 35016_at | 38095_i_at | 0.9844 | SH2D1A | CD74 | HLA-DPB1 |
35016_at | 2059_s_at | 38095_i_at | 0.9844 | CD74 | LCK | HLA-DPB1 |
38147_at | 37039_at | 38833_at | 0.9844 | SH2D1A | HLA-DRA | HLA-DPA1 |
35016_at | 2059_s_at | 38833_at | 0.9844 | CD74 | LCK | HLA-DPA1 |
38319_at | 37039_at | 38095_i_at | 0.9844 | CD3D | HLA-DRA | HLA-DPB1 |
37039_at | 38095_i_at | 38949_at | 0.9844 | HLA-DRA | HLA-DPB1 | PRKCQ |
38147_at | 38833_at | 33039_at | 0.9844 | SH2D1A | HLA-DPA1 | TRAT1 |
38095_i_at | 38833_at | 38949_at | 0.9844 | HLA-DPB1 | HLA-DPA1 | PRKCQ |
33238_at | 35016_at | 38833_at | 0.9844 | LCK | CD74 | HLA-DPA1 |
38319_at | 38095_i_at | 33039_at | 0.9844 | CD3D | HLA-DPB1 | TRAT1 |
2059_s_at | 37039_at | 38833_at | 0.9844 | LCK | HLA-DRA | HLA-DPA1 |
2059_s_at | 38833_at | 33039_at | 0.9766 | LCK | HLA-DPA1 | TRAT1 |
2059_s_at | 38095_i_at | 33039_at | 0.9766 | LCK | HLA-DPB1 | TRAT1 |
2059_s_at | 38095_i_at | 38833_at | 0.9766 | LCK | HLA-DPB1 | HLA-DPA1 |
33238_at | 2059_s_at | 38095_i_at | 0.9766 | LCK | LCK | HLA-DPB1 |
35016_at | 38095_i_at | 33039_at | 0.9766 | CD74 | HLA-DPB1 | TRAT1 |
38147_at | 38095_i_at | 33039_at | 0.9766 | SH2D1A | HLA-DPB1 | TRAT1 |
33238_at | 2059_s_at | 33039_at | 0.9766 | LCK | LCK | TRAT1 |
35016_at | 37039_at | 33039_at | 0.9766 | CD74 | HLA-DRA | TRAT1 |
33238_at | 38095_i_at | 33039_at | 0.9766 | LCK | HLA-DPB1 | TRAT1 |
33238_at | 38833_at | 33039_at | 0.9766 | LCK | HLA-DPA1 | TRAT1 |
35016_at | 38833_at | 33039_at | 0.9766 | CD74 | HLA-DPA1 | TRAT1 |
33238_at | 38095_i_at | 38833_at | 0.9688 | LCK | HLA-DPB1 | HLA-DPA1 |
37039_at | 38833_at | 33039_at | 0.9688 | HLA-DRA | HLA-DPA1 | TRAT1 |
38147_at | 35016_at | 37039_at | 0.9688 | SH2D1A | CD74 | HLA-DRA |
33238_at | 2059_s_at | 38833_at | 0.9688 | LCK | LCK | HLA-DPA1 |
37039_at | 38095_i_at | 33039_at | 0.9688 | HLA-DRA | HLA-DPB1 | TRAT1 |
38095_i_at | 38833_at | 33039_at | 0.9609 | HLA-DPB1 | HLA-DPA1 | TRAT1 |
35016_at | 38095_i_at | 38833_at | 0.9609 | CD74 | HLA-DPB1 | HLA-DPA1 |
37039_at | 38095_i_at | 38833_at | 0.9531 | HLA-DRA | HLA-DPB1 | HLA-DPA1 |
35016_at | 37039_at | 38095_i_at | 0.9531 | CD74 | HLA-DRA | HLA-DPB1 |
35016_at | 37039_at | 38833_at | 0.9531 | CD74 | HLA-DRA | HLA-DPA1 |
Table 2. Annotations of all the features from the dataset ALL1. This is a binary classification dataset between B-cell and T-cell ALL samples. The gene symbols were collected for all the microarray features in the last three columns.
This study presents an easy-to-follow multi-solution biomarker detection and characterization protocol for a user-specified binary classification dataset. The software puts an emphasis on user-friendliness and flexible import/export interfaces for various file formats, allowing a biomedical researcher to investigate their dataset easily using the GUI of the software. This study also highlights the necessity of generating more than one solution with similarly effective modeling performances, previously ignored by many existing biomarker detection algorithms. In the future, newly developed biomarker detection algorithms may include this option by recording all the intermediate biomarker subsets with sufficient modeling performances.
In this protocol, steps 1 and 5 are of most importance, as the software is a fully automatic system that relies on correctly formatted input files. It was found that during our testing step, the mis-match of sample names from data matrix and class labels files may cause errors in the software, where the software will pop out a warning dialog about this error. Therefore, if the user finds no samples were loaded from the data matrix or class label files, the troubleshooting trick is to double-check whether the sample names in the two input files are inconsistent. If no dots were visualized in the 3D scatter plots, this may be due to the parameter pCutoff being higher than the best solution. In this instance, the troubleshooting trick is to lower the cutoff of the classification performance measurement (parameter pCutoff). However, the maximum performance measurement achieved by the biomarker subsets may be still blocked by the cutoff for a difficult dataset. A warning dialog will give this best performance measurement, and the user may choose a smaller cutoff to continue further analysis.
The main limitations of the software are its slow calculation speed and its ability to only focus on, at most, three features. Feature selection is an NP-hard problem, defined as a computational problem whose globally optimal solution cannot be resolved within polynomial time59. The comprehensive biomarker subset screening step consumes a high volume of computational power. The running time complexity of kSolutionVis is O(n3) where n is the parameter pTopX. Additionally, this multiple-biomarker detection algorithm focuses on visualizing the screen of features, therefore confining the number of the features to three or fewer. This limitation may impede some users who may work on difficult problems and wish to find feature subsets consisting of more than three features. However, the software visualizes feature subsets in the 3D space and it's difficult to directly visualize feature subsets in more than three dimensions. In addition, based on the representative results presented above, the multiple feature triplets selected by kSolutionVis is a highly effective method in classification and shows significant results with important biomedical meaning.
The software represents useful complementary software to the existing feature selection algorithms. In the field of biomedicine, feature selection is termed biomarker, with the goal to find a subset of features achieving improved modeling performance60,61,62. The software is a comprehensive screening tool of all the triplet biomarker subsets based on the strategy proposed in a recent study5. The two representative datasets screened by the software's protocol, and their results demonstrate the existences of quite a few solutions with similarly effective or even identical modeling performances. However, heuristic rules63,64,65,66 may be employed to find sub-optimal solutions, but such algorithms have a strong tendency to produce only one solution, ignoring many other solutions with similarly effective or even identical modeling performances. Therefore, the computer power and the lengthy running time of the software are worthwhile to ensure a more comprehensive detection of potential biomarkers in the future.
The representative results were calculated on two transcriptome datasets, however, the software handles input data in various standard file formats and may also be used to analyze other 'omic' datasets, including proteomics and metabolomics. Additionally, parallelization may speed up the calculation of the biomarker detection module in the software. There is some multi-core hardware including GPGPU (General-Purpose Graphical Processing Unite) and Intel Xeon Phi processors available for this purpose. However, these technologies require different coding strategies and will be considered in the next version of the software.
The authors have nothing to disclose.
This work was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (XDB13040400) and the startup grant from Jilin University. Anonymous reviewers and biomedical testing users were appreciated for their constructive comments on improving the usability and functionality of kSolutionVis.
Hardware | |||
laptop | Lenovo | X1 carbon | Any computer works. Recommended minimum configuration: 1GB extra hard disk space, 1 GB memory, 2.0MHz CPU |
Name | Company | Catalog Number | Comments |
Software | |||
Python 3.0 | WingWare | Wing Personal | Any python programming and running environments support Python version 3.0 or above |