Identification of statistical tools to aid in biomarker analysis

By Avishek Majumder on November 27, 2018

Biomarker discovery starts with a small number of samples in the form of preclinical exploratory studies to identify promising biomarkers from a pool of diseased and non-diseased groups. A typical biomarker analysis dataset consists of samples divided into two or more classes and each described by several features or variables. The use of statistical tools in biomarker discovery and research possesses great influence and application.

Biomarkers are prioritized on the basis of selection criteria depending on the discovery platform, type of biomarker, and intended usage (Pavlou et al. 2013). A small sample size at the early stage though helps to prioritize the potential candidate. Two problems that arise due to a smaller sample size are:

  1. High probability of getting false positives
  2. High probability of getting a correlation due to chance

The potential candidate progresses from pre-clinical exploratory studies to assay development, the sample size keeps on increasing and thus the need for robust statistical tools also increases. This help in quantifying the real effect brought about by the biomarker in a clinical environment; wherein other confounders affect the correlation (Dancey et al. 2010).

Statistical testing in a biomarker analysis

The data in biomarker research can be evaluated by two different strategies, as shown in the Table below according to (Robotti et al., 2014).

Statistical Strategy
Classical Statistical MethodIdentifying significant biomarkers using univariate statistical testsBiomarkers that show statistically significant different behaviour between two classes, for every variable are identified. Furthermore, each biomarker is evaluated independently of the other.t-test, z-test, ANOVA, MANOVA, Mann-Whitney test
Multivariable modelsIdentifying significant biomarkers by evaluating relationships between multiple biomarkersMultiple biomarkers are evaluated together, with correlations and interactions (synergy or antagonism) between them. Furthermore, these are compared for two or more groups.Clustering, Principal Component Analysis, Classification and regression trees.

Multivariable models are more reliable as they provide real-life estimates and decrease the probability of significance due to chance (Robotti et al. 2014). Moreover, they are advantageous over univariate analysis, since important biomarkers often show similar or opposite behaviour, especially if they are part of a common biochemical pathway.

Thus, based on the type of biomarker, statistical strategies can be classified as follows

  1. Statistical Strategies for genomic and epigenetic biomarkers
  2. Statistical Strategies for Protein Biomarkers

Genomic and epigenetic biomarkers

These biomarkers regulate certain physiological processes in the human body. They vary from person to person and their expression depends on environmental factors like nutrition, and comorbidities. Genomic and epigenomic biomarker exhibit higher inter and intrapersonal variability and with higher variability comes higher chances of false positives (Robotti et al. 2014). To minimize the false positives, statistical analysis of the genomic assays is necessary. Therefore, the gene set enrichment analysis (GSEA) analysis of significant Mutations (MUTs) and Copy number (CNs) is the most widely employed technique. Furthermore, these metrics are used to identify pathways associated with nucleic-acid regulation of biological processes, cellular metabolism, and signal transduction.

The metrics colour, on the other hand, codes the expression of genes and gives a score to each potential biomarker based on its expression, the top 10 scoring GESA pathways are selected for clinical validation studies (Covell 2015). The colour coding is given based on the affinity of the biomarker towards its target. The higher the affinity, the darker the shades of colour. In some cases, there might be repulsion between the biomarker and the potential target. Therefore, contrast colours are given to these antagonist biomarkers.

Statistical methods in genomic and epigenomic biomarker analysis

The table given below lists the different statistical tools used in genomic and epigenomic data analysis according to (Marcello Manfredi, 2013).

Statistical tests
Understanding the chemistry and structure of the genome.Does this genome exhibit mutations in different populations?
  • Cluster analysis methods,
  • GESA
Pattern recognition.Furthermore, does the genome express differently under different environmental conditions?
  • Principal Component Analysis (PCA)
Understanding the expression of the entire genome.Moreover, can the survival rate of cancer patients be predicted based on the presence of a certain gene in its sequence?
  • COX survival analysis,
  • Meta-analysis of microarray data
Assessing measurement reproducibility in biomarker analysis.Lastly, can the biomarker bring about the same effects in different individuals?
  • ROC curve and Correlation analysis

GESA studies are carried out for initial biomarker discovery along with validation studies for genomic and epigenetic biomarkers. They depend on the Receiver Operator Curve (ROC) analysis of the genome project database (CGP) the ROC values with P values, sensitivity and specificity are then reported as shown in the figure below.

Receiver operator curve (ROC) based on using the CGP minimal EN GEs to select tumor cells with the most similar GEs in biomarker analysis (Covell 2015)
Receiver operator curve (ROC) based on using the CGP minimal EN GEs to select tumour cells with the most similar GEs (Covell 2015)

Statistical strategies for protein biomarkers

Proteins biomarkers can be anything including transcripts, proteins, metabolites, and recently non-coding, regulatory RNAs. Thus, protein biomarker analysis data are typically sourced from:

  • Antibody microarray,
  • 2D-PAGE (2 Dimensional-Polyacrylamide Gel Electrophoresis),
  • 2D-DIGE (2 Dimensional Difference Gel Electrophoresis),
  • Mass spectrometry (MS) methods, in addition, have;
    • MALDI-TOF,
    • SELDI-TOF and
    • LC-MS (Liquid Chromatography-Mass Spectrometry

However, a typical protein biomarker study dataset includes data on multiple variables for every variable. One such example is peak intensity for MS tests or spot volume for 2D-PAGE or DIGE tests. Furthermore, the raw data need processing for validated quantification, before statistical tests can be conducted on them. Raw data comprise a subset of all proteins in a sample that is checked against a database and quantified.  Therefore, the ultimate goal of protein biomarker data analysis is to identify unique biomarkers based on the features that differentiate two or more sample groups (diseases/ healthy states or different time/ stages of a disease/ condition).

Statistical methods for protein biomarker analysis

However, based on the aim of the study and the behaviour of the protein biomarker, different statistical tests as shown in the table below according to (Bantscheff et al., 2007).

Statistical tests
Differential protein biomarker expression under different conditions.Does a protein behave significantly differently between the two samples?Multiple hypotheses testing
 Furthermore, does a protein exhibit time-dependent change?ANOVA
 In addition, is the sample a member of a defined class of samples?Classification methods (Linear Discriminant Analysis, Support Vector Machines)
Relationships or Interactions between biomarkers.Which proteins behave similarly in an experiment?Cluster analysis

Thus, like every other clinical entity biomarker analysis also requires robust statistical support to quantify its effect. However, the basic statistical principles of managing confounders, and hypothesis testing are required in any kind of biomarker study. Furthermore, biomarker analysis also requires clinical validation which has not shown much progress in the last decade much to unclear regulations and regulatory policies which give secondary importance to biomarker analysis in drug development (Cobleigh et al. 2005; Dancey et al. 2010).

Furthermore, oncology research on the other hand has benefited a lot due to biomarker discoveries (Jeffrey 2008). However, different statistical tools are required to best suit the objectives of the study. Thus, an investigator must wisely choose the statistical tools to get results closer to the truth. The biomarker analysis has come a long way and in years to come, we hope to see more biomarkers being clinically validated.


  • Bantscheff, M., Schirle, M., Sweetman, G., Rick, J., Kuster, B., 2007. Quantitative Mass Spectrometry in Proteomics: A Critical review. Anal. Bioanal. Chem. 389, 1017–1031.
  • Cobleigh, M.A., Tabesh, B., Bitterman, P., Baker, J., Cronin, M., Liu, M.-L., Borchik, R., Mosquera, J.-M., Walker, M.G., Shak, S., 2005. Tumor gene expression and prognosis in breast cancer patients with 10 or more positive lymph nodes. Clin. Cancer Res. Off. J. Am. Assoc. Cancer Res. 11, 8623–8631. doi:10.1158/1078-0432.CCR-05-0735.
  • Covell, D.G., 2015. Data Mining Approaches for Genomic Biomarker Development: Applications Using Drug Screening Data from the Cancer Genome Project and the Cancer Cell Line Encyclopedia. PLOS ONE 10, e0127433. doi:10.1371/journal.pone.0127433.
  • Dancey, J.E., Dobbin, K.K., Groshen, S., Jessup, J.M., Hruszkewycz, A.H., Koehler, M., Parchment, R., Ratain, M.J., Shankar, L.K., Stadler, W.M., True, L.D., Gravell, A., Grever, M.R., 2010. Guidelines for the Development and Incorporation of Biomarker Studies in Early Clinical Trials of Novel Agents. Clin. Cancer Res. 16, 1745–1755. doi:10.1158/1078-0432.CCR-09-2167.
  • Hergenhahn, M., Muhlemann, K., Hollstein, M., Kenzelmann, M., 2003. DNA Microarrays: Perspectives for Hypothesis-Driven Transcriptome Research and for Clinical Applications. Curr. Genomics 4, 543–555. doi:10.2174/1389202033490231.
  • Jeffrey, S.S., 2008. Cancer biomarker profiling with microRNAs. Nat. Biotechnol. 26, 400–401. doi:10.1038/nbt0408-400
  • Marcello Manfredi, E.R., 2013. Biomarkers Discovery through Multivariate Statistical Methods: A Review of Recently Developed Methods and Applications in Proteomics. J. Proteomics Bioinform. s3. doi:10.4172/jpb.S3-003.
  • Pavlou, M.P., Diamandis, E.P., Blasutig, I.M., 2013. The Long Journey of Cancer Biomarkers from the Bench to the Clinic. Clin. Chem. 59, 147–157. doi:10.1373/clinchem.2012.184614.
  • Robotti, E., Manfredi, M., Marengo, E., 2014. Biomarkers Discovery through Multivariate Statistical Methods: A Review of Recently Developed Methods and Applications in Proteomics. J. Proteomics Bioinform. S3. doi:10.4172/jpb.S3-003.