Identification of statistical tools to aid in biomarker analysis

Biomarker discovery starts with a small number of samples in the form of preclinical exploratory studies to identify promising biomarkers form a pool of diseased and non-diseased groups. A typical biomarker analysis dataset consists of samples divided into two or more classes and each described by several features or variables. The use of statistical tools in biomarker discovery and research possess great influence and application.

Biomarkers are prioritized on basis of selection criteria depending on the discovery platform, type of biomarker, intended usage (Pavlou et al. 2013). Small sample size at the early stage though helps to prioritize the potential candidate. Two problems that arise due to smaller sample size are:

  1. High probability of getting false positives
  2. High probability of getting a correlation due to chance

The potential candidate progresses from pre-clinical exploratory studies to assay development, the sample size keeps on increasing and thus the need for robust statistical tools also increases. This help in quantifying the real effect brought about by the biomarker in a clinical environment; wherein other confounders affect the correlation (Dancey et al. 2010).

Statistical testing in a biomarker analysis

The data in a biomarker research can be evaluated by two different strategies, as shown in Table below according to (Robotti et al., 2014).

Statistical Strategy
Classical Statistical Method Identifying significant biomarkers using univariate statistical tests Biomarkers that show statistically significant different behaviour between two classes, for every variable are identified. Furthermore, each biomarker is evaluated independently of the other. t-test, z-test, ANOVA, MANOVA, Mann-Whitney test
Multivariable models Identifying significant biomarkers by evaluating relationships between multiple biomarkers Multiple biomarkers are evaluated together, with correlations and interactions (synergy or antagonism) between them. Furthermore, these are compared for two or more groups. Clustering, Principal Component Analysis, Classification and regression trees.

Multivariable models are more reliable as they provide real life estimates and decrease the probability of significance due to chance (Robotti et al. 2014). Moreover, they are advantageous over univariate analysis, since important biomarkers often show similar or opposite behavior, especially if they are part of a common biochemical pathway.

Thus, based on the type of biomarker, statistical strategies can be classified as follows

  1. Statistical Strategies for genomic and epigenetic biomarkers
  2. Statistical Strategies for Protein biomarkers

Genomic and epigenetic biomarkers

These biomarkers regulate certain physiological processes in the human body. They vary from person to person and their expression depends on the environmental factors like nutrition, comorbidities. Genomic and epigenomic biomarker exhibit higher inter and intrapersonal variability and with higher variability comes higher chances of false positives (Robotti et al. 2014). To minimize the false positives, statistical analysis of the genomic assays is necessary. Therefore, the gene set enrichment analysis (GSEA) analysis of significant Mutations (MUTs) and Copy number (CNs) is the most widely employed technique. Furthermore, these metrics used to identify pathways associated with nucleic-acid regulation of biological processes, cellular metabolism, and signal transduction.

The metrics colour on the other hand, codes the expression of genes and gives a score to each potential biomarker based on its expression, top 10 scoring GESA pathways are selected for clinical validation studies (Covell 2015). The colour coding is given based on the affinity of the biomarker towards its target. Higher the affinity, darker are the shades of colour. In some cases, there might be repulsion between the biomarker and the potential target. Therefore, contrast colours are given to these antagonist biomarkers.

Statistical methods in genomic and epigenomic biomarker analysis

Table given below lists the different statistical tools used in genomic and epigenomic data analysis according to (Marcello Manfredi, 2013).

Statistical tests
Understanding the chemistry and structure of the genome. Does this genome exhibit mutations in different populations.
  • Cluster analysis methods,
  • GESA
Pattern recognition. Furthermore, does the genome express differently under different environmental conditions?
  • Principal Component Analysis (PCA)
Understanding the expression of the entire genome. Moreover, can a survival rate of cancer patient be predicted based on the presence of a certain gene in its sequence?
  • COX survival analysis,
  • Meta-analysis of microarray data
Assessing measurement reproducibility in biomarker analysis. Lastly, can the biomarker bring about the same effects in different individuals?
  • ROC curve and Correlation analysis

GESA studies are carried out for initial biomarker discovery along with validation studies for genomic and epigenetic biomarkers. They depend on the Receiver Operator Curve (ROC)  analysis of genome project database (CGP) the ROC values with P values, sensitivity and specificity then reported as shown in the figure below.

Receiver operator curve (ROC) based on using the CGP minimal EN GEs to select tumor cells with the most similar GEs in biomarker analysis (Covell 2015)

Receiver operator curve (ROC) based on using the CGP minimal EN GEs to select tumour cells with the most similar GEs (Covell 2015)

Statistical strategies for protein biomarkers

Proteins biomarkers can be anything including transcripts, proteins, metabolites, and recently non-coding, regulatory RNAs. Thus, protein biomarker analysis data are typically sourced from:

  • Antibody microarray,
  • 2D-PAGE (2 Dimensional-Polyacrylamide Gel Electrophoresis),
  • 2D-DIGE (2 Dimensional Difference Gel Electrophoresis),
  • Mass spectrometry (MS) methods, in addition, have;
    • MALDI-TOF,
    • SELDI-TOF and
    • LC-MS (Liquid Chromatography-Mass Spectrometry

However, a typical protein biomarker study dataset includes data of multiple variables for every variable. One such example is peak intensity for MS tests or spot volume for 2D-PAGE or DIGE tests. Furthermore, the raw data need processing for validated quantification, before statistical tests can be conducted on them. Raw data comprise of a subset of all proteins in a sample are checked against a database and quantified.  Therefore, the ultimate goal of protein biomarker data analysis is to identify unique biomarkers based on the features that differentiate two or more sample groups (diseases/healthy state or different time/stages of a disease/condition).

Statistical methods for protein biomarker analysis

However, based on the aim of the study and behaviour of the protein biomarker, different statistical tests as shown in the table below according to (Bantscheff et al., 2007).

Statistical tests
Differential protein biomarker expression under different conditions. Does a protein behave significantly different between the two samples? Multiple hypotheses testing
Furthermore, does a protein exhibit time-dependent change? ANOVA
In addition, is the sample a member of a defined class of samples? Classification methods (Linear Discriminant Analysis, Support Vector Machines)
Relationships or Interactions between biomarkers. Which proteins behave similarly in an experiment? Cluster analysis

Thus, like every other clinical entity biomarker analysis also requires robust statistical support to quantify its effect. However, the basic statistical principles of managing confounders, hypothesis testing are required in any kind of biomarker study. Furthermore, biomarker analysis also requires clinical validation which has not shown much progress in the last decade much to unclear regulations and regulatory policies which give secondary importance to biomarker analysis in drug development (Cobleigh et al. 2005; Dancey et al. 2010).

Furthermore, oncology research on other hand has benefited a lot due to biomarker discoveries (Jeffrey 2008). However, different statistical tools required to best suit the objectives of the study. Thus, an investigator must wisely choose the statistical tools to get results closer to the truth. The biomarker analysis has come along way and in years to come, we hope to see more biomarkers being clinically validated.


  • Bantscheff, M., Schirle, M., Sweetman, G., Rick, J., Kuster, B., 2007. Quantitative Mass Spectrometry in Proteomics: A Critical review. Anal. Bioanal. Chem. 389, 1017–1031.
  • Cobleigh, M.A., Tabesh, B., Bitterman, P., Baker, J., Cronin, M., Liu, M.-L., Borchik, R., Mosquera, J.-M., Walker, M.G., Shak, S., 2005. Tumor gene expression and prognosis in breast cancer patients with 10 or more positive lymph nodes. Clin. Cancer Res. Off. J. Am. Assoc. Cancer Res. 11, 8623–8631. doi:10.1158/1078-0432.CCR-05-0735.
  • Covell, D.G., 2015. Data Mining Approaches for Genomic Biomarker Development: Applications Using Drug Screening Data from the Cancer Genome Project and the Cancer Cell Line Encyclopedia. PLOS ONE 10, e0127433. doi:10.1371/journal.pone.0127433.
  • Dancey, J.E., Dobbin, K.K., Groshen, S., Jessup, J.M., Hruszkewycz, A.H., Koehler, M., Parchment, R., Ratain, M.J., Shankar, L.K., Stadler, W.M., True, L.D., Gravell, A., Grever, M.R., 2010. Guidelines for the Development and Incorporation of Biomarker Studies in Early Clinical Trials of Novel Agents. Clin. Cancer Res. 16, 1745–1755. doi:10.1158/1078-0432.CCR-09-2167.
  • Hergenhahn, M., Muhlemann, K., Hollstein, M., Kenzelmann, M., 2003. DNA Microarrays: Perspectives for Hypothesis-Driven Transcriptome Research and for Clinical Applications. Curr. Genomics 4, 543–555. doi:10.2174/1389202033490231.
  • Jeffrey, S.S., 2008. Cancer biomarker profiling with microRNAs. Nat. Biotechnol. 26, 400–401. doi:10.1038/nbt0408-400
  • Marcello Manfredi, E.R., 2013. Biomarkers Discovery through Multivariate Statistical Methods: A Review of Recently Developed Methods and Applications in Proteomics. J. Proteomics Bioinform. s3. doi:10.4172/jpb.S3-003.
  • Pavlou, M.P., Diamandis, E.P., Blasutig, I.M., 2013. The Long Journey of Cancer Biomarkers from the Bench to the Clinic. Clin. Chem. 59, 147–157. doi:10.1373/clinchem.2012.184614.
  • Robotti, E., Manfredi, M., Marengo, E., 2014. Biomarkers Discovery through Multivariate Statistical Methods: A Review of Recently Developed Methods and Applications in Proteomics. J. Proteomics Bioinform. S3. doi:10.4172/jpb.S3-003.
Avishek Majumder

Avishek Majumder

Research Analyst at Project Guru
Avishek is a Master in Biotechnology and has previously worked with Lifecell International Private Limited. Apart from data analysis and biological research, he loves photography and reading. He loves to play football and basketball in his spare time with an avid interest in adventure and nature. He was also a member of the Scouts in his school and has attended Military training.
Avishek Majumder

Related articles

  • Identification of common pipeline for biomarker discovery Biomarker identification based on hypothesis approach uses mechanistic realization of the disease processes. In contrast, discovery based methods have emphasized on recognizing changes in the relative abundance or presence of molecular species (McDermott et al., 2013).
  • The importance of benchmarking in biomarker discovery and validation Benchmarking is a process of comparing and contrasting best existing methods to new emerging techniques and methodologies. Benchmarking in biomarker discovery is a method of setting a baseline for identification and classification of new protein and molecules profiling technologies.
  • Mass spectrometry (MS) in protein biomarker discovery Mass spectrometry is a tool that helps in characterizing the proteins. It measures the mass of protein molecules through different steps and using different components and show a positive response to a protein biomarker research.
  • Classification and application of different biomarkers Biomarkers range from a simple molecule to a complex substance; from basics like blood pressure and pulse to a more composite lab tests of various body parts including blood and tissues. However, biomarkers are merely a medical sign like any other. It is one of the most objective and […]
  • Importance of Colorectal cancer biomarkers in epigenetic alterations The increase in research in the area of biomarkers has improvement in diagnosis, prognosis, and prediction of treatment response in the patients with colorectal cancer.


We are looking for candidates who have completed their master's degree or Ph.D. Click here to know more about our vacancies.