Identification of statistical tools to aid in biomarker analysis
Biomarker discovery starts with a small number of samples in the form of preclinical exploratory studies to identify promising biomarkers form a pool of diseased and nondiseased groups. A typical biomarker analysis dataset consists of samples divided into two or more classes and each described by several features or variables. The use of statistical tools in biomarker discovery and research possess great influence and application.
Biomarkers are prioritized on basis of selection criteria depending on the discovery platform, type of biomarker, intended usage (Pavlou et al. 2013). Small sample size at the early stage though helps to prioritize the potential candidate. Two problems that arise due to smaller sample size are:
 High probability of getting false positives
 High probability of getting a correlation due to chance
The potential candidate progresses from preclinical exploratory studies to assay development, the sample size keeps on increasing and thus the need for robust statistical tools also increases. This help in quantifying the real effect brought about by the biomarker in a clinical environment; wherein other confounders affect the correlation (Dancey et al. 2010).
Statistical testing in a biomarker analysis
The data in a biomarker research can be evaluated by two different strategies, as shown in Table below according to (Robotti et al., 2014).
Statistical Strategy 
Aim 
Description 
Examples 
Classical Statistical Method  Identifying significant biomarkers using univariate statistical tests  Biomarkers that show statistically significant different behaviour between two classes, for every variable are identified. Furthermore, each biomarker is evaluated independently of the other.  ttest, ztest, ANOVA, MANOVA, MannWhitney test 
Multivariable models  Identifying significant biomarkers by evaluating relationships between multiple biomarkers  Multiple biomarkers are evaluated together, with correlations and interactions (synergy or antagonism) between them. Furthermore, these are compared for two or more groups.  Clustering, Principal Component Analysis, Classification and regression trees. 
Multivariable models are more reliable as they provide real life estimates and decrease the probability of significance due to chance (Robotti et al. 2014). Moreover, they are advantageous over univariate analysis, since important biomarkers often show similar or opposite behavior, especially if they are part of a common biochemical pathway.
Thus, based on the type of biomarker, statistical strategies can be classified as follows
 Statistical Strategies for genomic and epigenetic biomarkers
 Statistical Strategies for Protein biomarkers
Genomic and epigenetic biomarkers
These biomarkers regulate certain physiological processes in the human body. They vary from person to person and their expression depends on the environmental factors like nutrition, comorbidities. Genomic and epigenomic biomarker exhibit higher inter and intrapersonal variability and with higher variability comes higher chances of false positives (Robotti et al. 2014). To minimize the false positives, statistical analysis of the genomic assays is necessary. Therefore, the gene set enrichment analysis (GSEA) analysis of significant Mutations (MUTs) and Copy number (CNs) is the most widely employed technique. Furthermore, these metrics used to identify pathways associated with nucleicacid regulation of biological processes, cellular metabolism, and signal transduction.
The metrics colour on the other hand, codes the expression of genes and gives a score to each potential biomarker based on its expression, top 10 scoring GESA pathways are selected for clinical validation studies (Covell 2015). The colour coding is given based on the affinity of the biomarker towards its target. Higher the affinity, darker are the shades of colour. In some cases, there might be repulsion between the biomarker and the potential target. Therefore, contrast colours are given to these antagonist biomarkers.
Statistical methods in genomic and epigenomic biomarker analysis
Table given below lists the different statistical tools used in genomic and epigenomic data analysis according to (Marcello Manfredi, 2013).
Goal 
Question 
Statistical tests 
Understanding the chemistry and structure of the genome.  Does this genome exhibit mutations in different populations. 

Pattern recognition.  Furthermore, does the genome express differently under different environmental conditions? 

Understanding the expression of the entire genome.  Moreover, can a survival rate of cancer patient be predicted based on the presence of a certain gene in its sequence? 

Assessing measurement reproducibility in biomarker analysis.  Lastly, can the biomarker bring about the same effects in different individuals? 

GESA studies are carried out for initial biomarker discovery along with validation studies for genomic and epigenetic biomarkers. They depend on the Receiver Operator Curve (ROC) analysis of genome project database (CGP) the ROC values with P values, sensitivity and specificity then reported as shown in the figure below.
Statistical strategies for protein biomarkers
Proteins biomarkers can be anything including transcripts, proteins, metabolites, and recently noncoding, regulatory RNAs. Thus, protein biomarker analysis data are typically sourced from:
 Antibody microarray,
 2DPAGE (2 DimensionalPolyacrylamide Gel Electrophoresis),
 2DDIGE (2 Dimensional Difference Gel Electrophoresis),
 Mass spectrometry (MS) methods, in addition, have;
 MALDITOF,
 SELDITOF and
 LCMS (Liquid ChromatographyMass Spectrometry
However, a typical protein biomarker study dataset includes data of multiple variables for every variable. One such example is peak intensity for MS tests or spot volume for 2DPAGE or DIGE tests. Furthermore, the raw data need processing for validated quantification, before statistical tests can be conducted on them. Raw data comprise of a subset of all proteins in a sample are checked against a database and quantified. Therefore, the ultimate goal of protein biomarker data analysis is to identify unique biomarkers based on the features that differentiate two or more sample groups (diseases/healthy state or different time/stages of a disease/condition).
Statistical methods for protein biomarker analysis
However, based on the aim of the study and behaviour of the protein biomarker, different statistical tests as shown in the table below according to (Bantscheff et al., 2007).
Goal 
Question 
Statistical tests 
Differential protein biomarker expression under different conditions.  Does a protein behave significantly different between the two samples?  Multiple hypotheses testing 
Furthermore, does a protein exhibit timedependent change?  ANOVA  
In addition, is the sample a member of a defined class of samples?  Classification methods (Linear Discriminant Analysis, Support Vector Machines)  
Relationships or Interactions between biomarkers.  Which proteins behave similarly in an experiment?  Cluster analysis 
Thus, like every other clinical entity biomarker analysis also requires robust statistical support to quantify its effect. However, the basic statistical principles of managing confounders, hypothesis testing are required in any kind of biomarker study. Furthermore, biomarker analysis also requires clinical validation which has not shown much progress in the last decade much to unclear regulations and regulatory policies which give secondary importance to biomarker analysis in drug development (Cobleigh et al. 2005; Dancey et al. 2010).
Furthermore, oncology research on other hand has benefited a lot due to biomarker discoveries (Jeffrey 2008). However, different statistical tools required to best suit the objectives of the study. Thus, an investigator must wisely choose the statistical tools to get results closer to the truth. The biomarker analysis has come along way and in years to come, we hope to see more biomarkers being clinically validated.
References
 Bantscheff, M., Schirle, M., Sweetman, G., Rick, J., Kuster, B., 2007. Quantitative Mass Spectrometry in Proteomics: A Critical review. Anal. Bioanal. Chem. 389, 1017–1031.
 Cobleigh, M.A., Tabesh, B., Bitterman, P., Baker, J., Cronin, M., Liu, M.L., Borchik, R., Mosquera, J.M., Walker, M.G., Shak, S., 2005. Tumor gene expression and prognosis in breast cancer patients with 10 or more positive lymph nodes. Clin. Cancer Res. Off. J. Am. Assoc. Cancer Res. 11, 8623–8631. doi:10.1158/10780432.CCR050735.
 Covell, D.G., 2015. Data Mining Approaches for Genomic Biomarker Development: Applications Using Drug Screening Data from the Cancer Genome Project and the Cancer Cell Line Encyclopedia. PLOS ONE 10, e0127433. doi:10.1371/journal.pone.0127433.
 Dancey, J.E., Dobbin, K.K., Groshen, S., Jessup, J.M., Hruszkewycz, A.H., Koehler, M., Parchment, R., Ratain, M.J., Shankar, L.K., Stadler, W.M., True, L.D., Gravell, A., Grever, M.R., 2010. Guidelines for the Development and Incorporation of Biomarker Studies in Early Clinical Trials of Novel Agents. Clin. Cancer Res. 16, 1745–1755. doi:10.1158/10780432.CCR092167.
 Hergenhahn, M., Muhlemann, K., Hollstein, M., Kenzelmann, M., 2003. DNA Microarrays: Perspectives for HypothesisDriven Transcriptome Research and for Clinical Applications. Curr. Genomics 4, 543–555. doi:10.2174/1389202033490231.
 Jeffrey, S.S., 2008. Cancer biomarker profiling with microRNAs. Nat. Biotechnol. 26, 400–401. doi:10.1038/nbt0408400
 Marcello Manfredi, E.R., 2013. Biomarkers Discovery through Multivariate Statistical Methods: A Review of Recently Developed Methods and Applications in Proteomics. J. Proteomics Bioinform. s3. doi:10.4172/jpb.S3003.
 Pavlou, M.P., Diamandis, E.P., Blasutig, I.M., 2013. The Long Journey of Cancer Biomarkers from the Bench to the Clinic. Clin. Chem. 59, 147–157. doi:10.1373/clinchem.2012.184614.
 Robotti, E., Manfredi, M., Marengo, E., 2014. Biomarkers Discovery through Multivariate Statistical Methods: A Review of Recently Developed Methods and Applications in Proteomics. J. Proteomics Bioinform. S3. doi:10.4172/jpb.S3003.
Discuss