The increase in biological databases has escalated, and so has the requirement for modern and powerful data analysis tools and techniques. To meet the demand, machine learning has become an indispensable tool in bioinformatics research. Biomarkers act as indicators of any biological state of the body. It can help to detect different kinds of diseases.
On the other hand, machine learning is a process where computers learn from experience, example and analogy. Machine learning employs a varied range of algorithms based on artificial intelligence and statistics. Moreover, machine learning techniques such as; support vector machines, Markov models, graphical models, and neural networks, are used in analyzing life science data. However, these techniques can handle randomness and uncertainty of data noise and generalization (Zhang and Rajapakse, 2008).
Machine learning techniques in bioinformatics
The main motive of machine learning is to extract useful information from a collection of data by building a good probabilistic model. Therefore, machine learning comprises programming in computers to enhance the performance criteria by using example data from past experiences. With these algorithms, computers can learn from experience with respect to a class of tasks and performance measures. These algorithms are suitable for the study of molecular biology as they can construct classifiers or hypotheses which can explain complex relationships in the data.
In supervised learning, objects are classified using sets of attributes. However, the classification process prescribes objects to classes based on the values of features. In addition, the features comprise of expression of an individual gene in the tissue samples and the presence or absence of amino acids at a particular position in the protein sequence. Therefore, the goal of supervised learning is to design a system that can correctly predict the class membership of new objects based on available features (Tarca et al., 2007).
In this type of learning no predefined classes are available for the objects. However, the main motive of unsupervised learning is to explore data and to find similarities among them. Thus, in this system, all data is unlabelled and the learning process involves defining labels and associating objects with them (Tarca et al., 2007).
Feature algorithms used for bioinformatics
Feature selection is a very important process prior to a learning algorithm. Algorithm selection allows the removal of all unimportant and redundant data for precise results. Furthermore, feature selection falls into two broad categories- Wrappers and Filters. On one hand, Wrappers use learning algorithms to evaluate the usefulness of features that further helps in formulating algorithms for biomarkers. And on the other hand, filters evaluate the general characteristic features of data. Furthermore, in bioinformatics, feature selection helps to identify and estimate the sequences for biomarker research. For large databases filters have proven to be much better than wrappers in terms of speed.
Common machine learning algorithms are as follows:
- Linear regression: It is used to estimate real values based on continuous variables. In addition, the regression is shown by an equation Y=a*x+b (Y is the dependent variable is a slope, x is an independent value and b is the intercept).
- Logistic regression: It is used to predict the probability of an event by applying data to the Logit function.
- Decision tree: It is a type of supervised learning algorithm widely used for the purpose of classification. In this type, the population is split into two or more homogenous sets.
- SVM (Support Vector Machine): In this algorithm, an n-dimensional plot is used whereby the space data items are used against the value of each feature is the value of a particular coordinate.
- Naive Bayes: It is based on the Bayes theorem with an assumption of independence between predictors that is one particular feature is unrelated to another feature.
- KNN (K-Nearest Neighbour): Used for both classification and regression problems. However, it stores all available cases and classifies new cases by a majority vote of its k-neighbours.
- K- Means: Used as an unsupervised algorithm used for clustering processes.
- Random Forest: A collection of decision trees therefore known as a forest. Furthermore, classification is done on the basis of votes given by the tree. However, the forest chooses the classification having the maximum votes.
- Dimensionality Reduction Algorithm: Identification of highly significant variables done by Dimensional Reduction Algorithm. It helps as variable algorithms for analyses such as decision trees, SVM, Random forests etc.
- Gradient Boost Algorithm: GBM, is a boosting algorithm used to make a prediction with abundant data comprising high prediction power. Moreover, boosting ensemble learning algorithms combine the prediction of several base estimators in order to improve robustness over a single estimator (Sunil, 2016).
Applications of machine learning
Certain molecules label with suitable isotopes and can insert into the living organism which binds with a suitable site detecting the presence of any disease at that particular site. However, these molecules help in biomarker detection for early disease identification. Furthermore, machine learning provides the knowledge of analysis of genome sequencing datasets, including the annotation of sequence elements and epigenetic, proteomic or metabolomic data (Libbrecht and Noble, 2015). Moreover, machine learning can use spectrometry data from various biological samples and sequences. Furthermore, it helps to identify proteins for biomarkers associated with diseases, along with their classifications for various treatment groups and diagnostics. Lastly, it also includes the challenges faced by such investigations, such as the prediction of proteins present, protein quantification, planning for the use of machine learning, and small sample sizes (Swan et al., 2013).
In the post-genomic era, machine learning, however, has proved as an important tool in biomarker research. On the other hand, it also helps in the identification of early detection of disease. Furthermore, supervised and unsupervised machine learning help in the classification and clustering of data. In addition, it has helped move from human-generated classification to automatic computer-generated databases. Bioinformatics deals with genomics, proteomics, and protein 3D structure.
- Libbrecht, M. W. and Noble, W. S. (2015) ‘Machine learning applications in genetics and genomics’, Nature Reviews Genetics, pp. 321–332. doi: 10.1038/nrg3920.
- Sunil, R. (2016) ‘Essentials of Machine Learning Algorithms ( with Python and R Codes )’, 20.08.2015, 20, pp. 1–15. Available at: https://www.analyticsvidhya.com/blog/2015/08/common-machine-learning-algorithms/.
- Swan, A. L. et al. (2013) ‘Application of Machine Learning to Proteomics Data: Classification and Biomarker Identification in Postgenomics Biology’, OMICS: A Journal of Integrative Biology, 17(12), pp. 595–610. doi: 10.1089/omi.2013.0017.
- Tarca, A. L. et al. (2007) ‘Machine Learning and Its Applications to Biology’, PLoS Computational Biology, 3(6), p. e116. doi: 10.1371/journal.pcbi.0030116.
- Zhang, Y. Q. and Rajapakse, J. C. (2008) Machine Learning in Bioinformatics, Machine Learning in Bioinformatics. doi: 10.1002/9780470397428.