Title:Machine Learning Study of Classifiers Trained with Biophysiochemical Properties of Amino Acids to Predict Fibril Forming Peptide Motifs
Volume: 19
Issue: 9
Author(s): Smitha Sunil Kumaran Nair, N. V. Subba Reddy and K. S. Hareesha
Affiliation:
Keywords:
Amyloid fibrils, ant colony optimization, biophysiochemical properties, decision tree, memetic algorithm, neural
network, principal component analysis, random forest, support vector machine
Abstract: It is important to understand the cause of amyloid illnesses by predicting the short protein fragments capable of
forming amyloid-like fibril motifs aiding in the discovery of sequence-targeted anti-aggregation drugs. It is extremely desirable
to design computational tools to provide affordable in silico predictions owing to the limitations of molecular techniques
for their identification. In this research article, we tried to study, from a machine learning perspective, the performance
of several machine learning classifiers that use heterogenous features based on biochemical and biophysical properties
of amino acids to discriminate between amyloidogenic and non-amyloidogenic regions in peptides. Four conventional
machine learning classifiers namely Support Vector Machine, Neural network, Decision tree and Random forest were
trained and tested to find the best classifier that fits the problem domain well. Prior to classification, novel implementations
of two biologically-inspired feature optimization techniques based on evolutionary algorithms and methodologies
that mimic social life and a multivariate method based on projection are utilized in order to remove the unimportant and
uninformative features. Among the dimenionality reduction algorithms considered under the study, prediction results
show that algorithms based on evolutionary computation is the most effective. SVM best suits the problem domain in its
fitment among the classifiers considered. The best classifier is also compared with an online predictor to evidence the
equilibrium maintained between true positive rates and false positive rates in the proposed classifier. This exploratory
study suggests that these methods are promising in providing amyloidogenity prediction and may be further extended for
large-scale proteomic studies.