Title:Predicting Protein Structural Class for Low-Similarity Sequences via Novel Evolutionary Modes of PseAAC and Recursive Feature Elimination
Volume: 14
Issue: 9
Author(s): Liang Kong*, Lingfu Kong, Changwu Wang, Rong Jing and Lichao Zhang
Affiliation:
- College of Information Science and Engineering, Yanshan University, Qinhuangdao,China
Keywords:
Feature selection, position specific score matrix, protein structural class, recursive feature elimination, sequence
similarity, support vector machine.
Abstract: Background and Objective: Protein structural class prediction is a first and key step in protein
structure prediction and has become an active research area in biochemistry and bioinformatics.
An important aspect for this prediction task is exploring good feature representation. Prior works have
demonstrated the effectiveness of the PSI-BLAST profile based feature extraction methods especially
for low-similarity protein sequences. However, the prediction accuracies still remain limited. This
highlights the need for keeping on exploring the potential of evolutionary information.
Method: In this study, three novel sequence evolutionary modes of pseudo amino acid composition
(PseAAC) are proposed and optimized by a two-stage feature selection process based on recursive feature
elimination strategy. The selected top-ranking features are then fed into a linear kernel support
vector machine classifier to predict the protein structure class. To evaluate the performance of the proposed
method, jackknife tests are performed on three widely used low-similarity benchmark datasets
(25PDB, 1189 and 640).
Results: With comprehensive comparison with the current state-of-the-art methods, the proposed method
achieves superior performance. The overall accuracies on 25PDB, 1189 and 640 datasets are 96.2%,
97.9% and 99.5%, which are 1.9%, 1.5% and 2.3% higher than previous best-performing method.
Conclusion: The satisfactory prediction accuracies achieved by the proposed method are attributed to
the specially designed sequence evolutionary modes of PseAAC and the effective feature selection strategy,
which cover more discriminative sequence order information. It is anticipated that our method would
be helpful in other prediction problems in protein research.