Title:Variable Length Character N-Gram Embedding of Protein Sequences for Secondary Structure Prediction
Volume: 28
Issue: 5
Author(s): Ashish Kumar Sharma*Rajeev Srivastava
Affiliation:
- Department of Computer Science and Engineering, Indian Institute of Technology (BHU), Varanasi, Uttar Pradesh,India
Keywords:
Proteomics, protein secondary structure, amino acids sequence, character n-gram embedding, deep learning, bidirectional
long short-term memory.
Abstract:
Background: The prediction of a protein's secondary structure from its amino acid sequence
is an essential step towards predicting its 3-D structure. The prediction performance improves
by incorporating homologous multiple sequence alignment information. Since homologous
details not available for all proteins. Therefore, it is necessary to predict the protein secondary structure
from single sequences.
Objective and Methods: Protein secondary structure predicted from their primary sequences using
n-gram word embedding and deep recurrent neural network. Protein secondary structure depends
on local and long-range neighbor residues in primary sequences. In the proposed work, the local
contextual information of amino acid residues captures variable-length character n-gram words. An
embedding vector represents these variable-length character n-gram words. Further, the bidirectional
long short-term memory (Bi-LSTM) model is used to capture the long-range contexts by extracting
the past and future residues information in primary sequences.
Results: The proposed model evaluates on three public datasets ss.txt, RS126, and CASP9. The
model shows the Q3 accuracy of 92.57%, 86.48%, and 89.66% for ss.txt, RS126, and CASP9.
Conclusion: The proposed model performance compares with state-of-the-art methods available in
the literature. After a comparative analysis, it observed that the proposed model performs better
than state-of-the-art methods.