Title:DeepSSPred: A Deep Learning Based Sulfenylation Site Predictor Via a Novel nSegmented Optimize Federated Feature Encoder
Volume: 28
Issue: 6
Author(s): Zaheer Ullah Khan and Dechang Pi*
Affiliation:
- College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing,China
Keywords:
S-sulfenylation proteins, cytokine signaling, new feature encoding scheme, nSegmented wrapper feature, 2DCNN,
deep learning.
Abstract:
Background: S-sulfenylation (S-sulphenylation, or sulfenic acid) proteins, are special
kinds of post-translation modification, which plays an important role in various physiological and
pathological processes such as cytokine signaling, transcriptional regulation, and apoptosis. Despite
these aforementioned significances, and by complementing existing wet methods, several
computational models have been developed for sulfenylation cysteine sites prediction. However,
the performance of these models was not satisfactory due to inefficient feature schemes, severe imbalance
issues, and lack of an intelligent learning engine.
Objective: In this study, our motivation is to establish a strong and novel computational predictor
for discrimination of sulfenylation and non-sulfenylation sites.
Methods: In this study, we report an innovative bioinformatics feature encoding tool, named
DeepSSPred, in which, resulting encoded features is obtained via nSegmented hybrid feature, and
then the resampling technique called synthetic minority oversampling was employed to cope with
the severe imbalance issue between SC-sites (minority class) and non-SC sites (majority class).
State of the art 2D-Convolutional Neural Network was employed over rigorous 10-fold jackknife
cross-validation technique for model validation and authentication.
Results: Following the proposed framework, with a strong discrete presentation of feature space,
machine learning engine, and unbiased presentation of the underline training data yielded into an
excellent model that outperforms with all existing established studies. The proposed approach is
6% higher in terms of MCC from the first best. On an independent dataset, the existing first best
study failed to provide sufficient details. The model obtained an increase of 7.5% in accuracy,
1.22% in Sn, 12.91% in Sp and 13.12% in MCC on the training data and12.13% of ACC, 27.25%
in Sn, 2.25% in Sp, and 30.37% in MCC on an independent dataset in comparison with 2nd best
method. These empirical analyses show the superlative performance of the proposed model over
both training and Independent dataset in comparison with existing literature studies.
Conclusion: In this research, we have developed a novel sequence-based automated predictor for
SC-sites, called DeepSSPred. The empirical simulations outcomes with a training dataset and independent
validation dataset have revealed the efficacy of the proposed theoretical model. The good
performance of DeepSSPred is due to several reasons, such as novel discriminative feature encoding
schemes, SMOTE technique, and careful construction of the prediction model through the
tuned 2D-CNN classifier. We believe that our research work will provide a potential insight into a
further prediction of S-sulfenylation characteristics and functionalities. Thus, we hope that our developed
predictor will significantly helpful for large scale discrimination of unknown SC-sites in
particular and designing new pharmaceutical drugs in general.