Title:Predicting Citrullination Sites in Protein Sequences Using mRMR Method and Random Forest Algorithm
Volume: 20
Issue: 2
Author(s): Qing Zhang, Xijun Sun, Kaiyan Feng, ShaoPeng Wang, Yu-Hang Zhang, SiBao Wang, Lin Lu and Yu-Dong Cai*
Affiliation:
- School of Life Sciences, Shanghai University, Shanghai 200444,China
Keywords:
Post-translational modification, citrullination site, maximum relevance minimum redundancy, random forest.
Abstract: Background: As one of essential post-translational modifications (PTMs), the citrullination
or deimination on an arginine residue would change the molecular weight and electrostatic
charge of its side-chain. And it has been found that the citrullination in protein sequences was
catalyzed by a type of Ca2+-dependent enzyme family called peptidylarginine deiminase (PAD),
which include five isotypes: PAD1, 2, 3, 4/5, and 6. Citrullinated proteins participate in many
biological processes, e.g. the citrullination of myelin basic protein (MBP) assists the early
development of central nervous system. However, abnormal modifications on citrullinated proteins
would also lead to some severe human diseases including multiple sclerosis and rheumatoid arthritis.
Objective: Therefore, it is necessary and important to identify the citrullination sites in protein
sequences. The information about the location of citrulliantion sites in protein sequences will be
useful to investigate the molecular functions and disease mechanisms related to citrullinated
proteins.
Materials and Methods: In this study, we investigated the peptide segments that contain the
citrullination sites in the centers, which were encoded into numeric digits from four aspects. Thus,
we yielded a training set with 116 positive samples and 232 negative samples. Then, a reliable
feature selection technique, called maximum-relevance-minimum-redundancy (mRMR), was applied
to analyze these features, and four algorithms, including random forest (RF), Dagging, nearest
neighbor algorithm (NNA), and support vector machine (SVM), together with the incremental
feature selection (IFS) method were adopted to extract important features.
Results: Finally an optimal classifier derived from RF algorithm was constructed to predict
citrullination sites. 44 most prominent features were comprehensively analyzed and their biological
characteristics in citrullination catalysis were also revealed.
Conclusion: We believed that the biological features obtained in this pioneering work would
provide some useful insights into the formation and function of citrullination and the optimal
classifier could be a useful tool to identify citrullination sites in protein sequences.