In this paper, we propose a strategy to predict subcellular locations of human
proteins using multi-step feature selection. Each protein is firstly coded by features
derived from KEGG and GO enrichment scores. After an initial feature reduction, 9958
features remain and they are sorted by the Minimum Redundancy Maximum Relevance
(mRMR) method. The sorted features are then filtered by an incremental feature
selection (IFS) procedure and a compact set of features are obtained. Random forest
(RF) is used as the prediction model and achieved an overall prediction accuracy of
67.72%, evaluated by ten-fold cross-validation. The corresponding KEGG pathways
and GO terms of the resultant features are analyzed in-depth, and are deemed as the
most important terms relating to human protein subcellular location.
Keywords: Subcellular location, minimum redundancy maximum relevance,
incremental feature selection, random forest algorithm, ten-fold crossvalidation.