Sequence classification is one of the most fundamental machine learning task in computational biology nowadays. With the wide availability of large corpora of annotated sequences, the use of supervised learning techniques can greatly speed up the process of identifying new sequences sharing certain function or properties. Many methods have been proposed over the years and we hope to provide an introduction to some of the more prominent ones by focussing on protease cleavage prediction: a typical representative of this class of problem. The variety of proteolytic action modes between cysteine-proteases covers a broad range of complexity level and feature specificity, illustrating the strengths and limitations of the different machine learning techniques used on them.
This review briefly introduces the particulars of predicting cleavage by calpains and caspases. We then offer some general practical considerations on treating sequences for use with machine learning algorithms, before covering specific methods. The methods presented range from basic position-based statistical models to more technically advanced methods such as Markov models or kernel-based algorithms, as well as methods with more restricted goals such as decision trees. With each family of algorithms, examples of implementations are introduced and their performances compared, along with particular strengths and weaknesses.
With this review, we aim to provide useful elements of decision toward choosing an existing method or developing a new one, based on the complexity and specific needs of a given sequence classification problem.