Background: The study of breast cancer detection begins with the WBCD dataset for most researchers, as it is a very well-known dataset. We use this dataset as a benchmark in this paper to study ML algorithms like SVM, DT, RF, KNN, NB classifiers, Logistic Regression, Extra Trees, Bagging Classifiers with hard and soft voting, Ensemble techniques and Extreme Gradient Boosting classifiers like XG Boost and 2 deep learning models with regularization and without regularization.
Objective: The primary objective is to revisit how the existing classifiers fare with the WBCD dataset and suggest a method with Grid search and Randomized search by selecting the best hyper parameters to apply with and without PCA and check if the WBCD dataset can be classified in lesser time without compromising accuracy.
Methods: We explore PCA as a feature extraction technique in this dataset and use techniques like Feature Scaling K Fold stratified cross-validation technique, K best etc. We implement Grid search CV along with PCA in the pipeline to tune the hyper parameters across various classifiers and reduce the training and prediction time without compromising accuracy. Last but not least, this paper also compares the accuracy, precision and recall of various ML techniques for manually selected features by observing the feature importance score and the correlation matrix.
Results: In our experiment with all features, we get an accuracy of 97.9 percent for Extra trees and Ensemble techniques with RF, KNN and Extra Trees with soft voting strategy and using feature selection with PCA and grid search, we get an accuracy of 99.1 percent with SVM (kernel trick). We also demonstrate that the running time of training and prediction also reduces if hyper parameters of classifiers are tuned appropriately, which is taken care of by Grid and Randomized Hyper Parameter Grids.
Conclusion: It is shown in this paper that Feature subset selection or feature ranking may not be the best way and not the only way to be applied to the WBCD dataset along with PCA. In datasets where features are closely correlated , a method for hyper parameter tuning using either Grid or Randomized Search can be accompanied with PCA to extract the best feature combinations and then fed into the classifiers to get good accuracy scores and can be executed in a much quicker time.
Keywords: WBCD, breast cancer detection, pca, grid search cv, randomized search, cross-validation, regularization, machine learning, deep learning.
[http://dx.doi.org/10.1148/radiol.2019182716] [PMID: 31063083]
[http://dx.doi.org/10.1007/s10103-016-1976-x] [PMID: 27289243]