The project goal was to develop a machine learning model to predict breast tumor quality.
Two different data sets from a Wisconsin hospital with 570 and 700 cases, respectively, were used to address the problem. The first data set was based exclusively on the cellular level. For each image ten cell features with the corresponding mean, standard error and “worst” (mean of the three largest values) were computed. The second data set provided discrete values from one to ten of cell attributes and mitosis stage.
Both data sets contained heterogenous distribution of benign and malignant groups.
Taking into consideration, that the output groups were known, we implemented supervised learning algorithms like random forest and logistic regression. Accuracy was chosen as a metric to obtain a comparison between algorithms performances. These steps were carried out with Dataiku, the platform democratizing access to data and enabling enterprises to build their own path to AI.
Random forest approach performed with an accuracy of 99,7 % and of 99,3% respectively.
Using random forest method we obtained the importance of used variables of both data sets. For the first data set the means of the three largest values were critical for making predictions. It was notable that mitosis stage of the second date set played no role for the prediction.
Logistic regression approach performed with an accuracy of 99.6% in both cases. The confusion matrices based on the optimized F1-score were stored. The false prediction of benign tumor instead of malignant was penalized higher.
The threshold of the confusion matrices corresponded to the number beyond which the prediction was considered positive , the values were set to 0.475 and 0.25.