Abstract:
Breast cancer is the second most common type of cancer among women. According to the
World Health Organisation, it is reported that over 2.3 million new cases and 600,000 deaths
occur annually worldwide, and approximately 3,000 people develop it each year in Sri Lanka.
These figures highlight the importance of extremely sensitive testing methods in the improvement
of early detection and survival. This study proposes a hybrid attribute selection and classification
method with Shannon Entropy-based attribute selection combined with a Support
Vector Machine (SVM) classifier for the detection of breast cancer. The aim of this study is to
enhance classification performance while preserving the comprehensibility of original clinical
features, thereby providing a robust tool that can be applied in real-world medical settings. Unlike
other dimensionality reduction techniques, such as Principal Component Analysis (PCA),
where the original features are transformed into new ones, the entropy-based approach proposed
here retains the intuitive interpretable nature of the original features so that model output
is easier to understand and more clinically relevant. Two benchmark datasets were used to
evaluate the methodology: the Wisconsin Diagnostic Breast Cancer (WDBC) dataset with 30
continuous features, and the Wisconsin Breast Cancer (WBC) dataset with 9 discrete features.
Dataset-specific preprocessing techniques were applied. Shannon Entropy was used to calculate
information gain for all features, and informative features were selected visually based on
bar plots and cumulative curves. Thirteen features were selected for the WDBC dataset and
5 features for the WBC dataset. The selected features were used to train and test the SVM
classifiers. The model achieved a measure of accuracy as 94.17% on the WDBC dataset and
96.10% on the WBC dataset. Precision, recall, and F1-metrics for the benign and malignant
classes demonstrated strong classification performance. The suggested method represents an
extremely accurate, explainable, and computationally lightweight technique for the diagnosis
of breast cancer. Moreover, the Shannon entropy-based method is useful for continuous as well
as discrete datasets.