Identification of Important Factors in the Diagnosis of Breast Cancer Cells Using Machine Learning Models and Principal Component Analysis


  • Suejit Pechprasarn College of Biomedical Engineering, Rangsit University, Pathum Thani, 12000, Thailand
  • Ohmthong Wattanapermpool Satriwithaya School, Wat Bowon Niwet, Phra Nakhon, Bangkok, 10200, Thailand
  • Maninya Warunlawan Satriwithaya School, Wat Bowon Niwet, Phra Nakhon, Bangkok, 10200, Thailand
  • Pornchaya Homsud Satriwithaya School, Wat Bowon Niwet, Phra Nakhon, Bangkok, 10200, Thailand
  • Thumpussorn Akarajarasroj Satriwithaya School, Wat Bowon Niwet, Phra Nakhon, Bangkok, 10200, Thailand



breast cancer classification, classification of malignant and benign cells, machine learning, principal component analysis, complexity reduced model, intelligent diagnostic software


Breast cancer (BC) is now identified as a disease with a significant impact on morbidity and mortality that is growing and widespread worldwide. This study uses a publicly available clinical dataset of 699 patients from the University of Wisconsin with 9 variables: (1) clump thickness, (2) uniformity of cell size, (3) uniformity of cell shape, (4) marginal adhesion, (5) single epithelial cell size, (6) bare nuclei, (7) bland chromatin, (8) normal nucleoli, and (9) mitoses. This dataset has been used for many studies in the past to pinpoint critical factors in patient diagnosis. Here, we use this data to ensure its unbiasedness and accuracy. We then apply principal component analysis and machine learning models to identify factors in diagnosing a malignant or benign tumor. We investigate and compare the classification accuracy of different machine learning models, including tree, linear discriminant, quadratic discriminant, logistic regression, naive Bayes, support vector machine (SVM), K-nearest neighbor (KNN), ensemble, neural network, and kernel. The best models that can achieve the highest accuracy are medium Gaussian SVM, coarse Gaussian SVM, and cosine KNN, with an accuracy of 96.5%. The principal component analysis method is then performed to identify crucial components and build an accurate model with fewer parameters. The medium Gaussian SVM has the highest cross-validation classification accuracy of 96.98% and requires only three predictors: normal nucleoli, bare nuclei, and cell size uniformity.


Colangelo, T., Carbone, A., Mazzarelli, F., Cuttano, R., Dama, E., Nittoli, T., ... & Mazzoccoli, G. (2022). Loss of circadian gene Timeless induces EMT and tumor progression in colorectal cancer via Zeb1-dependent mechanism. Cell Death & Differentiation, 29(8), 1552-1568.

Cornell, L., Sahni, S., Couch, F., & Clune, C. (2022). Clinical Implications and Utility of Polygenic Risk Scores in Women at Elevated Risk for Breast Cancer. Journal of Precision Medicine, 8(3), 408-413.

Cserni, G., Chmielik, E., Cserni, B., & Tot, T. (2018). The new TNM-based staging of breast cancer. Virchows Archiv, 472, 697-703.

Dange, V., Shid, S., Magdum, C., & Mohite, S. (2017). A review on breast cancer: An overview. Asian Journal of Pharmaceutical Research, 7(1), 49-51.

Das, A. K., Biswas, S. K., Bhattacharya, A., & Alam, E. (2021, March 19-20). Introduction to Breast Cancer and Awareness [Conference presentation]. 2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India.

Dileep, G., & Gyani, S. G. G. (2022). Artificial intelligence in breast cancer screening and diagnosis. Cureus, 14(10) Article e30318.

Dumalaon-Canaria, J. A., Hutchinson, A. D., Prichard, I., & Wilson, C. (2014). What causes breast cancer? A systematic review of causal attributions among breast cancer survivors and how these compare to expert-endorsed risk factors. Cancer Causes & Control, 25, 771-785.

Frable, W. J. (1983). Fine-needle aspiration biopsy: a review. Human pathology, 14(1), 9-28.

Giaquinto, A. N., Sung, H., Miller, K. D., Kramer, J. L., Newman, L. A., Minihan, A., ... & Siegel, R. L. (2022). Breast cancer statistics, 2022. CA: a cancer journal for clinicians, 72(6), 524-541.

Gill, S. S., Xu, M., Ottaviani, C., Patros, P., Bahsoon, R., Shaghaghi, A., ... & Uhlig, S. (2022). AI for next generation computing: Emerging trends and future directions. Internet of Things, 19, Article 100514.

Gupta, R., Kurc, T., & Saltz, J. H. (2022). Introduction to Digital Pathology from Historical Perspectives to Emerging Pathomics. Whole Slide Imaging: Current Applications and Future Directions, 1-22.

Iqbal, M. S., Ahmad, W., Alizadehsani, R., Hussain, S., & Rehman, R. (2022). Breast Cancer Dataset, Classification and Detection Using Deep Learning. Healthcare, 10(12), Article 2395.

Kaur, K., Sagar, A. K., & Chakraborty, S. (2022). Accelerating the performance of sequence alignment using machine learning with RAPIDS enabled GPU. Journal of Current Science and Technology, 12(3), 462-481.

Lever, J., Krzywinski, M., & Altman, N. (2017). Points of significance: Principal component analysis. Nature methods, 14(7), 641-643.

Liu, Y., Nguyen, N., & Colditz, G. A. (2015). Links between alcohol consumption and breast cancer: a look at the evidence. Women’s health, 11(1), 65-77.

Mangasarian, O. L., & Wolberg, W. H. (1990). Cancer diagnosis via linear programming. University of Wisconsin-Madison Department of Computer Sciences. Retrieved from

McGuire, A., Brown, J. A., & Kerin, M. J. (2015). Metastatic breast cancer: the potential of miRNA for diagnosis and treatment monitoring. Cancer and metastasis reviews, 34, 145-155.

Montazeri, M., Montazeri, M., Montazeri, M., & Beigzadeh, A. (2016). Machine learning models in breast cancer survival prediction. Technology and Health Care, 24(1), 31-42.

Mouriquand, J., & Pasquier, D. (1980). Fine needle aspiration of breast carcinoma: a preliminary cytoprognostic study. Acta cytologica, 24(2), 153-159.

Nassif, A. B., Talib, M. A., Nasir, Q., Afadar, Y., & Elgendy, O. (2022). Breast cancer detection using artificial intelligence techniques: A systematic literature review. Artificial Intelligence in Medicine, 127, Article 102276.

Ohno-Machado, L., & Bialek, D. (1998). Diagnosing breast cancer from FNAs: variable relevance in neural network and logistic regression models. MEDINFO'98 (pp. 537-540). IOS Press Ebooks.

Oliveri, S., Faccio, F., Pizzoli, S., Monzani, D., Redaelli, C., Indino, M., & Pravettoni, G. (2019). A pilot study on aesthetic treatments performed by qualified aesthetic practitioners: efficacy on health-related quality of life in breast cancer patients. Quality of Life Research, 28, 1543-1553.

Osareh, A., & Shadgar, B. (2010, April 20-22). Machine learning techniques to diagnose breast cancer [Conference presentation]. 2010 5th international symposium on health informatics and bioinformatics.

Ostertagova, E., Ostertag, O., & Kováč, J. (2014). Methodology and application of the Kruskal-Wallis test. Paper presented at the Applied mechanics and materials, Ankara, Turkey.

Panyamit, T., Sukvivatn, P., Chanma, P., Kim, Y., Premratanachai, P., & Pechprasarn, S. (2022). Identification of factors in the survival rate of heart failure patients using machine learning models and principal component analysis. Journal of Current Science and Technology, 12(2), 336-348.

Perrin, S., & Roncalli, T. (2020). Machine learning optimization algorithms & portfolio allocation. Machine Learning for Asset Management: New Developments and Financial Applications, 261-328.

Richardson, L. C., King, J. B., Thomas, C. C., Richards, T. B., Dowling, N. F., & King, S. C. (2022). Peer Reviewed: Adults Who Have Never Been Screened for Colorectal Cancer, Behavioral Risk Factor Surveillance System, 2012 and 2020. Preventing Chronic Disease, 19, Article E21.

Saelee, P., Pongtheerat, T., Sophonnithiprasert, T., & Jinda, W. (2022). Clinicopathological significance of FANCA mRNA expression in Thai patients with breast cancer. Journal of Current Science and Technology, 12(3), 408-416.

Shah, S. M., Khan, R. A., Arif, S., & Sajid, U. (2022). Artificial intelligence for breast cancer analysis: Trends & directions. Computers in Biology and Medicine, 142, Article 105221.

Sheikh, A., Md, S., & Kesharwani, P. (2022). Aptamer grafted nanoparticle as targeted therapeutic tool for the treatment of breast cancer. Biomedicine & Pharmacotherapy, 146, Article 112530.

Shimoi, T., Nagai, S. E., Yoshinami, T., Takahashi, M., Arioka, H., Ishihara, M., ... & Toyama, T. (2020). The Japanese breast cancer society clinical practice guidelines for systemic treatment of breast cancer, 2018 edition. Breast Cancer, 27, 322-331.

Siegel, R. L., Miller, K. D., Wagle, N. S., & Jemal, A. (2023). Cancer statistics, 2023. CA: a cancer journal for clinicians, 73(1), 17-48.

Swathi, T., Krishna, S., & Ramesh, M. V. (2019, March 21-23). A survey on breast cancer diagnosis methods and modalities [Conference presentation]. 2019 international conference on wireless communications signal processing and networking (WiSPNET), Chennai, India.

Troxel, D. B. (2006). Medicolegal aspects of error in pathology. Archives of Pathology & Laboratory Medicine, 130(5), 617-619.

Versaggi, S. L., & De Leucio, A. (2020). Breast Biopsy. Europe PMC. Retrieved from

Vidal, R., Ma, Y., Sastry, S. S., Vidal, R., Ma, Y., & Sastry, S. S. (2016). Principal component analysis (pp. 25-62). Springer New York.

Wolberg, W. H., & Mangasarian, O. L. (1990). Multisurface method of pattern separation for medical diagnosis applied to breast cytology. AppliedMathematics:Wolberg and Mangasarian, 87(23), 9193-9196.

Wolberg, W. H., Mangasarian, O. L., & Setiono, R. (1989). Pattern Recognition Via Linear Programming: Theory and Application to Medical Diagnosis. University of Wisconsin-Madison Department of Computer Sciences. Retrieved from

Yau, C., Osdoit, M., van der Noordaa, M., Shad, S., Wei, J., de Croze, D., ... & Symmans, W. F. (2022). Residual cancer burden after neoadjuvant chemotherapy and long-term survival outcomes in breast cancer: a multicentre pooled analysis of 5161 patients. The Lancet Oncology, 23(1), 149-160.

Yedjou, C. G., Sims, J. N., Miele, L., Noubissi, F., Lowe, L., Fonseca, D. D., ... & Tchounwou, P. B. (2019). Health and racial disparity in breast cancer. Breast cancer metastasis and drug resistance: Challenges and progress, 31-49.




How to Cite

Pechprasarn, S., Wattanapermpool, O. ., Warunlawan, M., Homsud, P. ., & Akarajarasroj, T. . (2023). Identification of Important Factors in the Diagnosis of Breast Cancer Cells Using Machine Learning Models and Principal Component Analysis. Journal of Current Science and Technology, 13(3), 642–656.