Identification of Important Factors in the Diagnosis of Breast Cancer Cells Using Machine Learning Models and Principal Component Analysis

Authors

  • Suejit Pechprasarn College of Biomedical Engineering, Rangsit University, Pathum Thani, 12000, Thailand
  • Ohmthong Wattanapermpool Satriwithaya School, Wat Bowon Niwet, Phra Nakhon, Bangkok, 10200, Thailand
  • Maninya Warunlawan Satriwithaya School, Wat Bowon Niwet, Phra Nakhon, Bangkok, 10200, Thailand
  • Pornchaya Homsud Satriwithaya School, Wat Bowon Niwet, Phra Nakhon, Bangkok, 10200, Thailand
  • Thumpussorn Akarajarasroj Satriwithaya School, Wat Bowon Niwet, Phra Nakhon, Bangkok, 10200, Thailand

DOI:

https://doi.org/10.59796/jcst.V13N3.2023.700

Keywords:

breast cancer classification, classification of malignant and benign cells, machine learning, principal component analysis, complexity reduced model, intelligent diagnostic software

Abstract

Breast cancer (BC) is now identified as a disease with a significant impact on morbidity and mortality that is growing and widespread worldwide. This study uses a publicly available clinical dataset of 699 patients from the University of Wisconsin with 9 variables: (1) clump thickness, (2) uniformity of cell size, (3) uniformity of cell shape, (4) marginal adhesion, (5) single epithelial cell size, (6) bare nuclei, (7) bland chromatin, (8) normal nucleoli, and (9) mitoses. This dataset has been used for many studies in the past to pinpoint critical factors in patient diagnosis. Here, we use this data to ensure its unbiasedness and accuracy. We then apply principal component analysis and machine learning models to identify factors in diagnosing a malignant or benign tumor. We investigate and compare the classification accuracy of different machine learning models, including tree, linear discriminant, quadratic discriminant, logistic regression, naive Bayes, support vector machine (SVM), K-nearest neighbor (KNN), ensemble, neural network, and kernel. The best models that can achieve the highest accuracy are medium Gaussian SVM, coarse Gaussian SVM, and cosine KNN, with an accuracy of 96.5%. The principal component analysis method is then performed to identify crucial components and build an accurate model with fewer parameters. The medium Gaussian SVM has the highest cross-validation classification accuracy of 96.98% and requires only three predictors: normal nucleoli, bare nuclei, and cell size uniformity.

References

Colangelo, T., Carbone, A., Mazzarelli, F., Cuttano, R., Dama, E., Nittoli, T., ... & Mazzoccoli, G. (2022). Loss of circadian gene Timeless induces EMT and tumor progression in colorectal cancer via Zeb1-dependent mechanism. Cell Death & Differentiation, 29(8), 1552-1568. https://doi.org/10.1038/s41418-022-00935-y

Cornell, L., Sahni, S., Couch, F., & Clune, C. (2022). Clinical Implications and Utility of Polygenic Risk Scores in Women at Elevated Risk for Breast Cancer. Journal of Precision Medicine, 8(3), 408-413. https://doi.org/10.7326/M20-5874

Cserni, G., Chmielik, E., Cserni, B., & Tot, T. (2018). The new TNM-based staging of breast cancer. Virchows Archiv, 472, 697-703. https://doi.org/10.1007/s00428-018-2301-9

Dange, V., Shid, S., Magdum, C., & Mohite, S. (2017). A review on breast cancer: An overview. Asian Journal of Pharmaceutical Research, 7(1), 49-51. https://doi.org/10.5958/2231-5691.2017.00008.9

Das, A. K., Biswas, S. K., Bhattacharya, A., & Alam, E. (2021, March 19-20). Introduction to Breast Cancer and Awareness [Conference presentation]. 2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India. https://doi.org/10.1109/ICACCS51430.2021.9441686

Dileep, G., & Gyani, S. G. G. (2022). Artificial intelligence in breast cancer screening and diagnosis. Cureus, 14(10) Article e30318. https://doi.org/10.7759/cureus.30318

Dumalaon-Canaria, J. A., Hutchinson, A. D., Prichard, I., & Wilson, C. (2014). What causes breast cancer? A systematic review of causal attributions among breast cancer survivors and how these compare to expert-endorsed risk factors. Cancer Causes & Control, 25, 771-785. https://doi.org/10.1007/s10552-014-0377-3

Frable, W. J. (1983). Fine-needle aspiration biopsy: a review. Human pathology, 14(1), 9-28. https://doi.org/10.1007/s10552-014-0377-3

Giaquinto, A. N., Sung, H., Miller, K. D., Kramer, J. L., Newman, L. A., Minihan, A., ... & Siegel, R. L. (2022). Breast cancer statistics, 2022. CA: a cancer journal for clinicians, 72(6), 524-541. https://doi.org/10.3322/caac.21754

Gill, S. S., Xu, M., Ottaviani, C., Patros, P., Bahsoon, R., Shaghaghi, A., ... & Uhlig, S. (2022). AI for next generation computing: Emerging trends and future directions. Internet of Things, 19, Article 100514. https://doi.org/10.1016/j.iot.2022.100514

Gupta, R., Kurc, T., & Saltz, J. H. (2022). Introduction to Digital Pathology from Historical Perspectives to Emerging Pathomics. Whole Slide Imaging: Current Applications and Future Directions, 1-22. https://doi.org/10.1007/978-3-030-83332-9_1

Iqbal, M. S., Ahmad, W., Alizadehsani, R., Hussain, S., & Rehman, R. (2022). Breast Cancer Dataset, Classification and Detection Using Deep Learning. Healthcare, 10(12), Article 2395. https://doi.org/10.3390/healthcare10122395

Kaur, K., Sagar, A. K., & Chakraborty, S. (2022). Accelerating the performance of sequence alignment using machine learning with RAPIDS enabled GPU. Journal of Current Science and Technology, 12(3), 462-481.

Lever, J., Krzywinski, M., & Altman, N. (2017). Points of significance: Principal component analysis. Nature methods, 14(7), 641-643. https://doi.org/10.1038/nmeth.4346

Liu, Y., Nguyen, N., & Colditz, G. A. (2015). Links between alcohol consumption and breast cancer: a look at the evidence. Women’s health, 11(1), 65-77. https://doi.org/10.2217/WHE.14.62

Mangasarian, O. L., & Wolberg, W. H. (1990). Cancer diagnosis via linear programming. University of Wisconsin-Madison Department of Computer Sciences. Retrieved from https://minds.wisconsin.edu/bitstream/handle/1793/59346/TR958.pdf

McGuire, A., Brown, J. A., & Kerin, M. J. (2015). Metastatic breast cancer: the potential of miRNA for diagnosis and treatment monitoring. Cancer and metastasis reviews, 34, 145-155. https://doi.org/10.1007/s10555-015-9551-7

Montazeri, M., Montazeri, M., Montazeri, M., & Beigzadeh, A. (2016). Machine learning models in breast cancer survival prediction. Technology and Health Care, 24(1), 31-42. https://doi.org/10.3233/THC-151071

Mouriquand, J., & Pasquier, D. (1980). Fine needle aspiration of breast carcinoma: a preliminary cytoprognostic study. Acta cytologica, 24(2), 153-159. https://pubmed.ncbi.nlm.nih.gov/6245554/

Nassif, A. B., Talib, M. A., Nasir, Q., Afadar, Y., & Elgendy, O. (2022). Breast cancer detection using artificial intelligence techniques: A systematic literature review. Artificial Intelligence in Medicine, 127, Article 102276. https://doi.org/10.1016/j.artmed.2022.102276

Ohno-Machado, L., & Bialek, D. (1998). Diagnosing breast cancer from FNAs: variable relevance in neural network and logistic regression models. MEDINFO'98 (pp. 537-540). IOS Press Ebooks. https://doi.org/10.3233/978-1-60750-896-0-537

Oliveri, S., Faccio, F., Pizzoli, S., Monzani, D., Redaelli, C., Indino, M., & Pravettoni, G. (2019). A pilot study on aesthetic treatments performed by qualified aesthetic practitioners: efficacy on health-related quality of life in breast cancer patients. Quality of Life Research, 28, 1543-1553. https://doi.org/10.1007/s11136-019-02133-9

Osareh, A., & Shadgar, B. (2010, April 20-22). Machine learning techniques to diagnose breast cancer [Conference presentation]. 2010 5th international symposium on health informatics and bioinformatics. https://doi.org/10.1109/HIBIT.2010.5478895

Ostertagova, E., Ostertag, O., & Kováč, J. (2014). Methodology and application of the Kruskal-Wallis test. Paper presented at the Applied mechanics and materials, Ankara, Turkey. https://doi.org/10.4028/www.scientific.net/AMM.611.115

Panyamit, T., Sukvivatn, P., Chanma, P., Kim, Y., Premratanachai, P., & Pechprasarn, S. (2022). Identification of factors in the survival rate of heart failure patients using machine learning models and principal component analysis. Journal of Current Science and Technology, 12(2), 336-348. https://ph04.tci-thaijo.org/index.php/JCST/article/view/299

Perrin, S., & Roncalli, T. (2020). Machine learning optimization algorithms & portfolio allocation. Machine Learning for Asset Management: New Developments and Financial Applications, 261-328. https://doi.org/10.1002/9781119751182.ch8

Richardson, L. C., King, J. B., Thomas, C. C., Richards, T. B., Dowling, N. F., & King, S. C. (2022). Peer Reviewed: Adults Who Have Never Been Screened for Colorectal Cancer, Behavioral Risk Factor Surveillance System, 2012 and 2020. Preventing Chronic Disease, 19, Article E21. https://doi.org/10.5888/pcd19.220001

Saelee, P., Pongtheerat, T., Sophonnithiprasert, T., & Jinda, W. (2022). Clinicopathological significance of FANCA mRNA expression in Thai patients with breast cancer. Journal of Current Science and Technology, 12(3), 408-416. https://ph04.tci-thaijo.org/index.php/JCST/article/view/254

Shah, S. M., Khan, R. A., Arif, S., & Sajid, U. (2022). Artificial intelligence for breast cancer analysis: Trends & directions. Computers in Biology and Medicine, 142, Article 105221. https://doi.org/10.1016/j.compbiomed.2022.105221

Sheikh, A., Md, S., & Kesharwani, P. (2022). Aptamer grafted nanoparticle as targeted therapeutic tool for the treatment of breast cancer. Biomedicine & Pharmacotherapy, 146, Article 112530. https://doi.org/10.1016/j.biopha.2021.112530

Shimoi, T., Nagai, S. E., Yoshinami, T., Takahashi, M., Arioka, H., Ishihara, M., ... & Toyama, T. (2020). The Japanese breast cancer society clinical practice guidelines for systemic treatment of breast cancer, 2018 edition. Breast Cancer, 27, 322-331. https://doi.org/10.1007/s12282-020-01085-0

Siegel, R. L., Miller, K. D., Wagle, N. S., & Jemal, A. (2023). Cancer statistics, 2023. CA: a cancer journal for clinicians, 73(1), 17-48. https://doi.org/10.3322/caac.21763

Swathi, T., Krishna, S., & Ramesh, M. V. (2019, March 21-23). A survey on breast cancer diagnosis methods and modalities [Conference presentation]. 2019 international conference on wireless communications signal processing and networking (WiSPNET), Chennai, India. https://doi.org/10.1109/WiSPNET45539.2019.9032799

Troxel, D. B. (2006). Medicolegal aspects of error in pathology. Archives of Pathology & Laboratory Medicine, 130(5), 617-619. https://doi.org/10.5858/2006-130-617-MAOEIP

Versaggi, S. L., & De Leucio, A. (2020). Breast Biopsy. Europe PMC. Retrieved from https://europepmc.org/article/nbk/nbk559147

Vidal, R., Ma, Y., Sastry, S. S., Vidal, R., Ma, Y., & Sastry, S. S. (2016). Principal component analysis (pp. 25-62). Springer New York. https://doi.org/10.1007/978-0-387-87811-9_2

Wolberg, W. H., & Mangasarian, O. L. (1990). Multisurface method of pattern separation for medical diagnosis applied to breast cytology. AppliedMathematics:Wolberg and Mangasarian, 87(23), 9193-9196. https://doi.org/10.1073/pnas.87.23.9193

Wolberg, W. H., Mangasarian, O. L., & Setiono, R. (1989). Pattern Recognition Via Linear Programming: Theory and Application to Medical Diagnosis. University of Wisconsin-Madison Department of Computer Sciences. Retrieved from http://digital.library.wisc.edu/1793/59186

Yau, C., Osdoit, M., van der Noordaa, M., Shad, S., Wei, J., de Croze, D., ... & Symmans, W. F. (2022). Residual cancer burden after neoadjuvant chemotherapy and long-term survival outcomes in breast cancer: a multicentre pooled analysis of 5161 patients. The Lancet Oncology, 23(1), 149-160. https://doi.org/10.1016/S1470-2045(21)00589-1

Yedjou, C. G., Sims, J. N., Miele, L., Noubissi, F., Lowe, L., Fonseca, D. D., ... & Tchounwou, P. B. (2019). Health and racial disparity in breast cancer. Breast cancer metastasis and drug resistance: Challenges and progress, 31-49. https://doi.org/10.1007/978-3-030-20301-6_3

Downloads

Published

2023-08-30

How to Cite

Pechprasarn, S., Wattanapermpool, O. ., Warunlawan, M., Homsud, P. ., & Akarajarasroj, T. . (2023). Identification of Important Factors in the Diagnosis of Breast Cancer Cells Using Machine Learning Models and Principal Component Analysis. Journal of Current Science and Technology, 13(3), 642–656. https://doi.org/10.59796/jcst.V13N3.2023.700