Mitigating Data Imbalance for Robust Diabetes Diagnosis Using Machine Learning and Explainable Artificial Intelligence

Authors

  • Poorani K Department of Computer Applications, School of Computing, Kalasalingam Academy of Research and Education, Krishnankoil, Tamil Nadu, India.
  • Balakannan S P Department of Information Technology, School of Computing, Kalasalingam Academy of Research and Education, Krishnankoil, Tamil Nadu, India.
  • Karuppasamy M Department of Computer Science and Engineering, RajaRajeswari College of Engineering, Bengaluru, Karnataka, India

DOI:

https://doi.org/10.59796/jcst.V15N3.2025.111

Keywords:

diabetes prediction, explainable ai, machine learning, mutual information, random forest, SMOTE-ENN, stacking

Abstract

Diabetes is increasing at a global level and is associated with a high mortality rate. Early diagnosis can significantly reduce the risk of complications and save lives. This study proposes an efficient ensemble model for diabetes diagnosis using Machine Learning (ML). To address class imbalance in the dataset, a hybrid sampling technique Synthetic Minority Over-sampling Technique (SMOTE) combined with Edited Nearest Neighbors (ENN) is implemented. This combined method, referred to as SMOTE-ENN, enhances the model’s ability to accurately predict diabetes outcomes by generating synthetic samples and removing noisy instances. Before implementing any ML model, it is necessary to prepare the data and find a suitable model. Data preprocessing techniques like normalization, and filling missing values are essential in preparing data for a ML model. The proposed approach implements suitable preprocessing techniques such as mean value imputation and encoding. Feature selection with mutual information is carried out to select important variables. The PIMA Indian Diabetic dataset is balanced using SMOTE-ENN, a sampling strategy in Python with the help of the imbalanced-learn library, which improves model performance. The dataset is split for analysis at 80:20 ratio (train: test). Before ML implementation, the data is prepared for model building. Then, various ML models are introduced, ranging from single classifiers to ensemble models. The proposed approach, SMOTE-ENN with ensemble ML models proves that stacking provides high accuracy (98.9%), precision (97.6%), recall (99.5%), and F1-score (98%). Explainable Artificial Intelligence is also used to interpret the results with the help of Local Interpretable Model-Agnostic Explanations (LIME). The proposed approach combines feature selection, data imbalance handling, and ensemble techniques to improve performance. Stacking with the proposed approach performs better than state-of-the-art algorithms.

References

Abdollahi, J., & Nouri-Moghaddam, B. (2022). Hybrid stacked ensemble combined with genetic algorithms for diabetes prediction. Iran Journal of Computer Science, 5(3), 205-220. https://doi.org/10.1007/s42044-022-00100-1

Ali, M., Haider, M. N., Lashari, S. A., Sharif, W., Khan, A., & Ramli, D. A. (2022). Stacking classifier with random forest functioning as a meta classifier for diabetes diseases classification. Procedia Computer Science, 207, 3459-3468. https://doi.org/10.1016/j.procs.2022.09.404

Chou, C. Y., Hsu, D. Y., & Chou, C. H. (2023). Predicting the onset of diabetes with machine learning methods. Journal of Personalized Medicine, 13(3), Article 406. https://doi.org/10.3390/jpm13030406

Ganie, S. M., & Malik, M. B. (2022). An ensemble machine learning approach for predicting type-II diabetes mellitus based on lifestyle indicators. Healthcare Analytics, 2, Article 100092. https://doi.org/10.1016/j.health.2022.100092

Hasan, M. K., Alam, M. A., Das, D., Hossain, E., & Hasan, M. (2020). Diabetes prediction using ensembling of different machine learning classifiers. IEEE Access, 8, 76516-76531. https://doi.org/10.1109/ACCESS.2020.2989857

International Diabetes Federation. (n.d.). Diabetes around the world in 2021. IDF Diabetes Atlas. Retrieved from https://diabetesatlas.org/

Jaiswal, V., Negi, A., & Pal, T. (2021). A review on current advances in machine learning based diabetes prediction. Primary Care Diabetes, 15(3), 435-443. https://doi.org/10.1016/j.pcd.2021.02.005

Kalagotla, S. K., Gangashetty, S. V., & Giridhar, K. (2021). A novel stacking technique for prediction of diabetes. Computers in Biology and Medicine, 135, Article 104554. https://doi.org/10.1016/j.compbiomed.2021.104554

Khanam, J. J., & Foo, S. Y. (2021). A comparison of machine learning algorithms for diabetes prediction. ICT Express, 7(4), 432- 439. https://doi.org/10.1016/j.icte.2021.02.004

Kishor, A., & Chakraborty, C. (2024). Early and accurate prediction of diabetics based on FCBF feature selection and SMOTE. International Journal of System Assurance Engineering and Management, 15(10), 4649-4657. https://doi.org/10.1007/s13198-021-01174-z

Larabi-Marie-Sainte, S., Aburahmah, L., Almohaini, R., & Saba, T. (2019). Current techniques for diabetes prediction: review and case study. Applied Sciences, 9(21), 4604. https://doi.org/10.3390/app9214604

Mohamad Suffian, M. S. Z. B., Kamil, S., Ariffin, A. K., Azman, A. H., Ibrahim, I. M., & Suga, K. (2024). A surrogate model's decision tree method evaluation for uncertainty quantification on a finite element structure via a fuzzy-random approach. Journal of Current Science and Technology, 14(3), Article 50. https://doi.org/10.59796/jcst.V14N3.2024.50

Nnamoko, N., & Korkontzelos, I. (2020). Efficient treatment of outliers and class imbalance for diabetes prediction. Artificial Intelligence in Medicine, 104, Article 101815. https://doi.org/10.1016/j.artmed.2020.101815

Pechprasarn, S., Suechoey, N., Pholtrakoolwong, N., Tanedvorapinyo, P., & Toboonliang, Y. (2024). Optimizing lung cancer diagnosis with machine learning and feature selection methods. Journal of Current Science and Technology, 14(3), Article 55. https://doi.org/10.59796/jcst.V14N3.2024.55

Pechprasarn, S., Wattanapermpool, O., Warunlawan, M., Homsud, P., & Akarajarasroj, T. (2023). Identification of important factors in the diagnosis of breast cancer cells using machine learning models and principal component analysis. Journal of Current Science and Technology, 13(3), 642–656. https://doi.org/10.59796/jcst.V13N3.2023.700

Saihood, Q., & Sonuç, E. (2023). A practical framework for early detection of diabetes using ensemble machine learning models. Turkish Journal of Electrical Engineering and Computer Sciences, 31(4), 722–738. https://doi.org/10.55730/1300-0632.4013

Singh, N., & Singh, P. (2020). Stacking-based multi-objective evolutionary ensemble framework for prediction of diabetes mellitus. Biocybernetics and Biomedical Engineering, 40(1), 1-22. https://doi.org/10.1016/j.bbe.2019.10.001

Sun, H., Saeedi, P., Karuranga, S., Pinkepank, M., Ogurtsova, K., Duncan, B. B., ... & Magliano, D. J. (2022). IDF Diabetes atlas: global, regional and country-level diabetes prevalence estimates for 2021 and projections for 2045. Diabetes Research and Clinical Practice, 183, Article 109119. https://doi.org/10.1016/j.diabres.2021.109119

Suryasa, I. W., Rodríguez-Gámez, M., & Koldoris, T. (2021). Health and treatment of diabetes mellitus. International Journal of Health Sciences, 5(1), 1-5. https://doi.org/10.53730/ijhs.v5n1.2864

Tasin, I., Nabil, T. U., Islam, S., & Khan, R. (2023). Diabetes prediction using machine learning and explainable AI techniques. Healthcare Technology Letters, 10(1-2), 1-10. https://doi.org/10.1049/htl2.12039

Tasneem, S., Younas, M., & Shafiq, Q. (2024). Identifying key learning algorithm parameter of forward feature selection to integrate with ensemble learning for customer churn prediction. VFAST Transactions on Software Engineering, 12(2), 56-75. https://doi.org/10.21015/vtse.v12i2.1811

Tuppad, A., & Patil, S. D. (2022). Machine learning for diabetes clinical decision support: a review. Advances in Computational Intelligence, 2(2), Article 22. https://doi.org/10.1007/s43674-022-00034-y

World Health Organization. (2024). Diabetes. Retrieved form https://www.who.int/news-room/fact-sheets/detail/diabetes

Xu, Z., Shen, D., Nie, T., & Kou, Y. (2020). A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data. Journal of Biomedical Informatics, 107, Article 103465. https://doi.org/10.1016/j.jbi.2020.103465

Yang, F., Wang, K., Sun, L., Zhai, M., Song, J., & Wang, H. (2022). A hybrid sampling algorithm combining synthetic minority over-sampling technique and edited nearest neighbor for missed abortion diagnosis. BMC Medical Informatics and Decision Making, 22(1), Article 344. https://doi.org/10.1186/s12911-022-02075-2

Downloads

Published

2025-06-15

How to Cite

K, P., S P, B., & M, K. (2025). Mitigating Data Imbalance for Robust Diabetes Diagnosis Using Machine Learning and Explainable Artificial Intelligence. Journal of Current Science and Technology, 15(3), 111. https://doi.org/10.59796/jcst.V15N3.2025.111