Mitigating Data Imbalance for Robust Diabetes Diagnosis Using Machine Learning and Explainable Artificial Intelligence
DOI:
https://doi.org/10.59796/jcst.V15N3.2025.111Keywords:
diabetes prediction, explainable ai, machine learning, mutual information, random forest, SMOTE-ENN, stackingAbstract
Diabetes is increasing at a global level and is associated with a high mortality rate. Early diagnosis can significantly reduce the risk of complications and save lives. This study proposes an efficient ensemble model for diabetes diagnosis using Machine Learning (ML). To address class imbalance in the dataset, a hybrid sampling technique Synthetic Minority Over-sampling Technique (SMOTE) combined with Edited Nearest Neighbors (ENN) is implemented. This combined method, referred to as SMOTE-ENN, enhances the model’s ability to accurately predict diabetes outcomes by generating synthetic samples and removing noisy instances. Before implementing any ML model, it is necessary to prepare the data and find a suitable model. Data preprocessing techniques like normalization, and filling missing values are essential in preparing data for a ML model. The proposed approach implements suitable preprocessing techniques such as mean value imputation and encoding. Feature selection with mutual information is carried out to select important variables. The PIMA Indian Diabetic dataset is balanced using SMOTE-ENN, a sampling strategy in Python with the help of the imbalanced-learn library, which improves model performance. The dataset is split for analysis at 80:20 ratio (train: test). Before ML implementation, the data is prepared for model building. Then, various ML models are introduced, ranging from single classifiers to ensemble models. The proposed approach, SMOTE-ENN with ensemble ML models proves that stacking provides high accuracy (98.9%), precision (97.6%), recall (99.5%), and F1-score (98%). Explainable Artificial Intelligence is also used to interpret the results with the help of Local Interpretable Model-Agnostic Explanations (LIME). The proposed approach combines feature selection, data imbalance handling, and ensemble techniques to improve performance. Stacking with the proposed approach performs better than state-of-the-art algorithms.
References
Abdollahi, J., & Nouri-Moghaddam, B. (2022). Hybrid stacked ensemble combined with genetic algorithms for diabetes prediction. Iran Journal of Computer Science, 5(3), 205-220. https://doi.org/10.1007/s42044-022-00100-1
Ali, M., Haider, M. N., Lashari, S. A., Sharif, W., Khan, A., & Ramli, D. A. (2022). Stacking classifier with random forest functioning as a meta classifier for diabetes diseases classification. Procedia Computer Science, 207, 3459-3468. https://doi.org/10.1016/j.procs.2022.09.404
Chou, C. Y., Hsu, D. Y., & Chou, C. H. (2023). Predicting the onset of diabetes with machine learning methods. Journal of Personalized Medicine, 13(3), Article 406. https://doi.org/10.3390/jpm13030406
Ganie, S. M., & Malik, M. B. (2022). An ensemble machine learning approach for predicting type-II diabetes mellitus based on lifestyle indicators. Healthcare Analytics, 2, Article 100092. https://doi.org/10.1016/j.health.2022.100092
Hasan, M. K., Alam, M. A., Das, D., Hossain, E., & Hasan, M. (2020). Diabetes prediction using ensembling of different machine learning classifiers. IEEE Access, 8, 76516-76531. https://doi.org/10.1109/ACCESS.2020.2989857
International Diabetes Federation. (n.d.). Diabetes around the world in 2021. IDF Diabetes Atlas. Retrieved from https://diabetesatlas.org/
Jaiswal, V., Negi, A., & Pal, T. (2021). A review on current advances in machine learning based diabetes prediction. Primary Care Diabetes, 15(3), 435-443. https://doi.org/10.1016/j.pcd.2021.02.005
Kalagotla, S. K., Gangashetty, S. V., & Giridhar, K. (2021). A novel stacking technique for prediction of diabetes. Computers in Biology and Medicine, 135, Article 104554. https://doi.org/10.1016/j.compbiomed.2021.104554
Khanam, J. J., & Foo, S. Y. (2021). A comparison of machine learning algorithms for diabetes prediction. ICT Express, 7(4), 432- 439. https://doi.org/10.1016/j.icte.2021.02.004
Kishor, A., & Chakraborty, C. (2024). Early and accurate prediction of diabetics based on FCBF feature selection and SMOTE. International Journal of System Assurance Engineering and Management, 15(10), 4649-4657. https://doi.org/10.1007/s13198-021-01174-z
Larabi-Marie-Sainte, S., Aburahmah, L., Almohaini, R., & Saba, T. (2019). Current techniques for diabetes prediction: review and case study. Applied Sciences, 9(21), 4604. https://doi.org/10.3390/app9214604
Mohamad Suffian, M. S. Z. B., Kamil, S., Ariffin, A. K., Azman, A. H., Ibrahim, I. M., & Suga, K. (2024). A surrogate model's decision tree method evaluation for uncertainty quantification on a finite element structure via a fuzzy-random approach. Journal of Current Science and Technology, 14(3), Article 50. https://doi.org/10.59796/jcst.V14N3.2024.50
Nnamoko, N., & Korkontzelos, I. (2020). Efficient treatment of outliers and class imbalance for diabetes prediction. Artificial Intelligence in Medicine, 104, Article 101815. https://doi.org/10.1016/j.artmed.2020.101815
Pechprasarn, S., Suechoey, N., Pholtrakoolwong, N., Tanedvorapinyo, P., & Toboonliang, Y. (2024). Optimizing lung cancer diagnosis with machine learning and feature selection methods. Journal of Current Science and Technology, 14(3), Article 55. https://doi.org/10.59796/jcst.V14N3.2024.55
Pechprasarn, S., Wattanapermpool, O., Warunlawan, M., Homsud, P., & Akarajarasroj, T. (2023). Identification of important factors in the diagnosis of breast cancer cells using machine learning models and principal component analysis. Journal of Current Science and Technology, 13(3), 642–656. https://doi.org/10.59796/jcst.V13N3.2023.700
Saihood, Q., & Sonuç, E. (2023). A practical framework for early detection of diabetes using ensemble machine learning models. Turkish Journal of Electrical Engineering and Computer Sciences, 31(4), 722–738. https://doi.org/10.55730/1300-0632.4013
Singh, N., & Singh, P. (2020). Stacking-based multi-objective evolutionary ensemble framework for prediction of diabetes mellitus. Biocybernetics and Biomedical Engineering, 40(1), 1-22. https://doi.org/10.1016/j.bbe.2019.10.001
Sun, H., Saeedi, P., Karuranga, S., Pinkepank, M., Ogurtsova, K., Duncan, B. B., ... & Magliano, D. J. (2022). IDF Diabetes atlas: global, regional and country-level diabetes prevalence estimates for 2021 and projections for 2045. Diabetes Research and Clinical Practice, 183, Article 109119. https://doi.org/10.1016/j.diabres.2021.109119
Suryasa, I. W., Rodríguez-Gámez, M., & Koldoris, T. (2021). Health and treatment of diabetes mellitus. International Journal of Health Sciences, 5(1), 1-5. https://doi.org/10.53730/ijhs.v5n1.2864
Tasin, I., Nabil, T. U., Islam, S., & Khan, R. (2023). Diabetes prediction using machine learning and explainable AI techniques. Healthcare Technology Letters, 10(1-2), 1-10. https://doi.org/10.1049/htl2.12039
Tasneem, S., Younas, M., & Shafiq, Q. (2024). Identifying key learning algorithm parameter of forward feature selection to integrate with ensemble learning for customer churn prediction. VFAST Transactions on Software Engineering, 12(2), 56-75. https://doi.org/10.21015/vtse.v12i2.1811
Tuppad, A., & Patil, S. D. (2022). Machine learning for diabetes clinical decision support: a review. Advances in Computational Intelligence, 2(2), Article 22. https://doi.org/10.1007/s43674-022-00034-y
World Health Organization. (2024). Diabetes. Retrieved form https://www.who.int/news-room/fact-sheets/detail/diabetes
Xu, Z., Shen, D., Nie, T., & Kou, Y. (2020). A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data. Journal of Biomedical Informatics, 107, Article 103465. https://doi.org/10.1016/j.jbi.2020.103465
Yang, F., Wang, K., Sun, L., Zhai, M., Song, J., & Wang, H. (2022). A hybrid sampling algorithm combining synthetic minority over-sampling technique and edited nearest neighbor for missed abortion diagnosis. BMC Medical Informatics and Decision Making, 22(1), Article 344. https://doi.org/10.1186/s12911-022-02075-2

Downloads
Published
How to Cite
Issue
Section
Categories
License
Copyright (c) 2025 Journal of Current Science and Technology

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.