Effect of Resampling Techniques on Machine Learning Models for Classifying Road Accident Severity in Thailand

Teerawat Simmachan; Pichit Boonkrong

doi:10.59796/jcst.V15N2.2025.99

Authors

Teerawat Simmachan Department of Mathematics and Statistics, Faculty of Science and Technology, Thammasat University, Pathum Thani 12120, Thailand & Thammasat University Research Unit in Statistical Theory and Applications, Thammasat University, Pathum Thani 12120, Thailand https://orcid.org/0000-0002-0210-3623
Pichit Boonkrong College of Biomedical Engineering, Rangsit University, Pathum Thani 12000, Thailand https://orcid.org/0000-0001-5105-0460

DOI:

https://doi.org/10.59796/jcst.V15N2.2025.99

Keywords:

gradient boosting, imbalanced data, KNN, over-sampling, random forest, road safety, SDGs 3

Abstract

Road traffic accidents (RTAs) pose a significant global challenge, particularly in Thailand. This study investigates the impact of resampling techniques on machine learning (ML) models for classifying road accident severity in Thailand, utilizing data from 31,817 road traffic accidents collected between January 1, 2021, and December 31, 2022. The primary challenge addressed is class imbalance, where fatal accidents represent a small fraction of the dataset. Three popular ML models, including Random Forest (RF), K-Nearest Neighbors (KNN), and Extreme Gradient Boosting (XGB), were evaluated with four resampling techniques: Imbalanced (IB), Under-sampling (US), Over-sampling (OS), and Combined Sampling (CS). These resampling approaches generated 12 ML models, whose performance was evaluated under three different train/test split ratios: 70/30, 80/20, and 90/10. Compared to the IB approach, the results demonstrate that all US, OS and CS techniques significantly improved model performance, particularly in terms of F1 score, G-mean, and balanced accuracy. Among the models, RF-CS, KNN-OS, and XGB-CS exhibited the best classification performance. Although these evaluation metrics improved over the imbalanced scheme, KNN’s overall performance in detecting fatal accidents was weaker compared to RF and XGB. Specifically, KNN struggled more with the imbalanced dataset, even after applying resampling techniques. These findings suggest that choosing the appropriate resampling techniques is crucial for enhancing model performance in classifying accident severity.

References

Aggarwal, H. K., & Jacob, M. (2020). J-MoDL: Joint model-based deep learning for optimized sampling and reconstruction. IEEE Journal of Selected Topics in Signal Processing, 14(6), 1151-1162. https://doi.org/10.1109/JSTSP.2020.3004094

Akarajarasroj, T., Wattanapermpool, O., Sapphaphab, P., Rinthon, O., Pechprasarn, S., & Boonkrong, P. (2023, October 28-31). Feature selection in the classification of erythemato-squamous diseases using machine learning models and principal component analysis [Conference presentation]. 2023 15th Biomedical Engineering International Conference (BMEiCON). IEEE, Tokyo, Japan. https://doi.org/10.1109/BMEiCON60347.2023.10322034

Aksoy, S., & Haralick, R. M. (2001). Feature normalization and likelihood-based similarity measures for image retrieval. Pattern Recognition Letters, 22(5), 563-582. https://doi.org/10.1016/S0167-8655(00)00112-4

Almannaa, M., Zawad, M. N., Moshawah, M., & Alabduljabbar, H. (2023). Investigating the effect of road condition and vacation on crash severity using machine learning algorithms. International Journal of Injury Control and Safety Promotion, 30(3), 392-402.

Arockia Panimalar, S., & Krishnakumar, A. (2023). A review of churn prediction models using different machine learning and deep learning approaches in cloud environment. Journal of Current Science and Technology, 13(1), 136-161. https://doi.org/10.14456/jcst.2023.12

Aryanti, R., Arifin, Y. T., Khairunas, S., Misriati, T., Dalis, S., Baidawi, T., ... & Marlina, S. (2023). The use of resampling techniques to overcome imbalance of data on the classification algorithm [Conference presentation]. AIP Conference Proceedings. AIP Publishing, Jakarta, Indonesia. https://doi.org/10.1063/5.0128424

Bentéjac, C., Csörgő, A., & Martínez-Muñoz, G. (2021). A comparative analysis of gradient boosting algorithms. Artificial Intelligence Review, 54, 1937-1967. https://doi.org/10.1007/s10462-020-09896-5

Boonkrong, P., & Simmachan, T. (2016). A Multigroup SEIR Epidemic Model with Vaccination on Heterogeneous Network. Chiang Mai Journal of Science, 43(4), 897-903.

Boonserm, E., & Wiwatwattana, N. (2021). Using Machine Learning to Predict Injury Severity of Road Traffic Accidents During New Year Festivals from Thailand’s Open Government Data [Conference presentation]. The 2021 9th International Electrical Engineering Congress (iEECON). IEEE, March 10-12, 2021, Pattaya, Thailand. https://doi.org/10.1109/iEECON51072.2021.9440287

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324

Chaiwuttisak, P. (2019). Analysis of Accidental Deaths During Songkran Festival Using Data Mining [Conference presentation]. The International Conference on Industrial Engineering and Operations Management Pilsen, IEOM Society International, July 23-26, 2019, Czech Republic.

Chaiyapet, C., Phakdeekul, W., & Kedthongma, W. (2022). Risk factors of severity of road accident injury incidence at Kut Bak district Sakon Nakhon province, Thailand. Res Militaris, 12(5), 835-45.

Champahom, T., Jomnonkwao, S., Banyong, C., Nambulee, W., Karoonsoontawong, A., & Ratanavaraha, V. (2021). Analysis of crash frequency and crash severity in Thailand: Hierarchical structure models approach. Sustainability, 13(18), Article 10086. https://doi.org/10.3390/su131810086

Champahom, T., Wisutwattanasak, P., Se, C., Banyong, C., Jomnonkwao, S., & Ratanavaraha, V. (2023a). Analysis of Factors Associated with Highway Personal Car and Truck Run-Off-Road Crashes: Decision Tree and Mixed Logit Model with Heterogeneity in Means and Variances Approaches. Informatics, 10(3), Article 66. https://doi.org/10.3390/informatics10030066

Champahom, T., Se, C., Aryuyo, F., Banyong, C., Jomnonkwao, S., & Ratanavaraha, V. (2023b). Crash Severity Analysis of Young Adult Motorcyclists: A Comparison of Urban and Rural Local Roadways. Applied Sciences, 13(21), Article 11723. https://doi.org/10.3390/app132111723

Chantith, C., Permpoonwiwat, C. K., & Hamaide, B. (2021). Measure of productivity loss due to road traffic accidents in Thailand. IATSS Research, 45(1), 131-136. https://doi.org/10.1016/j.iatssr.2020.07.001

Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, 29(5), 1189-1232. https://www.jstor.org/stable/2699986

Geron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems (2nd ed.). O'Reilly Media.

Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning. Cambridge: MIT press.

Hao, X., Zhang, C., Xu, H., Tao, X., Wang, S., & Hu, Y. (2008). An improved condensing algorithm [Conference presentation]. Seventh IEEE/ACIS International Conference on Computer and Information Science (icis 2008). IEEE, May 14-16, 2008, Portland, OR, USA. https://doi.org/10.1109/ICIS.2008.67

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer.

He, G., Han, H., & Wang, W. (2005). An over-sampling expert system for learing from imbalanced data sets [Conference presentation]. the 2005 international conference on neural networks and brain. IEEE, October 13-15, 2005, Beijing. https://doi.org/10.1109/ICNNB.2005.1614671

Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift [Conference presentation]. Proceedings of the 32nd International Conference on Machine Learning. July 6-11, 2015, PMLR, Lille, France. https://proceedings.mlr.press/v37/ioffe15.html

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. New York: springer.

Klungboonkrong, P., Woolley, J., Pramualsakdikul, S., Tirapat, S., Yotmeeboon, W., Pattulee, N., & Faiboun, N. (2019). Road safety status and analysis in Thailand and other Asian countries. Engineering & Applied Science Research, 46(4), 340-348. https://ph01.tci-thaijo.org/index.php/easr/index

Kotb, M. H., & Ming, R. (2021). Comparing SMOTE Family Techniques in Predicting Insurance Premium Defaulting using Machine Learning Models. International Journal of Advanced Computer Science and Applications, 12(9), 621-629. https://doi.org/10.14569/IJACSA.2021.0120970

Kowshalya, G., & Nandhini, M. (2018). Predicting fraudulent claims in automobile insurance [Conference presentation]. the 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT). April 20-21, 2018, IEEE, Coimbatore, India. https://doi.org/10.1109/ICICCT.2018.8473034

Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling. New York: Springer.

Lerdsuwansri, R., Phonsrirat, C., Prawalwanna, P., Wongsai, N., Wongsai, S., & Simmachan, T. (2022). Road traffic injuries in Thailand and their associated factors using Conway-Maxwell-Poisson regression model. Thai Journal of Mathematics, 240-249.

Mahikul, W., Aiyasuwan, O., Thanartthanaboon, P., Chancharoen, W., Achararit, P., Sirisombat, T., & Singkham, P. (2022). Factors affecting bus accident severity in Thailand: A multinomial logit model. PLoS One, 17(11), Article e0277318. https://doi.org/10.1371/journal.pone.0277318

Mahikul, W., Thongbun, T., Tungparamutsakul, A., Kitudom, P., Phun, S., ..., & Onuean, A. (2024). Machine Learning for Predicting the Severity of Road Accident Victims at a University Hospital Emergency Center [Conference presentation]. the 2024 IEEE International Conference on Big Data and Smart Computing (BigComp). IEEE, February 18-21, 2024, Bangkok, Thailand. https://doi.org/10.1109/BigComp60711.2024.00091

Mamdouh Farghaly, H., Shams, M. Y., & Abd El-Hafeez, T. (2023). Hepatitis C Virus prediction based on machine learning framework: a real-world case study in Egypt. Knowledge and Information Systems, 65(6), 2595-2617. https://doi.org/10.1007/s10115-023-01851-4

Mathew, T. E. (2022). Appositeness of Hoeffding tree models for breast cancer classification. Journal of Current Science and Technology, 12(3), 391-407. https://ph04.tci-thaijo.org/index.php/JCST/article/view/253

Moon, H., Pu, Y., & Ceglia, C. (2019). A Predictive Modeling for Detecting Fraudulent Automobile Insurance Claims. Theoretical Economics Letters, 9(6), Article 1886. https://doi.org/10.4236/tel.2019.96120

Moulaei, K., Bahaadinbeigy, K., Ghasemian, F., & Taghiabad, Z. M. (2022). Predicting the Mortality in the Patients Hospitalized in Intensive Care Units (ICU) Based on Machine Learning Techniques. Science & Technology Asia, 27(2), 98–114. https://ph02.tci-thaijo.org/index.php/SciTechAsia/article/view/242886

Na Bangchang, K., Wongsai, S., & Simmachan, T. (2023). Application of Data Mining Techniques in Automobile Insurance Fraud Detection [Conference presentation]. Proceedings of the 2023 6th International Conference on Mathematics and Statistics. July 14-16, 2023, ACM, New York, NY, USA. https://doi.org/10.1145/3613347.3613355

Nair, P., & Kashyap, I. (2019). Optimization of kNN classifier using hybrid preprocessing model for handling imbalanced data. International Journal of Engineering Research and Technology, 12(5), 697-704.

Open Government Data of Thailand. (2023). Road accident data set. Retrieved December 20, 2023, from https://data.go.th/en/

Pasangthien, T., & Yimwadsana, B. (2022). Rebalancing Clinical Data with Probabilistic Random Oversampling. Journal of the Thai Medical Informatics Association, 8(2), 68–72. https://he03.tci-thaijo.org/index.php/jtmi/article/view/480

Pechprasarn, S., Srisaranon, N., & Yimluean, P. (2025). Optimizing diabetes prediction: an evaluation of machine learning models through strategic feature selection. Journal of Current Science and Technology, 15(1), Article 75. https://doi.org/10.59796/jcst.V15N1.2025.75

Phaphan, W., Sangnuch, N., & Piladaeng, J. (2023). Comparison of the Effectiveness of Regression Models for the Number of Road Accident Injuries. Science & Technology Asia, 28(4), 54–66. https://ph02.tci-thaijo.org/index.php/SciTechAsia/article/view/249723

Polvimoltham, P., & Sinapiromsaran, K. (2021). Mass Ratio Variance Majority Undersampling and Minority Oversampling Technique for Class Imbalance. In Fuzzy Systems and Data Mining VII (pp. 152-161). IOS Press. https://doi.org/10.3233/FAIA210186

Prasasti, I. M. N., Dhini, A., & Laoh, E. (2020). Automobile insurance fraud detection using supervised classifiers [Conference presentation]. The 2020 International Workshop on Big Data and Information Security (IWBIS). October 17-18, 2020, IEEE, Depok, Indonesia. https://doi.org/10.1109/IWBIS50925.2020.9255426

Ran, C. (2023). An Imbalanced Data Classification Algorithm Based on Mixed Sampling [Conference presentation]. 2023 IEEE 11th Joint International Information Technology and Artificial Intelligence Conference (ITAIC). December 8-10, 2023, IEEE, Chongqing, China. https://doi.org/10.1109/ITAIC58329.2023.10409074

Riyapan, S., Thitichai, P., Chaisirin, W., Nakornchai, T., & Chakorn, T. (2018). Outcomes of emergency medical service usage in severe road traffic injury during Thai holidays. Western Journal of Emergency Medicine, 19(2), 266-275. https://doi.org/10.5811/westjem.2017.11.35169

Sainin, M. S., Alfred, R., Adnan, F., & Ahmad, F. (2017). Combining sampling and ensemble classifier for multiclass imbalance data learning [Conference presentation]. Computational Science and Technology: 4th ICCST 2017, November 29-30, 2017, Kuala Lumpur, Malaysia. Springer Singapore. https://doi.org/10.1007/978-981-10-8276-4_25

Sangkharat, K., Thornes, J. E., Wachiradilok, P., & Pope, F. D. (2021). Determination of the impact of rainfall on road accidents in Thailand. Heliyon, 7(2), Article e06061. https://doi.org/10.1016/j.heliyon.2021.e06061

Sarac, K., & Guvenis, A. (2023). Determining HPV status in patients with oropharyngeal cancer from 3D CT images using radiomics: Effect of sampling methods [Conference presentation]. International Work-Conference on Bioinformatics and Biomedical Engineering. Cham, July 12-14, 2023, Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-34960-7_3

Simmachan, T., Manopa, W., Neamhom, P., Poothong, A., & Phaphan, W. (2023). Detecting fraudulent claims in automobile insurance policies by data mining techniques. Thailand Statistician, 21(3), 552-568. https://ph02.tci-thaijo.org/index.php/thaistat/article/view/250065

Simmachan, T., Wongsai, N., Wongsai, S., & Lerdsuwansri, R. (2022). Modeling road accident fatalities with underdispersion and zero-inflated counts. PLoS One, 17(11), Article e0269022. https://doi.org/10.1371/journal.pone.0269022

Simmachan, T., Wongsai, S., Lerdsuwansri, R., & Boonkrong, P. (2025). Impact of COVID-19 Pandemic on Road Traffic Accident Severity in Thailand: An Application of K-Nearest Neighbor Algorithm with Feature Selection Techniques. Thailand Statistician, 23(1), 129-143.

Siviroj, P., Peltzer, K., Pengpid, S., & Morarit, S. (2012a). Helmet use and associated factors among Thai motorcyclists during Songkran festival. International Journal of Environmental Research and Public Health, 9(9), 3286-3297. https://doi.org/10.3390/ijerph9093286

Siviroj, P., Peltzer, K., Pengpid, S., & Morarit, S. (2012b). Non-seatbelt use and associated factors among Thai drivers during Songkran festival. BMC Public Health, 12, Article 608. https://doi.org/10.1186/1471-2458-12-608

Sun, H., Wang, A., Feng, Y., & Liu, C. (2021). An optimized random forest classification method for processing imbalanced data sets of alzheimer's disease [Conference presentation]. 2021 33rd Chinese Control and Decision Conference (CCDC). IEEE. https://doi.org/10.1109/CCDC52312.2021.9602177.

Tanaboriboon, Y., & Satiennam, T. (2005). Traffic accidents in Thailand. IATSS Research, 29(1), 88-100. https://doi.org/10.1016/S0386-1112(14)60122-9

Taveekal, P., Rajchanuwong, P., Wongwiangjan, R., Lerdsuwansri, R., Intrakul, J., Simmachan, T., & Wongsai, S. (2023). Modelling Road Accident Injuries and Fatalities in Suratthani Province of Thailand Using Conway-Maxwell-Poisson Regression. Thailand Statistician, 21(3), 569-579. https://ph02.tci-thaijo.org/index.php/thaistat/article/view/250067

Vanishkorn, B., & Supanich, W. (2022). Crash severity classification prediction and factors affecting analysis of highway accidents [Conference presentation]. 2022 9th International Conference on Advanced Informatics: Concepts, Theory and Applications (ICAICTA). September 28-29, 2022, IEEE, Tokoname, Japan. https://doi.org/10.1109/ICAICTA56449.2022.9932998

Wang, J., Neskovic, P., & Cooper, L. N. (2007). Improving nearest neighbor rule with a simple adaptive distance measure. Pattern Recognition Letters, 28(2), 207-213. https://doi.org/10.1016/j.patrec.2006.07.002

Wisutwattanasak, P., Jomnonkwao, S., Se, C., & Ratanavaraha, V. (2022). Influence of psychological perspectives and demographics on drivers’ valuation of road accidents: a combination of confirmatory factor analysis and preference heterogeneity model. Behavioral Sciences, 12(9), Article 336. https://doi.org/10.3390/bs12090336

Worachairungreung, M., Ninsawat, S., Witayangkurn, A., & Dailey, M. N. (2021). Identification of road traffic injury risk prone area using environmental factors by machine learning classification in Nonthaburi, Thailand. Sustainability, 13(7), Article 3907. https://doi.org/10.3390/su13073907

WHO. (2018). Global status report on alcohol and health 2018. Retrieved December 20, 2023, from https://www.who.int/publications/i/item/9789241565684

Xu, J., Yao, L., Li, L., & Chen, Y. (2014). Sampling based multi-agent joint learning for association rule mining [Conference presentation]. The 2014 international conference on Autonomous agents and multi-agent systems. ACM. https://dl.acm.org/doi/abs/10.5555/2615731.2617527

Yilmaz, A. E., and Demirhan, H. (2023). Weighted kappa measures for ordinal multi-class classification performance. Applied Soft Computing, 134, Article 110020. https://doi.org/10.1016/j.asoc.2023.110020

Zha, D., Lai, K. H., Tan, Q., Ding, S., Zou, N., & Hu, X. B. (2022). Towards automated imbalanced learning with deep hierarchical reinforcement learning [Conference presentation]. The 31st ACM International Conference on Information & Knowledge Management. October 17-21, 2022, ACM, Atlanta GA USA. https://doi.org/10.1145/3511808.3557474

Stastistic (Updated June 2025)	%
Submissions Accepted	19
Submissions Declined (After Review)	19
Submissions Declined (Desk reject)	62
Day to first decision (days)	5
Day to Acceptance (days)	105

Effect of Resampling Techniques on Machine Learning Models for Classifying Road Accident Severity in Thailand

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

Categories

License

Make a Submission

Indexed in

Scimago Journal Rank

Statics

new stat

CC

Facebook