การเปรียบเทียบโครงสร้าง CNN สำหรับการจำแนกสำเนียงไทย: VFNet กับโครงสร้าง CNN เชิงอนุกรม

Thiraphat Soebklin; Suradet Jitprapaikulsarn

PDF

Published: Nov 25, 2025

Keywords:

Thai accent, accent classification, CNN, VFNet, MFCC

Thiraphat Soebklin

Department of Electrical & Computer Engineering, Faculty of Engineering, Naresuan University, Thailand,

Suradet Jitprapaikulsarn

Department of Electrical & Computer Engineering, Faculty of Engineering, Naresuan University, Thailand,

Abstract

This research presents a comparative study of Convolutional Neural Network (CNN) architectures for Thai accent classification. It contrasts a parallel architecture based on [1] VFNet: A Convolutional Architecture for Accent Classification, which uses multi-size filters simultaneously, with a sequential architecture that stacks different kernel sizes across layers (e.g., 3→5→7). The input features are Mel-Frequency Cepstral Coefficients (MFCCs) extracted from the Thai Dialect Corpus [2]. Experimental results show that both models achieve comparable accuracy and F1-scores. However, further analysis reveals that sequential models such as 5→5→5 and 7→5→3 outperform the VFNet-based parallel architecture in terms of lower parameter count and cross-entropy loss. A detailed 2D receptive field (RF) analysis also indicates that architectures with moderate RF sizes tend to deliver better classification performance compared to those with very small or excessively large RFs. These findings emphasize the practical advantages of well-structured sequential CNNs for real-world deployment under computational and memory constraints.

How to Cite

[1]

T. Soebklin and S. Jitprapaikulsarn, “Comparative Analysis of CNN Architectures for Thai Accent Classification: VFNet vs. Sequential CNN ”, TEEJ, vol. 5, no. 3, pp. 22–28, Nov. 2025.

Issue

Vol. 5 No. 3 (2025): September – December

Section

Research article

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Journal of TCI is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) licence, unless otherwise stated. Please read our Policies page for more information...

Author Biography

Suradet Jitprapaikulsarn, Department of Electrical & Computer Engineering, Faculty of Engineering, Naresuan University, Thailand,

ดร.สุรเดช จิตประไพกุลศาล อาจารย์ประจำภาควิชาวิศวกรรมไฟฟ้าและคอมพิวเตอร์ ม.นเรศวร ความเชื่ยวชาญ Mathematical Programming, วิศวกรรมซอฟต์แวร์, Cybersecurity, Machine Learning

References

A. Ahmed, P. Tangri, A. Panda, D. Ramani, and S. Karmakar, “VFNet: A Convolutional Architecture for Accent Classification,” in Proc. IEEE 16th India Council Int. Conf. (INDICON), Nov. 1‑4 2019, pp. 1–4.

A. Suwanbandit, B. Naowarat, O. Sangpetch, and E. Chuangsuwanich, “Thai Dialect Corpus and Transfer‑based Curriculum Learning for Dialect ASR,” in Proc. Interspeech 2023, Dublin, Ireland, Aug. 20–24 2023, pp. 4069–4073, doi: 10.21437/Interspeech.2023‑1828.

K. J. Piczak, “Environmental sound classification with convolutional neural networks,” in 2015 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2015, pp. 1-6.

Z. Ren, Q. Kong, K. Qian, M. D. Plumbley, and B. W. Schuller, "Attention-based convolutional neural networks for acoustic scene classification," in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), 2018, pp. 39-43.

I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA: MIT Press, 2016.

Stanford CS231n, “Convolutional Neural Networks for Visual Recognition,” [Online]. Available: http://cs231n.stanford.edu [Accessed: 19‑Jun‑2025].

Article Sidebar

Main Article Content

Abstract

Article Details

Suradet Jitprapaikulsarn, Department of Electrical & Computer Engineering, Faculty of Engineering, Naresuan University, Thailand,

References