Comparative Analysis of CNN Architectures for Thai Accent Classification: VFNet vs. Sequential CNN

Main Article Content

Thiraphat Soebklin
Suradet Jitprapaikulsarn

Abstract

This research presents a comparative study of Convolutional Neural Network (CNN) architectures for Thai accent classification. It contrasts a parallel architecture based on [1] VFNet: A Convolutional Architecture for Accent Classification, which uses multi-size filters simultaneously, with a sequential architecture that stacks different kernel sizes across layers (e.g., 3→5→7). The input features are Mel-Frequency Cepstral Coefficients (MFCCs) extracted from the Thai Dialect Corpus [2]. Experimental results show that both models achieve comparable accuracy and F1-scores. However, further analysis reveals that sequential models such as 5→5→5 and 7→5→3 outperform the VFNet-based parallel architecture in terms of lower parameter count and cross-entropy loss. A detailed 2D receptive field (RF) analysis also indicates that architectures with moderate RF sizes tend to deliver better classification performance compared to those with very small or excessively large RFs. These findings emphasize the practical advantages of well-structured sequential CNNs for real-world deployment under computational and memory constraints.

Article Details

How to Cite
[1]
T. Soebklin and S. Jitprapaikulsarn, “Comparative Analysis of CNN Architectures for Thai Accent Classification: VFNet vs. Sequential CNN ”, TEEJ, vol. 5, no. 3, pp. 22–28, Nov. 2025.
Section
Research article
Author Biography

Suradet Jitprapaikulsarn, Department of Electrical & Computer Engineering, Faculty of Engineering, Naresuan University, Thailand,

ดร.สุรเดช จิตประไพกุลศาล อาจารย์ประจำภาควิชาวิศวกรรมไฟฟ้าและคอมพิวเตอร์ ม.นเรศวร ความเชื่ยวชาญ Mathematical Programming, วิศวกรรมซอฟต์แวร์, Cybersecurity, Machine Learning

References

A. Ahmed, P. Tangri, A. Panda, D. Ramani, and S. Karmakar, “VFNet: A Convolutional Architecture for Accent Classification,” in Proc. IEEE 16th India Council Int. Conf. (INDICON), Nov. 1‑4 2019, pp. 1–4.

A. Suwanbandit, B. Naowarat, O. Sangpetch, and E. Chuangsuwanich, “Thai Dialect Corpus and Transfer‑based Curriculum Learning for Dialect ASR,” in Proc. Interspeech 2023, Dublin, Ireland, Aug. 20–24 2023, pp. 4069–4073, doi: 10.21437/Interspeech.2023‑1828.

K. J. Piczak, “Environmental sound classification with convolutional neural networks,” in 2015 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2015, pp. 1-6.

Z. Ren, Q. Kong, K. Qian, M. D. Plumbley, and B. W. Schuller, "Attention-based convolutional neural networks for acoustic scene classification," in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), 2018, pp. 39-43.

I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA: MIT Press, 2016.

Stanford CS231n, “Convolutional Neural Networks for Visual Recognition,” [Online]. Available: http://cs231n.stanford.edu [Accessed: 19‑Jun‑2025].