A comparative study of clustering techniques for non-segmented language documents

Authors

  • Todsanai Chumwatana College of Information and Communication Technology, Rangsit University, Patumthani 12000, Thailand

Keywords:

document clustering, k-means, non-segmented languages, self-organizing map

Abstract

Document clustering has become an important area of study due to the rapid increase in the number of electronic documents.  It can be employed to group and categorize documents, as well as provide a useful summary of the categories for browsing purposes.  Until now, many clustering techniques have been developed for grouping and clustering documents both in segmented and non-segmented languages, like English and some Asian languages, respectively.  However, document clustering can be a complicated task for many Asian languages such as Chinese, Japanese, Korean and Thai, because these languages are written without explicit word boundary delimiters such as white space.  The aim of this paper is to provide a comprehensive and comparative study of non-segmented document clustering techniques using self-organizing map (SOM) and k-means, as they are two classic and well known methods in the area of text clustering.  To illustrate these two methods, experimental and comparative studies on clustering non-segmented documents by using SOM and k-means are revealed in this paper.  The keyword extraction is first applied to search for the member of occurrences.  These members are then used as an input for the next clustering process.  The experimental results show that k-means technique is simple and has low computation cost.  Meanwhile, SOM is relatively complex, but the clustering performance is more visual and easy to comprehend.  Consequently, k-means technique has become a well-known text clustering method and is used by many fields due to its straightforwardness, while SOM performs well for detection of noisy documents, thus making it more suitable for some applications such as navigation of document collection and multi-document summarization.

References

Adams, E. (1991). A study of trigrams and their feasibility as index terms in a full text information retrieval system (PhD’s thesis, George Washington University, USA).

Arora, P., & Varshney, S. (2016). Analysis of k-means and k-medoids algorithm for big data. Procedia Computer Science, 78, 507-512. DOI: 10.1016/j.procs.2016.02.095

Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. New York, USA: ACM Press.

Berrada, M., Hmaidi, A. E., Monyr, N., Abrid, D., Abdallaoui, A., Essahlaoui, A., & Ouali, A. E. (2016). Self-organizing map for the detection of seasonal variations in Sidi Chahed Dam sediments (Northern Morocco). Hydrological Sciences Journal, 61(3), 628-635. http://dx.doi.org/10.1080/02626667.2014. 964717

Brent, M. R., & Tao, X. (2001). Chinese text segmentation with MBDP-1: making the most of training corpora. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics (ACL 2001). Toulouse, France, pp. 90-97. DOI: 10.3115/1073012.1073025

Cai, D., & Zhao, H. (2016). Neural word segmentation learning for Chinese. arXiv preprint arXiv:1606.04300 [cs.CL]. https://arxiv.org/abs/1606.04300

Cavnar, W., & Trenkle, J. (1994). N-gram based text categorization. In Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval. Las Vegas, USA, pp. 161-175.

Chien, L. F. (1995). Fast and quasi-natural language search for gigabytes of Chinese texts. In Proceedings of 18th ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA, pp. 112-120. DOI: 10.1145/215206.215345

Chuleerat, J. (1998). An automatic indexing for Thai text retrieval (PhD’s thesis, George Washington University, USA).

Chumwatana, T. (2014). Using clustering techniques for non-segmented language document management: A comparison of k-mean and self organizing map techniques. In Proceedings in Knowledge Management International Conference (KMICe) 2014, 12-15 August 2014, Malaysia, PID214, 600-605.

Chumwatana, T., Wong, W. K., & Xie, H. (2009). Non-segmented document clustering using self-organizing map and frequent max substring technique. In 16th International Conference on Neural Information Processing (ICONIP 2009), Bangkok, Thailand. pp. 691-698.

Chumwatana, T., Wong, K.W. & Xie, H. (2010) A SOM-based document clustering using frequent max substrings for non-segmented texts. Journal of Intelligent Learning Systems and Applications, 2 (03), 117-125. http://dx.doi.org/10.4236/jilsa.2010.23015

Croft, W. B. (1993). A comparison of indexing techniques for Japanese text retrieval. In Proceedings of ACM SIGIR International Conference on Research and Development in Information Retrieval, pp. 237-246.

Cutting, R. D., Pedersen, J. O., & Tukey, J. W. (1992). Scatter/gather: A cluster-based approach to browsing large document collections. In The 15th Annual ACM-SIGIR ’92, pp. 318-329.

Dembele, P., & Kastner, P. (2003). Fuzzy C-means method for clustering microarray data. Bioinformatics, 19(8), 973-980. DOI: 10.1093/bioinformatics/btg119

Feldman, R., & Sanger, J. (2006). The text mining handbook: Advanced approaches in analyzing unstructured data. UK: Cambridge University Press.

Fung, C. C., Wong, W. K., Eren, H., Charlebois, R., & Crocker, H. (1997). Modular artificial neural network for prediction of petrophysical properties from well Log data. IEEE Transactions on Instrumentation & Measurement, 46(6), 1295-1299.

Haruechaiyasak, C., Kongyoung, S., & Damrongrat, C. (2000). LearnLexTo:a machine-learning based word segmentation for indexing Thai texts. In Proceedings of iNEWS’08: Proceedings of the 2nd ACM Workshop on Improving non English web searching. Napa Valley, CA, USA, pp. 85-88.

Hasan, M. M., & Matsumoto, Y. (2000). Chinese-Japanese cross language information retrieval: A Han character based approach. In Proceedings of the SIGLEX Workshop on Word Senses and Multi-linguality (ACL-2000). Hong Kong, pp. 19-26.

Hearst, M. A. (1999). Untangling text data mining. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL) University of Maryland, June 20-26, 1999, pp. 3-10.

Heyer, L. J., Kruglyak, S. & Yooseph, S. (1999). Exploring expression data: identification and analysis of coexpressed genes. Genome Research, 9(11), 1106-1115.

Huang, Z. (1998). Extensions to the k-means algorithm for clustering large datasets with categorical values. Data Mining and Knowledge Discovery, 3, 283-304. DOI: 10.1023/A:1009769707641

Jain, A. K. (1988). Algorithms for clustering data. Englewood Cliffs, New Jersey, USA: Prentice Hall.

Jain, A. K., Murty, M. N., & Flynn, P. J. (1999) Data clustering: a review. ACM Computing Surveys (CSUR), 31(3), 264-323.

Jaruskulchai, C. (1996). Thai text segmentation: problems and potential solutions. In the Sixth Annual Workshop on Science and Technology Exchange between Thai Professionals in North America and Thailand. Edmonton, Alberta, Canada.

Jaruskulchai, C., & Kruengkrai, C. (2003). A practical text summarizer by paragraph extraction for Thai. In Proceedings of the sixth international workshop on Information retrieval with Asian languages, Japan, 11, 9-16. DOI: 10.3115/1118935.1118937

Jiao, H., Liu, Q., & Jia, H. (2007). Chinese keyword extraction based on n-gram and word co-occurrence. In 2007 International Conference on Computational Intelligence and Security Workshops (CISW 2007). 15-19 December 2007. China, pp. 152-155. DOI: 10.1109/CISW.2007.4425468

Kalpana, S., & Vigneshwari, S. (2016). Selecting multiview point similarity from different methods of similarity measure to perform document comparison. Indian Journal of Science and Technology, 9(10). DOI: 10.17485/ijst/2016/v9i10/88903

Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data: An introduction to cluster analysis. John Wiley and Sons.

Kim, M. S., Whang, K. Y., Lee, J. G., & Lee, M. J. (2005). n-Gram/2L: A space and time efficient two-level n-Gram inverted index structure. In Proceedings of the 31st VLDB Conference, Trondheim, Norway, pp. 325-336.

Kohonen, T. (1984). Self-organization and associative memory. vol. 8: More about biological memory. Springer-Verlag Berlin Heidelberg.

Kwok, K. L. (1997). Comparing representations in Chinese information retrieval. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Philadelphia, USA, pp. 34-41.

Lee, J. H., & Ahn, J. S. (1996). Using n-grams for Korean text retrieval. In SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval. Zurich, Switzerland, August 18 - 22, 1996, pp. 216-224. DOI: 10.1145/243199.243269

Liang, T., Lee, S. Y., & Yang, W. P. (1996). Optimal weight assignment for a Chinese signature file. In Journal of Information Processing and Management, 32(2), 227-237. https://doi.org/10.1016/S0306-4573(96)85008-4

Lin, Y. T. (2007). Chinese-English dictionary of modern usage. Hong Kong: Chinese University of Hong Kong Press.

Liu, B. (2007). Web data mining: Exploring hyperlinks, contents, and usage data. New York, USA: Springer-Verlag Berlin Heidelberg.

Majumder, P., Mitra, M., & Chaudhuri, B. B. (2002). N-gram: a language independent approach to IR and NLP. In International Conference on Universal Knowledge.

Matveeva, I. (2006). Document representation and multilevel measures of document similarity. In Proceedings of the 2006 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume: doctoral consortium. New York, June 04-09, pp. 235-238. DOI: 10.3115/1225797.1225804

Merkevičius, E., Garšva, G., & Simutis, R. (2015). Forecasting of credit classes with the self-organizing maps. Information technology and control, 33(4), 61-66.

Ogawa, Y., & Matsudua, T. (1998). Optimizing query evaluation in n-gram indexing. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. Melbourne, Australia, August 24-28, 1998, pp. 367-368. DOI: 10.1145/290941.291057

Olszewski, D. (2016). Asymmetric k-means clustering of the asymmetric self-organizing map. Neural Processing Letters, 43(1), 231-253. DOI: 10.1007/s11063-015-9415-8

Rajchl, M., Baxter, J. S., McLeod, A. J., Yuan, J., Qiu, W., Peters, T. M., & Khan, A. R. (2016). Hierarchical max-flow segmentation framework for multi-atlas segmentation with Kohonen self-organizing map based Gaussian mixture modeling. Medical image analysis, 27, 45-56. DOI: 10.1016/j.media. 2015.05.005

Sornlertlamvanich, V. (1993). Word segmentation for Thai in machine translation system. Machine Translation, National Electronics and Computer Technology Center, 50-56. Bangkok.

Steinbach, G. K. M., & Kumar, V. (2000). A comparison of document clustering techniques. In KDD Workshop on Text Mining.

Sukhahuta, R., & Smith, D. (2000). Information extraction for Thai documents. In IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages, Hong Kong, China. September 30 - October 01, 2000, pp. 103-110. DOI: 10.1145/355214.355229

Tan, A. (1999). Text mining: The state of the art and the challenges. In Proceedings of the Pacific Asia Conf on Knowledge Discovery and Data Mining PAKDD'99 workshop on Knowledge Discovery from Advanced Databases (KDAD'99), page 65-70.

Theeramunkong, T., Sornlertlamvanich, V., Tanhermhong, T., & Chinnan, W. (2000). Character-cluster based Thai information retrieval. In Proc. of the 5th Int. Workshop on Information Retrieval with Asian Languages. Hong Kong, pp.75-80.

Wieger, L. (1965). Chinese characters: Their origin, etymology, history, classification and signification: A thorough study from Chinese documents. The Catholic Mission Press.

Willett, P. (1988). Recent trends in hierarchic document clustering: a critical review. Information Processing and Management, 24(5), 577-597.

Williams, H. E., & Zobel, J. (2002). Indexing and retrieval for genomic databases. In IEEE Transaction on Knowledge and Data Engineering, 14(1), 63-78. DOI: 10.1109/69.979973

Yang, H. C., Lee, C. H., & Hsiao, H. W. (2015). Incorporating self-organizing map with text mining techniques for text hierarchy generation. Applied Soft Computing, 34(C), 251-259. DOI: 10.1016/j.asoc.2015.05.005

Zhao, Y., & Karypis, G. (2002). Comparison of agglomerative and partitional document clustering algorithms. (No. TR-02-014). Minnesota University, Minneapolis, Department of Computer Science

Downloads

Published

2023-02-18

How to Cite

Todsanai Chumwatana. (2023). A comparative study of clustering techniques for non-segmented language documents. Journal of Current Science and Technology, 7(1), 11–22. Retrieved from https://ph04.tci-thaijo.org/index.php/JCST/article/view/520

Issue

Section

Research Article