Text-to-face Synthesis for Criminal Sketch

Authors

  • Mathee Prasertkijaphan Graduate School of Applied Statistics, National Institute of Development Administration, Bangkok, Thailand
  • Peerapat Tungpaiboon Graduate School of Applied Statistics, National Institute of Development Administration, Bangkok, Thailand
  • Taitip Suphasiriwattana Graduate School of Applied Statistics, National Institute of Development Administration, Bangkok, Thailand
  • Duangthida Sae-Tae Graduate School of Applied Statistics, National Institute of Development Administration, Bangkok, Thailand
  • Wichit Chamnannawa Graduate School of Applied Statistics, National Institute of Development Administration, Bangkok, Thailand
  • Nattaphon Assavahem Graduate School of Applied Statistics, National Institute of Development Administration, Bangkok, Thailand
  • Thitirat Siriborvornratanakul Graduate School of Applied Statistics, National Institute of Development Administration, Bangkok, Thailand

Keywords:

Deep Learning, Text-to-face Synthesis, StyleGAN3, Contrastive Language-Image Pre-training

Abstract

Background and Objectives: Transformation of text into images is gaining widespread popularity because such transformation offers benefits in various fields, including in investigations requiring sketching of suspects or even in the search for missing persons. However, most researches on text-to-image transformation tend to focus on such simple objects or common items as flowers or birds. Studies on human faces are still relatively rare, partly because facial databases are incomplete. Such databases as Labeled Faces in the Wild and MegaFace contain only images without descriptive text. The present research aimed to develop an automatic suspect face sketching system using only textual descriptions of physical characteristics as input.

Methodology: The present research used CelebA HQ database, which contains both images and descriptions. BERT Sentence Encoder and CLIP Text Encoder were employed to encode the text, converting the descriptions into vocabulary and eventually transforming them into numerical values. These numerical values were tested on four models: StyleGAN3+CLIP, StyleGAN3+CLIP (Fine-Tuning), DC+CLIP (Fine-Tuning), and DF+CLIP (Fine-Tuning).

Main Results: Based on the experimental comparison of the four models using the quantitative indicator FID, we found that the model using StyleGAN3 as the Generator exhibited the highest performance in generating high-quality images that matched the descriptions. The best results were obtained when StyleGAN3 was paired with the CLIP model, which had been further fine-tuned with the CelebA HQ dataset.

Conclusions: To facilitate and enhance the efficiency of the process of sketching suspect faces in investigations, the present research developed a system for generating suspect face sketches based solely on textual descriptions of the suspect's physical features. The experimental results, which were obtained through the combined use of Generative Adversarial Networks (GAN) and CLIP model, revealed that the combination of StyleGAN3 and CLIP produced the highest quality and most accurate face sketches based on the descriptions.

Practical Application: The present research can be further developed to create a system for generating suspect face sketches from textual descriptions to aid in investigations. It can also be used in similar contexts, such as designing new model faces for advertising media creation.

References

Jalan, H.J., Maurya, G., Corda, C., Dsouza, S. and Panchal, D., 2020, “Suspect Face Generation,” Proceedings of the 2020 3rd International Conference on Communication System, Computing and IT Applications (CSCITA), 3-4 April 2020, Mumbai, India, pp. 73-78.

Kotian, C., Lokhande, S., Jain, M. and Pavate, A., 2020, “D2F: Description to Face Synthesis Using GAN,” Proceedings of the International Conference on Recent Advances in Computational Techniques (IC-RACT), 27-28 March 2020, Mumbai, India, 8 p.

Nair, K.R., Sam, S.S., Praveena, K.P., Juju, K. and Cherian, S., 2021, “Transfer Learning with Deep Convolutional Neural Networks in Forensic Face Sketch Recognition,” Proceedings of the International Conference on IoT Based Control Networks & Intelligent Systems (ICICNIS 2021), 28-29 June 2021, Kerala, India, pp. 1-5.

Xu, J., Xue, X., Wu, Y. and Mao, X., 2020, “Matching a Composite Sketch to a Photographed Face using Fused HOG and Deep Feature Models,” The Visual Computer, 37 (4), pp. 765-776.

Wadhawan, R., Drall, T., Singh, S. and Chakraverty, S., 2020, “Multi-Attributed and Structured Text-to-Face Synthesis,” IEEE International Conference on Technology, Engineering, Management for Societal impact using Marketing, Entrepreneurship and Talent (TEMSMET), 10-10 December 2020, Bengaluru, India, pp. 1-7.

Sun, J., Li, Q., Wang, W., Zhao, J. and Sun, Z., 2021, “Multi-caption Text-to-Face Synthesis: Dataset and Algorithm,” Proceedings of the 29th ACM International Conference on Multimedia, Association for Computing Machinery, 20-24 October 2021, New York, USA, pp. 2290–2298.

Oza, M., Chanda, S. and Doerman, D., 2022, “Semantic Text-to-Face GAN-ST2F,” arXiv preprint arXiv:2107.10756. https://doi.org/10.48550/arXiv.2107.10756

Wang, T., Zhang, T. and Lovell, B., 2021, “Faces la Carte: Text-to-Face Generation via Attribute Disentanglement,” Winter Conference of Applications on Computer Vision (WACV), 5-9 January 2021, Virtually, pp. 3380-3388.

Sun, J., Deng, Q., Li, Q., Sun, M., Ren, M. and Sun, Z., 2022, “AnyFace: Free-style Text-to-Face Synthesis and Manipulation,” arXiv preprint arXiv:2203.15334v1. https://doi.org/10.48550/arXiv.2203.15334

Li, Z., Min, M.R., Li, K. and Xu, C., 2022, “StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis,” Conference on Computer Vision and Pattern Recognition (CVPR), 19-24 June 2022, New Orleans, Louisiana, USA, pp. 18197-18207.

Hermosilla, G., Tapia, D.H., Allende-Cid, H., Castro, G.F. and Vera, E., 2021, “Thermal Face Generation Using StyleGAN,” IEEE Access, 9, pp. 80511-80523.

Tao, M., Tang, H., Wu, F., Jing, X., Bao, B. and Xu, C., 2022, “DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis,” Conference on Computer Vision and Pattern Recognition (CVPR), 19-24 June 2022, New Orleans, Louisiana, USA, pp. 16515-16525.

Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D. and Lischinski, D., 2021, “StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery,” Proceedings of the International Conference on Computer Vision (ICCV), Virtually, 11-17 October 2021, Virtually, pp. 2085-2094.

Deorukhkar, K., Kadamala, K. and Menezes, E., 2022, “FGTD: Face Generation from Textual Description,” Inventive Communication and Computational Technologies, Lecture Notes in Networks and Systems, 311, pp. 547-562.

Karras, T., Aittala, M., Laine, S., Härkönen, E., Hellsten, J., Lehtinen, J. and Aila, T., 2021, “Alias-Free Generative Adversarial Networks,” Conference on Neural Information Processing Systems (NeurIPS 2021), 6-14 December 2021, Virtually, 12 p.

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G. and Sutskever, I., 2021, “Learning Transferable Visual Models From Natural Language Supervision,” Proceedings of the 38th International Conference on Machine Learning (PMLR), 18-24 July 2021, Virtually, 16 p.

Devlin, J., Chang, M., Lee, K. and Toutanova, K., 2019, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2-7 June 2019, Minneapolis, Minnesota, USA, pp. 4171-4186.

Cui, J., Jin, L., Kuang, H., Xu, Q. and Schwertfeger, S., 2021, “Underwater Depth Estimation for Spherical Images,” Journal of Robotics, 2021, 12 p.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B. and Hochreiter, S., 2017, “GANs Trained by a Two Time-scale Update Rule Converge to a Local Nash Equilibrium,” Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17), 4-9 December 2017, Long Beach, CA, USA, pp. 6629–6640.

Radford, A., Metz, L. and Chintala, S., 2016, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks,” International Conference on Learning Representations (ICLR), 2-4 May 2016, San Juan, Puerto Rico.

Cover Image

Downloads

Published

2024-06-30

How to Cite

Prasertkijaphan, M., Tungpaiboon, P., Suphasiriwattana, T., Sae-Tae, D., Chamnannawa, W., Assavahem, N., & Siriborvornratanakul, T. (2024). Text-to-face Synthesis for Criminal Sketch. Science and Engineering Connect, 47(2), 99–117. retrieved from https://ph04.tci-thaijo.org/index.php/SEC/article/view/7725

Issue

Section

Research Article