Text-to-face Synthesis for Criminal Sketch
Keywords:
Deep Learning, Text-to-face Synthesis, StyleGAN3, Contrastive Language-Image Pre-trainingAbstract
Background and Objectives: Transformation of text into images is gaining widespread popularity because such transformation offers benefits in various fields, including in investigations requiring sketching of suspects or even in the search for missing persons. However, most researches on text-to-image transformation tend to focus on such simple objects or common items as flowers or birds. Studies on human faces are still relatively rare, partly because facial databases are incomplete. Such databases as Labeled Faces in the Wild and MegaFace contain only images without descriptive text. The present research aimed to develop an automatic suspect face sketching system using only textual descriptions of physical characteristics as input.
Methodology: The present research used CelebA HQ database, which contains both images and descriptions. BERT Sentence Encoder and CLIP Text Encoder were employed to encode the text, converting the descriptions into vocabulary and eventually transforming them into numerical values. These numerical values were tested on four models: StyleGAN3+CLIP, StyleGAN3+CLIP (Fine-Tuning), DC+CLIP (Fine-Tuning), and DF+CLIP (Fine-Tuning).
Main Results: Based on the experimental comparison of the four models using the quantitative indicator FID, we found that the model using StyleGAN3 as the Generator exhibited the highest performance in generating high-quality images that matched the descriptions. The best results were obtained when StyleGAN3 was paired with the CLIP model, which had been further fine-tuned with the CelebA HQ dataset.
Conclusions: To facilitate and enhance the efficiency of the process of sketching suspect faces in investigations, the present research developed a system for generating suspect face sketches based solely on textual descriptions of the suspect's physical features. The experimental results, which were obtained through the combined use of Generative Adversarial Networks (GAN) and CLIP model, revealed that the combination of StyleGAN3 and CLIP produced the highest quality and most accurate face sketches based on the descriptions.
Practical Application: The present research can be further developed to create a system for generating suspect face sketches from textual descriptions to aid in investigations. It can also be used in similar contexts, such as designing new model faces for advertising media creation.
References
Jalan, H.J., Maurya, G., Corda, C., Dsouza, S. and Panchal, D., 2020, “Suspect Face Generation,” Proceedings of the 2020 3rd International Conference on Communication System, Computing and IT Applications (CSCITA), 3-4 April 2020, Mumbai, India, pp. 73-78.
Kotian, C., Lokhande, S., Jain, M. and Pavate, A., 2020, “D2F: Description to Face Synthesis Using GAN,” Proceedings of the International Conference on Recent Advances in Computational Techniques (IC-RACT), 27-28 March 2020, Mumbai, India, 8 p.
Nair, K.R., Sam, S.S., Praveena, K.P., Juju, K. and Cherian, S., 2021, “Transfer Learning with Deep Convolutional Neural Networks in Forensic Face Sketch Recognition,” Proceedings of the International Conference on IoT Based Control Networks & Intelligent Systems (ICICNIS 2021), 28-29 June 2021, Kerala, India, pp. 1-5.
Xu, J., Xue, X., Wu, Y. and Mao, X., 2020, “Matching a Composite Sketch to a Photographed Face using Fused HOG and Deep Feature Models,” The Visual Computer, 37 (4), pp. 765-776.
Wadhawan, R., Drall, T., Singh, S. and Chakraverty, S., 2020, “Multi-Attributed and Structured Text-to-Face Synthesis,” IEEE International Conference on Technology, Engineering, Management for Societal impact using Marketing, Entrepreneurship and Talent (TEMSMET), 10-10 December 2020, Bengaluru, India, pp. 1-7.
Sun, J., Li, Q., Wang, W., Zhao, J. and Sun, Z., 2021, “Multi-caption Text-to-Face Synthesis: Dataset and Algorithm,” Proceedings of the 29th ACM International Conference on Multimedia, Association for Computing Machinery, 20-24 October 2021, New York, USA, pp. 2290–2298.
Oza, M., Chanda, S. and Doerman, D., 2022, “Semantic Text-to-Face GAN-ST2F,” arXiv preprint arXiv:2107.10756. https://doi.org/10.48550/arXiv.2107.10756
Wang, T., Zhang, T. and Lovell, B., 2021, “Faces la Carte: Text-to-Face Generation via Attribute Disentanglement,” Winter Conference of Applications on Computer Vision (WACV), 5-9 January 2021, Virtually, pp. 3380-3388.
Sun, J., Deng, Q., Li, Q., Sun, M., Ren, M. and Sun, Z., 2022, “AnyFace: Free-style Text-to-Face Synthesis and Manipulation,” arXiv preprint arXiv:2203.15334v1. https://doi.org/10.48550/arXiv.2203.15334
Li, Z., Min, M.R., Li, K. and Xu, C., 2022, “StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis,” Conference on Computer Vision and Pattern Recognition (CVPR), 19-24 June 2022, New Orleans, Louisiana, USA, pp. 18197-18207.
Hermosilla, G., Tapia, D.H., Allende-Cid, H., Castro, G.F. and Vera, E., 2021, “Thermal Face Generation Using StyleGAN,” IEEE Access, 9, pp. 80511-80523.
Tao, M., Tang, H., Wu, F., Jing, X., Bao, B. and Xu, C., 2022, “DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis,” Conference on Computer Vision and Pattern Recognition (CVPR), 19-24 June 2022, New Orleans, Louisiana, USA, pp. 16515-16525.
Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D. and Lischinski, D., 2021, “StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery,” Proceedings of the International Conference on Computer Vision (ICCV), Virtually, 11-17 October 2021, Virtually, pp. 2085-2094.
Deorukhkar, K., Kadamala, K. and Menezes, E., 2022, “FGTD: Face Generation from Textual Description,” Inventive Communication and Computational Technologies, Lecture Notes in Networks and Systems, 311, pp. 547-562.
Karras, T., Aittala, M., Laine, S., Härkönen, E., Hellsten, J., Lehtinen, J. and Aila, T., 2021, “Alias-Free Generative Adversarial Networks,” Conference on Neural Information Processing Systems (NeurIPS 2021), 6-14 December 2021, Virtually, 12 p.
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G. and Sutskever, I., 2021, “Learning Transferable Visual Models From Natural Language Supervision,” Proceedings of the 38th International Conference on Machine Learning (PMLR), 18-24 July 2021, Virtually, 16 p.
Devlin, J., Chang, M., Lee, K. and Toutanova, K., 2019, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2-7 June 2019, Minneapolis, Minnesota, USA, pp. 4171-4186.
Cui, J., Jin, L., Kuang, H., Xu, Q. and Schwertfeger, S., 2021, “Underwater Depth Estimation for Spherical Images,” Journal of Robotics, 2021, 12 p.
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B. and Hochreiter, S., 2017, “GANs Trained by a Two Time-scale Update Rule Converge to a Local Nash Equilibrium,” Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17), 4-9 December 2017, Long Beach, CA, USA, pp. 6629–6640.
Radford, A., Metz, L. and Chintala, S., 2016, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks,” International Conference on Learning Representations (ICLR), 2-4 May 2016, San Juan, Puerto Rico.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 King Mongkut's University of Technology Thonburi
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Any form of contents contained in an article published in Science and Engineering Connect, including text, equations, formula, tables, figures and other forms of illustrations are copyrights of King Mongkut's University of Technology Thonburi. Reproduction of these contents in any format for commercial purpose requires a prior written consent of the Editor of the Journal.