Ara-RATGAN for Arabic Text to Image Synthesis

Samy, Mostafa Samy; Al-Berry, Mariam Nabil; Bahgat, Sayed Fadel

doi:10.21608/ijicis.2025.392458.1400

Ara-RATGAN for Arabic Text to Image Synthesis

Document Type : Original Article

Authors

¹ 146 Madint Naser,Madint Al_Tawfeq,Cairo,Egypt

² Ain Shams University

³ Prof., Scientific Computing Department, Faculty of Computers and Information Sciences Ain Shams University, Cairo, Egypt

10.21608/ijicis.2025.392458.1400

Abstract

Current text-to-image systems have revealed outstanding performance in tasks requiring the automated synthesis of realistic generated images from text descriptions. Previous approaches typically employ multiple sperate fusion blocks to adaptively fuse appropriate text information into the generation process, which increases the difficulty of training and conflict with one another. To solve these concerns, we present Arabic Recurrent Affine Transformation (Ara-RATGAN) a novel framework that integrates AraBERT -- a pretrained Arabic BERT that has been trained on billions of Arabic words to generate robust Arabic sentences embedding with Recurrent Affine Transformation (RAT) to generate images with high-quality from Arabic-language text descriptions. Furthermore, a spatial attention model is used in the discriminator to promote semantic coherence between text and synthesized images, which identifies corresponding image areas, and directs the generator to produce more appropriate visual contents according to the Arabic text descriptions. We conducted our extensive experiments on Arabic CUB dataset translated from English to Arabic, which shows a superior performance of our proposed model in comparison to the previous Arabic text-to-image models. Our approach addresses two key challenges: (1) Text-Image Fusion: Unlike traditional methods that use isolated fusion blocks, we employ RAT to model long-term dependencies across layers, ensuring global consistency in text conditioning. (2) Semantic Alignment: A spatial attention mechanism is used in the discriminator to enhance the semantic coherence between the synthetic visuals and Arabic text.

Keywords