AraCLIP: Arabic Image Retrieval

Cross-lingual learning model for effective Arabic text-to-image retrieval using knowledge distillation from CLIP

Year: 2024

Tech: PyTorch, CLIP, Transformers

Category: Computer Vision, NLP, Arabic

Overview

AraCLIP extends the CLIP architecture for Arabic image retrieval tasks. It leverages Knowledge Distillation to transfer cross-modal knowledge from English to Arabic, enhancing its ability to understand Arabic text and retrieve relevant images.

The model outperforms state-of-the-art multilingual models by approximately 10% across various evaluation metrics including recall@1, recall@5, and mean reciprocal rank.

Key Features

Knowledge Distillation

Transfers cross-modal knowledge from English CLIP to Arabic, preserving semantic relationships.

Arabic-First Design

Specifically optimized for Arabic text understanding, not just multilingual adaptation.

Superior Performance

10% improvement over state-of-the-art multilingual vision-language models.

Open Source

Model weights and training code available on HuggingFace and GitHub.

Technical Approach

Architecture

AraCLIP uses a dual-encoder architecture with:

Text Encoder: AraBERT-based model fine-tuned for Arabic text understanding
Image Encoder: Vision Transformer (ViT) pre-trained on ImageNet
Projection Heads: Linear layers mapping both encoders to shared embedding space

Training Strategy

The model is trained using a two-stage approach:

Stage 1: Knowledge distillation from English CLIP using parallel Arabic-English datasets
Stage 2: Fine-tuning on Arabic-specific image-caption pairs with contrastive learning

            
# Example usage

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("Arabic-Clip/AraCLIP")

processor = CLIPProcessor.from_pretrained("Arabic-Clip/AraCLIP")

# Arabic text query

text = "قطة جالسة على طاولة"  # "A cat sitting on a table"

inputs = processor(text=text, images=image, return_tensors="pt")

outputs = model(**inputs)

Results

AraCLIP was evaluated on multiple Arabic image-text retrieval benchmarks:

Model	Recall@1	Recall@5	MRR
mCLIP	45.2%	68.3%	0.542
AltCLIP	48.7%	71.5%	0.567
AraCLIP (Ours)	55.3%	78.9%	0.631

Citation

If you use AraCLIP in your research, please cite:

            
@inproceedings{albarham2024araclip,

  title={AraCLIP: Cross-Lingual Arabic Image Retrieval},

  author={Albarham, Mohammad and Others},

  booktitle={Proceedings of ArabicNLP 2024},

  year={2024},

  url={https://aclanthology.org/2024.arabicnlp-1.9/}

}

Back to Projects