Overview
AraCLIP extends the CLIP architecture for Arabic image retrieval tasks. It leverages Knowledge Distillation
to transfer cross-modal knowledge from English to Arabic, enhancing its ability to understand Arabic text
and retrieve relevant images.
The model outperforms state-of-the-art multilingual models by approximately 10% across
various evaluation metrics including recall@1, recall@5, and mean reciprocal rank.
Technical Approach
Architecture
AraCLIP uses a dual-encoder architecture with:
- Text Encoder: AraBERT-based model fine-tuned for Arabic text understanding
- Image Encoder: Vision Transformer (ViT) pre-trained on ImageNet
- Projection Heads: Linear layers mapping both encoders to shared embedding space
Training Strategy
The model is trained using a two-stage approach:
- Stage 1: Knowledge distillation from English CLIP using parallel Arabic-English datasets
- Stage 2: Fine-tuning on Arabic-specific image-caption pairs with contrastive learning
# Example usage
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("Arabic-Clip/AraCLIP")
processor = CLIPProcessor.from_pretrained("Arabic-Clip/AraCLIP")
# Arabic text query
text = "قطة جالسة على طاولة" # "A cat sitting on a table"
inputs = processor(text=text, images=image, return_tensors="pt")
outputs = model(**inputs)
Results
AraCLIP was evaluated on multiple Arabic image-text retrieval benchmarks:
| Model |
Recall@1 |
Recall@5 |
MRR |
| mCLIP |
45.2% |
68.3% |
0.542 |
| AltCLIP |
48.7% |
71.5% |
0.567 |
| AraCLIP (Ours) |
55.3% |
78.9% |
0.631 |
Citation
If you use AraCLIP in your research, please cite:
@inproceedings{albarham2024araclip,
title={AraCLIP: Cross-Lingual Arabic Image Retrieval},
author={Albarham, Mohammad and Others},
booktitle={Proceedings of ArabicNLP 2024},
year={2024},
url={https://aclanthology.org/2024.arabicnlp-1.9/}
}