AraCLIP: Arabic Image Retrieval

Cross-lingual learning model for effective Arabic text-to-image retrieval using knowledge distillation from CLIP

AraCLIP Architecture
Year: 2024
Tech: PyTorch, CLIP, Transformers
Category: Computer Vision, NLP, Arabic

Overview

AraCLIP extends the CLIP architecture for Arabic image retrieval tasks. It leverages Knowledge Distillation to transfer cross-modal knowledge from English to Arabic, enhancing its ability to understand Arabic text and retrieve relevant images.

The model outperforms state-of-the-art multilingual models by approximately 10% across various evaluation metrics including recall@1, recall@5, and mean reciprocal rank.

Key Features

Knowledge Distillation

Transfers cross-modal knowledge from English CLIP to Arabic, preserving semantic relationships.

Arabic-First Design

Specifically optimized for Arabic text understanding, not just multilingual adaptation.

Superior Performance

10% improvement over state-of-the-art multilingual vision-language models.

Open Source

Model weights and training code available on HuggingFace and GitHub.

Technical Approach

Architecture

AraCLIP uses a dual-encoder architecture with:

  • Text Encoder: AraBERT-based model fine-tuned for Arabic text understanding
  • Image Encoder: Vision Transformer (ViT) pre-trained on ImageNet
  • Projection Heads: Linear layers mapping both encoders to shared embedding space

Training Strategy

The model is trained using a two-stage approach:

  1. Stage 1: Knowledge distillation from English CLIP using parallel Arabic-English datasets
  2. Stage 2: Fine-tuning on Arabic-specific image-caption pairs with contrastive learning
# Example usage
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("Arabic-Clip/AraCLIP")
processor = CLIPProcessor.from_pretrained("Arabic-Clip/AraCLIP")

# Arabic text query
text = "قطة جالسة على طاولة" # "A cat sitting on a table"
inputs = processor(text=text, images=image, return_tensors="pt")
outputs = model(**inputs)

Results

AraCLIP was evaluated on multiple Arabic image-text retrieval benchmarks:

Model Recall@1 Recall@5 MRR
mCLIP 45.2% 68.3% 0.542
AltCLIP 48.7% 71.5% 0.567
AraCLIP (Ours) 55.3% 78.9% 0.631

Citation

If you use AraCLIP in your research, please cite:

@inproceedings{albarham2024araclip,
title={AraCLIP: Cross-Lingual Arabic Image Retrieval},
author={Albarham, Mohammad and Others},
booktitle={Proceedings of ArabicNLP 2024},
year={2024},
url={https://aclanthology.org/2024.arabicnlp-1.9/}
}