Retrieval-enriched zero-shot image classification in low-resource domains

1University of Trento 2University of Pisa 3Fondazione Bruno Kessler 4Apple Inc

EMNLP 2024

Abstract

Low-resource domains, characterized by scarce data and annotations, present significant challenges for language and visual understanding tasks, with the latter much under-explored in the literature. Recent advancements in Vision-Language Models (VLM) have shown promising results in high-resource domains but fall short in low-resource concepts that are under-represented (e.g. only a handful of images per category) in the pre-training set. We tackle the challenging task of zero-shot low-resource image classification from a novel perspective. By leveraging a retrieval-based strategy, we achieve this in a training-free fashion. Specifically, our method, named CoRE (Combination of Retrieval Enrichment), enriches the representation of both query images and class prototypes by retrieving relevant textual information from large web-crawled databases. This retrieval-based enrichment significantly boosts classification performance by incorporating the broader contextual information relevant to the specific class. We validate our method on a newly established benchmark covering diverse low-resource domains, including medical imaging, rare plants, and circuits. Our experiments demonstrate that CoRE outperforms existing state-of-the-art methods that rely on synthetic data generation and model fine-tuning.

Method

CoRE

Our CoRE enriches both the image embedding zq and the class prompts p with retrieved captions from a large-scale web-crawled database D. We weight the retrieved captions Τ with their similarity scores ST, which we skew with controllable temperatures τi2t and τt2t. By combining the retrieved captions embedding with the original representations W and q through α and β, we obtain enriched representations W+ and zq+ which we employ for zero-shot classification.

BibTeX

@inproceedings{dallasen2024core,
  title={Retrieval-enriched zero-shot image classification in low-resource domains},
  author={Nicola Dall'Asen and Yiming Wang and Enrico Fini and Elisa Ricci},
  year={2024},
  booktitle={Empirical Methods in Natural Language Processing (EMNLP)}
}