Maybe you are looking for CroQS 🐊

Cross-modal Query Suggestion for Text-to-Image Retrieval

ECIR 2025 Oral

1ISTI CNR 2University of Pisa

Abstract

Cross-modal query suggestion architecture

Query suggestion, a technique widely adopted in information retrieval, enhances system interactivity and the browsing experience of document collections. In cross-modal retrieval, many works have focused on retrieving relevant items from natural language queries, while few have explored query suggestion solutions.

In this work, we address query suggestion in cross-modal retrieval, introducing a novel task that focuses on suggesting minimal textual modifications needed to explore visually consistent subsets of the collection, following the premise of “Maybe you are looking for”. To facilitate the evaluation and development of methods, we present a comprehensive benchmark named CroQS. This dataset comprises initial queries, grouped result sets, and human-defined suggested queries for each group. We establish dedicated metrics to rigorously evaluate the performance of various methods on this task, measuring representativeness, cluster specificity, and similarity of the suggested queries to the original ones. Baseline methods from related fields, such as image captioning and content summarization, are adapted for this task to provide reference performance scores.

Although rather far from human performance, our experiments reveal that both LLM-based and captioning-based methods achieve competitive results on CroQS, improving the recall on cluster specificity by more than 122% and representativeness mAP by more than 23% with respect to the initial query.

Methods

We studied existing methods from related fields to develop a set of baseline methods for the cross-modal query suggestion task.

Captioning-derived Methods

Captioning image group prototype

We adapted models that build a caption from a semantic space (such CLIP) by exploiting the spatial properties of semantic spaces. By giving a prototype point in input to those methods, we used the caption for that point as the suggested query for the image group.

GroupCap: LLM-based approach

Captioning image group prototype

An LLM receives the captions of the most representative images and builds a query suggestion from them and q0.

Results

Dataset Samples, with suggestions generated by a set of methods

Results table over the CroQS dataset for the best performing baseline methods

Radar Chart of macro-averaged scores over CroQS

Captioning-derived methods are stronger in the Cluster Specificity property, measured by Recall on Closed Set. The LLM-based method is more balanced among the metrics and achieves a human-level similarity to initial query.
Radar Chart comparison of best baseline methods for cross-modal query suggestion

Acknowledgements

FAIR Project Logo This work has received financial support by the project FAIR – Future Artificial Intelligence Research - Spoke 1 (PNRR M4C2 Inv. 1.3 PE00000013) funded by the European Union - Next Generation EU.

MUCES Project Logo This work has received financial support by the European Union — Next Generation EU, Mission 4 Component 1 CUP B53D23026090001 (a MUltimedia platform for Content Enrichment and Search in audiovisual archives — MUCES PRIN 2022 PNRR P2022BW7CW).

ICSC Logo This work has received financial support by the Spoke ``FutureHPC & BigData'' of the ICSC – Centro Nazionale di Ricerca in High-Performance Computing, Big Data and Quantum Computing funded by the Italian Government.

This work has received financial support by the FoReLab and CrossLab projects (Departments of Excellence), the NEREO PRIN project (Research Grant no. 2022AEFHAZ) funded by the Italian Ministry of Education and Research (MUR).