One Patch to Caption Them All

A Unified Zero-Shot Captioning Framework

1ISTI CNR 2University of Pisa
* Equal contribution
Patch-ioner architecture

Abstract

Zero-shot captioners are recently proposed models that utilize common-space vision-language representations to caption images without relying on paired image-text data. To caption an image, they proceed by textually decoding a text-aligned image feature, but they limit their scope to global representations and whole-image captions.

We present Patch-ioner, a unified framework for zero-shot captioning that shifts from an image-centric to a patch-centric paradigm, enabling the captioning of arbitrary regions without the need of region-level supervision. Instead of relying on global image representations, we treat individual patches as atomic captioning units and aggregate them to describe arbitrary regions, from single patches to non-contiguous areas and entire images. We analyze the key ingredients that enable current latent captioners to work in our novel proposed framework. Experiments demonstrate that backbones producing meaningful, dense visual features, such as DINO, are key to achieving state-of-the-art performance in multiple region-based captioning tasks. Compared to other baselines and state-of-the-art competitors, our models achieve better performance on zero-shot dense, region-set, and a newly introduced trace captioning task, highlighting the effectiveness of patch-wise semantic representations for scalable caption generation.

Tasks

Local-understanding capabilities of our model enable it to solve many captioning tasks:

Patch Captioning

Trace Captioning example on skyscrapers that are on the background of an image depicting giraffes

Patch Captioning consists in the task of generating a caption for each patch of the image generated by the visual backbone.

Trace Captioning

Trace Captioning example on skyscrapers that are on the background of an image depicting giraffes

We define Trace Captioning as generating a caption for a region within an image specified by a mouse trace. This task is particularly useful to obtain localized descriptions of images. For example, consider the understanding of image content by visually impaired users.

Dense Captioning

Dense Captioning example image: train station clock

requires locating salient regions in an image and generating their descriptions. We focus on the captioning of already defined boxes.

Region-Set Captioning

Region-set Captioning: example image two people with a stroller

Consists of generating a single caption for multiple regions within an image, where each region is specified by a distinct bounding box.

Image Captioning

Image Captioning: surfer in the sea example

involves generating a single caption that describes the entire image. To achieve this, we derive a global representation by aggregating the feature embeddings of all patches within the image.

Qualitative Results

We report some predictions of our model and compare baselines from the finer (top) to the coarser (bottom) task. For trace captioning examples, the trace time is color-coded from start (red) to end (yellow). DeCap = DeCap applied on the whole image. DeCap P = DeCap applied on the same aggregation of patches used by our method. DeCap C = DeCap applied on cropped box. ZeroCap = ZeroCap applied to the whole image. CLOSE = CLOSE applied to the whole image. GT = ground-truth caption.

Other Trace Captioning Examples here.

Comparison with SOTA

We compare Patch-ioner with state-of-the-art zero-shot captioners and region-supervised backbones (AlphaCLIP, RegionCLIP) on trace, dense, region-set, and image captioning. Unlike these models, Patch-ioner is trained without region-level annotations. Our framework excels in fine-grained regional tasks, extends seamlessly to context-aware region-set captioning, and remains competitive on whole-image captioning.

Quantitative results table across tasks
Comparison of Patch-ioner (Talk2DINO, T2D) with zero-shot and region-supervised captioners. Patch-ioner consistently outperforms whole-image and region-level baselines on local, fine-grained tasks, while achieving strong results on whole-image captioning. Metrics: CIDEr (C), RefPAC (P), mean average precision (mAP), CLIP-Score (CLIP-S).

In trace and dense captioning, Patch-ioner surpasses whole-image and crop-based models. On region-set captioning, patch aggregation yields coherent captions that even outperform region-supervised backbones. For whole-image captioning, results are competitive with the strongest dedicated models, prioritizing semantic quality.

Acknowledgements

FAIR Project Logo This work has received financial support by the project FAIR – Future Artificial Intelligence Research - Spoke 1 (PNRR M4C2 Inv. 1.3 PE00000013) funded by the European Union - Next Generation EU.

MUCES Project Logo This work has received financial support by the European Union — Next Generation EU, Mission 4 Component 1 CUP B53D23026090001 (a MUltimedia platform for Content Enrichment and Search in audiovisual archives — MUCES PRIN 2022 PNRR P2022BW7CW).