One Patch to Caption Them All

A Unified Zero-Shot Captioning Model

1ISTI CNR 2University of Pisa
* Equal contribution
Patch-ioner architecture

Abstract

We introduce Patch-ioner, a unified zero-shot captioning model that shifts from an image-centric to a patch-centric paradigm, enabling caption generation at arbitrary spatial granularity without region-level supervision. Instead of relying on full-image representations, we treat individual patches as atomic captioning units and aggregate them to describe arbitrary regions, from single patches to non-contiguous areas and entire images.

Leveraging language-aligned dense visual representations, we provide a flexible framework for solving various captioning tasks in a zero-shot manner. Experiments demonstrate state-of-the-art or competitive performance in zero-shot dense captioning, region-set captioning, and a newly introduced trace captioning task, highlighting the effectiveness of patch-wise semantic representations for scalable caption generation.

Tasks

Local-understanding capabilities of our model enable it to solve many captioning tasks:

Patch Captioning

Trace Captioning example on skyscrapers that are on the background of an image depicting giraffes

Patch Captioning consists in the task of generating a caption for each patch of the image generated by the visual backbone.

Trace Captioning

Trace Captioning example on skyscrapers that are on the background of an image depicting giraffes

We define Trace Captioning as generating a caption for a region within an image specified by a mouse trace. This task is particularly useful to obtain localized descriptions of images. For example, consider the understanding of image content by visually impaired users.

Dense Captioning

Dense Captioning example image: train station clock

requires locating salient regions in an image and generating their descriptions. We focus on the captioning of already defined boxes.

Region-Set Captioning

Region-set Captioning: example image two people with a stroller

Consists of generating a single caption for multiple regions within an image, where each region is specified by a distinct bounding box.

Image Captioning

Image Captioning: surfer in the sea example

involves generating a single caption that describes the entire image. To achieve this, we derive a global representation by aggregating the feature embeddings of all patches within the image.

Qualitative Results

We report some predictions of our model and compare baselines from the finer (top) to the coarser (bottom) task. For trace captioning examples, the trace time is color-coded from start (red) to end (yellow). DeCap = DeCap applied on the whole image. DeCap P = DeCap applied on the same aggregation of patches used by our method. DeCap C = DeCap applied on cropped box. ZeroCap = ZeroCap applied to the whole image. CLOSE = CLOSE applied to the whole image. GT = ground-truth caption.

Other Trace Captioning Examples here.

Quantitative Results

Radar Charts on the assessed tasks for our Patch-ioner versus the best baseline of each task

In fine-grained tasks like trace and dense captioning, Patch-ioner excels by providing localized descriptions that outperform baselines, which struggle with fine-grained details. For context-aware tasks like region-set and image captioning, our model performs well, with improvements in semantic metrics. In standard image captioning, our method achieves strong results in semantic quality, although it diverges slightly in syntactic metrics, indicating a focus on meaning rather than phrasing.

Radar Chart comparison of our method versus baselins in various captioning tasks

Acknowledgements

FAIR Project Logo This work has received financial support by the project FAIR – Future Artificial Intelligence Research - Spoke 1 (PNRR M4C2 Inv. 1.3 PE00000013) funded by the European Union - Next Generation EU.

MUCES Project Logo This work has received financial support by the European Union — Next Generation EU, Mission 4 Component 1 CUP B53D23026090001 (a MUltimedia platform for Content Enrichment and Search in audiovisual archives — MUCES PRIN 2022 PNRR P2022BW7CW).