Patch Captioning

Patch Captioning consists in the task of generating a caption for each patch of the image generated by the visual backbone.
A Unified Zero-Shot Captioning Framework
Zero-shot captioners are recently proposed models that utilize common-space vision-language representations to caption images without relying on paired image-text data. To caption an image, they proceed by textually decoding a text-aligned image feature, but they limit their scope to global representations and whole-image captions.
We present Patch-ioner, a unified framework for zero-shot captioning that shifts from an image-centric to a patch-centric paradigm, enabling the captioning of arbitrary regions without the need of region-level supervision. Instead of relying on global image representations, we treat individual patches as atomic captioning units and aggregate them to describe arbitrary regions, from single patches to non-contiguous areas and entire images. We analyze the key ingredients that enable current latent captioners to work in our novel proposed framework. Experiments demonstrate that backbones producing meaningful, dense visual features, such as DINO, are key to achieving state-of-the-art performance in multiple region-based captioning tasks. Compared to other baselines and state-of-the-art competitors, our models achieve better performance on zero-shot dense, region-set, and a newly introduced trace captioning task, highlighting the effectiveness of patch-wise semantic representations for scalable caption generation.
Local-understanding capabilities of our model enable it to solve many captioning tasks:
Patch Captioning consists in the task of generating a caption for each patch of the image generated by the visual backbone.
We define Trace Captioning as generating a caption for a region within an image specified by a mouse trace. This task is particularly useful to obtain localized descriptions of images. For example, consider the understanding of image content by visually impaired users.
requires locating salient regions in an image and generating their descriptions. We focus on the captioning of already defined boxes.
Consists of generating a single caption for multiple regions within an image, where each region is specified by a distinct bounding box.
involves generating a single caption that describes the entire image. To achieve this, we derive a global representation by aggregating the feature embeddings of all patches within the image.
We report some predictions of our model and compare baselines from the finer (top) to the coarser (bottom) task. For trace captioning examples, the trace time is color-coded from start (red) to end (yellow). DeCap = DeCap applied on the whole image. DeCap P = DeCap applied on the same aggregation of patches used by our method. DeCap C = DeCap applied on cropped box. ZeroCap = ZeroCap applied to the whole image. CLOSE = CLOSE applied to the whole image. GT = ground-truth caption.
Other Trace Captioning Examples here.
We compare Patch-ioner with state-of-the-art zero-shot captioners and region-supervised backbones (AlphaCLIP, RegionCLIP) on trace, dense, region-set, and image captioning. Unlike these models, Patch-ioner is trained without region-level annotations. Our framework excels in fine-grained regional tasks, extends seamlessly to context-aware region-set captioning, and remains competitive on whole-image captioning.
In trace and dense captioning, Patch-ioner surpasses whole-image and crop-based models. On region-set captioning, patch aggregation yields coherent captions that even outperform region-supervised backbones. For whole-image captioning, results are competitive with the strongest dedicated models, prioritizing semantic quality.
This work has received financial support by the project FAIR – Future Artificial Intelligence Research - Spoke 1 (PNRR M4C2 Inv. 1.3 PE00000013) funded by the European Union - Next Generation EU.
This work has received financial support by the European Union — Next Generation EU, Mission 4 Component 1
CUP B53D23026090001 (a MUltimedia platform for Content Enrichment and Search in audiovisual archives — MUCES PRIN 2022 PNRR P2022BW7CW).