One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework

Abstract

Zero-shot captioners are recently proposed models that utilize common-space vision-language representations to caption images without relying on paired image-text data. To caption an image, they proceed by textually decoding a text-aligned image feature, but they limit their scope to global representations and whole-image captions.

We present Patch-ioner, a unified framework for zero-shot captioning that shifts from an image-centric to a patch-centric paradigm, enabling the captioning of arbitrary regions without the need of region-level supervision. Instead of relying on global image representations, we treat individual patches as atomic captioning units and aggregate them to describe arbitrary regions, from single patches to non-contiguous areas and entire images. We analyze the key ingredients that enable current latent captioners to work in our novel proposed framework. Experiments demonstrate that backbones producing meaningful, dense visual features, such as DINO, are key to achieving state-of-the-art performance in multiple region-based captioning tasks. Compared to other baselines and state-of-the-art competitors, our models achieve better performance on zero-shot dense, region-set, and a newly introduced trace captioning task, highlighting the effectiveness of patch-wise semantic representations for scalable caption generation.

Tasks

Local-understanding capabilities of our model enable it to solve many captioning tasks:

Patch Captioning

Patch Captioning consists in the task of generating a caption for each patch of the image generated by the visual backbone.

Trace Captioning

We define Trace Captioning as generating a caption for a region within an image specified by a mouse trace. This task is particularly useful to obtain localized descriptions of images. For example, consider the understanding of image content by visually impaired users.

Dense Captioning

requires locating salient regions in an image and generating their descriptions. We focus on the captioning of already defined boxes.

Region-Set Captioning

Consists of generating a single caption for multiple regions within an image, where each region is specified by a distinct bounding box.

Image Captioning

involves generating a single caption that describes the entire image. To achieve this, we derive a global representation by aggregating the feature embeddings of all patches within the image.

Comparison with SOTA

We compare Patch-ioner with state-of-the-art zero-shot captioners and region-supervised backbones (AlphaCLIP, RegionCLIP) on trace, dense, region-set, and image captioning. Unlike these models, Patch-ioner is trained without region-level annotations. Our framework excels in fine-grained regional tasks, extends seamlessly to context-aware region-set captioning, and remains competitive on whole-image captioning.

Quantitative results table across tasks — Comparison of **Patch-ioner** (Talk2DINO, T2D) with zero-shot and region-supervised captioners. Patch-ioner consistently outperforms whole-image and region-level baselines on local, fine-grained tasks, while achieving strong results on whole-image captioning. Metrics: CIDEr (C), RefPAC (P), mean average precision (mAP), CLIP-Score (CLIP-S).

In trace and dense captioning, Patch-ioner surpasses whole-image and crop-based models. On region-set captioning, patch aggregation yields coherent captions that even outperform region-supervised backbones. For whole-image captioning, results are competitive with the strongest dedicated models, prioritizing semantic quality.

Installation and Checkpoints

You can try Patch-ioner directly in your browser using our Hugging Face Demo .

To install locally create an empty python virtual environment (recommended), then:

pip install git+https://github.com/Ruggero1912/Patch-ioner.git

All available model checkpoints are hosted on Hugging Face here

Each model can be loaded through the from_config method, that can fetch the checkpoints weights from HuggingFace hub. Example:

from patchioner import Patchioner
model = Patchioner.from_config("Ruggero1912/Patch-ioner_talk2dino_decap_COCO_Captions")

BibTeX

@misc{bianchi2025patchcaptionallunified,
      title={One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework}, 
      author={Lorenzo Bianchi and Giacomo Pacini and Fabio Carrara and Nicola Messina and Giuseppe Amato and Fabrizio Falchi},
      year={2025},
      eprint={2510.02898},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.02898}, 
}

One Patch to Caption Them All
A Unified Zero-Shot Captioning Framework