Classifier-Guided Captioning Across Modalities

Ariel Shaulov, Tal Shaharabany, Eitan Shaar, Gal Chechik, Lior Wolf

Abstract

Most current captioning systems use language models trained on data from specific settings, such as image-based captioning via Amazon Mechanical Turk, limiting their ability to generalize to other modality distributions and contexts. This limitation hinders performance in tasks like audio or video captioning, where different semantic cues are needed. Addressing this challenge is crucial for creating more adaptable and versatile captioning frameworks applicable across diverse real-world contexts. Our framework improves captioning quality across modalities and achieves state-of-the-art results.

Audibility Dataset

Examples from our dataset, highlighting audible and non-audible instances:

Audible Not Audible
The barking of a dog in excitement. Magnets attract metals.
Ringing phone awaits an answer. Ice covers the lake in winter.
Birds chirping in early morning. An umbrella stands closed by the door.
The buzz of a drone flying overhead. Icebergs float on water.
Jingling coins are counted or played with. A statue in a park.
The swishing sound of a washing machine. Rusting car sits in the yard.
Whips cracked in the rodeo. Resolved issue is fixed.
The meow of a cat. Soccer balls are stored in a mesh bag.
The school bell rings, signaling the end of class. Skimmed milk has less fat.
Raindrops tapping on rooftops. A pair of hiking boots rests next to a backpack.

BibTeX

If you find this project useful for your research, please cite:

@article{shaulov2025classifier,
  title={Classifier-Guided Captioning Across Modalities},
  author={Shaulov, Ariel and Shaharabany, Tal and Shaar, Eitan and Chechik, Gal and Wolf, Lior},
  journal={arXiv preprint arXiv:2501.03183},
  year={2025}
}