MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models Paper • 2408.02718 • Published Aug 5, 2024 • 62
One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework Paper • 2510.02898 • Published Oct 3, 2025 • 4
Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation Paper • 2411.19331 • Published Nov 28, 2024 • 5