Why Caption Evaluation Matters
Automated image captioning has come a long way, but evaluation hasn’t kept up. Standard metrics like BLEU, METEOR, and CIDEr were designed for reference-based machine translation — they weren’t built to capture the nuanced quality dimensions of modern generative captioners. More recent metrics like CLIPScore and RefCLIPScore improve on alignment but still miss fine-grained issues like hallucination, style appropriateness, and factual grounding.
CapsBench: A Closer Look
PGV3’s CapsBench is one of the most comprehensive efforts to benchmark caption evaluation metrics. It tests evaluators across multiple dimensions: object hallucination, attribute accuracy, relationship correctness, and more. I’ve been reproducing CapsBench to understand its strengths and — more importantly — where it breaks down.