<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Benchmarking on</title><link>https://c-allergic.github.io/tags/benchmarking/</link><description>Recent content in Benchmarking on</description><generator>Hugo</generator><language>en</language><lastBuildDate>Mon, 11 May 2026 00:00:00 +0800</lastBuildDate><atom:link href="https://c-allergic.github.io/tags/benchmarking/index.xml" rel="self" type="application/rss+xml"/><item><title>Reproducing CapsBench: What Makes a Good Caption Evaluator?</title><link>https://c-allergic.github.io/blog/reproducing-capsbench/</link><pubDate>Mon, 11 May 2026 00:00:00 +0800</pubDate><guid>https://c-allergic.github.io/blog/reproducing-capsbench/</guid><description>&lt;h2 id="why-caption-evaluation-matters"&gt;Why Caption Evaluation Matters&lt;/h2&gt;
&lt;p&gt;Automated image captioning has come a long way, but evaluation hasn&amp;rsquo;t kept up. Standard metrics like BLEU, METEOR, and CIDEr were designed for reference-based machine translation — they weren&amp;rsquo;t built to capture the nuanced quality dimensions of modern generative captioners. More recent metrics like CLIPScore and RefCLIPScore improve on alignment but still miss fine-grained issues like hallucination, style appropriateness, and factual grounding.&lt;/p&gt;
&lt;h2 id="capsbench-a-closer-look"&gt;CapsBench: A Closer Look&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://arxiv.org/abs/2409.10695"&gt;PGV3&lt;/a&gt;&amp;rsquo;s &lt;strong&gt;CapsBench&lt;/strong&gt; is one of the most comprehensive efforts to benchmark caption evaluation metrics. It tests evaluators across multiple dimensions: object hallucination, attribute accuracy, relationship correctness, and more. I&amp;rsquo;ve been reproducing CapsBench to understand its strengths and — more importantly — where it breaks down.&lt;/p&gt;</description></item></channel></rss>