Awesome Generation Model Evaluation

Core Benchmarks

Benchmark	Venue	Scale	Key Insight	Code
T2I-CoReBench	ICLR 2026	1,080 prompts, 13.5K items	12-dim taxonomy (composition + reasoning); supersedes GenEval	GitHub
ImagenWorld	ICLR 2026	3.6K conditions, 20K annotations	6 tasks × 6 domains; VLM auto-eval Kendall $\tau$ = 0.79	GitHub
CoBench	CVPR 2026	319K generated images	Unified semantic-spatial evaluation for layout-guided diffusion	GitHub
UEval	2026.01	1,000 expert questions, 10.4K rubrics	8 real-world scenarios; GPT-5-Thinking only 66.4/100	—
SciGenBench	arXiv 2026.01	—	Scientific correctness via information utility & logical validity	—
ScImage	ICLR 2025	7 models, 11 scientist evaluators	All models struggle on complex scientific prompts	GitHub
T2I-CompBench++	TPAMI 2025	8,000 prompts, 8 sub-categories	Adds 3D-spatial & numeracy dimensions	GitHub

Paper	Venue	Contribution	Code
cFreD	WACV 2026	Conditional Fréchet Distance — unified quality + alignment score	GitHub
Beyond Text-Image Alignment	ICCV 2025	ICT Score + HP Score; reward models penalize high-aesthetic images	—
Guidance Matters (GA-Eval)	ICLR 2026	Reveals CFG bias; guidance-aware fair comparison framework	—
Color Fidelity Benchmark	arXiv 2026.03	1.3M+ CFD dataset + CFM metric; vivid-color bias quantified	GitHub
MMHM	arXiv 2026.03	MinMax Harmonic Mean: FID + IS + CLIP Score + Pick Score composite	—
Multimodal Benchmarking	arXiv 2025.05	Weighted Score + CLIP + LPIPS + FID suite on DeepFashion-MM	—
SVGauge	ICIAP 2025	SigLIP + BLIP-2 + SBERT metric; FID/LPIPS/CLIPScore fail on SVG	—

Paper	Venue	Contribution	Code
MJ-Bench	NeurIPS 2025	6-perspective multimodal judge benchmark; GPT-4o best overall	GitHub
VisualQuality-R1	NeurIPS 2025	RL2R + GRPO + Thurstone ranking; SOTA on 8 IQA datasets	GitHub
EvoQuality	ICLR 2026	Self-supervised VLM IQA; +31.8% zero-shot PLCC	—
VLM-as-Judge + Specialist	WACV 2026	ICL + CoT fine-tuning improves VLM judging alignment by 13% (Adobe)	—
TIQA / ANTIQA	2026.03	Text rendering quality in generated images; +14% human-rated improvement	GitHub
DIQ-H	arXiv 2025.12	Degraded Image Quality → Hallucination; VIR improves VLM 72 → 83%	—
VLM-RobustBench	arXiv 2026.03	49 augmentations × 133 corrupted settings; VLMs spatially fragile	—

Paper	Domain	Key Finding
Artistic Image Review (KSII)	Artistic Image	Hybrid framework coupling automated metrics with domain-expert assessment
Text-in-Image Benchmark	Text Rendering	Applied Sciences (MDPI); models struggle with structural precision & domain accuracy
SVGauge	SVG Generation	FID / LPIPS / CLIPScore fail on vector graphics
Fairness, Diversity & Reliability	Social Impact	Embedding-space perturbation framework; AI Review 2026

Single metric → Multi-dimensional suites. FID/CLIPScore era is ending; rubric-based evaluation (12-dim CoReBench, 6-dim ImagenWorld) is the new standard.
VLM-as-Judge goes mainstream. Capped at ~0.79 Kendall $\tau$ ; cannot fully replace fine-grained human evaluation.
CFG / Preference Bias discovered. Two ICLR 2026 papers independently revealed systematic bias toward large CFG scales and oversaturated colors.
Reasoning is the new bottleneck. Models adequate at composition but fail on deductive, inductive, and abductive reasoning.
Domain-specialized evaluation emerges. Scientific images, text rendering, color fidelity, SVG, and fashion now have dedicated benchmarks.
Text-in-Image remains the Achilles' heel. Industry-wide average of ~65% text legibility.
Self-supervised evaluation rises. EvoQuality achieves SOTA without any labels.