Awesome Generation Model Evaluation

A curated survey of image generation quality assessment — 2025 & 2026

Core Benchmarks

BenchmarkVenueScaleKey InsightCode
T2I-CoReBench ICLR 2026 1,080 prompts, 13.5K items 12-dim taxonomy (composition + reasoning); supersedes GenEval GitHub
ImagenWorld ICLR 2026 3.6K conditions, 20K annotations 6 tasks × 6 domains; VLM auto-eval Kendall \tau = 0.79 GitHub
CoBench CVPR 2026 319K generated images Unified semantic-spatial evaluation for layout-guided diffusion GitHub
UEval 2026.01 1,000 expert questions, 10.4K rubrics 8 real-world scenarios; GPT-5-Thinking only 66.4/100
SciGenBench arXiv 2026.01 Scientific correctness via information utility & logical validity
ScImage ICLR 2025 7 models, 11 scientist evaluators All models struggle on complex scientific prompts GitHub
T2I-CompBench++ TPAMI 2025 8,000 prompts, 8 sub-categories Adds 3D-spatial & numeracy dimensions GitHub

Metrics & Evaluation Frameworks

PaperVenueContributionCode
cFreD WACV 2026 Conditional Fréchet Distance — unified quality + alignment score GitHub
Beyond Text-Image Alignment ICCV 2025 ICT Score + HP Score; reward models penalize high-aesthetic images
Guidance Matters (GA-Eval) ICLR 2026 Reveals CFG bias; guidance-aware fair comparison framework
Color Fidelity Benchmark arXiv 2026.03 1.3M+ CFD dataset + CFM metric; vivid-color bias quantified GitHub
MMHM arXiv 2026.03 MinMax Harmonic Mean: FID + IS + CLIP Score + Pick Score composite
Multimodal Benchmarking arXiv 2025.05 Weighted Score + CLIP + LPIPS + FID suite on DeepFashion-MM
SVGauge ICIAP 2025 SigLIP + BLIP-2 + SBERT metric; FID/LPIPS/CLIPScore fail on SVG

VLM / LLM-based Automated Evaluation

PaperVenueContributionCode
MJ-Bench NeurIPS 2025 6-perspective multimodal judge benchmark; GPT-4o best overall GitHub
VisualQuality-R1 NeurIPS 2025 RL2R + GRPO + Thurstone ranking; SOTA on 8 IQA datasets GitHub
EvoQuality ICLR 2026 Self-supervised VLM IQA; +31.8% zero-shot PLCC
VLM-as-Judge + Specialist WACV 2026 ICL + CoT fine-tuning improves VLM judging alignment by 13% (Adobe)
TIQA / ANTIQA 2026.03 Text rendering quality in generated images; +14% human-rated improvement GitHub
DIQ-H arXiv 2025.12 Degraded Image Quality → Hallucination; VIR improves VLM 72 → 83%
VLM-RobustBench arXiv 2026.03 49 augmentations × 133 corrupted settings; VLMs spatially fragile

Domain-Specific Evaluation

PaperDomainKey Finding
Artistic Image Review (KSII) Artistic Image Hybrid framework coupling automated metrics with domain-expert assessment
Text-in-Image Benchmark Text Rendering Applied Sciences (MDPI); models struggle with structural precision & domain accuracy
SVGauge SVG Generation FID / LPIPS / CLIPScore fail on vector graphics
Fairness, Diversity & Reliability Social Impact Embedding-space perturbation framework; AI Review 2026

Key Trends (2025–2026)

Code Availability

PaperRepositoryStatus
T2I-CoReBenchKwaiVGI/T2I-CoReBench
ImagenWorldTIGER-AI-Lab/ImagenWorld
CoBenchlparolari/cobench
ScImageleixin-zhang/Scimage
T2I-CompBench++Karine-Huang/T2I-CompBench
cFreDJaywonKoo17/cFreD
Color Fidelity (CFM)ZhengyaoFang/CFM
MJ-Benchaimi-lab/MJ-Bench
VisualQuality-R1TianheWu/VisualQuality-R1
TIQA / ANTIQAkoltsov-cmc/antiqa
UEval
SciGenBench
Beyond Text-Image Alignment
Guidance Matters (GA-Eval)
MMHM
Multimodal Benchmarking
SVGauge
EvoQuality
VLM-as-Judge + Specialist
DIQ-H
VLM-RobustBench

Recommended Reading (Top 5)

T2I-CoReBench ICLR 2026
New-generation benchmark defining a 12-dimension evaluation taxonomy — composition + reasoning.
ImagenWorld ICLR 2026
Largest-scale explainable human evaluation; upper-bound analysis of VLM auto-evaluation.
Exposes fundamental flaws in current evaluation paradigms — CFG bias and over-saturation.
cFreD WACV 2026
Theoretical framework unifying image quality and text alignment into a single score.
EvoQuality ICLR 2026
Self-supervised IQA paradigm — state-of-the-art without any ground-truth labels.