A curated survey of image generation quality assessment — 2025 & 2026
| Benchmark | Venue | Scale | Key Insight | Code |
|---|---|---|---|---|
| T2I-CoReBench | ICLR 2026 | 1,080 prompts, 13.5K items | 12-dim taxonomy (composition + reasoning); supersedes GenEval | GitHub |
| ImagenWorld | ICLR 2026 | 3.6K conditions, 20K annotations | 6 tasks × 6 domains; VLM auto-eval Kendall \tau = 0.79 | GitHub |
| CoBench | CVPR 2026 | 319K generated images | Unified semantic-spatial evaluation for layout-guided diffusion | GitHub |
| UEval | 2026.01 | 1,000 expert questions, 10.4K rubrics | 8 real-world scenarios; GPT-5-Thinking only 66.4/100 | — |
| SciGenBench | arXiv 2026.01 | — | Scientific correctness via information utility & logical validity | — |
| ScImage | ICLR 2025 | 7 models, 11 scientist evaluators | All models struggle on complex scientific prompts | GitHub |
| T2I-CompBench++ | TPAMI 2025 | 8,000 prompts, 8 sub-categories | Adds 3D-spatial & numeracy dimensions | GitHub |
| Paper | Venue | Contribution | Code |
|---|---|---|---|
| cFreD | WACV 2026 | Conditional Fréchet Distance — unified quality + alignment score | GitHub |
| Beyond Text-Image Alignment | ICCV 2025 | ICT Score + HP Score; reward models penalize high-aesthetic images | — |
| Guidance Matters (GA-Eval) | ICLR 2026 | Reveals CFG bias; guidance-aware fair comparison framework | — |
| Color Fidelity Benchmark | arXiv 2026.03 | 1.3M+ CFD dataset + CFM metric; vivid-color bias quantified | GitHub |
| MMHM | arXiv 2026.03 | MinMax Harmonic Mean: FID + IS + CLIP Score + Pick Score composite | — |
| Multimodal Benchmarking | arXiv 2025.05 | Weighted Score + CLIP + LPIPS + FID suite on DeepFashion-MM | — |
| SVGauge | ICIAP 2025 | SigLIP + BLIP-2 + SBERT metric; FID/LPIPS/CLIPScore fail on SVG | — |
| Paper | Venue | Contribution | Code |
|---|---|---|---|
| MJ-Bench | NeurIPS 2025 | 6-perspective multimodal judge benchmark; GPT-4o best overall | GitHub |
| VisualQuality-R1 | NeurIPS 2025 | RL2R + GRPO + Thurstone ranking; SOTA on 8 IQA datasets | GitHub |
| EvoQuality | ICLR 2026 | Self-supervised VLM IQA; +31.8% zero-shot PLCC | — |
| VLM-as-Judge + Specialist | WACV 2026 | ICL + CoT fine-tuning improves VLM judging alignment by 13% (Adobe) | — |
| TIQA / ANTIQA | 2026.03 | Text rendering quality in generated images; +14% human-rated improvement | GitHub |
| DIQ-H | arXiv 2025.12 | Degraded Image Quality → Hallucination; VIR improves VLM 72 → 83% | — |
| VLM-RobustBench | arXiv 2026.03 | 49 augmentations × 133 corrupted settings; VLMs spatially fragile | — |
| Paper | Domain | Key Finding |
|---|---|---|
| Artistic Image Review (KSII) | Artistic Image | Hybrid framework coupling automated metrics with domain-expert assessment |
| Text-in-Image Benchmark | Text Rendering | Applied Sciences (MDPI); models struggle with structural precision & domain accuracy |
| SVGauge | SVG Generation | FID / LPIPS / CLIPScore fail on vector graphics |
| Fairness, Diversity & Reliability | Social Impact | Embedding-space perturbation framework; AI Review 2026 |
| Paper | Repository | Status |
|---|---|---|
| T2I-CoReBench | KwaiVGI/T2I-CoReBench | ✓ |
| ImagenWorld | TIGER-AI-Lab/ImagenWorld | ✓ |
| CoBench | lparolari/cobench | ✓ |
| ScImage | leixin-zhang/Scimage | ✓ |
| T2I-CompBench++ | Karine-Huang/T2I-CompBench | ✓ |
| cFreD | JaywonKoo17/cFreD | ✓ |
| Color Fidelity (CFM) | ZhengyaoFang/CFM | ✓ |
| MJ-Bench | aimi-lab/MJ-Bench | ✓ |
| VisualQuality-R1 | TianheWu/VisualQuality-R1 | ✓ |
| TIQA / ANTIQA | koltsov-cmc/antiqa | ✓ |
| UEval | — | ✗ |
| SciGenBench | — | ✗ |
| Beyond Text-Image Alignment | — | ✗ |
| Guidance Matters (GA-Eval) | — | ✗ |
| MMHM | — | ✗ |
| Multimodal Benchmarking | — | ✗ |
| SVGauge | — | ✗ |
| EvoQuality | — | ✗ |
| VLM-as-Judge + Specialist | — | ✗ |
| DIQ-H | — | ✗ |
| VLM-RobustBench | — | ✗ |