Exposing Blindspots: Cultural Bias Evaluation in Generative Image Models
Abstract
Generative image models produce striking visuals yet often misrepresent culture. Prior work has probed cultural dimensions primarily in text-to-image (T2I) systems, leaving image-to-image (I2I) editors largely underexamined. We close this gap with a unified, reproducible evaluation spanning six countries, an 8-category/36-subcategory schema, and era-aware prompts, auditing both T2I generation and I2I editing under a standardized, reproducible protocol that yields comparable model-level diagnostics.
Using open models with fixed configurations, we derive comparable diagnostics across countries, eras, and categories for both T2I and I2I. Our evaluation combines standard automatic measures, a culture-aware metric that integrates retrieval-augmented VQA with curated knowledge, and expert human judgments collected on a web platform from country-native reviewers. To enable downstream analyses without re-running compute-intensive pipelines, we release the complete image corpus from both studies alongside prompts and settings.
Our study reveals three recurring findings. First, under country-agnostic prompts, models default to Global-North, modern-leaning depictions and flatten cross-country distinctions, reducing separability between culturally distinct neighbors despite fixed schema and era controls. Second, iterative I2I editing erodes cultural fidelity even when conventional metrics remain flat or improve; by contrast, expert ratings and our culture-aware metric both register this degradation. Third, I2I models tend to apply superficial cues (palette shifts, generic props) rather than context- and era-consistent changes, frequently retaining source identity for Global-South targets and drifting toward non-photorealistic styles; attribute-addition trials further expose weak text rendering and brittle handling of fine, culture-specific details. Taken together, these results indicate that culture-sensitive edits remain unreliable in current systems. By standardizing prompts, settings, metrics, and human evaluation protocols—and releasing all images and configurations—we offer a reproducible, culture-centered pipeline for diagnosing and tracking progress in generative image research. Project page: https://seochan99.github.io/ECB/
Schema spanning six countries, eight categories, and three era-aware prompts feeds unified T2I and I2I protocols.
Iterative edits keep CLIPScore flat while human ratings expose rapid cultural degradation.
Attribute addition shows text rendering failures and accessories breaking realism.
Cross-country restylization leans on palette swaps and leaves Global-South identities unchanged.
Research Contributions
Era-aware Benchmarking Suite
Six countries, eight categories, and 36 subcategories with traditional, modern, and era-agnostic prompts expose temporal blindspots in generative priors.
Integrated T2I âś• I2I Protocols
Base T2I generations seed three complementary editing studies—multi-loop, attribute addition, cross-country restyle—capturing how bias propagates across pipelines.
Culture-aware Evaluation Signal
A retrieval-augmented VQA metric surfaces cultural drift that CLIPScore and aesthetic scores overlook, aligning tightly with native reviewer decisions.
Open Audit Stack
We release generations, prompts, execution configs, and the survey platform—ready to fork into future fairness checkpoints and longitudinal monitoring.
Observed Failure Modes
- Global-North defaults: Country-agnostic prompts converge to U.S.-centric, modern aesthetics even with controlled schema and era cues.
- Metric-human gap: CLIPScore and aesthetic metrics remain stable while the culture-aware signal and native raters flag rapid semantic drift.
- Shortcut editing: Iterative edits substitute palette shifts and flags for genuine cultural attributes, often retaining the source identity for Global-South targets.
- Demographic skew: Gender-neutral occupation prompts still surface male dominance and light skin tones, highlighting systematic data imbalance.
Multi-loop Editing Drift
Five successive instructions should converge toward culturally faithful depictions. Instead, models collapse to a homogenized wedding aesthetic irrespective of the target locale.
Traditional attire dissolves into Global-North gowns, regional symbols fade to generic décor, and palette tweaks masquerade as meaningful edits.
Rows correspond to countries, columns to edit iterations 0→5. Automatic metrics stay optimistic while cultural fidelity visibly deteriorates.
Iterative edits flatten regional diversity: prompts request country-specific weddings, but outputs converge toward the same Global-North style.
Culture-aware Metric Matches Native Reviewers
Our retrieval-augmented VQA metric agrees with country-native reviewers on 74% of best selections and 84% of worst selections, capturing cultural erosion that generic scores overlook.
We ground each evaluation with Wikipedia-derived context retrieved via FAISS and question answering with Qwen2 models, enabling scalable yet culturally aware diagnostics.
Expert-in-the-loop Evaluation Platform
Native raters evaluate image quality and cultural representation for each editing loop using the ECB Human Survey platform, ensuring judgments reflect emic expertise.
The tool supports IRB documentation, reviewer dashboards, and best/worst selection workflows, making it easier to repeat cultural audits in new domains.
Beyond Culture: Occupational Bias
Even with gender-neutral prompts, occupation generations skew male for leadership roles and predominantly show light skin tones, while caregiving roles skew female.
These findings reinforce that cultural fidelity and demographic fairness must be evaluated together when deploying generative models.
Read the Paper
Dive into the full methodology, extended analyses, and deployment guidance.
- Full benchmark specification with prompt schema, sampling controls, and evaluation circuits.
- Expanded qualitative audits, survey instrumentation, and reproducible model configurations.
- High-resolution figures, tables, and appendices ready for downstream analysis.
Preview the PDF in-browser or open in a new tab for a full-screen experience.
BibTeX
@article{seo2025exposing,
title={Exposing Blindspots: Cultural Bias Evaluation in Generative Image Models},
author={Seo, Huichan and Choi, Sieun and Hong, Minki and Zhou, Yi and Kim, Junseo and Ismaila, Lukman and Etori, Naome and Agarwal, Mehul and Liu, Zhixuan and Kim, Jihie and Oh, Jean},
journal={arXiv preprint},
year={2025},
url={https://arxiv.org/abs/2510.20042}
}