Exposing Blindspots: Cultural Bias Evaluation in Generative Image Models

Carnegie Mellon University · Dongguk University · Delft University of Technology · Johns Hopkins University School of Medicine · University of Minnesota Twin Cities
*Equal contribution
Representative cultural blindspots in text-to-image generations across six countries.

Country-agnostic prompts quickly default to Global-North aesthetics: mis-styled Indian weddings, flattened East Asian ceremony cues, wildlife stereotypes, and misplaced rituals motivate our cultural audit.

Abstract

Generative image models produce striking visuals yet often misrepresent culture. Prior work has probed cultural dimensions primarily in text-to-image (T2I) systems, leaving image-to-image (I2I) editors largely underexamined. We close this gap with a unified, reproducible evaluation spanning six countries, an 8-category/36-subcategory schema, and era-aware prompts, auditing both T2I generation and I2I editing under a standardized, reproducible protocol that yields comparable model-level diagnostics.

Using open models with fixed configurations, we derive comparable diagnostics across countries, eras, and categories for both T2I and I2I. Our evaluation combines standard automatic measures, a culture-aware metric that integrates retrieval-augmented VQA with curated knowledge, and expert human judgments collected on a web platform from country-native reviewers. To enable downstream analyses without re-running compute-intensive pipelines, we release the complete image corpus from both studies alongside prompts and settings.

Our study reveals three recurring findings. First, under country-agnostic prompts, models default to Global-North, modern-leaning depictions and flatten cross-country distinctions, reducing separability between culturally distinct neighbors despite fixed schema and era controls. Second, iterative I2I editing erodes cultural fidelity even when conventional metrics remain flat or improve; by contrast, expert ratings and our culture-aware metric both register this degradation. Third, I2I models tend to apply superficial cues (palette shifts, generic props) rather than context- and era-consistent changes, frequently retaining source identity for Global-South targets and drifting toward non-photorealistic styles; attribute-addition trials further expose weak text rendering and brittle handling of fine, culture-specific details. Taken together, these results indicate that culture-sensitive edits remain unreliable in current systems. By standardizing prompts, settings, metrics, and human evaluation protocols—and releasing all images and configurations—we offer a reproducible, culture-centered pipeline for diagnosing and tracking progress in generative image research. Project page: https://seochan99.github.io/ECB/

Research Contributions

Era-aware Benchmarking Suite

Six countries, eight categories, and 36 subcategories with traditional, modern, and era-agnostic prompts expose temporal blindspots in generative priors.

Integrated T2I âś• I2I Protocols

Base T2I generations seed three complementary editing studies—multi-loop, attribute addition, cross-country restyle—capturing how bias propagates across pipelines.

Culture-aware Evaluation Signal

A retrieval-augmented VQA metric surfaces cultural drift that CLIPScore and aesthetic scores overlook, aligning tightly with native reviewer decisions.

Open Audit Stack

We release generations, prompts, execution configs, and the survey platform—ready to fork into future fairness checkpoints and longitudinal monitoring.

Observed Failure Modes

  • Global-North defaults: Country-agnostic prompts converge to U.S.-centric, modern aesthetics even with controlled schema and era cues.
  • Metric-human gap: CLIPScore and aesthetic metrics remain stable while the culture-aware signal and native raters flag rapid semantic drift.
  • Shortcut editing: Iterative edits substitute palette shifts and flags for genuine cultural attributes, often retaining the source identity for Global-South targets.
  • Demographic skew: Gender-neutral occupation prompts still surface male dominance and light skin tones, highlighting systematic data imbalance.

Multi-loop Editing Drift

Five successive instructions should converge toward culturally faithful depictions. Instead, models collapse to a homogenized wedding aesthetic irrespective of the target locale.

Traditional attire dissolves into Global-North gowns, regional symbols fade to generic décor, and palette tweaks masquerade as meaningful edits.

Rows correspond to countries, columns to edit iterations 0→5. Automatic metrics stay optimistic while cultural fidelity visibly deteriorates.

Multi-loop edit progression across six countries showing cultural drift toward similar wedding scenes.

Iterative edits flatten regional diversity: prompts request country-specific weddings, but outputs converge toward the same Global-North style.

Culture-aware Metric Matches Native Reviewers

Our retrieval-augmented VQA metric agrees with country-native reviewers on 74% of best selections and 84% of worst selections, capturing cultural erosion that generic scores overlook.

We ground each evaluation with Wikipedia-derived context retrieved via FAISS and question answering with Qwen2 models, enabling scalable yet culturally aware diagnostics.

Heatmap showing agreement between the culture-aware metric and human best and worst selections across countries.
Screenshots of the ECB Human Survey platform used by native expert reviewers.

Expert-in-the-loop Evaluation Platform

Native raters evaluate image quality and cultural representation for each editing loop using the ECB Human Survey platform, ensuring judgments reflect emic expertise.

The tool supports IRB documentation, reviewer dashboards, and best/worst selection workflows, making it easier to repeat cultural audits in new domains.

Beyond Culture: Occupational Bias

Even with gender-neutral prompts, occupation generations skew male for leadership roles and predominantly show light skin tones, while caregiving roles skew female.

These findings reinforce that cultural fidelity and demographic fairness must be evaluated together when deploying generative models.

Examples of gender and skin-tone skew in occupation prompts produced by generative models.

Read the Paper

Dive into the full methodology, extended analyses, and deployment guidance.

  • Full benchmark specification with prompt schema, sampling controls, and evaluation circuits.
  • Expanded qualitative audits, survey instrumentation, and reproducible model configurations.
  • High-resolution figures, tables, and appendices ready for downstream analysis.

Preview the PDF in-browser or open in a new tab for a full-screen experience.

BibTeX

@article{seo2025exposing,
  title={Exposing Blindspots: Cultural Bias Evaluation in Generative Image Models},
  author={Seo, Huichan and Choi, Sieun and Hong, Minki and Zhou, Yi and Kim, Junseo and Ismaila, Lukman and Etori, Naome and Agarwal, Mehul and Liu, Zhixuan and Kim, Jihie and Oh, Jean},
  journal={arXiv preprint},
  year={2025},
  url={https://arxiv.org/abs/2510.20042}
}