arXiv Preprint

Full paper with methodology, results, and appendices.

arXiv - Coming Soon
Code & Evaluation Assets

Reproducible pipelines, prompts, and evaluation code.

GitHub - Coming Soon
Benchmark Data

5,640 edited outputs, VLM scores, and human annotations.

Hugging Face - Coming Soon

Evaluating Demographic Misrepresentation in Image-to-Image Portrait Editing

1Carnegie Mellon University · 2Dongguk University
*Equal contribution
Qualitative examples of demographic-conditioned failures in I2I editing across different prompts and source demographics.

Identical edit instructions yield systematically different outcomes across subject demographics: skin lightening, race change, and gender inference reveal deeply embedded priors in open-weight I2I editors.

Overall Framework

Overview of the study framework: source portraits, diagnostic prompts, three I2I editors, feature prompt mitigation, VLM and human evaluation.

We build a controlled benchmark from FairFace, pair source portraits with diagnostic prompts, run three I2I editing models, and assess outputs via human evaluation and a VLM ensemble. For feature prompt mitigation, we prepend identity-preserving constraints and re-run editing under identical conditions.

Abstract

Demographic bias in text-to-image (T2I) generation is well studied, yet demographic-conditioned failures in instruction-guided image-to-image (I2I) editing remain underexplored. We examine whether identical edit instructions yield systematically different outcomes across subject demographics in open-weight I2I editors. We formalize two failure modes: Soft Erasure, where edits are silently weakened or ignored in the output image, and Stereotype Replacement, where edits introduce unrequested, stereotype-consistent attributes. We introduce a controlled benchmark that probes demographic-conditioned behavior by generating and editing portraits conditioned on race, gender, and age using a diagnostic prompt set, and evaluate multiple editors with vision-language model (VLM) scoring and human evaluation. Our analysis shows that identity preservation failures are pervasive, demographically uneven, and shaped by implicit social priors, including occupation-driven gender inference. Finally, we demonstrate that a prompt-level identity constraint, without model updates, can substantially reduce demographic change for minority groups while leaving majority-group portraits largely unchanged, revealing asymmetric identity priors in current editors. Together, our findings establish identity preservation as a central and demographically uneven failure mode in I2I editing and motivate demographic-robust editing systems.

Research Contributions

Failure Mode Formalization

We identify and define two demographic-conditioned failure modes in I2I editing: Soft Erasure (silent edit suppression) and Stereotype Replacement (unrequested identity change driven by social priors).

Controlled Benchmark

5,040 edited images from 84 factorially sampled FairFace portraits across race, gender, and age, evaluated by a dual-VLM ensemble and 30 human annotators on Prolific.

Prompt-level Mitigation

A feature prompt that specifies observable appearance attributes reduces identity change for non-White groups by up to 1.48 points without any model modification.

VLM-Human Alignment

Strong agreement between VLM scoring and human evaluation validates automated assessment as a scalable, conservative lower bound on demographic-conditioned failures.

Two Failure Modes in I2I Editing

Soft Erasure occurs when the editor silently suppresses the requested edit, yielding unchanged or minimally altered results despite producing an output image.

Stereotype Replacement occurs when edits introduce stereotype-consistent demographic attributes not specified in the prompt—skin lightening, race change, or gender inference driven by occupational priors.

These failures are not reliably captured by generic edit-quality metrics, motivating our multi-axis evaluation protocol.

Examples of Soft Erasure (edit ignored) and Stereotype Replacement (identity changed toward majority-group features).
Racial disparities in skin lightening (72-75% for Indian/Black vs 44% for White) and race change (14% for Indian vs 1% for White).

Pervasive Racial Disparity

62–71% of all edited outputs exhibit lighter skin tones than the source image. This effect is not uniform: Indian and Black subjects experience 72–75% skin lightening, compared to 44% for White subjects.

Race change shows even starker disparity: Indian subjects experience 14% change vs. only 1% for White subjects. This systematic drift toward lighter skin and White-presenting features occurs across all three models and all prompt categories.

Asymmetric Mitigation via Feature Prompts

Without any model modification, prepending observable appearance features to edit instructions reduces identity change across all non-White groups.

Feature prompts reduce race change by 1.48 points for Black subjects but only 0.06 points for White subjects. This asymmetry reveals a “default to White” prior: without constraints, edits drift toward White-presenting outputs.

The mitigation operates purely at the prompt level—model-agnostic, no fine-tuning required, applicable to closed-source editors.

Qualitative comparison: baseline outputs show identity drift for non-White subjects; feature prompt outputs preserve identity.

Feature prompts reduce race change for non-White subjects by preserving source identity attributes. Edit success decreases as the model prioritizes identity preservation over edit compliance.

Human Evaluation Platform

We recruited N=30 annotators via Prolific to validate VLM-based scoring. Each output was independently rated by three human raters using the same 5-point rubric, yielding 3,000 annotations.

Onboarding guide showing evaluation task structure with source image, two AI-edited outputs, and edit prompt.

Onboarding guide: participants are shown the evaluation task structure with source image, two edited outputs, and the edit prompt.

Main evaluation interface with side-by-side comparison and 5-point Likert scales.

Main evaluation interface: side-by-side comparison with independent 5-point Likert scales for each evaluation dimension.

IRB consent form confirming study purpose and participant anonymity.

IRB consent form: participants confirm eligibility and agree to participate in the study.

Task selection dashboard showing completed and available tasks.

Task dashboard: participants complete one task of 100 items, with completion status tracked in real time.

VLM-Human Alignment

Human scores detect significant racial differences in skin tone (Kruskal–Wallis H = 24.7, p < 0.001) and a White vs. Non-White disparity (Mann–Whitney U, p = 0.020), matching VLM-identified patterns.

VLM systematically overestimates edit success by +0.72 points on average, meaning VLM-detected soft erasure rates serve as a conservative lower bound on the true prevalence.

Identity drift differences between VLM and human means are small (race: 0.03–0.16; gender: 0.05–0.12; age: 0.02–0.10), suggesting VLM scoring enables reliable, scalable assessment.

Human evaluation: (a) mean skin tone scores by race showing significant racial disparity; (b) edit success and change reduction with feature prompts.
Gender-occupation stereotypes: WinoBias-based edits showing models consistently adopt stereotype-consistent gender presentations.

WinoBias-derived edits with male/female stereotype mapping across occupations.

Gender-Occupation Stereotypes

Both models follow occupational stereotypes in 84–86% of cases, shifting toward stereotype-consistent gender presentations under gender-coded occupation edits.

CEO and military prompts push female sources toward male presentations, while nurse and housekeeper prompts push male sources toward female presentations—indicating occupation-driven gender priors override source identity.

We evaluate 50 WinoBias-derived occupation prompts balanced across male- and female-coded roles, with outputs annotated by VLM evaluators and human raters.

Additional Evaluation Interface

Evaluation interface before selection, showing both edited outputs awaiting annotation.

Pre-selection view: annotators see the source image, edit prompt, and both edited outputs before making judgments.

Task dashboard showing incomplete tasks with progress indicators.

Task dashboard (in progress): real-time tracking of annotation completion across all tasks.

VLM-human alignment showing agreement patterns across evaluation dimensions.

Detailed VLM-Human Agreement

Our dual-VLM ensemble (Gemini 3.0 Flash Preview + GPT-5-mini) produces scores that align closely with human judgments across all five evaluation dimensions.

The agreement pattern confirms that automated VLM scoring can serve as a reliable proxy for human annotation, enabling scalable evaluation of demographic bias in I2I systems.

Extended Alignment Analysis

Beyond aggregate agreement, we analyze per-dimension and per-demographic alignment to ensure VLM scoring does not systematically misrepresent any subgroup.

The extended analysis reveals consistent alignment across race, gender, and age categories, confirming that our protocol supports equitable automated evaluation.

Extended VLM-human alignment analysis across demographic subgroups.

Representative Output Examples

A dense mosaic of 480 randomly sampled outputs across all three models, spanning Occupational Stereotype and Vulnerability Attribute prompts across all demographic groups.

Dense mosaic of 480 edited outputs from FLUX.2-dev, Step1X-Edit-v1p2, and Qwen-Image-Edit-2511.

Visual inspection reveals pervasive skin lightening and identity drift patterns across models and demographic groups.

Cross-Race Comparison

The same prompt applied across all seven racial groups enables direct visual comparison of how I2I models treat identical edit requests differently based on source demographics.

Same prompt across different races: CEO, Doctor, Housekeeper, Politician, Wheelchair User, Aged prompts applied to all seven racial groups.

Each row shows a single prompt applied to all seven racial groups. Notable patterns include consistent skin lightening for darker-skinned subjects and stereotype-congruent feature enhancement.

Key Findings

  • Pervasive Soft Erasure: Step1X-Edit-v1p2 shows the lowest edit success, reflecting frequent silent non-compliance. Qwen-Image-Edit-2511 achieves the highest edit success (4.65) but FLUX.2-dev exhibits the strongest identity change.
  • Systematic skin lightening: 62–71% of all edited outputs exhibit lighter skin tones. Indian and Black subjects experience 72–75% skin lightening vs. 44% for White subjects.
  • Asymmetric mitigation: Feature prompts reduce race change by 1.48 points for Black subjects but only 0.06 for White subjects, revealing a “default to White” prior in current editors.
  • Gender-occupation bias: Both models follow occupational stereotypes in 84–86% of cases, overriding source gender with occupation-driven priors.

BibTeX

@article{seo2026demographic,
  title={Evaluating Demographic Misrepresentation in Image-to-Image Portrait Editing},
  author={Seo, Huichan and Hong, Minki and Choi, Sieun and Kim, Jihie and Oh, Jean},
  journal={arXiv preprint},
  year={2026}
}