Evaluating Demographic Misrepresentation in Image-to-Image Portrait Editing
Overall Framework
We build a controlled benchmark from FairFace, pair source portraits with diagnostic prompts, run three I2I editing models, and assess outputs via human evaluation and a VLM ensemble. For feature prompt mitigation, we prepend identity-preserving constraints and re-run editing under identical conditions.
Abstract
Demographic bias in text-to-image (T2I) generation is well studied, yet demographic-conditioned failures in instruction-guided image-to-image (I2I) editing remain underexplored. We examine whether identical edit instructions yield systematically different outcomes across subject demographics in open-weight I2I editors. We formalize two failure modes: Soft Erasure, where edits are silently weakened or ignored in the output image, and Stereotype Replacement, where edits introduce unrequested, stereotype-consistent attributes. We introduce a controlled benchmark that probes demographic-conditioned behavior by generating and editing portraits conditioned on race, gender, and age using a diagnostic prompt set, and evaluate multiple editors with vision-language model (VLM) scoring and human evaluation. Our analysis shows that identity preservation failures are pervasive, demographically uneven, and shaped by implicit social priors, including occupation-driven gender inference. Finally, we demonstrate that a prompt-level identity constraint, without model updates, can substantially reduce demographic change for minority groups while leaving majority-group portraits largely unchanged, revealing asymmetric identity priors in current editors. Together, our findings establish identity preservation as a central and demographically uneven failure mode in I2I editing and motivate demographic-robust editing systems.
End-to-end evaluation pipeline: FairFace source portraits paired with diagnostic prompts, edited by three open-weight I2I editors, and assessed via VLM ensemble and human evaluation.
Qualitative comparison of baseline and ours. Feature prompts reduce race change for non-White subjects by preserving source identity attributes.
WinoBias-derived occupation edits reveal 84-86% stereotype adherence: models override source gender with occupation-driven priors.
Cross-race comparison: identical prompts produce consistent skin lightening for darker-skinned subjects and stereotype-congruent feature enhancement.
Research Contributions
Failure Mode Formalization
We identify and define two demographic-conditioned failure modes in I2I editing: Soft Erasure (silent edit suppression) and Stereotype Replacement (unrequested identity change driven by social priors).
Controlled Benchmark
5,040 edited images from 84 factorially sampled FairFace portraits across race, gender, and age, evaluated by a dual-VLM ensemble and 30 human annotators on Prolific.
Prompt-level Mitigation
A feature prompt that specifies observable appearance attributes reduces identity change for non-White groups by up to 1.48 points without any model modification.
VLM-Human Alignment
Strong agreement between VLM scoring and human evaluation validates automated assessment as a scalable, conservative lower bound on demographic-conditioned failures.
Two Failure Modes in I2I Editing
Soft Erasure occurs when the editor silently suppresses the requested edit, yielding unchanged or minimally altered results despite producing an output image.
Stereotype Replacement occurs when edits introduce stereotype-consistent demographic attributes not specified in the prompt—skin lightening, race change, or gender inference driven by occupational priors.
These failures are not reliably captured by generic edit-quality metrics, motivating our multi-axis evaluation protocol.
Pervasive Racial Disparity
62–71% of all edited outputs exhibit lighter skin tones than the source image. This effect is not uniform: Indian and Black subjects experience 72–75% skin lightening, compared to 44% for White subjects.
Race change shows even starker disparity: Indian subjects experience 14% change vs. only 1% for White subjects. This systematic drift toward lighter skin and White-presenting features occurs across all three models and all prompt categories.
Asymmetric Mitigation via Feature Prompts
Without any model modification, prepending observable appearance features to edit instructions reduces identity change across all non-White groups.
Feature prompts reduce race change by 1.48 points for Black subjects but only 0.06 points for White subjects. This asymmetry reveals a “default to White” prior: without constraints, edits drift toward White-presenting outputs.
The mitigation operates purely at the prompt level—model-agnostic, no fine-tuning required, applicable to closed-source editors.
Feature prompts reduce race change for non-White subjects by preserving source identity attributes. Edit success decreases as the model prioritizes identity preservation over edit compliance.
Human Evaluation Platform
We recruited N=30 annotators via Prolific to validate VLM-based scoring. Each output was independently rated by three human raters using the same 5-point rubric, yielding 3,000 annotations.
Onboarding guide: participants are shown the evaluation task structure with source image, two edited outputs, and the edit prompt.
Main evaluation interface: side-by-side comparison with independent 5-point Likert scales for each evaluation dimension.
IRB consent form: participants confirm eligibility and agree to participate in the study.
Task dashboard: participants complete one task of 100 items, with completion status tracked in real time.
VLM-Human Alignment
Human scores detect significant racial differences in skin tone (Kruskal–Wallis H = 24.7, p < 0.001) and a White vs. Non-White disparity (Mann–Whitney U, p = 0.020), matching VLM-identified patterns.
VLM systematically overestimates edit success by +0.72 points on average, meaning VLM-detected soft erasure rates serve as a conservative lower bound on the true prevalence.
Identity drift differences between VLM and human means are small (race: 0.03–0.16; gender: 0.05–0.12; age: 0.02–0.10), suggesting VLM scoring enables reliable, scalable assessment.
WinoBias-derived edits with male/female stereotype mapping across occupations.
Gender-Occupation Stereotypes
Both models follow occupational stereotypes in 84–86% of cases, shifting toward stereotype-consistent gender presentations under gender-coded occupation edits.
CEO and military prompts push female sources toward male presentations, while nurse and housekeeper prompts push male sources toward female presentations—indicating occupation-driven gender priors override source identity.
We evaluate 50 WinoBias-derived occupation prompts balanced across male- and female-coded roles, with outputs annotated by VLM evaluators and human raters.
Additional Evaluation Interface
Pre-selection view: annotators see the source image, edit prompt, and both edited outputs before making judgments.
Task dashboard (in progress): real-time tracking of annotation completion across all tasks.
Detailed VLM-Human Agreement
Our dual-VLM ensemble (Gemini 3.0 Flash Preview + GPT-5-mini) produces scores that align closely with human judgments across all five evaluation dimensions.
The agreement pattern confirms that automated VLM scoring can serve as a reliable proxy for human annotation, enabling scalable evaluation of demographic bias in I2I systems.
Extended Alignment Analysis
Beyond aggregate agreement, we analyze per-dimension and per-demographic alignment to ensure VLM scoring does not systematically misrepresent any subgroup.
The extended analysis reveals consistent alignment across race, gender, and age categories, confirming that our protocol supports equitable automated evaluation.
Representative Output Examples
A dense mosaic of 480 randomly sampled outputs across all three models, spanning Occupational Stereotype and Vulnerability Attribute prompts across all demographic groups.
Visual inspection reveals pervasive skin lightening and identity drift patterns across models and demographic groups.
Cross-Race Comparison
The same prompt applied across all seven racial groups enables direct visual comparison of how I2I models treat identical edit requests differently based on source demographics.
Each row shows a single prompt applied to all seven racial groups. Notable patterns include consistent skin lightening for darker-skinned subjects and stereotype-congruent feature enhancement.
Key Findings
- Pervasive Soft Erasure: Step1X-Edit-v1p2 shows the lowest edit success, reflecting frequent silent non-compliance. Qwen-Image-Edit-2511 achieves the highest edit success (4.65) but FLUX.2-dev exhibits the strongest identity change.
- Systematic skin lightening: 62–71% of all edited outputs exhibit lighter skin tones. Indian and Black subjects experience 72–75% skin lightening vs. 44% for White subjects.
- Asymmetric mitigation: Feature prompts reduce race change by 1.48 points for Black subjects but only 0.06 for White subjects, revealing a “default to White” prior in current editors.
- Gender-occupation bias: Both models follow occupational stereotypes in 84–86% of cases, overriding source gender with occupation-driven priors.
BibTeX
@article{seo2026demographic,
title={Evaluating Demographic Misrepresentation in Image-to-Image Portrait Editing},
author={Seo, Huichan and Hong, Minki and Choi, Sieun and Kim, Jihie and Oh, Jean},
journal={arXiv preprint},
year={2026}
}