MedEBench Icon

MedEBench

Revisiting Text-Instructed Image Editing on Medical Domain

Minghao Liu   Zhitao He   Zhiyuan Fan   Qingyun Wang   Yi R. (May) Fung

HKUST   |   William & Mary

arXiv Code HF Dataset

Abstract

Text-guided image editing has seen rapid progress in natural image domains, but its adaptation to medical imaging remains limited and lacks standardized evaluation. Clinically, such editing holds promise for simulating surgical outcomes, creating personalized teaching materials, and enhancing patient communication. To bridge this gap, we introduce MedEBench, a comprehensive benchmark for evaluating text-guided medical image editing. It consists of 1,182 clinically sourced image-prompt triplets spanning 70 tasks across 13 anatomical regions. MedEBench offers three key contributions: (1) a clinically relevant evaluation framework covering Editing Accuracy, Contextual Preservation, and Visual Quality, supported by detailed descriptions of expected change and ROI (Region of Interest) masks; (2) a systematic comparison of seven state-of-the-art models, revealing common failure patterns; and (3) a failure analysis protocol based on attention grounding, using IoU (Intersection over Union Ratio) between attention maps and ROIs to identify mislocalization. MedEBench provides a solid foundation for developing and evaluating reliable, clinically meaningful medical image editing systems.

MedEBench Overview

MedEBench Overview

MedEBench consists of 1,182 samples across 13 categories. Each sample includes an input image, a reference image, an editing prompt, an ROI mask, and a detailed description of expected change.

Organ Distribution Dataset Property

📄 Click to View Full Task Table (PDF)

Evaluation Metrics

Contextual Preservation: We evaluate whether the model preserves regions unrelated to the intended edit by computing the Structural Similarity Index (SSIM) between the previous image \( I_{\text{prev}} \) and the edited image \( I_{\text{edit}} \), excluding the region-of-interest (ROI) mask \( \mathcal{R} \). The contextual SSIM is defined as:

\( \text{SSIM}_{\text{context}} = \text{SSIM}(I_{\text{prev}} \mid \bar{\mathcal{R}}, \; I_{\text{edit}} \mid \bar{\mathcal{R}}) \)

This metric captures how well the model maintains anatomical consistency outside the edited region.

Editing Accuracy and Visual Quality: We use GPT-4o, a multimodal language model with visual reasoning capabilities, to evaluate both metrics. For each sample, GPT-4o receives the description of change, previous image, edited image, and ground truth image. It follows a structured two-step protocol:

Detailed Evaluation Pipeline

Experiment

Comparison Table

Visual Comparison of Editing Results

Previous Truth Gemini2 SeedX Imagic IP2P InstructDiff. PaintByInpaint ICEdit
Remove the diminutive polyp. 0.8/0.8/0.9 0.1/1.0/1.0 0.9/0.9/0.9 0.6/0.7/1.0 0.0/0.3/0.9 0.2/0.2/0.8 0.9/0.9/1.0
Remove the dental black stains. 0.9/0.8/0.8 0.7/0.5/0.9 0.9/0.8/0.7 0.5/0.5/0.9 0.5/0.7/0.7 0.2/0.1/0.6 0.0/0.5/0.8
Remove intestinal polyps. 0.8/0.8/0.8 0.2/0.3/0.2 0.2/0.7/0.5 0.3/0.7/0.8 0.1/0.5/0.7 0.0/0.3/0.3 0.2/0.6/0.7
Fix damaged front teeth. 0.9/0.8/0.9 0.0/0.5/0.8 0.7/0.6/0.5 0.0/0.0/0.7 0.2/0.5/0.6 0.9/0.9/0.7 0.0/0.4/0.9

Each row includes the previous and ground truth images, followed by outputs from seven models. Scores below each output denote EA (Editing Accuracy), VQ (Visual Quality), and CP (Masked SSIM) (all in [0, 1] range).

Learning Paradigms Comparison

Fine-tuning vs In-context Learning

Comparison of Fine-tuning and In-context Learning

Failure Analysis via Attention

  • We analyze InstructPix2Pix to understand its high context preservation but low editing accuracy.
  • The model uses cross-attention maps to connect prompt tokens (e.g., "lip", "ear") to image regions.
  • For each key visual token \( t_c \):
    • Extract cross-attention maps over all diffusion steps.
    • Average them:
      \( \bar{a}_{t_c} = \frac{1}{S} \sum_{s=1}^{S} a^{(s)}_{t_c} \)
    • Reshape, normalize, and binarize via scaled Otsu threshold:
      \( A = \left( \frac{\bar{A}_{t_c} - \min}{\max - \min} > \alpha \cdot \tau_{\text{Otsu}} \right) \)
      with \( \alpha = 1.3 \).
  • Compare with ROI mask \(M\) using Intersection-over-Union (IoU):
    \( \text{IoU} = \frac{|A \cap M|}{|A \cup M|} \)
  • Interpretation:
    • High IoU: Localizes correctly but edits conservatively (e.g., samples 5–6).
    • Low IoU: Fails to localize/edit (e.g., samples 1–4).
  • A higher IoU indicates stronger spatial grounding of visual concepts.
Attention Grounding Example

Illustration of attention-based failure analysis

Key Insights and Takeaways

Large multimodal models outperform open-source alternatives on complex medical edits. Gemini 2 Flash consistently leads in Editing Accuracy (EA), Visual Quality (VQ), and FID, with the largest gaps in internal organ tasks (e.g., gastrointestinal tract, spine, teeth), where fine-grained structures and spatial reasoning are critical. Open-source models like ICEdit remain competitive but lag in high-precision scenarios.

Region-specific challenges persist. Models struggle with repetitive or occluded anatomy (e.g., hands, spine, teeth), where structures are small or overlapped, making precise localization and editing difficult. In contrast, tasks involving superficial structures (e.g., skin blemishes, nose shape) are more reliably handled, showing higher EA and VQ scores.

Fine-tuning aids domain adaptation; in-context learning shows limited generalization. Fine-tuning models like InstructPix2Pix on medical image-editing triplets improves performance, even with limited data. However, in-context learning with large multimodal models like Gemini struggles to generalize fine-grained, pixel-level medical editing tasks. Increasing demonstration samples in prompts does not improve accuracy and often reduces contextual consistency, highlighting the challenges of applying in-context learning to clinical scenarios.

Attention maps provide diagnostic insights. By comparing attention heatmaps and Region-of-Interest (ROI) masks, we find two key failure patterns: (1) High IoU but low EA: The model attends to the correct region but fails to apply the desired edit, indicating conservative or incomplete changes (e.g., concept removal tasks like removing moles). (2) Low IoU: The model attends to irrelevant regions, failing both localization and editing, which often occurs in addition or reconstruction tasks (e.g., repairing a nose or adding missing teeth). These insights guide model improvement for medical editing tasks.

Citation

If you find this work useful, please cite it as follows:

    @misc{liu2025medebenchrevisitingtextinstructedimage, 
      title={MedEBench: Revisiting Text-instructed Image Editing on Medical Domain}, 
      author={Minghao Liu and Zhitao He and Zhiyuan Fan and Qingyun Wang and Yi R. Fung},
      year={2025},
      eprint={2506.01921},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.01921}
    }