REPLAN: REASONING-GUIDED REGION PLANNING FOR COMPLEX INSTRUCTION-BASED IMAGE EDITING

This model is part of the work presented in the paper RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing.

Demo video of RePlan:

Model Summary

This model is the Planner module of the RePlan framework, designed for complex instruction-based image editing. It is a fine-tuned version of Qwen2.5-VL-7B, trained with GRPO and only ~1k samples without any paired images.

Given an input image and a natural language editing instruction, this model performs Chain-of-Thought (CoT) reasoning to decompose the task. It outputs structured guidance containing:

Reasoning: Analysis of the image and instruction.
Global Edits: Instructions for the entire image (if necessary).
Regional Edits: Precise bounding boxes (bbox_2d) and specific prompts (hint) for local regions.

Paper Introduction

Paper Title: RePlan: Reasoning-Guided Region Planning for Complex Instruction-Based Image Editing

Existing instruction-based image editing models often struggle with Instruction-Visual Complexity (IV-Complexity): scenarios involving cluttered visuals, ambiguous instructions, or the need for multi-step reasoning.

RePlan introduces a "Plan-then-Execute" strategy:

Plan: This VLM planner analyzes the scene, grounds the instruction to specific pixels, and generates a precise editing plan.
Execute: A diffusion model (equipped with a Training-Free Attention Region Injection mechanism) applies the edits based on the planner's guidance.

Experiments show that RePlan significantly outperforms baselines in visual reasoning and background consistency.

Usage

System Prompt

Crucial: To get the correct XML/JSON structured output, you MUST use the following System Prompt.

You are an expert AI image editing assistant. Your task is to carefully analyze a user's editing instruction and input image, reason step by step, and then decompose the necessary actions into global and local edits.

### Rules
1.  **Global Edits:** Affect the entire image's style, lighting, color grading, or overall composition. Global edits should only be derived if they are essential for achieving the user's core instruction.
2.  **Local Edits:** Target specific objects or areas. These instructions go into the `<region>` tag.
3.  **Hint Quality:** The `hint` text MUST be a concise, visually descriptive instruction for its specific region. It should clearly state the expected visual outcome.
4.  **Strict Separation:** Instructions for local edits in `<region>` MUST NOT be duplicated in the `<gen_image>` prompt.
5.  **Edge Case - No Global Edits:** If no global edits are necessary to achieve the user's goal, `<gen_image>` MUST be placeholder ‘keep remaining part of image unchanged.’.
6.  **Edge Case - No Local Edits:** If no local edits are needed, `<region>` must be an empty list `[]`.

### Output Format
Your entire output must follow this format, with no text outside the tags.

<think>Reasoning process</think><gen_image>Global edit instruction</gen_image><region>[{"bbox_2d": [10,150,150,210], "point_2d": [30,175], "hint": "change the color of this one apple to blue"}, {"bbox_2d": [150,50,200,150], "point_2d": [175,75], "hint": "keep this one apple unchanged"}]</region>

Inference Code Example

import torch
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info

# 1. Load Model
model_path = "path/to/your/model"  # Replace with huggingface repo id
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path, torch_dtype=torch.float16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_path)

# 2. Define System Prompt (See section above)
SYSTEM_PROMPT = "..." # Paste the full system prompt here

# 3. Prepare Inputs
image_path = "./example_image.jpg"
user_instruction = "Replace the cup that has been used and left on the desk with a small potted plant."

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_path},
            {"type": "text", "text": user_instruction},
        ],
    },
]

# 4. Processing
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

# 5. Generate
generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text_list = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)

# Output is the raw XML string containing <think>, <gen_image>, and <region>
response_text = output_text_list[0] if isinstance(output_text_list, list) else output_text_list
print(response_text)

Input / Output Examples

Example 1: Complex Reasoning

Input Image:

Instruction: "Replace the cup that has been used and left on the desk with a small potted plant."

Output Format Example:

<think>
The user wants to replace a "used" cup. I see two vessels: a red coffee cup and a glass of water. The red cup appears empty and discarded, fitting the "used and left" description. The glass of water is still potentially in use. Therefore, I should target the red cup for replacement and explicitly keep the glass unchanged to avoid confusion.
</think><gen_image>keep remaining part of image unchanged.</gen_image><region>[{"bbox_2d": [224, 372, 263, 431], "hint": "Replace this red cup with a small potted plant"}, {"bbox_2d": [175, 329, 220, 388], "hint": "Keep this glass unchanged"}]</region>

Downloads last month: 786

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for TainU/RePlan-Qwen2.5-VL-7B

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Finetuned

(925)

this model

Quantizations

1 model