REPLAN: REASONING-GUIDED REGION PLANNING FOR COMPLEX INSTRUCTION-BASED IMAGE EDITING
This model is part of the work presented in the paper RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing.
Demo video of RePlan:
Model Summary
This model is the Planner module of the RePlan framework, designed for complex instruction-based image editing. It is a fine-tuned version of Qwen2.5-VL-7B, trained with GRPO and only ~1k samples without any paired images.
Given an input image and a natural language editing instruction, this model performs Chain-of-Thought (CoT) reasoning to decompose the task. It outputs structured guidance containing:
- Reasoning: Analysis of the image and instruction.
- Global Edits: Instructions for the entire image (if necessary).
- Regional Edits: Precise bounding boxes (
bbox_2d) and specific prompts (hint) for local regions.
Paper Introduction
Paper Title: RePlan: Reasoning-Guided Region Planning for Complex Instruction-Based Image Editing
Existing instruction-based image editing models often struggle with Instruction-Visual Complexity (IV-Complexity): scenarios involving cluttered visuals, ambiguous instructions, or the need for multi-step reasoning.
RePlan introduces a "Plan-then-Execute" strategy:
- Plan: This VLM planner analyzes the scene, grounds the instruction to specific pixels, and generates a precise editing plan.
- Execute: A diffusion model (equipped with a Training-Free Attention Region Injection mechanism) applies the edits based on the planner's guidance.
Experiments show that RePlan significantly outperforms baselines in visual reasoning and background consistency.
Usage
System Prompt
Crucial: To get the correct XML/JSON structured output, you MUST use the following System Prompt.
You are an expert AI image editing assistant. Your task is to carefully analyze a user's editing instruction and input image, reason step by step, and then decompose the necessary actions into global and local edits.
### Rules
1. **Global Edits:** Affect the entire image's style, lighting, color grading, or overall composition. Global edits should only be derived if they are essential for achieving the user's core instruction.
2. **Local Edits:** Target specific objects or areas. These instructions go into the `<region>` tag.
3. **Hint Quality:** The `hint` text MUST be a concise, visually descriptive instruction for its specific region. It should clearly state the expected visual outcome.
4. **Strict Separation:** Instructions for local edits in `<region>` MUST NOT be duplicated in the `<gen_image>` prompt.
5. **Edge Case - No Global Edits:** If no global edits are necessary to achieve the user's goal, `<gen_image>` MUST be placeholder ‘keep remaining part of image unchanged.’.
6. **Edge Case - No Local Edits:** If no local edits are needed, `<region>` must be an empty list `[]`.
### Output Format
Your entire output must follow this format, with no text outside the tags.
<think>Reasoning process</think><gen_image>Global edit instruction</gen_image><region>[{"bbox_2d": [10,150,150,210], "point_2d": [30,175], "hint": "change the color of this one apple to blue"}, {"bbox_2d": [150,50,200,150], "point_2d": [175,75], "hint": "keep this one apple unchanged"}]</region>
Inference Code Example
import torch
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info
# 1. Load Model
model_path = "path/to/your/model" # Replace with huggingface repo id
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_path, torch_dtype=torch.float16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_path)
# 2. Define System Prompt (See section above)
SYSTEM_PROMPT = "..." # Paste the full system prompt here
# 3. Prepare Inputs
image_path = "./example_image.jpg"
user_instruction = "Replace the cup that has been used and left on the desk with a small potted plant."
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": [
{"type": "image", "image": image_path},
{"type": "text", "text": user_instruction},
],
},
]
# 4. Processing
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
).to(model.device)
# 5. Generate
generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text_list = processor.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
# Output is the raw XML string containing <think>, <gen_image>, and <region>
response_text = output_text_list[0] if isinstance(output_text_list, list) else output_text_list
print(response_text)
Input / Output Examples
Example 1: Complex Reasoning
Input Image:
Instruction: "Replace the cup that has been used and left on the desk with a small potted plant."
Output Format Example:
<think>
The user wants to replace a "used" cup. I see two vessels: a red coffee cup and a glass of water. The red cup appears empty and discarded, fitting the "used and left" description. The glass of water is still potentially in use. Therefore, I should target the red cup for replacement and explicitly keep the glass unchanged to avoid confusion.
</think><gen_image>keep remaining part of image unchanged.</gen_image><region>[{"bbox_2d": [224, 372, 263, 431], "hint": "Replace this red cup with a small potted plant"}, {"bbox_2d": [175, 329, 220, 388], "hint": "Keep this glass unchanged"}]</region>
- Downloads last month
- 786