feat: add multi-modal attachment propagation to all worker agents thr…#1196
feat: add multi-modal attachment propagation to all worker agents thr…#1196bittoby wants to merge 13 commits intoeigent-ai:mainfrom
Conversation
|
@Wendong-Fan Could you please review this PR? thanks |
thanks @bittoby 's contribution! could @nitpicker55555 and @Zephyroam help checking this? |
|
@Zephyroam @nitpicker55555 I would appreciate your feedback. please review the PR! thanks |
nitpicker55555
left a comment
There was a problem hiding this comment.
Obviously, you didn’t test your code. You didn’t register the ImageAnalysisToolkit for the agent—so how is the agent supposed to gain image-reading capability?
Additionally, this design is overly complicated. Why not simply modify the prompt of the decompose agent so that, when breaking down the task, it passes the image location to the corresponding agent? That way, we would only need to adjust the prompt and register the ImageAnalysisToolkit for the agent.
|
@nitpicker55555 Obviously, after I tested it correctly I pushed the PR. When I attach a test image and input “As a developer, analyze this screenshot and write code” as prompt, then the developer agent generates HTML that matches the screenshot. |
you're right - I haven't added |
@bittoby Thank you for the explanation, but I still don’t understand: without providing a dedicated tool, how can the model read an image from a passed-in image URL instead of a base64-encoded input? Could you point out which part of your PR implements this functionality? |
|
Right now it only passes the image file paths along as additional_info. The current setup depends on the model’s built-in vision (which needs base64 data or a URL in the actual message). My PR doesn’t do that conversion or attach the images to the message - it just sends the paths as metadata. I will update PR. I misunderstood. I am clear now. For this to really work, we need to either:
Which option do you recommend? |
|
My idea is to tweak The benefits are: the change is relatively small, we’d only need to update the decompose agent’s prompt to pass along the image path. It would also allow the agent to read other image files, not just images uploaded by the user in the prompt, and it could support users providing an image path instead of the image itself. I also looked into how Claude Code does image reading, and it works in a similar way, via a What do you think @Wendong-Fan @bittoby |
|
@nitpicker55555 Thanks for your explanation! Concerns:
My approach (like Claude Code): make a small, separate tool:
But it requires more code, a bit more wiring I agree with your approach for now since it's faster to ship. We can refactor to a separate ReadImageTool later if we see reusability issues. @Wendong-Fan what's your preference? |
@bittoby what do you mean document agent read image for embedding? |
|
I mean, here "embedding" = putting the image into a document not analyzing it. |
|
@bittoby It seems you haven’t reviewed this tool at all. I strongly recommend checking the codebase before continuing the discussion. |
|
Okay. @nitpicker55555 I’ll update the PR to follow your approach. |
1e21ba1 to
805ed97
Compare
1e21ba1 to
53d8830
Compare
…ocument agents by integrating ImageAnalysisToolkit with proper agent registration and explicit priority instructions in system prompts
698b49b to
dba3f58
Compare
…lti-modal-worker-agents
|
@nitpicker55555 I updated pr to follow your feedback. please review again |
|
@Wendong-Fan @nitpicker55555 I would appreciate your feedback. |
|
@Zephyroam @bytecii would you review my PR, too! Thank you |
|
@bittoby I don’t think you followed my suggestion. I said we need to modify the ImageAnalysisToolkit to support giving the agent the ability to read images directly, but you didn’t make that change. If you goona do this, you can directly remove the current functionality of ImageAnalysisToolkit and change it so that it worker can only read images by providing an input path. And the prompt you add is also redundent, we only need one sentence: You can use ImageAnalysisToolkit to read image with given path. And based on our updated guideline, https://github.com/eigent-ai/eigent/blob/main/CONTRIBUTING.md, please upload your test screenshot to prove your change is work. |
…implify agent prompts
…lti-modal-worker-agents
test.webm@nitpicker55555 pls check this test video. My changes works well |
…ker agents ScreenshotToolkit already provides read_image capability via the agent's own vision model, making ImageAnalysisToolkit redundant. Add ScreenshotToolkit to browser and document agents for image reading support, and revert all ImageAnalysisToolkit additions from worker agents.
|
I can see the problem now, develop agent alreday has Screenshot toolkit for image reading, we can use this tool directly, check this bittoby#1, and test browser agent/document agent to see if they can read image. You can achieve this by modifing the generated task plan to: "use browser agent/document agent to check the xxx image file content" |
browser.webmThe browser agent works, too. |
…lti-modal-worker-agents
|
@nitpicker55555 I merged your changes. Now all agents work well for reading function. i think it's okay to merge this pr. @Wendong-Fan |
| from app.utils.listen.toolkit_listen import auto_listen_toolkit | ||
|
|
||
|
|
There was a problem hiding this comment.
You do not need to modify this file anymore since it is not used.
Others LGTM
…lti-modal-worker-agents
|
@nitpicker55555 I reverted |
nitpicker55555
left a comment
There was a problem hiding this comment.
Thanks @bittoby ! @Wendong-Fan can you take a look for this?
|
@Wendong-Fan would you review this pr and merge that if it has no problem? |
Enable Image Analysis for All Worker Agents
Problem
Only Multi-Modal Agent could analyze images. Worker agents (Developer, Browser, Document) couldn't see or process image attachments, causing tasks to fail.
Solution
ImageAnalysisToolkitto Developer, Browser, and Document agentstoolkits_to_register_agentparameterChanges
Agent Factories:
backend/app/agent/factory/developer.pybackend/app/agent/factory/browser.pybackend/app/agent/factory/document.pySystem Prompts:
backend/app/agent/prompt.pyImpact
All worker agents can now:
Testing
✅ Developer Agent: Extracts code from screenshots
✅ Browser Agent: Analyzes webpage screenshots
✅ Document Agent: Ready for document image analysis
Type
Closes #956