Description:
Problem:
The application's multimodal functionality fails to work with images captured via screenshot tools (e.g., Windows Snipping Tool, Win+Shift+S). While a user can paste the screenshot image into other applications like Paint or Discord, Blink does not detect the image data on the clipboard. The feature only works if the user manually saves the screenshot as a file and then copies the file itself.
Steps to Reproduce:
- Use the Windows Snipping Tool (
Win+Shift+S) to capture a portion of the screen. A notification confirms the snip has been copied to the clipboard.
- Verify the image is on the clipboard by pasting it into an application like MS Paint.
- In a text editor, type and select an instruction (e.g., "What is in this image?").
- Press the multimodal hotkey (
Ctrl+Alt+/).
Expected Behavior:
Blink should detect the raw image data on the clipboard, combine it with the selected text prompt, and send both to the configured multimodal LLM for processing.
Actual Behavior:
The query fails silently or the application returns an error notification like "Unsupported clipboard content" or "Clipboard is empty." The multimodal feature is not triggered.
Root Cause Analysis:
This is not a bug in the file-handling logic, but a missing feature in the clipboard inspection module.
- When a user copies a file, the clipboard receives data in a special format (
CF_HDROP on Windows) which contains a list of file paths. Our current clipboard_manager.py is correctly designed to handle this.
- When a user takes a screenshot, the clipboard receives raw image data, typically in a bitmap format (
CF_DIB). Our clipboard_manager.py currently has no logic to detect or handle this data type. It only looks for file paths or plain text.
Proposed Solution:
The clipboard_manager.py and the hotkey_manager.py workflow must be enhanced to handle raw image data.
-
Add Dependency: The Pillow library is required for this. Add Pillow to requirements.txt.
-
Refactor clipboard_manager.py:
- Modify the function that inspects the clipboard (e.g.,
get_clipboard_contents).
- It must now check for clipboard content in a specific order of priority:
- Check for an Image First: Use
PIL.ImageGrab.grabclipboard() to attempt to get image data. If it returns a valid Image object, the clipboard contains an image.
- Then Check for Files: If no image is found, proceed with the existing
win32clipboard logic to check for file paths (CF_HDROP).
- Finally, Check for Text: If neither of the above is found, fall back to checking for plain text.
- The function should return a structured object that can handle different data types, for example:
{"type": "image_data", "content": <Pillow Image Object>} or {"type": "file_list", "content": ["C:\\path..."]}.
-
Refactor hotkey_manager.py:
- The
process_clipboard_context() method must be updated to handle the new image_data type from the clipboard manager.
- If it receives
image_data, it needs to:
a. Convert the Pillow Image object into an in-memory byte stream (e.g., using io.BytesIO).
b. Base64 encode this byte stream.
c. Pass this Base64 string to the llm_interface as part of the multimodal payload, just as it would for an image read from a file.
Acceptance Criteria:
Description:
Problem:
The application's multimodal functionality fails to work with images captured via screenshot tools (e.g., Windows Snipping Tool,
Win+Shift+S). While a user can paste the screenshot image into other applications like Paint or Discord, Blink does not detect the image data on the clipboard. The feature only works if the user manually saves the screenshot as a file and then copies the file itself.Steps to Reproduce:
Win+Shift+S) to capture a portion of the screen. A notification confirms the snip has been copied to the clipboard.Ctrl+Alt+/).Expected Behavior:
Blink should detect the raw image data on the clipboard, combine it with the selected text prompt, and send both to the configured multimodal LLM for processing.
Actual Behavior:
The query fails silently or the application returns an error notification like "Unsupported clipboard content" or "Clipboard is empty." The multimodal feature is not triggered.
Root Cause Analysis:
This is not a bug in the file-handling logic, but a missing feature in the clipboard inspection module.
CF_HDROPon Windows) which contains a list of file paths. Our currentclipboard_manager.pyis correctly designed to handle this.CF_DIB). Ourclipboard_manager.pycurrently has no logic to detect or handle this data type. It only looks for file paths or plain text.Proposed Solution:
The
clipboard_manager.pyand thehotkey_manager.pyworkflow must be enhanced to handle raw image data.Add Dependency: The
Pillowlibrary is required for this. AddPillowtorequirements.txt.Refactor
clipboard_manager.py:get_clipboard_contents).PIL.ImageGrab.grabclipboard()to attempt to get image data. If it returns a validImageobject, the clipboard contains an image.win32clipboardlogic to check for file paths (CF_HDROP).{"type": "image_data", "content": <Pillow Image Object>}or{"type": "file_list", "content": ["C:\\path..."]}.Refactor
hotkey_manager.py:process_clipboard_context()method must be updated to handle the newimage_datatype from the clipboard manager.image_data, it needs to:a. Convert the Pillow
Imageobject into an in-memory byte stream (e.g., usingio.BytesIO).b. Base64 encode this byte stream.
c. Pass this Base64 string to the
llm_interfaceas part of the multimodal payload, just as it would for an image read from a file.Acceptance Criteria:
Win+Shift+S, using the multimodal hotkey successfully sends the image to the LLM.