Skip to content

[Bug] Multimodal query fails to recognize image data from screenshots on the clipboard #5

@JStaRFilms

Description

@JStaRFilms

Description:

Problem:
The application's multimodal functionality fails to work with images captured via screenshot tools (e.g., Windows Snipping Tool, Win+Shift+S). While a user can paste the screenshot image into other applications like Paint or Discord, Blink does not detect the image data on the clipboard. The feature only works if the user manually saves the screenshot as a file and then copies the file itself.

Steps to Reproduce:

  1. Use the Windows Snipping Tool (Win+Shift+S) to capture a portion of the screen. A notification confirms the snip has been copied to the clipboard.
  2. Verify the image is on the clipboard by pasting it into an application like MS Paint.
  3. In a text editor, type and select an instruction (e.g., "What is in this image?").
  4. Press the multimodal hotkey (Ctrl+Alt+/).

Expected Behavior:
Blink should detect the raw image data on the clipboard, combine it with the selected text prompt, and send both to the configured multimodal LLM for processing.

Actual Behavior:
The query fails silently or the application returns an error notification like "Unsupported clipboard content" or "Clipboard is empty." The multimodal feature is not triggered.

Root Cause Analysis:
This is not a bug in the file-handling logic, but a missing feature in the clipboard inspection module.

  • When a user copies a file, the clipboard receives data in a special format (CF_HDROP on Windows) which contains a list of file paths. Our current clipboard_manager.py is correctly designed to handle this.
  • When a user takes a screenshot, the clipboard receives raw image data, typically in a bitmap format (CF_DIB). Our clipboard_manager.py currently has no logic to detect or handle this data type. It only looks for file paths or plain text.

Proposed Solution:
The clipboard_manager.py and the hotkey_manager.py workflow must be enhanced to handle raw image data.

  1. Add Dependency: The Pillow library is required for this. Add Pillow to requirements.txt.

  2. Refactor clipboard_manager.py:

    • Modify the function that inspects the clipboard (e.g., get_clipboard_contents).
    • It must now check for clipboard content in a specific order of priority:
      1. Check for an Image First: Use PIL.ImageGrab.grabclipboard() to attempt to get image data. If it returns a valid Image object, the clipboard contains an image.
      2. Then Check for Files: If no image is found, proceed with the existing win32clipboard logic to check for file paths (CF_HDROP).
      3. Finally, Check for Text: If neither of the above is found, fall back to checking for plain text.
    • The function should return a structured object that can handle different data types, for example: {"type": "image_data", "content": <Pillow Image Object>} or {"type": "file_list", "content": ["C:\\path..."]}.
  3. Refactor hotkey_manager.py:

    • The process_clipboard_context() method must be updated to handle the new image_data type from the clipboard manager.
    • If it receives image_data, it needs to:
      a. Convert the Pillow Image object into an in-memory byte stream (e.g., using io.BytesIO).
      b. Base64 encode this byte stream.
      c. Pass this Base64 string to the llm_interface as part of the multimodal payload, just as it would for an image read from a file.

Acceptance Criteria:

  • After taking a screenshot with Win+Shift+S, using the multimodal hotkey successfully sends the image to the LLM.
  • Copying an image file from Windows Explorer still works as expected (no regressions).
  • Copying plain text and using the clipboard context feature still works as expected.
  • The system correctly prioritizes image data over any old file paths that might also be lingering on the clipboard.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions