Authors: Conor Hayes, Kyuwon Weon, Amber Handal, Tianhao Zhang
PenPal uses a vision-guided robotic system to detect a whiteboard in the environment, read handwritten questions from the board using the Gemini vision-language model, generate concise answers, and physically writes responses back onto the board using a Franka Emika arm.
trim.3598FA08-6654-48A9-8AEE-79FFAE86DF47.MOV
Rviz.mp4
The system integrates:
- Computer vision (AprilTag-based pose estimation,
OpenCVpreprocessing) - Vision-language models (
Gemini VLMfor OCR + question answering) - Robot motion planning (
MoveItCartesian path planning) - Frame-consistent spatial reasoning (TF trees and rigid-body transforms)
All perception, reasoning, and motion are performed dynamically at runtime, allowing the robot to adapt to changes in board position and orientation.
- Prerequisites
- Ubuntu 24.04
- Python 3.10+
- ROS 2 (tested with
Kilted Kaiju) - MoveIt 2
- Google Gemini API key
- Franka ROS stack
- Realsense D435i
- Run Setup
# run the following from your ROS2 workspace where
# this repo is in the src folder:
# install dependencies
rosdep install --from-paths src --ignore-src --rosdistro kilted
# WARNING - the below worked fine on our computers, BUT
# there are warnings not to do this from Ubuntu. Move forward
# at your own risk...
pip install --break-system-packages google-genai torch
colcon build
# set the google API key in your shell:
export GOOGLE_API_KEY='[INSERT YOUR API KEY HERE]'
# build the workspace
`cd ~/ws`
`colcon build --symlink-install`
`source install/setup.bash`
# launch Penpal!
`ros2 launch penpal penpal.launch.py`- Commanding Penpal
penpal.pyTop-level orchestrator + FSM. Watches for board visibility, triggers OCR/VLM via a Trigger service client, and sends generated text to the writing stack (planner + controller). Provides wake, sleep, grab_pen services and write_message action. Subscribes to board_info.board_detector.pyReal board detector. Consumes AprilTag detections and produces penpal_interfaces/BoardInfo (board_info) containing board pose, dimensions, writable area, and detection metadata (e.g., tag count, sequence number).ocr_node.pyReal OCR + QA node. Provides read_and_answer_board (example_interfaces/Trigger) which captures/uses the latest image and returns a JSON payload containing transcription + a concise answer (and raw debug fields).mock_board_detector.pyPublishes a fixed/synthetic board_info for development when the camera/AprilTags aren’t running (useful for testing planning/writing).mock_ocr_node.pyMock Trigger-service implementation of OCR/QA. Always returns a static JSON payload for end-to-end testing of PenPal without VLM dependencies.
penpal.launch.py
Launches the full system (PenPal + MoveIt RViz + optionally vision, or mocks).
Launch arguments:
vision(default: true)mock: startmock_ocr_node+mock_board_detectorinstead (no real vision)controller(default: moveit)
penpal_vision.launch.py
Launches the vision system (Realsense + Apriltags + board detection + OCR)
Launch arguments:
gemini_api_key(default: $GOOGLE_API_KEY)run_rviz(default: true)
moveit_ctl.launch.py
Integration-test launchfile for the MoveIt-based controller.
- Significantly multithreaded code in the penpal node; many actions need to be done in parallel
- Integrating many distinct functions into one architecture
- Management of many frames, complex trajectories, and the transforms between them.
- Most of these transforms were handled manually (using
numpy,scipy) rather than using the TF tree, due to the sheer amount of information to handle.
- Use of joint trajectory controller limited lots of control options. While
setCollisionThresholdwas used, the upper threshold was set to a wider range than just for controlling the pen tip force to account for torque and force the robot experiences as it accelerates toward the board. - Use of cartesian impedance controller would be a great next step to improve Penpal.
This repo contains a pre-commit hook that performs lint checks before you're allowed to commit code, and also auto-formats some of those errors for you (i.e. replacing double quotes with single quotes). This is so we don't have to go back and fight with with ament_lint for a million years like we did in the previous project.
It uses the python pre-commit framework + the ruff formatter+linter to do this; see docs for pre-commit and ruff.
In order to set it up, do the following:
# install the pre-commit program
sudo apt install pre-commit
# install the pre-commit hooks to the repo (as configured in .pre-commit-config.yaml)
pre-commit installAll done! Now every time you commit, it will run lint checks + do some autoformatting to make sure that we stick to the ROS2 style guidelines (mostly. it doesn't do everything for us/check everything).