This project was created to be used by Blind and Low Vision users to have an AAII (Accessible Artificial Intelligence Implementation) available to them wherever they may be. Most components of this project have minimal dependency on a stable internet connection once all the components have been installed if the user wants to work solely on their workstation (PC/laptop). If the user wants to access it from anywhere, a WhatsApp connection needs to be set up. You would need both the server and the client web applications running for it to work. Make sure to start the server first. You can access the client by opening client/screen_wss.html in your browser.
If you want to use an already running server:
- Obtain the server WebSocket URL (format:
wss://[server-address]/ws) - Open the WhatsAI Web Client by opening the client/screen_wss.html in your browser.
- Paste the server URL into the "Server URL" text field
- Click "Select Screen to Share" and choose your desired screen
- Click "Start Streaming" to begin sharing your screen with the server
- Select a processor from the dropdown menu to analyze your screen
The system includes several processors, ordered by complexity:
- Description: A simple pass-through processor that returns the original image without modifications
- Dependencies: None
- Use Case: Testing connectivity and stream quality
- Description: Performs real-time object detection and segmentation using YOLO11
- Dependencies: None
- Use Case: Identifying and locating objects in your screen
- Reference: Based on Ultralytics YOLO - https://github.com/ultralytics/ultralytics
- Description: Generates detailed region-based captions and performs OCR using Florence-2
- Dependencies: None
- Use Case: Understanding text and visual content on screen
- Reference: Microsoft Florence-2 - https://huggingface.co/microsoft/Florence-2-large
- Description: Provides specialized object recognition for curated items
- Dependencies: None
- Use Case: Learning about specific pre-trained objects
- Reference: Based on Simple CamIO - https://github.com/Coughlan-Lab/simple_camio
- Description: Detects hands and counts raised fingers using MediaPipe
- Dependencies: Basic Processor (ID: 0)
- Use Case: Gesture recognition and finger counting
- Reference: Adapted from Finger Counter using MediaPipe - https://github.com/HarshitDolu/Finger-Counter-using-mediapipe
- WSL2 on Windows or Linux system
- Docker Desktop with WSL2 integration enabled
- Gemini API key
-
Clone the repository:
git clone --single-branch --branch workshop https://github.com/Znasif/HackTemplate.git cd HackTemplate -
Create and configure the
.envfile:GEMINI_API_KEY="your-gemini-api-key" -
Build the Docker container:
docker-compose build
-
Start the server:
docker-compose up
-
Access the server:
- Local access:
ws://localhost:8000/ws - Remote access: Use localtunnel (https://github.com/localtunnel/localtunnel) to expose the port
- Local access:
- Docker Hub account
- RunPod account with credits
-
Tag and push your local Docker image to Docker Hub:
docker tag whatsai-server:latest yourusername/whatsai:latest docker push yourusername/whatsai:latest
-
Create a new pod on RunPod:
- Container Image:
yourusername/whatsai:latest - Container Start Command:
bash -c "cd /app && /app/start_server.sh" - Container Disk: 50 GB
- Volume Disk: 20 GB (optional, for persistent storage)
- Volume Mount Path:
/workspace - Expose HTTP Ports:
8000 - GPU Selection: RTX 4000 Ada or similar
- Container Image:
-
Add environment variables in RunPod:
- GEMINI_API_KEY: your-api-key
- PYTHONPATH: /app
-
Deploy the pod and obtain your URL:
- Format:
wss://[pod-id]-8000.proxy.runpod.net/ws
- Format:
The system supports real-time audio streaming for voice-based interactions:
- Direct Audio: Screen reader and system audio output work automatically
- Remote Audio via Start Audio Button: Dictate processor by pressing the "Start Audio" button and saying which processor to start and then "Stop Audio" to initiate.
- Virtual Audio Cable: For WhatsApp audio streaming:
- Install VB-Audio Virtual Cable from https://vb-audio.com/Cable/
- Set VB-Audio Virtual Cable Output as default system audio output
- In WhatsApp calls, set audio input to VB-Audio Virtual Cable Input
- WebSocket:
ws://[server]/ws- Main streaming endpoint - HTTP GET:
http://[server]:8000/processors- List available processors
- Verify the server URL format includes
/wsat the end - Check if the server is running:
curl http://[server]:8000/processors - Ensure your firewall allows WebSocket connections
- For best results, use a wired internet connection
- Close unnecessary applications to reduce screen capture overhead
- Select specific application windows instead of full screen when possible
Click on the following image which will take you to a playlist:
