Convert training videos into professional Standard Operating Procedure (SOP) manuals automatically using AI.
- ⚡ 15x Faster - FFmpeg-powered frame extraction
- 🎯 Better Accuracy - Timestamped audio transcription
- ✅ Complete Procedures - Includes reassembly and verification steps
- 📊 Timing Display - See performance breakdown for each phase
- 🧹 Auto Cleanup - Automatic frame cleanup after generation
This tool uses multimodal AI (Gemini 1.5 Flash) and Whisper to watch industrial/manufacturing training videos and generate step-by-step instruction manuals with screenshots.
- 🎥 FFmpeg Video Processing: Extracts key frames 15x faster than traditional methods
- 🎙️ Timestamped Audio: High-quality speech-to-text with precise timestamps using Whisper AI
- 🤖 AI Analysis: Uses Gemini 1.5 Flash to understand and document complete procedures
- 📄 Professional PDFs: Creates polished SOP manuals with images and clear instructions
- ⚡ Fast Processing: 4-minute video → Complete SOP in ~2 minutes
- 🔒 Safety Notes: Automatically identifies safety considerations
- ✅ Complete Procedures: Includes disassembly, repair, reassembly, and verification steps
- 🧹 Auto Cleanup: Automatically removes temporary frames after generation
4-minute video (1920x1080):
- Audio Transcription: ~30s
- Frame Extraction: ~8s (15x faster with FFmpeg!)
- AI Analysis: ~75s
- PDF Generation: ~5s
- Total: ~2 minutes ⚡
- Python 3.8+
- FFmpeg (Installation guide)
- Google Gemini API key (Get one here)
- Groq API key for Whisper transcription (Get one here)
-
Clone or download this repository
-
Create a virtual environment (recommended):
python -m venv myvenv .\myvenv\Scripts\activate # Windows source myvenv/bin/activate # Linux/Mac
-
Install dependencies:
pip install -r requirements.txt
-
Install FFmpeg (for fast frame extraction):
- Windows:
choco install ffmpegor see FFMPEG_SETUP.md - Verify:
ffmpeg -version
- Windows:
-
Set up your API keys:
- Copy
.env.exampleto.env - Add your API keys:
GOOGLE_API_KEY=your_google_gemini_api_key_here GROQ_API_KEY=your_groq_api_key_here
- Copy
python main.py path/to/video.mp4This will:
- Extract audio and create timestamped transcript
- Extract key frames (fast with FFmpeg!)
- Analyze with AI to generate complete procedure
- Generate professional PDF
- Automatically cleanup temporary frames
python main.py video.mp4 \
--output my_sop.pdf \
--context "Car Tire Repair and Replacement" \
--company "Shezan Car Garage"| Option | Description | Default |
|---|---|---|
video |
Path to input video file | (required) |
-o, --output |
Output PDF filename | output_sop.pdf |
-c, --context |
Task context for better analysis | Auto-detected |
--company |
Company name for PDF header | "Your Company" |
Video Input → Frame Extraction → AI Analysis → PDF Generation
- Extracts frames at 1-2 second intervals
- Resizes images for optimal AI processing
- Maintains timestamp information
- Sends frames/video to Gemini 1.5 Pro
- Uses specialized prompt for SOP generation
- Returns structured JSON with steps and timestamps
- Creates professional document layout
- Embeds images at relevant steps
- Includes safety notes and table of contents
Video-to-SOP Generator/
├── main.py # Main application
├── video_processor.py # Frame extraction
├── sop_analyzer.py # AI analysis
├── pdf_generator.py # PDF creation
├── requirements.txt # Dependencies
├── .env.example # API key template
└── README.md # This file
Here's what the generated SOP looks like:
Input: 4-minute training video
Output: Professional 18-page SOP manual
Processing Time: 2 minutes
Professional cover page with title, company name, and date
Automatically generated table of contents with safety considerations
Each step includes clear instructions, timestamp reference, and corresponding image from the video
Includes reassembly and verification steps for complete procedures
- ✅ Cover Page - Professional title page with company branding
- ✅ Table of Contents - Easy navigation to all sections
- ✅ Safety Section - Automatically identified safety considerations
- ✅ Step-by-Step Instructions - Clear, actionable steps with:
- Numbered steps in logical order
- Timestamp references from video
- High-quality images showing each action
- Reasoning/tips for each step
- ✅ Complete Procedures - Includes:
- Disassembly steps
- Repair/maintenance actions
- Reassembly in correct order
- Final verification and testing
Video-to-SOP Generator/
├── main.py # Main application
├── video_processor.py # Frame extraction (FFmpeg)
├── sop_analyzer.py # AI analysis (Gemini)
├── whisper_transcription.py # Audio transcription (Whisper)
├── pdf_generator.py # PDF creation
├── requirements.txt # Dependencies
├── .env.example # API key template
├── Example_output/ # Sample output PDFs (18 pages)
└── README.md # This file
Video Input → Audio Transcription → Frame Extraction → AI Analysis → PDF Generation → Cleanup
↓ ↓ ↓ ↓ ↓ ↓
.mp4/.webm Timestamped text Key frames Complete SOP Professional Auto delete
PDF temp files
- Extracts audio from video using FFmpeg
- Transcribes with Whisper Large V3 via Groq
- Generates timestamped segments:
[15.3s - 18.7s]: spoken text - Provides context for better frame-to-instruction matching
- Uses FFmpeg for fast extraction (15x faster than OpenCV!)
- Extracts frames at specified intervals (default: 2 seconds)
- Resizes images for optimal AI processing
- Maintains timestamp information for correlation
- Sends frames and timestamped transcript to Gemini 1.5 Flash
- Uses enhanced prompt for complete procedures
- Cross-references audio timestamps with frame timestamps
- Returns structured JSON with steps, safety notes, and reasoning
- Creates professional document layout
- Embeds images at relevant steps
- Includes safety notes and table of contents
- Professional formatting with headers and page numbers
- Deletes temporary extracted frames
- Keeps only the final PDF
- Prevents old/new frame mixing on next run
The generated PDF includes:
- Title Page: Task name, description, document info
- Table of Contents: Quick navigation
- Safety Section: Important safety considerations
- Procedure Steps: Step-by-step instructions with:
- Clear numbered steps
- Action-oriented instructions
- Screenshot at each step
- Timestamp reference
- Additional notes/reasoning
Edit video_processor.py:
extractor = VideoFrameExtractor(
interval_seconds=2, # Extract 1 frame every 2 seconds
resize_width=512 # Resize width (maintains aspect ratio)
)Edit sop_analyzer.py:
generation_config={
"temperature": 0.4, # Lower = more consistent
"max_output_tokens": 8192 # Maximum response length
}- Make sure you created
.envfile (not.env.example) - Verify the API key is valid
- Install OpenCV:
pip install opencv-python
- Check video format (MP4, MOV supported)
- Ensure video file is not corrupted
- Try with a shorter video first
- Install ReportLab:
pip install reportlab - Check disk space for output file
- Manufacturing companies
- Industrial training departments
- Safety compliance teams
- Equipment vendors
- Consulting firms
- Per-video pricing: $50-200 per video
- SaaS subscription: $99-499/month
- Enterprise license: Custom pricing
- API access: Pay per API call
- Saves 10+ hours per manual
- Ensures consistency
- Easy updates when procedures change
- Reduces training time
- Improves compliance
- Video quality affects AI accuracy
- Works best with clear, well-lit videos
- Requires stable camera angle
- English language optimized (can be adapted)
- Processing time depends on video length
- Web interface (Flask/Django)
- Multi-language support
- Video quality validation
- Custom branding options
- Step editing interface
- Voice narration in video
- Multiple video formats
- Batch processing
opencv-python: Video frame extractiongoogle-generativeai: Gemini AI APIreportlab: PDF generationPillow: Image processingpython-dotenv: Environment configuration
This project is for educational and commercial use.
For questions or issues, please check:
- This README
- Code comments in source files
- API documentation
Built with:
- Google Gemini 2.5 Pro
- OpenCV
- ReportLab
Made for industrial training excellence 🏭





