📝 Description
Migrate the AI model inference pipeline to use GPU acceleration so that model inference runs significantly faster and more efficiently. This includes enabling CUDA/CUDNN support where applicable and switching to GPU-enabled builds of the inference libraries.
💡 Rationale
Current CPU-based inference is slow for larger inputs and high-throughput workloads. GPU acceleration will reduce latency and increase throughput for AI-driven features in FireForm, improving user experience and enabling more complex models to be used in production.
🛠️ Proposed Solution
A brief plan to implement GPU acceleration:
✅ Acceptance Criteria
📌 Additional Context
- Consider compatibility matrix for supported CUDA/CUDNN and PyTorch/TensorRT versions.
- If the project supports multiple model backends (Mistral/Ollama/etc.), document which backends can use GPU acceleration and any extra setup required.
- This change may require updates to deployment documentation and CI to support GPU testing or clear instructions for maintainers on how to validate changes.
📝 Description
Migrate the AI model inference pipeline to use GPU acceleration so that model inference runs significantly faster and more efficiently. This includes enabling CUDA/CUDNN support where applicable and switching to GPU-enabled builds of the inference libraries.
💡 Rationale
Current CPU-based inference is slow for larger inputs and high-throughput workloads. GPU acceleration will reduce latency and increase throughput for AI-driven features in FireForm, improving user experience and enabling more complex models to be used in production.
🛠️ Proposed Solution
A brief plan to implement GPU acceleration:
src/to select GPU when available and fall back to CPU.✅ Acceptance Criteria
docs/with setup and testing steps for GPU.📌 Additional Context