Lightweight FastAPI runtime for serving TensorRT-based FLAN text generation and MiniLM embeddings through OpenAI-compatible HTTP endpoints.
This repo is focused on inference only. It expects prebuilt TensorRT engine artifacts and tokenizer files to be available on disk or mounted into the container.
- FLAN text generation over
/v1/completions - Chat-style generation over
/v1/chat/completions - MiniLM embeddings over
/v1/embeddings - Health check endpoint at
/health - Environment-variable driven runtime configuration
- Docker image for CUDA/TensorRT deployment
runtime/: FastAPI app, route definitions, config, and TensorRT runnersstart_inference_server.sh: container entrypointtensorRT.dockerfile: runtime image definitionrequirements-trt.txt: Python dependencies
FLAN_ENGINE_DIRshould point to the directory containing the FLAN encoder/decoder engine outputs.MINILM_ENGINE_DIRshould point to the directory containing the MiniLM embedding engine output.MINILM_TOKENIZER_DIRis optional. If unset, the runtime derives a tokenizer location from the MiniLM engine path.
The runners auto-discover .engine, .plan, or .trt files below the configured directories.
pip install -r requirements-trt.txt
export FLAN_ENGINE_DIR=/models/flan
export MINILM_ENGINE_DIR=/models/minilm
# export MINILM_TOKENIZER_DIR=/models/minilm-tokenizer
# export API_KEY=change-me
python -m runtime.serverThe API will be available at http://localhost:8000, with interactive docs at http://localhost:8000/docs.
| Endpoint | Purpose |
|---|---|
GET /health |
Health and model readiness |
POST /v1/completions |
FLAN text generation |
POST /v1/chat/completions |
Chat-style generation |
POST /v1/embeddings |
MiniLM embeddings |
| Variable | Purpose | Default |
|---|---|---|
FLAN_ENGINE_DIR |
Directory containing FLAN TensorRT engines | /models/trt_engines |
FLAN_ENGINE_NAME |
Base name used by the FLAN engine loader | enc_dec |
MODEL_NAME |
Model identifier used for metadata/logging | google/flan-t5-base |
MODEL_HF_DIR |
Local directory containing FLAN tokenizer/config assets | /app/models/flan-t5-base |
MINILM_ENGINE_DIR |
Directory containing MiniLM TensorRT engine files | /app/models/minilm |
MINILM_TOKENIZER_DIR |
Optional tokenizer override for MiniLM | unset |
MINILM_MAX_LENGTH |
Maximum MiniLM input length | 256 |
API_KEY |
Optional bearer token required for /v1/* routes |
unset |
SERVER_HOST |
Bind address for the FastAPI app | 0.0.0.0 |
SERVER_PORT |
Port for the FastAPI app | 8000 |
SERVER_WORKERS |
Uvicorn worker count | 1 |
LOG_LEVEL |
Uvicorn log level | info |
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"prompt":"Hello, TensorRT!","max_tokens":32}'curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"Summarize TensorRT in one sentence."}]}'curl -X POST http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"input":["sample text","second input"]}'If API_KEY is set, include Authorization: Bearer <token> in each request.
tensorRT.dockerfile builds a runtime image that layers this app on top of NVIDIA CUDA/TensorRT components copied from an NGC TensorRT-LLM release image.
docker build -f tensorRT.dockerfile -t local/tensorrt-helpers .
docker run --rm --gpus all \
-p 8000:8000 \
-e FLAN_ENGINE_DIR=/models/flan \
-e MINILM_ENGINE_DIR=/models/minilm \
-v /path/to/engines:/models \
local/tensorrt-helpers- This repository does not build TensorRT engines; it serves prebuilt engine artifacts.
- CUDA, TensorRT, and Python package compatibility still need to match your target environment.