Skip to content

Str0nkz/tensorRT-helpers

Repository files navigation

TensorRT Helpers

Lightweight FastAPI runtime for serving TensorRT-based FLAN text generation and MiniLM embeddings through OpenAI-compatible HTTP endpoints.

This repo is focused on inference only. It expects prebuilt TensorRT engine artifacts and tokenizer files to be available on disk or mounted into the container.

Features

  • FLAN text generation over /v1/completions
  • Chat-style generation over /v1/chat/completions
  • MiniLM embeddings over /v1/embeddings
  • Health check endpoint at /health
  • Environment-variable driven runtime configuration
  • Docker image for CUDA/TensorRT deployment

Repository Layout

  • runtime/: FastAPI app, route definitions, config, and TensorRT runners
  • start_inference_server.sh: container entrypoint
  • tensorRT.dockerfile: runtime image definition
  • requirements-trt.txt: Python dependencies

Engine Layout

  • FLAN_ENGINE_DIR should point to the directory containing the FLAN encoder/decoder engine outputs.
  • MINILM_ENGINE_DIR should point to the directory containing the MiniLM embedding engine output.
  • MINILM_TOKENIZER_DIR is optional. If unset, the runtime derives a tokenizer location from the MiniLM engine path.

The runners auto-discover .engine, .plan, or .trt files below the configured directories.

Quick Start

pip install -r requirements-trt.txt

export FLAN_ENGINE_DIR=/models/flan
export MINILM_ENGINE_DIR=/models/minilm
# export MINILM_TOKENIZER_DIR=/models/minilm-tokenizer
# export API_KEY=change-me

python -m runtime.server

The API will be available at http://localhost:8000, with interactive docs at http://localhost:8000/docs.

Endpoints

Endpoint Purpose
GET /health Health and model readiness
POST /v1/completions FLAN text generation
POST /v1/chat/completions Chat-style generation
POST /v1/embeddings MiniLM embeddings

Configuration

Variable Purpose Default
FLAN_ENGINE_DIR Directory containing FLAN TensorRT engines /models/trt_engines
FLAN_ENGINE_NAME Base name used by the FLAN engine loader enc_dec
MODEL_NAME Model identifier used for metadata/logging google/flan-t5-base
MODEL_HF_DIR Local directory containing FLAN tokenizer/config assets /app/models/flan-t5-base
MINILM_ENGINE_DIR Directory containing MiniLM TensorRT engine files /app/models/minilm
MINILM_TOKENIZER_DIR Optional tokenizer override for MiniLM unset
MINILM_MAX_LENGTH Maximum MiniLM input length 256
API_KEY Optional bearer token required for /v1/* routes unset
SERVER_HOST Bind address for the FastAPI app 0.0.0.0
SERVER_PORT Port for the FastAPI app 8000
SERVER_WORKERS Uvicorn worker count 1
LOG_LEVEL Uvicorn log level info

Example Requests

curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Hello, TensorRT!","max_tokens":32}'
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Summarize TensorRT in one sentence."}]}'
curl -X POST http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"input":["sample text","second input"]}'

If API_KEY is set, include Authorization: Bearer <token> in each request.

Docker

tensorRT.dockerfile builds a runtime image that layers this app on top of NVIDIA CUDA/TensorRT components copied from an NGC TensorRT-LLM release image.

docker build -f tensorRT.dockerfile -t local/tensorrt-helpers .

docker run --rm --gpus all \
  -p 8000:8000 \
  -e FLAN_ENGINE_DIR=/models/flan \
  -e MINILM_ENGINE_DIR=/models/minilm \
  -v /path/to/engines:/models \
  local/tensorrt-helpers

Notes

  • This repository does not build TensorRT engines; it serves prebuilt engine artifacts.
  • CUDA, TensorRT, and Python package compatibility still need to match your target environment.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors