TensorRT Helpers

Lightweight FastAPI runtime for serving TensorRT-based FLAN text generation and MiniLM embeddings through OpenAI-compatible HTTP endpoints.

This repo is focused on inference only. It expects prebuilt TensorRT engine artifacts and tokenizer files to be available on disk or mounted into the container.

Features

FLAN text generation over /v1/completions
Chat-style generation over /v1/chat/completions
MiniLM embeddings over /v1/embeddings
Health check endpoint at /health
Environment-variable driven runtime configuration
Docker image for CUDA/TensorRT deployment

Repository Layout

runtime/: FastAPI app, route definitions, config, and TensorRT runners
start_inference_server.sh: container entrypoint
tensorRT.dockerfile: runtime image definition
requirements-trt.txt: Python dependencies

Engine Layout

FLAN_ENGINE_DIR should point to the directory containing the FLAN encoder/decoder engine outputs.
MINILM_ENGINE_DIR should point to the directory containing the MiniLM embedding engine output.
MINILM_TOKENIZER_DIR is optional. If unset, the runtime derives a tokenizer location from the MiniLM engine path.

The runners auto-discover .engine, .plan, or .trt files below the configured directories.

Quick Start

pip install -r requirements-trt.txt

export FLAN_ENGINE_DIR=/models/flan
export MINILM_ENGINE_DIR=/models/minilm
# export MINILM_TOKENIZER_DIR=/models/minilm-tokenizer
# export API_KEY=change-me

python -m runtime.server

The API will be available at http://localhost:8000, with interactive docs at http://localhost:8000/docs.

Endpoints

Endpoint	Purpose
`GET /health`	Health and model readiness
`POST /v1/completions`	FLAN text generation
`POST /v1/chat/completions`	Chat-style generation
`POST /v1/embeddings`	MiniLM embeddings

Configuration

Variable	Purpose	Default
`FLAN_ENGINE_DIR`	Directory containing FLAN TensorRT engines	`/models/trt_engines`
`FLAN_ENGINE_NAME`	Base name used by the FLAN engine loader	`enc_dec`
`MODEL_NAME`	Model identifier used for metadata/logging	`google/flan-t5-base`
`MODEL_HF_DIR`	Local directory containing FLAN tokenizer/config assets	`/app/models/flan-t5-base`
`MINILM_ENGINE_DIR`	Directory containing MiniLM TensorRT engine files	`/app/models/minilm`
`MINILM_TOKENIZER_DIR`	Optional tokenizer override for MiniLM	unset
`MINILM_MAX_LENGTH`	Maximum MiniLM input length	`256`
`API_KEY`	Optional bearer token required for `/v1/*` routes	unset
`SERVER_HOST`	Bind address for the FastAPI app	`0.0.0.0`
`SERVER_PORT`	Port for the FastAPI app	`8000`
`SERVER_WORKERS`	Uvicorn worker count	`1`
`LOG_LEVEL`	Uvicorn log level	`info`

Example Requests

curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Hello, TensorRT!","max_tokens":32}'

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Summarize TensorRT in one sentence."}]}'

curl -X POST http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"input":["sample text","second input"]}'

If API_KEY is set, include Authorization: Bearer <token> in each request.

Docker

tensorRT.dockerfile builds a runtime image that layers this app on top of NVIDIA CUDA/TensorRT components copied from an NGC TensorRT-LLM release image.

docker build -f tensorRT.dockerfile -t local/tensorrt-helpers .

docker run --rm --gpus all \
  -p 8000:8000 \
  -e FLAN_ENGINE_DIR=/models/flan \
  -e MINILM_ENGINE_DIR=/models/minilm \
  -v /path/to/engines:/models \
  local/tensorrt-helpers

Notes

This repository does not build TensorRT engines; it serves prebuilt engine artifacts.
CUDA, TensorRT, and Python package compatibility still need to match your target environment.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
runtime		runtime
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements-trt.txt		requirements-trt.txt
start_inference_server.sh		start_inference_server.sh
tensorRT.dockerfile		tensorRT.dockerfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TensorRT Helpers

Features

Repository Layout

Engine Layout

Quick Start

Endpoints

Configuration

Example Requests

Docker

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TensorRT Helpers

Features

Repository Layout

Engine Layout

Quick Start

Endpoints

Configuration

Example Requests

Docker

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages