A fast Rust server for generating text and image embeddings using nomic-embed-text-v1.5 and nomic-embed-vision-v1.5 models via ONNX Runtime.
# Download model files
make model-txt model-img
# Build
make build
# Run
make run
# Test
curl -X POST localhost:8080/embed \
-H 'content-type: application/json' \
-d '{"input": "Hello world"}'Interactive documentation available at /docs (Swagger UI).
Returns health status and model availability.
Response:
{
"status": "OK",
"text_model": true,
"vision_model": true
}Returns server information including model paths and configuration.
Response:
{
"averaging": "geometric",
"load_cuda": false,
"txt_model": "models/txt/model_quantized.onnx",
"tokenizer": "models/txt/tokenizer.json",
"img_model": "models/img/model_quantized.onnx"
}Generate embedding for a single text.
Request:
{"input": "Hello world", "dim": 768}input(required): Text to embeddim(optional): Embedding dimension (1-768). Defaults to 768. Supports Matryoshka embeddings - use smaller dimensions for faster similarity search.
Response:
{
"embedding": [0.123, 0.456, ...],
"tokens": 4,
"time_ms": 12.34
}Example with reduced dimension:
{"input": "Hello world", "dim": 128}Returns a 128-dimensional embedding (faster similarity search, slightly lower quality).
Generate embeddings for multiple texts.
Request:
{"inputs": ["Hello world", "Goodbye world"], "dim": 8}inputs(required): List of texts to embeddim(optional): Embedding dimension (1-768). Defaults to 768. Applied to all embeddings in the batch.
Response:
{
"embeddings": [[0.123, 0.456, ...], [0.789, -0.123, ...]],
"tokens": [4, 5],
"time_ms": 45.67
}Generate embedding for a single image.
Request:
{
"content": "https://example.com/image.jpg",
"dim": 768
}content(required): Image as URL, data URL (data:image/jpeg;base64,...), or raw base64dim(optional): Embedding dimension (1-768). Defaults to 768.
Response:
{
"embedding": [0.123, 0.456, ...],
"time_ms": 89.12
}Limits: Maximum 20MB per image (compressed). Supports JPEG, PNG, GIF, WebP, BMP, TIFF.
Generate embeddings for multiple images.
Request:
{
"contents": [
"https://example.com/cat.jpg",
"data:image/png;base64,iVBORw0KGgo..."
],
"dim": 768
}Response:
{
"embeddings": [[0.123, ...], [0.789, ...]],
"time_ms": 178.45
}Extract image statistics including EXIF metadata and color analysis.
Request:
{
"content": "https://example.com/image.jpg",
"averaging_method": "geometric"
}content(required): Image as URL, data URL (data:image/jpeg;base64,...), or raw base64averaging_method(optional): Color averaging method -arithmeticorgeometric(defaults toAVERAGINGenv var orgeometric)
Response:
{
"exif_data": {
"Make": "Canon",
"Model": "EOS 5D",
"DateTime": "2024:01:01 12:00:00"
},
"color_data": {
"avg_color": {
"rgb": [0.5, 0.3, 0.2],
"hex": "#804d33",
"method": "geometric"
},
"dominant_color": {
"rgb": [0.6, 0.4, 0.3],
"hex": "#99664d"
}
},
"time_ms": 12.34
}Limits: Maximum 20MB per image (compressed). Supports JPEG, PNG, GIF, WebP, BMP, TIFF.
Swagger UI documentation page.
OpenAPI 3.1.0 schema.
| Variable | Default | Description |
|---|---|---|
PORT |
8080 |
Server port |
MODEL |
models/txt/model_quantized.onnx |
Path to text ONNX model (fallback for TXT_MODEL) |
TXT_MODEL |
models/txt/model_quantized.onnx |
Path to text ONNX model |
TOKENIZER |
models/txt/tokenizer.json |
Path to tokenizer |
IMG_MODEL |
models/img/model_quantized.onnx |
Path to vision ONNX model |
USE_GPU |
false |
Enable GPU inference (1 or true) |
AVERAGING |
geometric |
Default averaging method for image color statistics: arithmetic or geometric |
DISABLE_CORS |
false |
Disable CORS entirely (1 or true) |
CORS_ORIGINS |
(see below) | Comma-separated list of allowed origins |
By default, the server allows requests from localhost only (for local development):
http://localhost:3000/http://localhost:8080http://127.0.0.1:3000/http://127.0.0.1:8080
Production deployment - set allowed origins explicitly:
# Allow specific origins (comma-separated, no wildcards)
CORS_ORIGINS="https://example.com,https://app.example.com,https://api.example.com" ./nomic-serve
# Docker
docker run -p 8080:8080 \
-e CORS_ORIGINS="https://example.com,https://app.example.com" \
mindthemath/nomic-text-v1.5-rs:latest-cpuSecurity notes:
- Origins must be explicitly listed (no wildcard support)
- If
CORS_ORIGINSis set but contains invalid values, falls back to localhost defaults (never permissive) - Invalid origins are silently ignored and logged
Disable CORS entirely (allows all origins - use only for internal APIs):
DISABLE_CORS=1 ./nomic-serveTo modify the default allowed origins, edit DEFAULT_CORS_ORIGINS in src/main.rs.
The nomic-embed-text-v1.5 model supports Matryoshka embeddings - variable-dimension embeddings that maintain quality at reduced dimensions. Use the dim parameter to truncate embeddings for faster similarity search.
Benefits:
- Faster similarity search: Smaller vectors = faster distance calculations
- Reduced storage: Store fewer dimensions per embedding
- Quality preserved: Lower dimensions maintain high quality for most use cases
Recommended dimensions:
768(default): Full quality, best for fine-grained tasks512: ~99% quality, good balance256: ~95% quality, faster search128: ~90% quality, very fast search64: ~85% quality, fastest search
Example:
# Full dimension (default)
curl -X POST localhost:8080/embed \
-H 'content-type: application/json' \
-d '{"input": "Hello world"}'
# Reduced dimension for faster search
curl -X POST localhost:8080/embed \
-H 'content-type: application/json' \
-d '{"input": "Hello world", "dim": 128}'Important: All embeddings in a batch use the same dim value. For consistent similarity search, always use the same dim for all embeddings you compare.
This server processes each text individually rather than batching multiple texts into a single inference call. This is a deliberate design choice required by the quantized nomic-embed-text-v1.5 ONNX model.
We discovered that the quantized text model exhibits severe cross-sample interference during batched inference: when multiple texts are processed together, each text's embedding is affected by the other texts in the batch.
Empirical evidence (see test_batch_interference.py and test_text_batch_fp32.py):
| Model | Batch composition | Max embedding difference |
|---|---|---|
| Quantized | Same text × 2 (no padding) | 0.000000 ✓ identical |
| Quantized | Text A + Text B (same token count, no padding) | 0.539796 ✗ severe |
| Quantized | Text A + Text C (different token count, with padding) | 0.570585 ✗ severe |
| FP32 | Any batch composition | 0.000000 ✓ perfect |
| PyTorch/Transformers | Any batch composition | 0.000000 ✓ perfect |
Key findings:
- Quantization is the cause — FP32 model batches perfectly (0.000000 diff)
- Text model is more sensitive — Quantized text shows ~0.5 diff vs vision's ~0.02 diff
- PyTorch/Transformers batches perfectly — Confirms it's an ONNX quantization issue, not model architecture
The quantized (INT8) text model uses dynamic quantization that computes quantization parameters across the batch:
- Per-batch quantization scales: Parameters computed from batch statistics
- Asymmetric quantization: Zero-point calculations vary with batch composition
- Dequantization precision: Rounding errors accumulate differently in batches
The text model's architecture (BERT-based with mean pooling) appears more sensitive to these quantization artifacts than the vision model (ViT with CLS token extraction).
According to the Nomic Embed Technical Report:
- Base architecture: BERT-based encoder with 137M parameters
- Context length: 8,192 tokens (extended from standard 512)
- Training: Multi-stage contrastive learning with 235M text pairs
- Features: Matryoshka Representation Learning for variable-dimension embeddings
Note: The model architecture itself supports batching (proven by FP32 and PyTorch implementations). The interference is specific to ONNX quantized models.
For most embedding use cases (similarity search, clustering, RAG), embeddings need to be deterministic — the same text should always produce the same embedding. Cross-sample interference violates this:
embed("hello") alone → [0.123, 0.456, ...]
embed("hello") + "world" → [0.089, 0.512, ...] # Different!
This could cause:
- Inconsistent search results depending on what else was in the batch
- Non-reproducible experiments
- Subtle bugs that are hard to diagnose
By processing each text individually (batch_size=1) with the quantized model, we guarantee:
- Deterministic results: Same text → same embedding, always
- No cross-sample interference: Each text processed in isolation
- Correctness over speed: Throughput is lower, but results are reliable
If you need batching for text embeddings:
- Use FP32 model (
model.onnxinstead ofmodel_quantized.onnx) - FP32 batches perfectly (0.000000 difference, cosine similarity = 1.0)
- Trade-off: Larger model size (~375MB vs ~131MB) but enables batching
Current implementation: Uses quantized model by default, so sequential processing is required.
Since batching isn't viable for correctness, here are alternatives for handling high request volumes:
Run multiple server instances behind a load balancer:
# Instance 1
PORT=8080 ./target/release/nomic-serve &
# Instance 2
PORT=8081 ./target/release/nomic-serve &
# Instance 3
PORT=8082 ./target/release/nomic-serve &Use nginx, HAProxy, or cloud load balancers to distribute requests.
For CPU-bound workloads, run N instances where N = number of CPU cores:
for port in $(seq 8080 8087); do
PORT=$port ./target/release/nomic-serve &
doneThe server already uses Tokio for async I/O. Multiple concurrent requests are handled efficiently — they just can't share a single inference call.
If you have repeated texts, cache embeddings:
- In-memory cache (Redis, memcached)
- Persistent cache (database, vector store)
For high-volume batch jobs:
[Requests] → [Queue (Redis/RabbitMQ)] → [Worker Pool] → [Results]
Workers process texts sequentially but in parallel across the pool.
target/release/nomic-serve # 38MB binary (includes CPU + GPU support + image stats)
models/txt/model_quantized.onnx # 131MB text model
models/txt/tokenizer.json # 695KB tokenizer
models/img/model_quantized.onnx # 93MB vision model
Total: ~263MB
Note: The binary includes both CPU and GPU support. GPU code adds ~2MB but is only loaded when USE_GPU=1 is set. For CPU-only deployments, you can build without the cuda feature to save 2MB (remove "cuda" from Cargo.toml features).
CPU inference: Standard C libraries (glibc, libstdc++). No GPU drivers or CUDA needed.
GPU inference: NVIDIA CUDA drivers and CUDA toolkit. The binary is built with CUDA support enabled by default (can be disabled by removing "cuda" feature from Cargo.toml).
Multi-stage Dockerfile included. Build and push images:
# Build both CPU and GPU images
make docker-build
# Build specific image
make docker-build-cpu # CPU-only (debian:bookworm-slim)
make docker-build-gpu # GPU/CUDA (nvidia/cuda:12.1.0-runtime)
# Push to DockerHub (requires docker login)
make docker-pushImages:
mindthemath/nomic-text-v1.5-rs:latest-cpu- CPU-only deploymentmindthemath/nomic-text-v1.5-rs:latest-gpu- GPU/CUDA deployment
Usage:
# CPU
docker run -p 8080:8080 --dns 1.1.1.1 --dns 1.0.0.1 mindthemath/nomic-text-v1.5-rs:latest-cpu
# GPU (requires nvidia-docker)
docker run --gpus all -p 8080:8080 --dns 1.1.1.1 --dns 1.0.0.1 mindthemath/nomic-text-v1.5-rs:latest-gpuNote: The --dns flags are recommended for image embedding endpoints (/img/embed, /img/batch) to ensure fast DNS resolution. Cloudflare DNS (1.1.1.1) is used for privacy and performance. Without DNS configuration, image URL fetching may be slow (10+ seconds) due to Docker's default DNS configuration.
Image Size (as shown by docker images):
- CPU image: ~358MB
- Binary: 38MB
- Text model files (
model_quantized.onnx+tokenizer.json): 132MB - Vision model (
model_quantized.onnx): 93MB - Base image (
debian:bookworm-slim): 74.8MB - Runtime dependencies (ca-certificates, libssl3, dumb-init): 9.2MB
- Layer compression overhead: ~12MB
- GPU image: ~2.7GB
- Binary: 38MB
- Text model files (
model_quantized.onnx+tokenizer.json): 132MB - Vision model (
model_quantized.onnx): 93MB - ONNX Runtime CUDA providers libraries: 196MB
- CUDA runtime base image (
nvidia/cuda:12.1.0-runtime-ubuntu22.04): 2.23GB - Runtime dependencies: 8MB
- Layer compression overhead: ~100MB
- Note: The
-runtimevariant is required (not-base) as it includes CUDA runtime libraries needed by ONNX Runtime's CUDA execution provider. The ONNX Runtime CUDA providers add significant size but are required for GPU inference.
Note: The CPU Docker image includes both text and vision models for full multimodal support. The GPU image is significantly larger due to the CUDA runtime base image (~2.23GB) required for GPU inference.
GitHub Actions: Automatically builds and pushes images on tag releases (e.g., v1.0.0).
[Unit]
Description=Nomic Embedding Server
After=network.target
[Service]
Type=simple
WorkingDirectory=/opt/nomic-serve
ExecStart=/opt/nomic-serve/nomic-serve
Environment=PORT=8080
Restart=always
[Install]
WantedBy=multi-user.targetmake fmt # Format code
make build # Build release binary
make run # Run server
make test # Test single embedding
make test-list # Test multiple embeddings
make health # Health checkCompare different ONNX model quantizations against the fp32 baseline (full precision, unquantized):
# Download all model variants
make models-all
# Run comparison tests (starts multiple servers, runs tests, cleans up)
make test-modelsThis will:
- Start servers for each model variant on different ports (8080-8083)
- Port 8080:
model.onnx(fp32, baseline) - Port 8081:
model_quantized.onnx(quantized) - Port 8082:
model_q4f16.onnx(4-bit quantized) - Port 8083:
model_fp16.onnx(half precision)
- Port 8080:
- Run the same test texts through each model
- Compare embeddings using:
- Cosine similarity: 1.0 = identical, 0.95+ = very similar, <0.9 = different
- L2 distance: Lower is better (0 = identical)
- Max/mean absolute differences: Per-dimension differences
- Latency: Speed comparison
Interpreting Results:
- Cosine similarity > 0.95: Embeddings are very similar, quantization quality is good
- Cosine similarity 0.9-0.95: Moderate differences, may affect fine-grained tasks
- Cosine similarity < 0.9: Significant differences, may not be suitable for production
- Speedup > 1.0: Faster than baseline (good!)
- Speedup < 1.0: Slower than baseline (quantization overhead)
Quantization benefits are primarily on GPU/TPU, not CPU:
-
On GPU: Quantized models (INT8, INT4) can be 2-4x faster due to:
- Specialized tensor cores (e.g., NVIDIA's INT8 cores)
- Reduced memory bandwidth (smaller model size)
- Optimized quantized kernels
-
On CPU: Quantized models are often slower because:
- CPUs lack specialized INT4/INT8 instructions
- Dequantization overhead negates benefits
- ONNX Runtime CPU execution provider may not optimize quantized ops
- Memory bandwidth is rarely the bottleneck on CPU
Recommendation for CPU inference: Use fp32 (model.onnx) for best performance. The quantized models are smaller on disk but don't provide speedups on CPU.
Note: If model_quantized.onnx shows identical results to fp32 (cosine 1.0), it may be using a lossless quantization scheme or the ONNX Runtime is automatically dequantizing to fp32 for CPU execution.
Requirements: Python with requests library (pip install requests or pip install -r scripts/requirements.txt)
The comparison script handles server lifecycle automatically - starts servers, waits for readiness, runs tests, and cleans up on exit (including Ctrl-C).
To test model variants on GPU (NVIDIA CUDA):
# Test all models on GPU
make test-models-gpuGPU Requirements:
- NVIDIA GPU with CUDA support
- CUDA drivers installed (
nvidia-smishould work) - CUDA toolkit (for building ort crate with CUDA feature - already enabled)
- ONNX Runtime CUDA providers library: The ort crate needs
libonnxruntime_providers_shared.soto be available. If you see "Failed to load library libonnxruntime_providers_shared.so", the server will fall back to CPU automatically.
Note: If CUDA libraries aren't found, the server will automatically fall back to CPU execution. Check server logs (/tmp/nomic-serve-*.log) to see which execution provider is actually being used.
Note: On GPU, quantized models (especially INT8/INT4) should show significant speedups (2-4x) compared to CPU results. The GPU comparison will help you choose the best quantization for your GPU setup.
GPU Architecture Notes:
- RTX 30-series (Ampere): INT8 quantization typically performs best (2-3x speedup). fp16 may not show speedups due to conversion overhead or lack of optimized kernels.
- RTX 40-series / L40S (Ada Lovelace): Better fp16/BF16 support with 4th-gen Tensor Cores. fp16 should show better performance than on Ampere.
- A100/H100: Excellent fp16/BF16 performance, often matching or exceeding INT8 for many workloads.
Running server with GPU:
# The script automatically sets LD_LIBRARY_PATH, but for manual runs:
ORT_LIB_DIR=$(readlink -f target/release/deps/libonnxruntime_providers_shared.so 2>/dev/null | xargs dirname)
LD_LIBRARY_PATH="$ORT_LIB_DIR:$LD_LIBRARY_PATH" USE_GPU=1 ./target/release/nomic-serveNote: The make test-models-gpu script automatically finds and sets LD_LIBRARY_PATH to point to the ONNX Runtime CUDA providers library. If you run the server manually with USE_GPU=1, you may need to set LD_LIBRARY_PATH yourself.
Text Model:
- Model: nomic-ai/nomic-embed-text-v1.5
- Embedding dimension: 768
- Max sequence length: 8,192 tokens
- Pooling: Mean pooling over non-padding tokens
Vision Model:
- Model: nomic-ai/nomic-embed-vision-v1.5
- Embedding dimension: 768
- Input size: 224×224 (auto-resized)
- Pooling: CLS token extraction
Both models share the same embedding space via contrastive training, enabling direct comparison of text and image embeddings.
License: Apache 2.0
- Nomic Embed Technical Report (2024)
- Nomic Embed v1 Blog Post
- HuggingFace Text Model
- HuggingFace Vision Model
MIT