Unified serving layer for non-text foundation models.
vLLM solved inference for text LLMs by defining a standard compute contract and optimizing behind it. The same problem exists for every other class of foundation model — time series, tabular, molecular, geospatial, diffusion, audio — and nobody has solved it. Sheaf is that solution.
Each model type gets a typed request/response contract. Batching, caching, and scheduling are optimized per model type. Ray Serve is the substrate. Feast is a first-class input primitive.
In mathematics, a sheaf tracks locally-defined data that glues consistently across a space. Each model type defines its own local contract; Sheaf ensures they cohere into a unified serving layer.
pip install sheaf-serve # core only
pip install "sheaf-serve[time-series]" # + Chronos2 / TimesFM / Moirai
pip install "sheaf-serve[tabular]" # + TabPFN
pip install "sheaf-serve[molecular]" # + ESM-3 (Python 3.12+)
pip install "sheaf-serve[genomics]" # + Nucleotide Transformer
pip install "sheaf-serve[small-molecule]" # + MolFormer
pip install "sheaf-serve[materials]" # + MACE-MP
pip install "sheaf-serve[audio]" # + Whisper / faster-whisper
pip install "sheaf-serve[audio-generation]" # + MusicGen
pip install "sheaf-serve[tts]" # + Bark
pip install "sheaf-serve[vision]" # + DINOv2 / OpenCLIP / SAM2 / Depth Anything / DETR
pip install "sheaf-serve[earth-observation]" # + Prithvi
pip install "sheaf-serve[weather]" # + GraphCast
pip install "sheaf-serve[feast]" # + Feast feature store integration
pip install "sheaf-serve[modal]" # + Modal serverless deployment
pip install "sheaf-serve[batch]" # + offline batch inference (Ray Data)
pip install "sheaf-serve[all]" # everythingDirect backend inference:
from sheaf.api.time_series import Frequency, OutputMode, TimeSeriesRequest
from sheaf.backends.chronos import Chronos2Backend
backend = Chronos2Backend(model_id="amazon/chronos-bolt-tiny", device_map="cpu")
backend.load()
req = TimeSeriesRequest(
model_name="chronos-bolt-tiny",
history=[312, 298, 275, 260, 255, 263, 285, 320,
368, 402, 421, 435, 442, 438, 430, 425],
horizon=12,
frequency=Frequency.HOURLY,
output_mode=OutputMode.QUANTILES,
quantile_levels=[0.1, 0.5, 0.9],
)
response = backend.predict(req)
# response.mean, response.quantilesRay Serve (production, autoscaling):
from sheaf import ModelServer
from sheaf.spec import ModelSpec, ResourceConfig
from sheaf.api.base import ModelType
server = ModelServer(models=[
ModelSpec(
name="chronos",
model_type=ModelType.TIME_SERIES,
backend="chronos2",
backend_kwargs={"model_id": "amazon/chronos-bolt-small"},
resources=ResourceConfig(num_gpus=1),
),
])
server.run() # POST /chronos/predict, GET /chronos/healthFeast feature store (resolve features at request time):
# ModelSpec wires Feast — no history needed in the request
spec = ModelSpec(
name="chronos",
model_type=ModelType.TIME_SERIES,
backend="chronos2",
feast_repo_path="/feast/feature_repo",
)
# Client sends feature_ref instead of raw history
{
"model_type": "time_series",
"model_name": "chronos",
"feature_ref": {
"feature_view": "asset_prices",
"feature_name": "close_history_30d",
"entity_key": "ticker",
"entity_value": "AAPL"
},
"horizon": 7,
"frequency": "1d"
}Modal (serverless, zero-infra):
from sheaf import ModalServer
server = ModalServer(models=[spec], app_name="my-sheaf", gpu="A10G")
app = server.app # modal deploy my_server.pyDocker:
FROM ghcr.io/korbonits/sheaf-serve:v0.9.0
RUN pip install --no-cache-dir 'sheaf-serve[time-series]==0.9.0'
COPY server.py .
CMD ["python", "server.py"]The base image is sheaf-serve core only; extend with the backend extras you need. See examples/docker/ for a worked example with a runnable server.py.
Kubernetes (KubeRay):
examples/k8s/ ships a RayService manifest that deploys the same ModelSpec shape via the KubeRay operator. sheaf.build_app(spec) returns the Ray Serve Application directly, so it slots into KubeRay's serveConfigV2.applications[].import_path:
# app.py — referenced by the manifest as `import_path: app:app`
from sheaf import build_app
from sheaf.spec import ModelSpec
spec = ModelSpec(name="chronos", ...)
app = build_app(spec)Typed Python client:
from sheaf.client import SheafClient
from sheaf.api.time_series import Frequency, TimeSeriesRequest
with SheafClient(base_url="http://localhost:8000") as client:
resp = client.predict(
"chronos",
TimeSeriesRequest(
model_name="chronos",
history=[1.0, 2.0, 3.0, 4.0, 5.0],
horizon=3,
frequency=Frequency.HOURLY,
),
)
# resp is a typed TimeSeriesResponse — same Pydantic class the server returned
print(resp.mean)AsyncSheafClient is the async-mirror; client.stream(deployment, request) yields SSE events for streaming backends like FLUX.
See examples/ for time series comparison, tabular, audio, vision, and the Feast feature store quickstart.
| Type | Status | Backends |
|---|---|---|
| Time series | ✅ v0.1 | Chronos2, Chronos-Bolt, TimesFM, Moirai |
| Tabular | ✅ v0.1 | TabPFN v2 |
| Audio transcription | ✅ v0.3 | Whisper, faster-whisper |
| Audio generation | ✅ v0.3 | MusicGen |
| Text-to-speech | ✅ v0.3 | Bark |
| Vision embeddings | ✅ v0.3 | OpenCLIP, DINOv2 |
| Segmentation | ✅ v0.3 | SAM2 |
| Depth estimation | ✅ v0.3 | Depth Anything v2 |
| Object detection | ✅ v0.3 | DETR / RT-DETR |
| Protein / molecular | ✅ v0.3 | ESM-3 (Python 3.12+) |
| Genomics | ✅ v0.3 | Nucleotide Transformer |
| Small molecule | ✅ v0.3 | MolFormer-XL |
| Materials science | ✅ v0.3 | MACE-MP-0 |
| Earth observation | ✅ v0.3 | Prithvi (IBM/NASA) |
| Weather forecasting | ✅ v0.3 | GraphCast |
| Cross-modal embeddings | ✅ v0.3 | ImageBind (text, vision, audio, depth, thermal) |
| Feast feature store | ✅ v0.3 | Any Feast online store (SQLite, Redis, DynamoDB, …) |
| Modal serverless | ✅ v0.3 | ModalServer — zero-infra GPU deployment |
| Diffusion / image gen | ✅ v0.4 | FLUX (schnell, dev) |
| Video understanding | ✅ v0.4 | VideoMAE, TimeSformer |
| LiDAR / 3D point cloud | ✅ v0.5 | PointNet (pure PyTorch; embed + ModelNet40 classify) |
| Pose estimation | ✅ v0.5 | ViTPose (COCO 17-keypoint, optional person bboxes) |
| Optical flow | ✅ v0.5 | RAFT (raft_large / raft_small via torchvision) |
| Multimodal generation | ✅ v0.5 | SDXL img2img + inpainting |
| Speech synthesis | ✅ v0.5 | Kokoro (voice + speed per request) |
| Offline batch inference | ✅ v0.6 | BatchRunner (Ray Data; tasks + actor-pool modes) |
| Async-job worker | ✅ v0.7 | SheafWorker (Redis Streams; pluggable queue/result ABCs) |
| LoRA adapter multiplexing | ✅ v0.8 | FLUX, SDXL via ModelSpec.lora (local paths + HF Hub sources) |
v0.2 — serving layer (complete)
- Ray Serve integration tested end-to-end
- Async
predict()handlers - HTTP API with proper request validation (422 on bad input)
- Health check and readiness probe endpoints
- Batching scheduler (BatchPolicy wired into
@serve.batchper deployment) - Error handling at the service boundary (backend exceptions → structured HTTP 500)
- Model hot-swap without restart (
ModelServer.update()) - Container-friendly auth for TabPFN v2 (
TABPFN_TOKENenv var)
v0.3 — model types + integrations (complete)
- ESM-3 protein embeddings
- Nucleotide Transformer genomics embeddings
- MolFormer-XL small molecule embeddings
- MACE-MP-0 materials (energy, forces, stress)
- Whisper / faster-whisper audio transcription
- MusicGen audio generation
- Bark text-to-speech
- OpenCLIP image/text embeddings
- DINOv2 image embeddings
- SAM2 segmentation
- Depth Anything v2 depth estimation
- DETR / RT-DETR object detection
- Prithvi earth observation embeddings
- GraphCast weather forecasting
- ImageBind cross-modal embeddings (text, vision, audio, depth, thermal)
- Feast feature store integration (
feature_refin requests,FeastResolver,feast_repo_pathonModelSpec) - Modal serverless deployment (
ModalServer— zero-infra alternative to Ray Serve)
v0.4 — generation + video (complete)
- FLUX diffusion / image generation
- VideoMAE / TimeSformer video understanding
v0.5 — observability + new modalities
Ops / DX:
- PyPI publish (v0.4.0)
- Prometheus metrics endpoint per deployment
- Structured logging with request IDs end-to-end
- OpenTelemetry traces through the request path
Serving / infra:
- Streaming responses (
POST /{name}/stream→ SSE; FLUX emits per-step progress events) - Request caching (
CacheConfigonModelSpec— in-process LRU, optional TTL) -
bucket_bybatching — group requests by field value before@serve.batch
New model types:
- LiDAR / 3D point cloud (PointNet — pure-PyTorch, no torch-geometric; embed + ModelNet40 classify; install with
pip install 'sheaf-serve[lidar]') - Pose estimation (ViTPose — COCO 17-keypoint skeleton, optional person bboxes; install with
pip install 'sheaf-serve[pose]') - Optical flow (RAFT — raft_large/raft_small via torchvision; (H, W, 2) float32 flow field; install with
pip install 'sheaf-serve[optical-flow]') - Multimodal generation — text+image-conditioned (SDXL img2img + inpainting; install with
pip install 'sheaf-serve[multimodal-generation]') - Speech synthesis with fine-grained control (Kokoro — voice + speed per request; install with
pip install 'sheaf-serve[kokoro]')
v0.6 — offline batch inference (complete)
-
BatchRunner— same backend, same typed contract, offline batch mode; Ray Datamap_batchessubstrate, stateless tasks with a worker-local backend cache soload()fires once per worker (not once per batch); install withpip install 'sheaf-serve[batch]' -
BatchSpec— mirrorsModelSpecfor backend selection;JsonlSource/JsonlSinkin v1; new sources/sinks (S3, Parquet, Delta) slot in as additionalBatchSource/BatchSinksubclasses without changing the runner API - Actor-pool execution mode for warm loads on expensive backends (FLUX, GraphCast, SDXL) — opt-in via
BatchSpec.compute="actors"+num_actors=N;load()runs once per actor at__init__and persists for the actor's lifetime (#13) - Resumable checkpointing across process restarts (#12)
v0.7 — async-job queue (complete)
-
SheafWorker— queue-consumer pattern for long-running inference; v1 ships Redis Streams + consumer groups (horizontal scaling), pluggableJobQueue/ResultStoreABCs for SQS / Kafka follow-ups; install withpip install 'sheaf-serve[worker]' - Job lifecycle: enqueue → processing → result / dead-letter; at-least-once delivery via XACK-after-persist; per-job webhook on completion (best-effort POST)
- Priority lanes + per-tenant fair queuing
v0.8 — LoRA adapter multiplexing (complete)
-
ModelSpec.lora = LoRAConfig(adapters={...}, default="...")— declare per-deployment adapter registry; one GPU deployment serves many fine-tunes - Per-request adapter selection via
DiffusionRequest.adapters/MultimodalGenerationRequest.adapters(with optionaladapter_weightsfor fusion) - First targets: FLUX (FLUX.1-schnell + FLUX.1-dev), SDXL (img2img + inpaint)
- Local paths and HF Hub sources both supported (
hf:org/repo[:weight_file]convention) - Bucket-by-resolved-adapter inside Ray Serve batch windows:
set_active_adaptersis called exactly once per homogeneous sub-batch - Hot-add adapters at runtime without
ModelServer.update(spec)(deferred — adds VRAM-eviction / index-sync surface area) - Expose
enable_sequential_cpu_offloadonFluxBackendso FLUX + LoRA fits on 16-24 GB GPUs (currently onlyenable_model_cpu_offload, which leaves ~22 GB resident — Modal LoRA quickstart needs A100 today, this would unlock A10G)
v0.9 — typed Python client (complete)
Ships as sheaf.client inside sheaf-serve (not a separate sheaf-client PyPI package — schemas stay in one tree, no codegen, no drift). Splittable into its own package later if external client contributors arrive or install footprint becomes a real cost.
-
SheafClient(sync) +AsyncSheafClient(async,httpx-backed); typedpredict(deployment, request) -> responseagainst the discriminatedAnyResponseunion -
health()/ready()helpers; structured exceptions (ValidationErrorfor 422,ServerErrorfor 5xx,ClientErrorfor transport / decode failures) - SSE streaming via
client.stream(deployment, request)async generator -
RetryConfigwith exponential backoff: configurable status codes, connection-error retry toggle, andmax_attemptscap. Streams bypass retry by design (re-running yields interleaved progress events). - Server-side
request_id(the UUID minted on the request) is attached to every raisedSheafErrorsubclass so callers can log-correlate without holding the original request object. - OpenAPI export via
python -m sheaf.openapi --specs my_module:specs > openapi.json(orsheaf.openapi.generate(specs)programmatically) — backends are not loaded during generation, so it runs without GPU.
v0.10 — container + Kubernetes deployment
Today sheaf ships three deployment paths: ModelServer (a local Ray cluster you bring), ModalServer (Modal serverless), and BatchRunner / SheafWorker (offline / async). Production K8s clusters running their own Ray are common and have no first-class story yet — every team rolls their own image.
- Reference
Dockerfile(multi-stage, uv-based; CPU base + CUDA variant) so teams aren't building this from scratch. Pinned to a sheaf release; rebuilt on tag. -
examples/k8s/with aRayServicemanifest — KubeRay's canonical Ray-on-K8s shape — and a shortREADME.mdcovering prereqs (KubeRay operator installed),kubectl apply, and a port-forward smoke test. - GitHub Actions workflow that builds + pushes the Dockerfile to
ghcr.io/korbonits/sheaf-serve:vX.Y.Zonv*tag push, mirroring the PyPI publish flow.
┌─────────────────────────────────────────┐
│ API Layer │ typed contracts per model type
│ TimeSeriesRequest TabularRequest ... │
├─────────────────────────────────────────┤
│ Scheduling Layer │ model-type-aware batching
│ BatchPolicy RequestQueue │
├─────────────────────────────────────────┤
│ Backend Layer │ pluggable execution + Ray Serve
│ ModelBackend CacheManager Feast │
└─────────────────────────────────────────┘
Adding a new backend takes one class:
from sheaf.backends.base import ModelBackend
from sheaf.registry import register_backend
@register_backend("my-model")
class MyModelBackend(ModelBackend):
def load(self) -> None:
self._model = load_my_model()
def predict(self, request):
...
@property
def model_type(self):
return "time_series"Issues and PRs welcome. See CONTRIBUTING.md for development setup.
Apache 2.0