Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions docs/tutorials/foundation-model-timeseries.md
Original file line number Diff line number Diff line change
Expand Up @@ -283,3 +283,35 @@ model = TimeSeriesFoundationModel(
)
endpoint = model.deploy(inference_mode="serverless")
```

## Choosing an instance type

Each inference mode has different sweet spots. Defaults work for most users — read on if you want to optimize for cost or throughput, or the default GPU instance type isn't available in your account or region.

### Real-time endpoints

Pick a GPU if you need low latency, a CPU if cost matters more or a GPU instance isn't available in the chosen region.

| Instance | Cost\* | Throughput\* | Notes |
|---|---|---|---|
| `ml.c6i.2xlarge` | 1× | 1× | CPU fallback. Cheap, broadly available. |
| `ml.g4dn.xlarge` | ~2× | ~5× | Budget GPU. |
| `ml.g5.xlarge` (default) | ~3× | ~10× | Lowest latency. |

\*Rough ratios relative to `ml.c6i.2xlarge`. Throughput measured on `chronos-2` with a 100-item × 2000-step context, 64-step horizon. Actual numbers depend on payload, region, and pricing.

**A few rules of thumb:**

- **Single-GPU is enough.** Going beyond `xlarge` on a GPU instance gives little benefit — chronos-2 doesn't saturate one GPU at typical batch sizes.
- **Newer GPU generations are faster** (`g6.xlarge`, `g6e.xlarge`); availability varies by region.
- **For CPU, scaling vCPUs helps.** `c6i.2xlarge → 4xlarge → 8xlarge` roughly halves latency at each step.
- **Slow deploy?** If a deploy takes much longer than ~6 min for GPU or ~4 min for CPU, the region is likely out of capacity for that instance type — try a different instance type or region.

### Batch prediction

Batch jobs don't care about latency, so default to CPU instances such as `ml.m5.2xlarge` (current default) or `ml.c6i.2xlarge`. Consider GPU only for very large datasets (>10M rows). Typical job startup overhead is ~4 min on CPU.

### Serverless endpoints

You don't choose an instance type — SageMaker manages CPU sizing automatically. Cold starts are typically ~30s. GPU is not available for serverless inference.

2 changes: 1 addition & 1 deletion docs/tutorials/predictor-tabular.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ Once a predictor is trained, you can get predictions in two ways:
- **Real-time inference**: deploy the predictor as a long-running SageMaker endpoint and send requests to it. Best when you need low-latency predictions on demand — e.g. behind a user-facing service.
- **Batch inference**: launch a one-off SageMaker job that scores a dataset and writes the results to S3. Best for offline scoring of larger datasets — compute spins up, runs, and shuts down automatically, so you only pay for what you use.

A rough guideline: if you need predictions less often than once an hour and can tolerate ~10 minutes of compute spin-up, batch inference is usually cheaper and easier to operate.
A rough guideline: if you need predictions less often than once an hour and can tolerate ~4 minutes of compute spin-up, batch inference is usually cheaper and easier to operate.

### Real-time inference

Expand Down
2 changes: 1 addition & 1 deletion docs/tutorials/predictor-timeseries.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ Once a predictor is trained, you can get predictions in two ways:
- **Real-time inference**: deploy the predictor as a long-running SageMaker endpoint and send requests to it. Best when you need low-latency forecasts on demand — e.g. behind a user-facing service.
- **Batch inference**: launch a one-off SageMaker job that scores a dataset and writes the results to S3. Best for offline forecasting on larger datasets — compute spins up, runs, and shuts down automatically, so you only pay for what you use.

A rough guideline: if you need predictions less often than once an hour and can tolerate ~10 minutes of compute spin-up, batch inference is usually cheaper and easier to operate.
A rough guideline: if you need predictions less often than once an hour and can tolerate ~4 minutes of compute spin-up, batch inference is usually cheaper and easier to operate.

### Real-time inference

Expand Down
Loading