diff --git a/docs/tutorials/foundation-model-timeseries.md b/docs/tutorials/foundation-model-timeseries.md index 0e73871..62a6d44 100644 --- a/docs/tutorials/foundation-model-timeseries.md +++ b/docs/tutorials/foundation-model-timeseries.md @@ -283,3 +283,35 @@ model = TimeSeriesFoundationModel( ) endpoint = model.deploy(inference_mode="serverless") ``` + +## Choosing an instance type + +Each inference mode has different sweet spots. Defaults work for most users — read on if you want to optimize for cost or throughput, or the default GPU instance type isn't available in your account or region. + +### Real-time endpoints + +Pick a GPU if you need low latency, a CPU if cost matters more or a GPU instance isn't available in the chosen region. + +| Instance | Cost\* | Throughput\* | Notes | +|---|---|---|---| +| `ml.c6i.2xlarge` | 1× | 1× | CPU fallback. Cheap, broadly available. | +| `ml.g4dn.xlarge` | ~2× | ~5× | Budget GPU. | +| `ml.g5.xlarge` (default) | ~3× | ~10× | Lowest latency. | + +\*Rough ratios relative to `ml.c6i.2xlarge`. Throughput measured on `chronos-2` with a 100-item × 2000-step context, 64-step horizon. Actual numbers depend on payload, region, and pricing. + +**A few rules of thumb:** + +- **Single-GPU is enough.** Going beyond `xlarge` on a GPU instance gives little benefit — chronos-2 doesn't saturate one GPU at typical batch sizes. +- **Newer GPU generations are faster** (`g6.xlarge`, `g6e.xlarge`); availability varies by region. +- **For CPU, scaling vCPUs helps.** `c6i.2xlarge → 4xlarge → 8xlarge` roughly halves latency at each step. +- **Slow deploy?** If a deploy takes much longer than ~6 min for GPU or ~4 min for CPU, the region is likely out of capacity for that instance type — try a different instance type or region. + +### Batch prediction + +Batch jobs don't care about latency, so default to CPU instances such as `ml.m5.2xlarge` (current default) or `ml.c6i.2xlarge`. Consider GPU only for very large datasets (>10M rows). Typical job startup overhead is ~4 min on CPU. + +### Serverless endpoints + +You don't choose an instance type — SageMaker manages CPU sizing automatically. Cold starts are typically ~30s. GPU is not available for serverless inference. + diff --git a/docs/tutorials/predictor-tabular.md b/docs/tutorials/predictor-tabular.md index 7a1af0e..bb3bb79 100644 --- a/docs/tutorials/predictor-tabular.md +++ b/docs/tutorials/predictor-tabular.md @@ -50,7 +50,7 @@ Once a predictor is trained, you can get predictions in two ways: - **Real-time inference**: deploy the predictor as a long-running SageMaker endpoint and send requests to it. Best when you need low-latency predictions on demand — e.g. behind a user-facing service. - **Batch inference**: launch a one-off SageMaker job that scores a dataset and writes the results to S3. Best for offline scoring of larger datasets — compute spins up, runs, and shuts down automatically, so you only pay for what you use. -A rough guideline: if you need predictions less often than once an hour and can tolerate ~10 minutes of compute spin-up, batch inference is usually cheaper and easier to operate. +A rough guideline: if you need predictions less often than once an hour and can tolerate ~4 minutes of compute spin-up, batch inference is usually cheaper and easier to operate. ### Real-time inference diff --git a/docs/tutorials/predictor-timeseries.md b/docs/tutorials/predictor-timeseries.md index 453733f..6cdc465 100644 --- a/docs/tutorials/predictor-timeseries.md +++ b/docs/tutorials/predictor-timeseries.md @@ -73,7 +73,7 @@ Once a predictor is trained, you can get predictions in two ways: - **Real-time inference**: deploy the predictor as a long-running SageMaker endpoint and send requests to it. Best when you need low-latency forecasts on demand — e.g. behind a user-facing service. - **Batch inference**: launch a one-off SageMaker job that scores a dataset and writes the results to S3. Best for offline forecasting on larger datasets — compute spins up, runs, and shuts down automatically, so you only pay for what you use. -A rough guideline: if you need predictions less often than once an hour and can tolerate ~10 minutes of compute spin-up, batch inference is usually cheaper and easier to operate. +A rough guideline: if you need predictions less often than once an hour and can tolerate ~4 minutes of compute spin-up, batch inference is usually cheaper and easier to operate. ### Real-time inference