autogluon · shchur · Jun 4, 2026 · Jun 4, 2026
diff --git a/docs/tutorials/foundation-model-timeseries.md b/docs/tutorials/foundation-model-timeseries.md
@@ -283,3 +283,35 @@ model = TimeSeriesFoundationModel(
 )
 endpoint = model.deploy(inference_mode="serverless")
 ```
+
+## Choosing an instance type
+
+Each inference mode has different sweet spots. Defaults work for most users — read on if you want to optimize for cost or throughput, or the default GPU instance type isn't available in your account or region.
+
+### Real-time endpoints
+
+Pick a GPU if you need low latency, a CPU if cost matters more or a GPU instance isn't available in the chosen region.
+
+| Instance | Cost\* | Throughput\* | Notes |
+|---|---|---|---|
+| `ml.c6i.2xlarge` | 1× | 1× | CPU fallback. Cheap, broadly available. |
+| `ml.g4dn.xlarge` | ~2× | ~5× | Budget GPU. |
+| `ml.g5.xlarge` (default) | ~3× | ~10× | Lowest latency. |
+
+\*Rough ratios relative to `ml.c6i.2xlarge`. Throughput measured on `chronos-2` with a 100-item × 2000-step context, 64-step horizon. Actual numbers depend on payload, region, and pricing.
+
+**A few rules of thumb:**
+
+- **Single-GPU is enough.** Going beyond `xlarge` on a GPU instance gives little benefit — chronos-2 doesn't saturate one GPU at typical batch sizes.
+- **Newer GPU generations are faster** (`g6.xlarge`, `g6e.xlarge`); availability varies by region.
+- **For CPU, scaling vCPUs helps.** `c6i.2xlarge → 4xlarge → 8xlarge` roughly halves latency at each step.
+- **Slow deploy?** If a deploy takes much longer than ~6 min for GPU or ~4 min for CPU, the region is likely out of capacity for that instance type — try a different instance type or region.
+
+### Batch prediction
+
+Batch jobs don't care about latency, so default to CPU instances such as `ml.m5.2xlarge` (current default) or `ml.c6i.2xlarge`. Consider GPU only for very large datasets (>10M rows). Typical job startup overhead is ~4 min on CPU.
+
+### Serverless endpoints
+
+You don't choose an instance type — SageMaker manages CPU sizing automatically. Cold starts are typically ~30s. GPU is not available for serverless inference.
+
diff --git a/docs/tutorials/predictor-tabular.md b/docs/tutorials/predictor-tabular.md
@@ -50,7 +50,7 @@ Once a predictor is trained, you can get predictions in two ways:
 - **Real-time inference**: deploy the predictor as a long-running SageMaker endpoint and send requests to it. Best when you need low-latency predictions on demand — e.g. behind a user-facing service.
 - **Batch inference**: launch a one-off SageMaker job that scores a dataset and writes the results to S3. Best for offline scoring of larger datasets — compute spins up, runs, and shuts down automatically, so you only pay for what you use.
 
-A rough guideline: if you need predictions less often than once an hour and can tolerate ~10 minutes of compute spin-up, batch inference is usually cheaper and easier to operate.
+A rough guideline: if you need predictions less often than once an hour and can tolerate ~4 minutes of compute spin-up, batch inference is usually cheaper and easier to operate.
 
 ### Real-time inference
 

diff --git a/docs/tutorials/predictor-timeseries.md b/docs/tutorials/predictor-timeseries.md
@@ -73,7 +73,7 @@ Once a predictor is trained, you can get predictions in two ways:
 - **Real-time inference**: deploy the predictor as a long-running SageMaker endpoint and send requests to it. Best when you need low-latency forecasts on demand — e.g. behind a user-facing service.
 - **Batch inference**: launch a one-off SageMaker job that scores a dataset and writes the results to S3. Best for offline forecasting on larger datasets — compute spins up, runs, and shuts down automatically, so you only pay for what you use.
 
-A rough guideline: if you need predictions less often than once an hour and can tolerate ~10 minutes of compute spin-up, batch inference is usually cheaper and easier to operate.
+A rough guideline: if you need predictions less often than once an hour and can tolerate ~4 minutes of compute spin-up, batch inference is usually cheaper and easier to operate.
 
 ### Real-time inference