This repository was archived by the owner on Nov 20, 2025. It is now read-only.

Description
Describe the bug
I have a model which takes fairly long to load (40s). When I run some constant traffic against the endpoint and then scale in another instance, I see a short spike of errors. From logging timestamps I could conclude that these errors happened before the model loading completed.
I found that, on startup of the model_server there is a fixed 1s wait period (https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L266) and afterwards we just check if there is a matching process and return this without double checking if the model is actually loaded.
To reproduce
- Have running endpoint with model which takes some time to initialize.
- Run some constant traffic agianst endpoint
- Scale in another instance
Expected behavior
No errors spikes on scaling events and waiting til the model is fully loaded.
System information
- I am using custom docker image with CPU and noticed this issue for multiple frameworks
Additional context
Is there a parameter to control this initial loading time of the model which I might have missed?