Skip to content

Conversation

@aviruthen
Copy link
Collaborator

Issue #, if available:

Description of changes:

Introducing aws_batch into PySDK V3, placed in sagemaker.train.
Supports queueing in front of training jobs with interfaces TrainingQueue and TrainingQueuedJob.
Includes unit tests, integration tests, and an example notebook

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@aviruthen
Copy link
Collaborator Author

Unit and integ tests are passing, have to rerun for recent merge
Screenshot 2025-12-12 at 11 16 58 AM

"## Create Sample Resources\n",
"The diagram belows shows the Batch resources we'll create for this example.\n",
"\n",
"![The Resources to Create](batch_getting_started_resources.png \"Example Job Queue and Service Environment Resources\")\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're missing the png here from this PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, added the png back

" time.sleep(5)\n",
"\n",
" # Print training job logs\n",
" # job.get_estimator().logs()\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this Estimator comment reference

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops good catch!

# Step 3: Create ModelTrainer
model_trainer = ModelTrainer(**init_params)

# Step 4: Set _latest_training_job (key insight!)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: key insight?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this was a note for myself (the ModelTrainer parameter that I could attach the training job to)!

Copy link
Collaborator

@davlind-amzn davlind-amzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran some testing of these changes on my side, everything looks good to me!

return self.boto_session.resource("iam").Role(role).arn


def logs_for_job(self, job_name, wait=False, poll=10, log_type="All", timeout=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted to check on what this utility is used for?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to maintain parity with the Estimator::logs method, which tails the logs being emitted from an active training job. Example of usage here, down under the "Monitor Job Status" section: https://github.com/aws/amazon-sagemaker-examples/blob/default/%20%20%20%20%20%20build_and_train_models/sm-training-queues/sm-training-queues_getting_started_with_estimator.ipynb

Check out this line in the example notebook in this PR: model_trainer.sagemaker_session.logs_for_job(model_trainer._latest_training_job.training_job_name, wait=True)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seem like we are replacing:
v2 logic: job.get_estimator().logs()

v3 logic: model_trainer.sagemaker_session.logs_for_job(model_trainer._latest_training_job.training_job_name, wait=True)

The V3 experience looks pretty bad to me and since this is not an existing v2 parity issue - Can we think about tie-ing the get logs functionality to training queue job or model trainer directly? We can also think through on this and not make a decision on this for 1st release. Lets use utility method within example notebooks (replicate _logs_for_job funtionality within notebook as a standalone utility)? This would not break customers using the notebook and we can make a right offering for logs exposure and update the notebook.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced with Mufi. Resolution: let's make logs_for_job a notebook method so that we are not introducing a new external method (and we can take more time on this for the future). We are okay with allowing the user to get the training job name from _lastest_training_job since this is an internal parameter (not an internal class or method). Will be implementing this

import boto3


def get_batch_boto_client(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it internal?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used in the notebook, should stay external

]


def get_training_job_name_from_training_job_arn(training_job_arn: str) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made internal!

"jinja2>=3.0,<4.0",
"sagemaker-mlflow>=0.0.1,<1.0.0",
"mlflow>=3.0.0,<4.0.0",
"nest_asyncio>=1.5.0",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a required dependency? how much of additional size implications are added to the sagemaker-train package?

Copy link
Collaborator Author

@aviruthen aviruthen Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used in the result() method for TrainingQueuedJob. It's a really small package; however, I'm aligned with removing it from pyproject.toml and having it be a dependency users can pip install (we have something similar where there are many different ML frameworks for inference but we don't require all of them in the pyproject.toml file since users can pick and choose which ML frameworks they want to use)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed nest_asyncio as a dependency in pyproject.toml

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced with David. Adding back in dependency, it is a small package and should be available for users when they install sagemaker-train.

@aviruthen aviruthen merged commit f272de0 into aws:master Dec 19, 2025
8 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants