SageMaker
Amazon SageMaker uses containers for training jobs, processing jobs, batch transforms, and real-time inference. Use a custom Docker image when the built-in SageMaker images do not have the operating system packages, runtime, or serving stack that the workload needs.
The normal path is:
Build a training or inference image with Docker.
Push the image to ECR.
Point a SageMaker job or model at the ECR image URI.
Write outputs to the SageMaker paths that the platform mounts inside the container.
Training containers
For training jobs, SageMaker mounts input channels under /opt/ml/input/data and expects model artifacts under /opt/ml/model. A simple image can use the Python process as its entrypoint.
1# syntax=docker/dockerfile:1
2FROM python:3-slim
3
4ENV PYTHONUNBUFFERED=1
5WORKDIR /opt/program
6COPY requirements.txt train.py ./
7RUN --mount=type=cache,target=/root/.cache/pip \
8 pip install --no-cache-dir -r requirements.txt
9
10ENTRYPOINT ["python", "train.py"]
The training script should read inputs and write the trained model.
1from pathlib import Path
2import json
3
4input_dir = Path("/opt/ml/input/data/training")
5model_dir = Path("/opt/ml/model")
6model_dir.mkdir(parents=True, exist_ok=True)
7
8rows = sum(1 for _ in input_dir.glob("**/*") if _.is_file())
9(model_dir / "model.json").write_text(json.dumps({"training_files": rows}))
Build and push the image to ECR.
1export AWS_REGION=us-east-1
2export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"
3export IMAGE_URI="${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/sm-train:latest"
4
5aws ecr get-login-password --region "${AWS_REGION}" \
6 | docker login --username AWS --password-stdin "${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com"
7
8docker buildx build --platform linux/amd64 -t "${IMAGE_URI}" --push .
Run a training job with the SageMaker Python SDK.
1from sagemaker.estimator import Estimator
2
3estimator = Estimator(
4 image_uri=image_uri,
5 role=role_arn,
6 instance_count=1,
7 instance_type="ml.m5.large",
8 output_path="s3://my-bucket/sagemaker/output"
9)
10
11estimator.fit({"training": "s3://my-bucket/sagemaker/training"})
Inference containers
For real-time inference, SageMaker starts the container and sends traffic to port 8080. A custom serving container should expose health and invocation routes.
GET /pingreturns a successful health response when the model is ready.POST /invocationsaccepts inference requests and returns predictions.
Use a web framework such as FastAPI, Flask, or a model server that already implements the SageMaker inference contract.
Container rules
Keep training and inference images separate unless the same runtime is truly needed.
Put large model artifacts in S3, not in the image.
Use ECR repository scanning and immutable release tags.
Run as a non-root user when the framework allows it.
Use
/opt/ml/modelfor trained artifacts and/opt/ml/outputfor job output.Keep AWS credentials out of images. SageMaker injects IAM permissions through the job role.