SageMaker

Amazon SageMaker uses containers for training jobs, processing jobs, batch transforms, and real-time inference. Use a custom Docker image when the built-in SageMaker images do not have the operating system packages, runtime, or serving stack that the workload needs.

The normal path is:

  • Build a training or inference image with Docker.

  • Push the image to ECR.

  • Point a SageMaker job or model at the ECR image URI.

  • Write outputs to the SageMaker paths that the platform mounts inside the container.

Training containers

For training jobs, SageMaker mounts input channels under /opt/ml/input/data and expects model artifacts under /opt/ml/model. A simple image can use the Python process as its entrypoint.

 1# syntax=docker/dockerfile:1
 2FROM python:3-slim
 3
 4ENV PYTHONUNBUFFERED=1
 5WORKDIR /opt/program
 6COPY requirements.txt train.py ./
 7RUN --mount=type=cache,target=/root/.cache/pip \
 8    pip install --no-cache-dir -r requirements.txt
 9
10ENTRYPOINT ["python", "train.py"]

The training script should read inputs and write the trained model.

1from pathlib import Path
2import json
3
4input_dir = Path("/opt/ml/input/data/training")
5model_dir = Path("/opt/ml/model")
6model_dir.mkdir(parents=True, exist_ok=True)
7
8rows = sum(1 for _ in input_dir.glob("**/*") if _.is_file())
9(model_dir / "model.json").write_text(json.dumps({"training_files": rows}))

Build and push the image to ECR.

1export AWS_REGION=us-east-1
2export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"
3export IMAGE_URI="${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/sm-train:latest"
4
5aws ecr get-login-password --region "${AWS_REGION}" \
6    | docker login --username AWS --password-stdin "${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com"
7
8docker buildx build --platform linux/amd64 -t "${IMAGE_URI}" --push .

Run a training job with the SageMaker Python SDK.

 1from sagemaker.estimator import Estimator
 2
 3estimator = Estimator(
 4    image_uri=image_uri,
 5    role=role_arn,
 6    instance_count=1,
 7    instance_type="ml.m5.large",
 8    output_path="s3://my-bucket/sagemaker/output"
 9)
10
11estimator.fit({"training": "s3://my-bucket/sagemaker/training"})

Inference containers

For real-time inference, SageMaker starts the container and sends traffic to port 8080. A custom serving container should expose health and invocation routes.

  • GET /ping returns a successful health response when the model is ready.

  • POST /invocations accepts inference requests and returns predictions.

Use a web framework such as FastAPI, Flask, or a model server that already implements the SageMaker inference contract.

Container rules

  • Keep training and inference images separate unless the same runtime is truly needed.

  • Put large model artifacts in S3, not in the image.

  • Use ECR repository scanning and immutable release tags.

  • Run as a non-root user when the framework allows it.

  • Use /opt/ml/model for trained artifacts and /opt/ml/output for job output.

  • Keep AWS credentials out of images. SageMaker injects IAM permissions through the job role.

References