From Laptop to Large-Scale
An interactive guide to the official Kubeflow Trainer integration in PyTorch. Learn how to take your models from local development to scalable, distributed training on Kubernetes.
Core Concepts
Click on each term to learn what it does in this workflow.
3-Step Setup Process
Install Local Kubernetes
Use `kind` to create a local cluster for development.
kind create cluster
Install Trainer Controller
Apply the controller to your cluster to manage jobs.
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=v2.0.0"
Install Trainer SDK
Get the Python SDK to create and manage jobs.
pip install kubeflow-trainer
Interactive Code Walkthrough
This process involves two main files. Hover over the highlighted sections in the code to see what they do.
1. The Training Function
This standard PyTorch script is packaged into a container and run on each worker node.
2. The Launcher Script
This script uses the Kubeflow Trainer SDK to define, run, and monitor the distributed job.
Run Simulation
Click the button below to simulate running the launcher script. You'll see the logs appear and the training loss plotted on the chart.
Click "Run Training Simulation" to see the output...
Training Loss per Epoch
Practical Next Steps
Using GPUs
To leverage GPUs, ensure your Kubernetes nodes are GPU-enabled and add `"gpu": "1"` to the `resources_per_worker` dictionary in your launcher script.
Custom Docker Images
For real projects, you'll need to package your code and dependencies into a custom Docker image, push it to a registry, and reference it as the `base_image`.
Advanced Strategies
Explore deeper integrations with strategies like Fully Sharded Data Parallel (FSDP) or libraries like DeepSpeed for training even larger models efficiently.