Learn to scale deep learning models on supercomputers with this hands-on course. Gain the skills needed for efficient training and explore parallelization techniques.
This course introduces the essentials for running deep learning models on supercomputers and effectively scaling them. It provides foundational skills to enable efficient training of large models. There is a comprised two day and a five day version of this course. In the latter version, you will learn how to use various parallelization techniques.
All days cover alternating sequences of theoretical input and hands-on exercises, during which the instructors are available for quick feedback and advice.
Learning goals
By the end of the course, you will be able to:
Short version (2 days)
Day 1: Supercomputer Access Basics
- Understand what a supercomputer is.
- Configure SSH keys.
- Setup VSCODE.
- Use software packages of the supercomputer.
- Run your first job on the supercomputer.
- Bonus: Blablador.
Day 2: Distributed Data Parallel (DDP)
- The good practices before starting training.
- Where to store your data and how to load it
- Run your first PyTorch code on the supercomputer.
- Understand what is a distributed training.
- Understand DDP.
- Transform your code to a distributed one with DDP.
- Use Tensorboard on the supercomputer.
- Check GPU usage with llview.
Extended version (5 days)
Day 1: Supercomputer Access Basics
- Understand what a supercomputer is.
- Configure SSH keys.
- Setup VSCODE.
- Use software packages of the supercomputer.
- Run your first job on the supercomputer.
- Bonus: Blablador.
Day 2: Distributed Data Parallel (DDP)
- The good practices before starting training.
- Where to store your data and how to load it
- Run your first PyTorch code on the supercomputer.
- Understand what is a distributed training.
- Understand DDP.
- Transform your code to a distributed one with DDP.
- Use Tensorboard on the supercomputer.
- Check GPU usage with llview.
Day 3: Tensor Parallelism (TP)
- Know what TP is.
- Parallelize your code with TP.
Day 4: Pipeline Parallelism (PP)
- Know what PP is.
- Parallelize your code with PP.
Day 5: Fully Sharded Data Parallel (FSDP)
- Understand FSDP
- Distribute your code with FSDP.
Course date
Register now:
- March 06–07, 2025
- July 07–08, 2025
- October 16 – 17, 2025
- December 08 – 12, 2025
For more information on how to register, please follow the links on the course dates. Please note: The links to the courses happening in the second half of 2025 will be published in summer.
Prerequisites
To participate in this course, you need knowledge of
- Python as taught in the courses
- PyTorch
- Machine Learning, see the course
- Deep Learning as taught in the course
Target group
This course is addressing anyone who wants to learn how to scale their models (students, researchers, employees …).
This course is free of charge.