AI's Parallelization Methods in Supercomputers

Learn to scale deep learning models on supercomputers with this hands-on course. Gain the skills needed for efficient training and explore parallelization techniques.

This course is offered by Helmholtz AI in cooperation with HIDA

More about Helmholtz AI

Helmholtz AI is an application-driven artificial intelligence platform accelerating science across the Helmholtz Association. They enable the development and implementation of AI solutions while promoting collaboration and ensuring access to resources and expertise.

Learn more

This course introduces the essentials for running deep learning models on supercomputers and effectively scaling them. It provides foundational skills to enable efficient training of large models. There is a comprised two day and a five day version of this course. In the latter version, you will learn how to use various parallelization techniques.

All days cover alternating sequences of theoretical input and hands-on exercises, during which the instructors are available for quick feedback and advice.

Learning goals

By the end of the course, you will be able to:

Short version (2 days)

Day 1: Supercomputer Access Basics

Understand what a supercomputer is.
Configure SSH keys.
Setup VSCODE.
Use software packages of the supercomputer.
Run your first job on the supercomputer.
Bonus: Blablador.

Day 2: Distributed Data Parallel (DDP)

The good practices before starting training.
Where to store your data and how to load it
Run your first PyTorch code on the supercomputer.
Understand what is a distributed training.
Understand DDP.
Transform your code to a distributed one with DDP.
Use Tensorboard on the supercomputer.
Check GPU usage with llview.

Extended version (5 days)

Day 1: Supercomputer Access Basics

Understand what a supercomputer is.
Configure SSH keys.
Setup VSCODE.
Use software packages of the supercomputer.
Run your first job on the supercomputer.
Bonus: Blablador.

Day 2: Distributed Data Parallel (DDP)

The good practices before starting training.
Where to store your data and how to load it
Run your first PyTorch code on the supercomputer.
Understand what is a distributed training.
Understand DDP.
Transform your code to a distributed one with DDP.
Use Tensorboard on the supercomputer.
Check GPU usage with llview.

Day 3: Tensor Parallelism (TP)

Know what TP is.
Parallelize your code with TP.

Day 4: Pipeline Parallelism (PP)

Know what PP is.
Parallelize your code with PP.

Day 5: Fully Sharded Data Parallel (FSDP)

Understand FSDP
Distribute your code with FSDP.

Course date

Register now:

March 18–19, 2025

June 24–25, 2025

September 16–17, 2025

December 2–3, 2025

For more information on how to register, please follow the links on the course dates. Please note: The links to the courses happening in the second half of 2025 will be published in summer.

Prerequisites

To participate in this course, you need knowledge of

Python as taught in the courses
- “First steps in Python”
- “Data processing with Pandas & Data visualization in Matplotlib”
PyTorch
Machine Learning, see the course
- “Basic Methods of Machine Learning”
- “Introduction to Machine Learning”
Deep Learning as taught in the course
- “Introduction to Deep Learning”.

Target group

This course is addressing anyone who wants to learn how to scale their models (students, researchers, employees …).

This course is free of charge.

This course is offered by Helmholtz AI in cooperation with HIDA

More about Helmholtz AI

Learning goals

Course date

Prerequisites

Target group

Subscribe newsletter

Your Cookie Settings

Settings