AI and Data Science:

AI's Parallelization Methods in Supercomputers

Learn to scale deep learning models on supercomputers with this hands-on course. Gain the skills needed for efficient training and explore parallelization techniques.

This course introduces the essentials for running deep learning models on supercomputers and effectively scaling them. It provides foundational skills to enable efficient training of large models. There is a comprised two day and a five day version of this course. In the latter version, you will learn how to use various parallelization techniques.

All days cover alternating sequences of theoretical input and hands-on exercises, during which the instructors are available for quick feedback and advice.

Learning goals

By the end of the course, you will be able to:

Short version (2 days)

Day 1: Supercomputer Access Basics

  • Understand what a supercomputer is.
  • Configure SSH keys.
  • Setup VSCODE.
  • Use software packages of the supercomputer.
  • Run your first job on the supercomputer.
  • Bonus: Blablador.

Day 2: Distributed Data Parallel (DDP)

  • The good practices before starting training.
  • Where to store your data and how to load it
  • Run your first PyTorch code on the supercomputer.
  • Understand what is a distributed training.
  • Understand DDP.
  • Transform your code to a distributed one with DDP.
  • Use Tensorboard on the supercomputer.
  • Check GPU usage with llview.

Extended version (5 days)

Day 1: Supercomputer Access Basics

  • Understand what a supercomputer is.
  • Configure SSH keys.
  • Setup VSCODE.
  • Use software packages of the supercomputer.
  • Run your first job on the supercomputer.
  • Bonus: Blablador.

Day 2: Distributed Data Parallel (DDP)

  • The good practices before starting training.
  • Where to store your data and how to load it
  • Run your first PyTorch code on the supercomputer.
  • Understand what is a distributed training.
  • Understand DDP.
  • Transform your code to a distributed one with DDP.
  • Use Tensorboard on the supercomputer.
  • Check GPU usage with llview.

Day 3: Tensor Parallelism (TP)

  • Know what TP is.
  • Parallelize your code with TP.

Day 4: Pipeline Parallelism (PP)

  • Know what PP is.
  • Parallelize your code with PP.

Day 5: Fully Sharded Data Parallel (FSDP)

  • Understand FSDP
  • Distribute your code with FSDP.

Course date

Register now:

For more information on how to register, please follow the links on the course dates. Please note: The links to the courses happening in the second half of 2025 will be published in summer.

Prerequisites

To participate in this course, you need knowledge of

Target group

This course is addressing anyone who wants to learn how to scale their models (students, researchers, employees …).

This course is free of charge.

Alternativ-Text

Subscribe newsletter