This talk covers best practices and techniques for scaling machine learning workloads for building large scale models using PyTorch. We will share our experiences of using PyTorch to train 175-billion and 1-Trillion parameter models, different training paradigms and techniques for profiling and troubleshooting that will help you in jumpstarting your efforts in this space.
Jump to:
00:00 Introduction
00:44 Why is large model training needed?
00:59 Scaling creates training and model efficiency
01:13 Larger models = more efficient, less training, less data
01:24 Larger models can learn with few shot learning
02:19 Democratizing largescale language models with OPT175B
02:51 Challenges of large model training
03:25 What is PyTorch Distributed?
04:20 Features Overview
06:00 DistributedDataParallel
06:53 FullyShardedDataParallel
08:44 FSDP Auto wrapping
09:22 FSDP Auto wrapping example
09:38 FSDP CPU Offload, Backward Prefetch policies
09:46 FSDP Mixed Precision control
09:53 Pipeline
11:06 Example Auto Partitioning
12:26 Pipeline + DDP (PDP)
13:44 Memory Saving Features
13:52 Activation Checkpointing
14:20 Activation Offloading
15:01 Activation Checkpointing & Offloading
15:45 Parameter Offloading
16:15 Memory Saving Feature & Training Paradigms
18:11 Experiments & Insights
18:16 Model Implementation
18:50 Scaling Efficiency Varying # GPUs
20:57 Scaling Efficiency Varying World Size
22:07 Scaling Efficiency Varying Batch Size
23:50 Model Scale Limit
24:55 Impact of Network Bandwidth
27:08 Best Practices
28:20 Best Practices FSDP
29:01 Profiling & Troubleshooting
29:08 Profiling & Troubleshooting for Large Scale Model Training
30:35 Uber Prof (Experimental) Profiling & Troubleshooting tool
32:09 Demonstration
34:15 Combining DCGM + Profiling
35:36 Profiling for Large Scale Model Training
36:04 Nvidia NSights multinode, multigpu Profiling
36:47 PyTorch Profiler Distributed Training Profiling (single node multigpu)
37:04 Try it now
37:24 Resources
37:30 Closing Notes
Microsoft Build 2022
Ещё видео!