In this deep dive video, we zoom in on model distillation, an advanced technique to build high-performance small language models at a reasonable cost. First, we explain what a model distillation is. Then, we introduce two popular strategies for distillation, logits distillation and hidden states distillation. We study in detail how they work and how they're implemented in the Arcee DistillKit open-source library. Finally, we look at two Arcee models built with distillation, Arcee SuperNova 70B and Arcee SuperNova Medius 14B.
Note: my calculation at 18:45 is wrong. It's 2.3 Tera tokens, not 2.3 Peta tokens. Sorry about that 🤡
⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. You can become a channel member and enjoy exclusive perks: details at [ Ссылка ]
You can also follow me on Medium at [ Ссылка ] or Substack at [ Ссылка ]. ⭐️⭐️⭐️
* Slides: [ Ссылка ]
* DistillKit: [ Ссылка ]
00:00 Introduction
00:30 What is model distillation?
04:55 Model distillation with DistillKit
11:20 Logits distillation
20:10 Logits distillation with DistillKit
26:10 Hidden states distillation
31:35 Hidden states distillation with DistillKit
36:00 Pros and cons
40:32 Distillation example: Arcee SuperNova 70B
42:50 Distillation example: Arcee SuperNova Medius 14B
44:40 Conclusion
Ещё видео!