How can you speed up your LLM inference time?
In this video, we'll optimize the token generation time for our fine-tuned Falcon 7b model with QLoRA. We'll explore various model loading techniques and look into batch inference for faster predictions.
Discord: [ Ссылка ]
Prepare for the Machine Learning interview: [ Ссылка ]
Subscribe: [ Ссылка ]
Lit-Parrot: [ Ссылка ]
00:00 - Introduction
01:05 - Text Tutorial on MLExpert.io
01:26 - Google Colab Setup
03:58 - Training Config Baseline
07:06 - Loading in 4 Bit
08:26 - Loading in 8 Bit
09:40 - torch.compile()
10:25 - Batch Inference
12:00 - Lit-Parrot
16:57 - Conclusion
Turtle image by stockgiu
#chatgpt #gpt4 #llms #artificialintelligence #promptengineering #chatbot #transformers #python #pytorch
Ещё видео!