Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

LLM inference is not your normal deep learning model deployment nor is it trivial when it comes to managing scale, performance and COST. Understanding how to effectively size a production grade LLM deployment requires understanding of the model(s), the compute hardware, quantization and parallelization methods, KV Cache budgets, input and output token length predictions, model adapter management and much more.

If you want to deeply understand these topics and their effects on LLM inference cost and performance you will enjoy this talk.

This talk will cover the following topics:

Why LLM inference is different to standard deep learning inference
Current and future NVIDIA GPU overview - which GPU(s) for which models and why
Understanding the importance of building inference engines
Deep recap on the attention mechanism along with different types of popular attention mechanisms used in production
Deep dive on KV Cache and managing KV Cache budgets to increase throughput per model deployment
Parallelism (reducing latency) - mainly tensor parallelism but data, sequence, pipeline and expert parallelism will be highlighted
Quantization methods on weights, activations, KV Cache to reduce engine sizes for more effective GPU utilization
Increasing throughput with inflight batching and other techniques
Detailed performance analysis of LLM deployments looking at Time to first token, inter-token latencies, llm deployment characterizations, and more that can help reduce deployment costs
The main inference engine referenced in the talk with TRT-LLM and the open-source inference serve NVIDIA Triton.

Recorded live in San Francisco at the AI Engineer World's Fair. See the full schedule of talks at [ Ссылка ] & join us at the AI Engineer World's Fair in 2025! Get your tickets today at [ Ссылка ]

About Mark
Dr. Mark Moyou is a Senior Data Scientist at NVIDIA on the Retail team focused on enabling scalable machine learning for the nation's top Retailers. Before NVIDIA, he was a Data Science Manager in the Professional Services division at Lucidworks, an Enterprise Search and Recommendations company. Prior to Lucidworks, he was a founding Data Scientist at Alstom Transportation where he applied Data Science to the Railroad Industry in the US. Mark holds a PhD and MSc in Systems Engineering and a BSc in Chemical Engineering. On the side, Mark is the host of The AI Portfolio Podcast, The Caribbean Tech Pioneers, Progress Guaranteed Podcast and Director of the Southern Data Science Conference in Atlanta.

Смотрите далее

Jelas Sekali, Begini Channa Toman Menyambar #channa #channamicropeltes #ikanpredator

What Controller the gives BETTER Urban Range? DJI Mini 3 PRO Range Test

Маточная гайка своими руками на самодельный токарный станок по металлу

автономник для лазераного гравёра 2

автономник для лазерного гравёра своими руками.

Управление преобразователем частоты по протоколу Modbus RTU Часть №3

Резиновые сапоги накладки на берци армии Бундесвер

500kg HDPE milk bottle washing line with hot washer

Making a Da Vinci Bridge from Chopsticks

My.mehnat.uz YMMT(Yagona Milliy Mehnat Tizimi)ga kirish tartibi

Мощный и функциональный видеоредактор программа ВИДЕОМОНТАЖ обзор возможностей программы Видеомонтаж

Trend Master, обучение

Se você escova sua língua assim precisa ver esse vídeo ! #mauhalito

Édition 2022 : Cérémonie du concours Les Inn'ovations

Upgrade Your Surveillance System With Seagate SkyHawk AI-optimized Hard Drive | Direct Macro

Новые клипы

Тренды Наука