LLM inference is not your normal deep learning model deployment nor is it trivial when it comes to managing scale, performance and COST. Understanding how to effectively size a production grade LLM deployment requires understanding of the model(s), the compute hardware, quantization and parallelization methods, KV Cache budgets, input and output token length predictions, model adapter management and much more.
If you want to deeply understand these topics and their effects on LLM inference cost and performance you will enjoy this talk.
This talk will cover the following topics:
Why LLM inference is different to standard deep learning inference
Current and future NVIDIA GPU overview - which GPU(s) for which models and why
Understanding the importance of building inference engines
Deep recap on the attention mechanism along with different types of popular attention mechanisms used in production
Deep dive on KV Cache and managing KV Cache budgets to increase throughput per model deployment
Parallelism (reducing latency) - mainly tensor parallelism but data, sequence, pipeline and expert parallelism will be highlighted
Quantization methods on weights, activations, KV Cache to reduce engine sizes for more effective GPU utilization
Increasing throughput with inflight batching and other techniques
Detailed performance analysis of LLM deployments looking at Time to first token, inter-token latencies, llm deployment characterizations, and more that can help reduce deployment costs
The main inference engine referenced in the talk with TRT-LLM and the open-source inference serve NVIDIA Triton.
Recorded live in San Francisco at the AI Engineer World's Fair. See the full schedule of talks at [ Ссылка ] & join us at the AI Engineer World's Fair in 2025! Get your tickets today at [ Ссылка ]
About Mark
Dr. Mark Moyou is a Senior Data Scientist at NVIDIA on the Retail team focused on enabling scalable machine learning for the nation's top Retailers. Before NVIDIA, he was a Data Science Manager in the Professional Services division at Lucidworks, an Enterprise Search and Recommendations company. Prior to Lucidworks, he was a founding Data Scientist at Alstom Transportation where he applied Data Science to the Railroad Industry in the US. Mark holds a PhD and MSc in Systems Engineering and a BSc in Chemical Engineering. On the side, Mark is the host of The AI Portfolio Podcast, The Caribbean Tech Pioneers, Progress Guaranteed Podcast and Director of the Southern Data Science Conference in Atlanta.
Ещё видео!