Канал: Neural Magic

vLLM Office Hours - Disaggregated Prefill and KV Cache Storage in vLLM - November 14, 2024

vLLM Office Hours - Disaggregated Prefill and KV Cache Storage in vLLM - November 14, 2024

vLLM Office Hours - SOTA Tool-Calling Implementation in vLLM - November 7, 2024

vLLM Office Hours - SOTA Tool-Calling Implementation in vLLM - November 7, 2024

vLLM Office Hours - Deep Dive into Mistral on vLLM - October 17, 2024

vLLM Office Hours - Deep Dive into Mistral on vLLM - October 17, 2024

vLLM Office Hours - Speculative Decoding in vLLM - October 3, 2024

vLLM Office Hours - Speculative Decoding in vLLM - October 3, 2024

vLLM Office Hours - Advanced Techniques for Maximizing vLLM Performance - September 19, 2024

vLLM Office Hours - Advanced Techniques for Maximizing vLLM Performance - September 19, 2024

vLLM Office Hours - Using NVIDIA CUTLASS for High-Performance Inference - September 05, 2024

vLLM Office Hours - Using NVIDIA CUTLASS for High-Performance Inference - September 05, 2024

vLLM Office Hours - vLLM on AMD GPUs and Google TPUs - August 21, 2024

vLLM Office Hours - vLLM on AMD GPUs and Google TPUs - August 21, 2024

vLLM Office Hours - Multimodal Models in vLLM with Roblox - August 8, 2024

vLLM Office Hours - Multimodal Models in vLLM with Roblox - August 8, 2024

vLLM Office Hours - Model Quantization for Efficient vLLM Inference - July 25, 2024

vLLM Office Hours - Model Quantization for Efficient vLLM Inference - July 25, 2024

Deploy LLMs More Efficiently with vLLM and Neural Magic

Deploy LLMs More Efficiently with vLLM and Neural Magic

vLLM Office Hours - FP8 Quantization Deep Dive - July 9, 2024

vLLM Office Hours - FP8 Quantization Deep Dive - July 9, 2024

vLLM Office Hours - June 20, 2024

vLLM Office Hours - June 20, 2024

vLLM and Neural Magic Office Hours - June 5, 2024

vLLM and Neural Magic Office Hours - June 5, 2024

Are MLOps disappearing?

Are MLOps disappearing?

5x Faster YOLOv8 on CPUs

5x Faster YOLOv8 on CPUs

Deploy Fast and Accurate YOLOv8 Object Detection Models on CPUs You Already Have

Deploy Fast and Accurate YOLOv8 Object Detection Models on CPUs You Already Have

Unlock Faster and More Efficient LLMs with SparseGPT

Unlock Faster and More Efficient LLMs with SparseGPT

Pruning and Quantizing ML Models With One Shot Without Retraining

Pruning and Quantizing ML Models With One Shot Without Retraining

Sparse Transferring Hugging Face Models With SparseML

Sparse Transferring Hugging Face Models With SparseML

Apply Second-Order Pruning Algorithms for SOTA Model Compression

Apply Second-Order Pruning Algorithms for SOTA Model Compression

Use Sparse Transfer Learning to Create Sparse Models Fine-Tuned to Your Datasets

Use Sparse Transfer Learning to Create Sparse Models Fine-Tuned to Your Datasets

Intro to SparseML

Intro to SparseML

Accelerate Image Segmentation Tasks With Sparsity and the DeepSparse Runtime

Accelerate Image Segmentation Tasks With Sparsity and the DeepSparse Runtime

Accelerate Image Classification Tasks With Sparsity and the DeepSparse Runtime

Accelerate Image Classification Tasks With Sparsity and the DeepSparse Runtime

Accelerate Object Detection Tasks With Sparsity and the DeepSparse Runtime

Accelerate Object Detection Tasks With Sparsity and the DeepSparse Runtime

Intro to DeepSparse Runtime

Intro to DeepSparse Runtime

Intro to Deep Learning Model Sparsification

Intro to Deep Learning Model Sparsification

Intro to SparseZoo

Intro to SparseZoo

Intro to Neural Magic & Software-Delivered AI

Intro to Neural Magic & Software-Delivered AI

Accelerate NLP Tasks With Sparsity and the DeepSparse Runtime

Accelerate NLP Tasks With Sparsity and the DeepSparse Runtime

Sparse Training of Neural Networks Using AC/DC

Sparse Training of Neural Networks Using AC/DC

How Well Do Sparse Models Transfer?

How Well Do Sparse Models Transfer?

How to Achieve the Fastest CPU Inference Performance for Object Detection YOLO Models

How to Achieve the Fastest CPU Inference Performance for Object Detection YOLO Models

Workshop: How to Optimize Deep Learning Models for Production

Workshop: How to Optimize Deep Learning Models for Production

State-of-the-Art NLP Compression Research in Action: Understanding Crypto Sentiment

State-of-the-Art NLP Compression Research in Action: Understanding Crypto Sentiment

How to Compress Your BERT NLP Models For Very Efficient Inference

How to Compress Your BERT NLP Models For Very Efficient Inference

How to Compress Your NLP Models for Efficient Inference

How to Compress Your NLP Models for Efficient Inference

3.5x Faster NLP BERT Using a Sparsity-Aware Inference Engine on AMD Milan-X

3.5x Faster NLP BERT Using a Sparsity-Aware Inference Engine on AMD Milan-X

Deep Sparse Platform Demo: Build and Deploy Accurate Deep Learning Models Faster

Deep Sparse Platform Demo: Build and Deploy Accurate Deep Learning Models Faster

How to Sparsify BERT for Better CPU Performance & Smaller File Size

How to Sparsify BERT for Better CPU Performance & Smaller File Size

Faster & More Accurate BERT Models on CPUs

Faster & More Accurate BERT Models on CPUs

Sparsifying YOLOv5 for 10x Better Performance, 12x Smaller File Size, and Cheaper Deployment

Sparsifying YOLOv5 for 10x Better Performance, 12x Smaller File Size, and Cheaper Deployment

YOLOv5 on CPUs: Sparsifying to Achieve GPU-Level Performance and Tiny Footprint

YOLOv5 on CPUs: Sparsifying to Achieve GPU-Level Performance and Tiny Footprint

YOLOv3 on the Edge: DeepSparse Engine vs. PyTorch

YOLOv3 on the Edge: DeepSparse Engine vs. PyTorch

Using Sparsification Recipes with PyTorch

Using Sparsification Recipes with PyTorch

Introducing the Deep Sparse Platform: Sparsify Deep Learning Models to Run on CPUs on GPU Speeds.

Introducing the Deep Sparse Platform: Sparsify Deep Learning Models to Run on CPUs on GPU Speeds.

Tissue vs. Silicon: The Future of Deep Learning Hardware

Tissue vs. Silicon: The Future of Deep Learning Hardware

How Neural Magic Works: Easily Deliver GPU-class DL Performance on CPUs

How Neural Magic Works: Easily Deliver GPU-class DL Performance on CPUs

Pruning for Success

Pruning for Success

Who is Neural Magic?

Who is Neural Magic?

Neural Magic Demo: Lower Costs for Deep Learning Deployments

Neural Magic Demo: Lower Costs for Deep Learning Deployments

Pruning Deep Learning Models for Success in Production

Pruning Deep Learning Models for Success in Production

Big Brain Burnout: What's wrong with AI computing?

Big Brain Burnout: What's wrong with AI computing?

Neural Magic Demo

Neural Magic Demo