As providers of an end-to-end MLOps platform, we find that autoscaling ML inference is a frequent customer ask. Recently, serverless computing has been touted as the panacea for elastic compute that can provide flexibility and lower operating costs. However, for ML, the need to precisely define hardware configurations and long warm-up times of certain ML models exacerbate the limitations of serverless. To provide the best solution to our customers, we have run extensive benchmarking experiments comparing the performance of serverless and traditional computing for inference workloads running on Kubernetes (with KubeFlow and with the ModelDB MLOps Toolkit). Our experiments have spanned a variety of model types, data modalities, hardware, and workloads. In this talk, we present the results from our benchmarking study and provide a guide to architect your own k8s-based ML inference system.
Ещё видео!