Title: Unleashing the Power of LLMs for Visual Understanding
Abstract:
Recent work has adapted Large Language Models (LLMs) to various visual tasks, such as captioning and answering questions about images or short videos. The resulting multimodal LLMs combine visual understanding with powerful reasoning and "common sense" capabilities of LLMs. However, multimodal-LLMs still struggle with some fundamental visual tasks like image classification and understanding long videos.
This talk will cover two recent papers addressing these limitations. CLAMP proposes a parameter-efficient fine-tuning approach for LLMs using a contrastive image-caption matching objective, enabling LLMs to achieve good zero-shot image classification performance, outperforming state-of-the-art multimodal-LLMs by 13% and slightly surpassing contrastive learning with a custom text model. VideoMosaic introduces learnable spatiotemporal queries to adapt pretrained video LLMs (vLLMs) for generalizing to much longer videos. The approach incorporates a global-local video Qformer with two new modules that leverage global video context to compute contextual tokens for understanding short and long video segments. Trained on HowTo100M, VideoMosaic outperforms state-of-the-art large models by 3-6% on zero-shot long video understanding benchmarks and improves the vLLM's performance on short-term action recognition. These findings demonstrate the potential of adapting LLMs for new visual understanding tasks and expanding their capabilities.
Bio: Kate Saenko is a computer scientist, AI researcher at Meta and professor at Boston University. She has made notable contributions to the field of artificial intelligence, particularly in the areas of computer vision and machine learning. Her work has helped advance the state-of-the-art in developing more adaptive, generalizable and multimodal AI systems.
Ещё видео!