Abstract:
Recently, Vision Transformer and its variants have shown great promise on various computer vision tasks. In this talk, I will mainly cover the three most recent works of vision transformers in our team. First, I will present our most recent work, Focal Transformer, which proposed a focal attention mechanism to efficiently capture both short- and long-range visual dependencies for better visual understanding. Then, I will discuss EsViT showing how we can leverage the advanced vision transformer architectures to empower self-supervised learning. Finally, if time allows, I will also introduce one of our first multi-scale vision transformer architectures -- Vision Longformer, a sparse local self-attention for high-resolution dense predictions. Based on these techniques, we had achieved state-of-the-art performance on various standard benchmarks, including object detection, semantic segmentation, and self-supervised image classification. All these promising results demonstrate the great potential of using vision transformers as generic backbones for a variety of vision tasks and beyond.
Bio:
Jianwei Yang is currently a Senior Researcher in the Deep Learning Group at MSR Redmond, directed by Dr. Jianfeng Gao. Prior to joining MSR, he completed his Ph.D. at Georgia Institute of Technology, supervised by Prof. Devi Parikh. His research interests span computer vision, vision-language, and robot learning. More specifically, his research focus is about structured visual understanding at different levels and further leveraging them for intelligent interactions with the human through language and environment through embodiment. Most recently, he did a number of works on vision transformers toward various core vision problems. For more information, please refer to his homepage: jwyang.github.io.
Ещё видео!