The visual world around us is highly structured. As 2D projection of our world, images are also structured. In images, there are usually a background and some foreground objects (e.g., kites and birds in the sky, sheep and cows on the grass). Moreover, objects usually interact with each other in predictable ways (e.g., mugs are on tables, keyboards are below computer monitors, the sky is in the background). This structure in our world manifests itself in the visual data that captures the world around us. In this talk, I will talk about how to leverage this structure in our visual world for visual understanding and interactions with language and environment. Specifically, I will present: 1) how to learn to prune dense graph and perform relational modeling for scene graph generation; 2) how to leverage structure in images for more grounded caption generation and question generation to actively acquire more information from humans; 3) How to learn a moving strategy for embodied visual system in a 3D environments to achieve better visual perception through actions. Finally, I will briefly talk about my ongoing and future works which are aimed at connecting vision, language, and environment towards better visual understanding and interactions.
Talk slides: [ Ссылка ]
Learn more about this and other talks at Microsoft Research: [ Ссылка ]
Ещё видео!