Multimodal AI: LLMs that can see (and hear)