Get exclusive access to AI resources and project ideas: [ Ссылка ]
Multimodal (Large) Language Models expand an LLM's text-only capabilities to include other modalities. Here are three ways to do this.
Resources:
📰 Blog: [ Ссылка ]
▶️ LLM Playlist: [ Ссылка ]
💻 GitHub Repo: [ Ссылка ]
References:
[1] Multimodal Machine Learning: [ Ссылка ]
[2] A Survey on Multimodal Large Language Models: [ Ссылка ]
[3] Visual Instruction Tuning: [ Ссылка ]
[4] GPT-4o System Card: [ Ссылка ]
[5] Janus: [ Ссылка ]
[6] Learning Transferable Visual Models From Natural Language Supervision: [ Ссылка ]
[7] Flamingo: [ Ссылка ]
[8] Mini-Omni2: [ Ссылка ]
[9] Emu3: [ Ссылка ]
[10] Chameleon: [ Ссылка ]
--
Homepage: [ Ссылка ]
Introduction - 0:00
Multimodal LLMs - 1:49
Path 1: LLM + Tools - 4:24
Path 2: LLM + Adapaters - 7:20
Path 3: Unified Models - 11:19
Example: LLaMA 3.2 for Vision Tasks (Ollama) - 13:24
What's next? - 19:58
Multimodal AI: LLMs that can see (and hear)
Теги
AIMultimodal AIMultimodal LLMMultimodal Large Language Modelwhat is multimodal aimultimodal ragmultimodal large modelmultimodalityollamaLLaMALLaMA 3.2LLaMA 3.2 VisionGPT-4olocal LLMlocal llm modelollama on macollama on mac m1local llm macadapterfine-tuningencoderdecodermachine learningMLmultimodal MLmultimodal deep learningpythonexplainedtutoriallecturesimply explainedintroductionGeminiJanusFlamingoEmu3