Hey everyone! Thank you so much for watching the 110th episode of the Weaviate Podcast! Today we are diving into Snowflake’s Arctic Embedding model series and their newly released Arctic Embed 2.0 open-source model, additionally supporting multilingual text embeddings. The podcast covers the origin of Arctic Embed, Pre-training embedding models, Matryoshka Representation Learning (MRL), Fine-tuning embedding models, Synthetic Query Generation, Hard Negative Mining, and Single-Vector Embeddings Models in the cohort of Multi-Vector ColBERT, SPLADE, and Re-rankers.
Links:
Snowflake's Arctic Embed 2.0 Goes Multilingual: [ Ссылка ]
Arxiv Paper: [ Ссылка ]
Arctic Embedding Model Series on HuggingFace: [ Ссылка ]
Arctic Embed M v1.5: [ Ссылка ]
Initial Launch of Arctic Embed: [ Ссылка ]
Arxiv Paper: [ Ссылка ]
Embedding and Clustering our Data can improve Contrastive Pretraining: [ Ссылка ]
Google Gecko Embeddings Technical Report: [ Ссылка ]
Introducing Weaviate Embeddings: [ Ссылка ]
Embeddings on Weaviate Workbench: [ Ссылка ]
Rescoring from Disk in Weaviate: [ Ссылка ]
Chapters
0:00 Welcome Luke, Puxuan, and Charles!
1:08 The Origin of Arctic Embed
4:45 Moving up the Stack at Weaviate
7:08 MTEB Benchmark
12:08 Pre-Training Embedding Models
19:15 Cost of Embedding Model Pre-Training
25:42 Pre-Training LLMs vs. Embedding Models
28:45 Matryoshka Representation Learning
37:50 Resolutions of Embeddings
44:50 Source Stratification
50:10 Hard Negative Mining
1:04:55 Synthetic Query Generation
1:14:45 Multilingual Text Embeddings
1:24:50 Thoughts on ColBERT, SPLADE, and Rerankers
Ещё видео!