In this video, I put several cutting-edge language models through their paces using a custom-built ARC (Abstraction and Reasoning Corpus) challenge interface inspired by François Chollet’s work. I tested O1, Sonnet 3.5, LLama 3.3, Deepseek v2.5, QwQ Preview, and Qwen 2.5 (72B) against 10 unique ARC challenges to see how they handle complex reasoning tasks.
I also delve into a discussion on how we might define artificial general intelligence (AGI) and the subtle differences between these models’ reasoning capabilities. Join me as I explore the current state of AI, share insights on these tests, and reflect on what it might mean to inch closer to true AGI.
If you’d like to try the ARC challenge setup yourself, I’ve made the web UI available for everyone interested in experimenting. Share your thoughts on the models’ performance and the future of AGI in the comments!
ARC Creator:
[ Ссылка ]
ARC Questions/ Tasks:
[ Ссылка ]
Ещё видео!