Member-only story
How weak is OpenAI o1?
OpenAI o1 results on ARC-AGI Pub
ARC Award Testing and Explanation of OpenAI’s New o1 Model
In the past 24 hours, we have received new releases from OpenAIo1-preview
ando1-mini
Models that have been specially trained to simulate reasoning. Before providing the final answer, these models have extra time to generate and refine inference markers.
Hundreds of people asked how o1 performed at the ARC Awards. Therefore, we tested it using the same baseline testing tools as evaluating Claude 3.5 Sonnet, GPT-4o, and Gemini 1.5. The results are as follows:
Is o1 the new paradigm of AGI? Will the scale expand? Compared to the average score on ARC-AGI, there is a significant difference in the performance of o1 on IOI, AIME, and many other impressive benchmark test scores. How can this be explained?