It’s getting harder to tell which company is winning the AI race, Hugging Face co-founder says


- Hugging Face's Thomas Wolf says it's harder to say which AI model is the best as traditional AI benchmarks are saturated. Continuing, Wolfe said the AI industry can rely on two new benchmarking techniques – agencies based and use a specific case.
Thomas Wolf, Co – founder and chief scientist in the hugging face, thinks we may need new ways to measure AI models.
Wolf told the audience to Brainstorm Ai In London that as AI models are getting more advanced, it becomes difficult to say which one performs the best.
“It's hard to say what is the best model,” he said, teaching the differences of nominal between recent releases from Openai and Google. “They all seemed, actually, very close.”
“The world of benchmarks has changed a lot. We used this very academic benchmark that often measured the model's knowledge of -I think the most popular is MMLU (massive understanding of multitask language), which is a set of level -level or PHD questions -which the model should answer,” he said. “These benchmarks are almost all saturated now.”
Last year, there was a growing choir of voices from the academy, industry, and policy that claimed that common AI benchmarks, such as MMLU, glue, and hellaswag, have reached saturation, can be gamed, and no longer reflect the world's true utility.
In a study published in February, researchers at the Joint Research Center of the European Commission, published a paper called “Can we trust in AI benchmarks? An interdisciplinary review of current AI review issues” “Systemic flaws have found in current benchmarking skills” – including false incentives, develop validity failures, playing results and data – contamination.
Continuing, Wolf said the AI industry should rely on the two main types of benchmarks entering 2025: one for the agency's assessment of models, where LLMs are expected to do tasks, and the other consistent with each case of use for models.
The hugging face is already working late.
The company's new program, “Your Bench,” aims to help users determine which model to use for a specific task. Users feed some documents in the program, which then automatically generates a specific benchmark for the type of work that users can apply on different models to see which one is best for the case of use.
“Just because of these models everyone is working on the same academic benchmark that this doesn't really mean they are the same,” Wolf said.
Open the 'Chatgpt Moment'
Located by Wolf, Clément Delangue, and Julien Chaumond in 2016, the hugging face has long been a champion of tomorrow – source AI.
Often referred to as the GitHub of Machine Learning, the company provides an open source platform that provides developers, researchers, and businesses to produce, share, and deploy educational models, datasets, and size applications. Users can also browse models and datasets uploaded by others.
Wolfe told the Brainstorm Ai audience that the hugging face's “model model is actually aligned with open resources” and the company's goal is to have a maximum number of people participating in this type of open community and sharing models. “
Wolfe foretold that Open – Source AI will continue to develop, especially after DeepSeek's success earlier this year.
After its launch late last year, the Chinese model – Made Ai DeepSeek R1 sent shockwaves through the AI world when the testers learned that it was matching or even unable to release American closed – source AI models.
Wolf said Deepseek was a “chatgpt moment” for open review AI.
“Just as ChatGPT is the moment the world has discovered AI, the Deptseek is the moment the world has discovered that there is a kind of open society,” he said.
This story was originally featured on Fortune.com