Collaborative Benchmark for Large Language Models (LLMs)

April 8, 2025

A new benchmark dataset for Large Language Models (LLMs) is being proposed to identify areas for improvement and advance the field toward reliable LLMs.

A new benchmark dataset for LLMs is being proposed to complement existing challenging benchmarks. The goal is to create a benchmark of less complex questions where models still fail, helping identify areas for improvement.

The benchmark will contain questions that current models cannot solve, and as questions become solvable, they will be removed. The project aims to advance the field toward reliable LLMs that only fail with unreasonable questions.

Collaborators are sought to contribute questions that they’ve encountered problems with, which can be submitted through Google forms. Submissions before May 1st will be considered for the benchmark and may be included as co-authors.

Two examples of questions that current models cannot solve are provided, including a logic puzzle and a math problem.

Submit questions: https://forms.gle/KyZAGfWXrTbzJduV6
Submit answers: https://forms.gle/Cy7Nc5GmsEa9TXz67

Tags: Large Language Models, LLM benchmark, collaborative project, machine learning, natural language processing, AI research

You May Also Like