BabyLM Challenge 2024: A Call to Optimize Pretraining with Data Limitations

April 24, 2024

The BabyLM Challenge 2024 is a shared task aimed at encouraging researchers to focus on optimizing pretraining given data limitations inspired by human development. The goal is to democratize research on pretraining, which is typically thought to be practical only for large industry groups, by formulating an exciting open problem and establishing a community around it.

The task has three fixed-data tracks, two of which restrict the training data to pre-released datasets of 10M and 100M. The third track is the multimodal track, where the training set consists of 50M words of paired text-image data, and 50M words text-only data. There are also two other tracks with no fixed datasets, including the ‘bring-your-data’ track and the paper-only track. The challenge has a focus on small-scale pretraining as a sandbox for developing novel techniques for improving data efficiency and enhancing current approaches to modeling low-resource languages.

The challenge has several key dates, including the release of training data on March 30, 2024, and the submission of papers on September 20, 2024. For more information, visit the BabyLM website here or consult the extended call for papers here.

ML Scientist

BabyLM Challenge 2024: A Call to Optimize Pretraining with Data Limitations

Leave a Reply Cancel reply

You May Also Like

Leave a Reply Cancel reply