BabyLM Challenge 2024: A Call for Small-Scale Pretraining Research
The BabyLM Challenge 2024 is an exciting opportunity for researchers to focus on optimizing pretraining given data limitations inspired by human development. The goal is to democratize research on pretraining and encourage innovation in the choice of data, its domain, and even its modality. The task has three fixed-data tracks, two of which restrict the training data to pre-released datasets of 10M and 100M. The third track is the multimodal track, where the training set consists of 50M words of paired text-image data, and 50M words text-only data. There are also two other tracks with no fixed datasets: the ‘bring-your-data’ track and the paper-only track. The challenge will release a shared evaluation pipeline that evaluates on a variety of benchmarks and tasks. Submissions are due on September 13, 2024, and paper submissions are due on September 20, 2024. For more information, visit the BabyLM website https://babylm.github.io/ or consult the extended call for papers https://arxiv.org/abs/2404.06214.