ML Scientist

Connecting Scholars with the Latest Academic News and Career Paths

FeaturedNews

HPLT v3.0 Multilingual Dataset Released

HPLT v3.0 multilingual dataset released with significant improvements, featuring 29 billion documents and 112 trillion characters.

The latest version of the HPLT multilingual dataset, v3.0, has been released, boasting 29 billion documents and 112 trillion characters across 198 language-script combinations.

Key improvements include:

  • Unique content increased to 73% on average, up from 52%.
  • Enhanced data substance and robustness through better extraction and language identification.
  • Increased variety and representativeness of natural web content.

This dataset is ideal for building powerful LLMs and machine translation systems, especially for low- to medium-resourced languages.

For more information and to explore the data, visit https://hplt-project.org/datasets/v3.0.

Tags: HPLT v3.0, multilingual dataset, large-scale corpora, LLMs, machine translation, natural language processing, NLP