HPLT v3.0 Multilingual Dataset Released

October 7, 2025

HPLT v3.0 multilingual dataset released with significant improvements, featuring 29 billion documents and 112 trillion characters.

The latest version of the HPLT multilingual dataset, v3.0, has been released, boasting 29 billion documents and 112 trillion characters across 198 language-script combinations.

Key improvements include:

Unique content increased to 73% on average, up from 52%.
Enhanced data substance and robustness through better extraction and language identification.
Increased variety and representativeness of natural web content.

This dataset is ideal for building powerful LLMs and machine translation systems, especially for low- to medium-resourced languages.

For more information and to explore the data, visit https://hplt-project.org/datasets/v3.0.

Tags: HPLT v3.0, multilingual dataset, large-scale corpora, LLMs, machine translation, natural language processing, NLP

Related Reading

Join AIDA’s Nvidia DLI Deep Learning Course

PhD Opportunities in Machine Learning at TU Wien

Call for Papers: Neuro Symbolic AI and Complex Data @ ESANN 2026