ML Scientist

Connecting Scholars with the Latest Academic News and Career Paths

FeaturedNews

CLASSLA-web 2.0: Expanded Web Corpora for South Slavic Languages

CLASSLA-web 2.0 released with 38 million texts and 17 billion words for South Slavic languages, enhancing linguistic research and NLP tasks.

The second version of the South Slavic CLASSLA-web corpora has been released, containing approximately 38 million texts and 17 billion words collected from the web in 2024. It covers Bosnian, Bulgarian, Croatian, Macedonian, Montenegrin, Serbian, and Slovenian languages.

  • Linguistically annotated and automatically classified by genre.
  • Enriched with topic labels.
  • Significantly expanded compared to CLASSLA-web 1.0.

For more information, visit the CLASSLA-web website. The corpora can be browsed via CLARIN.SI concordancers or downloaded under a CC0 license from the CLARIN.SI repository.

Tags: CLASSLA-web 2.0, South Slavic languages, web corpora, linguistic research, natural language processing, corpus linguistics, lexicography