New Genre-Enriched Web Corpora and Multilingual Genre Classifier Available
New genre-enriched web corpora and a multilingual text genre classifier are now available on CLARIN.SI and Hugging Face repositories. Test your own systems on the task by using the AGILE-Automatic-Genre-Identification-Benchmark.
New genre-enriched web corpora for 13 European languages and a multilingual text genre classifier have been made available on CLARIN.SI and Hugging Face repositories. The MaCoCu web corpora comprise 67 million texts and 28.5 billion words, and are automatically annotated with genre labels. The multilingual text genre classifier is applicable to any of the 100 languages included in the XLM-RoBERTa model. Additionally, a benchmark for continuous evaluation of technologies on this task has been set up.
Tags: genre-enriched web corpora, multilingual genre classifier, CLARIN.SI, Hugging Face, MaCoCu web corpora, XLM-RoBERTa model, automatic genre annotation, text genre classifier, AGILE-Automatic-Genre-Identification-Benchmark