GUM Corpus V11 Released: Enhanced Annotations and New Documents
GUM Corpus V11 released with new documents and annotations, featuring 24 genres and multiple layers of annotation.
The Georgetown University Multilayer corpus (GUM) has released its latest version, V11.0.0, featuring a range of new documents and annotations.
New in this version are the merger of GUM and the out-of-domain test set GENTLE, additional documents bringing the total to 268,208 tokens, five different summaries per document, and graded salience scores for each entity in every document.
GUM is an open-source corpus of richly annotated English texts from 24 genres, including academic writing, biographies, courtroom transcripts, and more.
The corpus is created by students as part of the Computational Linguistics curriculum at Georgetown University and is available under Creative Commons licenses.
This version contains roughly 281 documents annotated for multiple POS tags, lemmatization, morphological segmentation, sentence segmentation, and more.
For more information and to search or download the corpus online, visit the corpus website.
Tags: GUM Corpus, Georgetown University Multilayer corpus, natural language processing, corpus linguistics, text annotation, machine learning