Nos_CorpusNOS-GL: Galician Macrocorpus for LLM training

de-Dios-Flores, Iria; Paniagua Suárez, Silvia; Bardanca, Daniel; Gamallo, Pablo; García, Marcos; Ramom Pichel Campos, José; Carbajal Pérez, Cristina; Moscoso Sánchez, Antonio; Francisco Marini, Jose Javier; Canosa Pérez, Cristian

doi:10.5281/ZENODO.10687642

Nos_CorpusNOS-GL: Galician Macrocorpus for LLM training

de-Dios-Flores, Iria ¹
Paniagua Suárez, Silvia
Bardanca, Daniel
Gamallo, Pablo
García, Marcos
Ramom Pichel Campos, José
Carbajal Pérez, Cristina
Moscoso Sánchez, Antonio
Francisco Marini, Jose Javier
Canosa Pérez, Cristian

1 Universidade de Santiago de Compostela

Universidade de Santiago de Compostela

Santiago de Compostela, España

ROR https://ror.org/030eybx10

Editor: Zenodo

Ano de publicación: 2024

Tipo: Dataset

CC BY 4.0

DOI: 10.5281/ZENODO.10687642 Acceso aberto editor

Resumo

CorpusNÓS is a massive Galician corpus made up of 2.1B words primarily devised for training large language models. The corpus sources are varied and represent a relatively wide range of genres. The corpus is structured as follows: Subcorpus: Data obtained via transfer agreement Genre Nº tokens Nº documents Books 7.255.784 104 Research articles 2.665.351 664 Press 124.253.084 224.419 Governmental 245.897.880 654.505 Web contents 15.946.686 44.165 Encyclopedic 4.799.214 47.396 Subtotal 400.817.999 971.253 Subcorpus: Public data Genre Nº tokens Nº documents Press and blogs 153.497.883 665.265 Encyclopedic 57.164.848 184.628 Web crawls 1.384.015.664 3.366.449 Translation corpora 133.726.004 4.745.799 Subtotal 1.728.404.399 8.777.514 Total 2.129.222.398 9.748.767 Following this structure, the corpus contains two folders for each subcorpus and within each subcorpus, folders with the different genres can be found. Files are in plain text format (*.txt) and individual documents inside each file are separated by two line breaks. Note: Some of the files referenced may be missing in this version of the corpus due to pending transfer agreements and they will be included in a future version of the corpus as soon as they are available for publishing.Note: Please, note that the following subcorpora have different licenses which correspond to their original licenses as specified in the paper: TED2020 (CC BY–NC–ND 4.0), mC4 (Apache License 2.0), OSCAR (CC0). Please refer to our paper for more details, CorpusNÓS: A massive Galician corpus for training large language models. If you use this data in your work, please cite: de-Dios-Flores, Iria, Silvia Paniagua Suárez, Cristina Carbajal Pérez, Daniel Bardanca Outeiriño, Marcos Garcia and Pablo Gamallo. 2024. CorpusNÓS: A massive Galician corpus for training large language models. Proceedings of the 16th International Conference on Computational Processing of Portuguese - ACL Anthology (Volume 1), 593-599.

Nos_CorpusNOS-GL: Galician Macrocorpus for LLM training

Universidade de Santiago de Compostela

Resumo