Nos_CorpusNOS-GL: Galician Macrocorpus for LLM training

  1. de-Dios-Flores, Iria 1
  2. Paniagua Suárez, Silvia
  3. Bardanca, Daniel
  4. Gamallo, Pablo
  5. García, Marcos
  6. Ramom Pichel Campos, José
  7. Carbajal Pérez, Cristina
  8. Moscoso Sánchez, Antonio
  9. Francisco Marini, Jose Javier
  10. Canosa Pérez, Cristian
  1. 1 Universidade de Santiago de Compostela
    info

    Universidade de Santiago de Compostela

    Santiago de Compostela, España

    ROR https://ror.org/030eybx10

Editor: Zenodo

Ano de publicación: 2024

Tipo: Dataset

CC BY 4.0

Resumo

CorpusNÓS is a massive Galician corpus made up of 2.1B words primarily devised for training large language models. The corpus sources are varied and represent a relatively wide range of genres. The corpus is structured as follows: Subcorpus: Data obtained via transfer agreement Genre Nº tokens Nº documents   Books 7.255.784 104   Research articles 2.665.351 664   Press 124.253.084 224.419   Governmental 245.897.880 654.505   Web contents 15.946.686 44.165   Encyclopedic 4.799.214 47.396   Subtotal 400.817.999 971.253   Subcorpus: Public data Genre Nº tokens Nº documents   Press and blogs 153.497.883 665.265   Encyclopedic 57.164.848 184.628   Web crawls 1.384.015.664 3.366.449   Translation corpora 133.726.004 4.745.799   Subtotal 1.728.404.399 8.777.514   Total 2.129.222.398 9.748.767 Following this structure, the corpus contains two folders for each subcorpus and within each subcorpus, folders with the different genres can be found. Files are in plain text format (*.txt) and individual documents inside each file are separated by two line breaks. Note: Some of the files referenced may be missing in this version of the corpus due to pending transfer agreements and they will be included in a future version of the corpus as soon as they are available for publishing.Note: Please, note that the following subcorpora have different licenses which correspond to their original licenses as specified in the paper: TED2020 (CC BY–NC–ND 4.0), mC4 (Apache License 2.0), OSCAR (CC0). Please refer to our paper for more details, CorpusNÓS: A massive Galician corpus for training large language models. If you use this data in your work, please cite: de-Dios-Flores, Iria, Silvia Paniagua Suárez, Cristina Carbajal Pérez, Daniel Bardanca Outeiriño, Marcos Garcia and Pablo Gamallo. 2024. CorpusNÓS: A massive Galician corpus for training large language models. Proceedings of the 16th International Conference on Computational Processing of Portuguese - ACL Anthology (Volume 1), 593-599.