Nos_Celtia-GL: Galician TTS corpus

  1. Vázquez Abuín, Marta 1
  2. García Díaz, Noelia 1
  3. Vladu, Adina Ioana 1
  4. Magariños, Carmen 1
  5. Vidal Miguéns, Adrián 1
  6. Fernández Rei, Elisa 1
  1. 1 Universidade de Santiago de Compostela
    info

    Universidade de Santiago de Compostela

    Santiago de Compostela, España

    ROR https://ror.org/030eybx10

Editor: Zenodo

Ano de publicación: 2023

Tipo: Dataset

CC BY 4.0

Resumo

This corpus is publicly accessible upon accepting T&Cs and requesting access. Galician TTS single speaker corpus of approximately 25 hours of speech. Nos_Celtia-GL is a phonetically and morphosyntactically rich corpus of 20,000 phrases (approximately 200,000 words) comprising two subcorpora: a previously compiled corpus created by the Grupo de Tecnoloxías Multimedia (GTM), together with the Centro Ramón Piñeiro para a Investigación en Humanidades (CRPIH), and a corpus compiled by the Nós Project from multi-domain texts. The text corpus statistics are detailed in the table below: Subcorpus  Sentence no. Word no.   Sentence length (words) Sentence domain / type GTM 10,000 121,726 1-44 Journalistic (written) text Manually designed sentences (interrogative, exclamative, imperative, lists of numbers…) Nós 10,000 99,622  1-36 21,8% transcripts of oral discourse 17,5% dictionary definitions 12.7% transcripts of parliamentary speeches 20% transcripts of news broadcasts 28% short (<4 words), interrogative, exclamative, imperative, and elliptical sentences   While the Nós subcorpus has undergone a thorough linguistic review, we have decided not to adapt the GTM corpus to the current grammatical norms of the Galician language with a view to obtaining a parallel corpus to the previously recorded CRPIH_UVigo-GL-Voices. Nos_Celtia-GL was recorded in a controlled environment (recording studio) by a professional female voice talent selected among four speakers through a perceptual listening test in which more than 50 participants assessed the speakers' clarity, prosody, likeability, and language proficiency. The file naming scheme of the audio files consists of a series of lowercase elements indicating the type of audio (raw), the creators of the corpus (nos), the name of the voice (celtia), and the ISO code for the Galician language (gl), followed by a 5-digit number identifying the utterance. All components are separated by underscores (e. g., raw_nos_celtia_gl_00001.wav). Metadata is provided in "metadata.csv". This file consists of one record per line, delimited by the vertical bar character (0x7c). The fields are:   1. Audio file: name of the corresponding .wav file   2. Transcription: non-normalized text read by speaker (UTF-8) The audio files are available in the format in which they were originally recorded, 48 kHz, 16-bit WAV format, and amount to approximately 25 hours. Version 1.0.0 contains the raw sound files with no editing nor normalization, together with the corresponding text. For more information, please go to https://nos.gal/  or contact the Nós project at proxecto.nos@usc.gal. Terms and conditions The property to the speech data contained in this dataset has been transferred to the University of Santiago de Compostela (USC) for the duration of 15 years. Starting 30/11/2037, this data will be removed. After this date, the USC is not liable for any use by third parties who might have downloaded the dataset.  Citing Please refer to our paper for more details: Nos_Celtia-GL: an Open High-Quality Speech Synthesis Resource for Galician If you use this data in your work, please cite: García Díaz, N., Vázquez Abuín, M., Magariños, C., Vladu, A.I., Moscoso Sánchez, A., Fernández Rei, E. (2024) Nos_Celtia-GL: an Open High-Quality Speech Synthesis Resource for Galician. Proc. IberSPEECH 2024, 91-95, doi: 10.21437/IberSPEECH.2024-19 Funding and acknowledgements "The Nós project: Galician in the society and economy of Artificial Intelligence" is possible thanks to the funding resulting from the agreement 2021-CP080 between the Xunta de Galicia and the University of Santiago de Compostela, and thanks to the Investigo program, within the National Recovery, Transformation and Resilience Plan, within the framework of the European Recovery Fund (NextGenerationEU). We would like to thank the speaker, Consuelo Díaz Isorna, for kindly providing her voice to this project. We would also like to thank the following entities for their kind collaboration in providing the data for the text corpus: Grupo de Tecnoloxías Multimedia (GTM), Centro Ramón Piñeiro para a Investigación en Humanidades (CRPIH), Real Academia Galega, Corporación Radio Televisión de Galicia S.A., Parlamento de Galicia, and the Arquivo do Galego Oral (ILG) project. Our gratitude also to Xoán Carlos Goris García, Elia Lago Pereira and Alicia López Besteiro for reviewing part of the audio corpus.