Open Generative Large Language Models for Galician

Gamallo, Pablo; Rodríguez, Pablo; de-Dios-Flores, Iria; Sotelo, Susana; Paniagua, Silvia; Bardanca, Daniel; Pichel, José Ramom; Garcia, Marcos

Open Generative Large Language Models for Galician

Gamallo, Pablo
Rodríguez, Pablo
de-Dios-Flores, Iria
Sotelo, Susana
Paniagua, Silvia
Bardanca, Daniel
Pichel, José Ramom
Garcia, Marcos

Journal:

Procesamiento del lenguaje natural

ISSN: 1135-5948

Year of publication: 2024

Issue: 73

Pages: 259-270

Type: Article

DIALNET GOOGLE SCHOLAR Open access editor

More publications in: Procesamiento del lenguaje natural

Abstract

Large language models (LLMs) have transformed natural language processing. Yet, their predominantly English-centric training has led to biases and performance disparities across languages. This imbalance marginalizes minoritized languages, making equitable access to NLP technologies more difficult for languages with lower resources, such as Galician. We present the first two generative LLMs focused on Galician to bridge this gap. These models, freely available as open-source resources, were trained using a GPT architecture with 1.3B parameters on a corpus of 2.1B words. Leveraging continual pretraining, we adapt to Galician two existing LLMs trained on larger corpora, thus mitigating the data constraints that would arise if the training were performed from scratch. The models were evaluated using human judgments and task-based datasets from standardized benchmarks. These evaluations reveal a promising performance, underscoring the importance of linguistic diversity in generative models.

Bibliographic References

Bandarkar, L., D. Liang, B. Muller, M. Artetxe, S. N. Shukla, D. Husa, N. Goyal, A. Krishnan, L. Zettlemoyer, and M. Khabsa. 2023. The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. arXiv preprint arXiv:2308.16884.
Chang, Y., X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang, W. Ye, Y. Zhang, Y. Chang, P. S. Yu, Q. Yang, and X. Xie. 2024. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3).
Dalt, S. D., J. Llop, I. Baucells, M. Pamies, Y. Xu, A. Gonzalez-Agirre, and M. Villegas. 2024. Flor: On the effectiveness of language adaptation. In Proceedings of LREC-COLING-24, Torino, Italia. Association for Computational Linguistics.
de Dios-Flores, I., C. Magariños, A. I. Vladu, J. E. Ortega, J. R. Pichel, M. García, P. Gamallo, E. Fernández Rei, A. Bugarín-Diz, M. González González, S. Barro, and X. L. Regueira. 2022. The nós project: Opening routes for the Galician language in the field of language technologies. In Proceedings of the Workshop Towards Digital Language Equality within the 13th Language Resources and Evaluation Conference, pages 52–61, Marseille, France, June. European Language Resources Association.
de Dios-Flores, I., S. P. Suárez, C. C. Pérez, D. B. Outeiriño, M. Garcia, and P. Gamallo. 2024. Corpusnós: A massive galician corpus for training large language models. In Proceedings of the 16th International Conference on Computational Processing of Portuguese, pages 593–599.
Downey, C., T. Blevins, N. Goldfine, and S. Steinert-Threlkeld. 2023. Embedding structure matters: Comparing methods to adapt multilingual vocabularies to new languages. In D. Ataman, editor, Proceedings of the 3rd Workshop on Multi-lingual Representation Learning (MRL), pages 268–281, Singapore, December. Association for Computational Linguistics.
Etxaniz, J., O. Sainz, N. Perez, I. Aldabe, G. Rigau, E. Agirre, A. Ormazabal, M. Artetxe, and A. Soroa. 2024. Latxa: An open language model and evaluation suite for basque.
Fernández-Pichel, M., M. Prada-Corral, D. E. Losada, J. C. Pichel, and P. Gamallo. 2024. An unsupervised perplexity-based method for boilerplate removal. Natural Language Engineering, 30(1):132–149.
Gao, L., J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou. 2023. A framework for few-shot language model evaluation, 12.
Garcia, M. 2021. Exploring the representation of word meanings in context: A case study on homonymy and synonymy. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistic and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3625–3640. Association for Computational Linguistics.
Gupta, K., B. Thérien, A. Ibrahim, M. L. Richter, Q. G. Anthony, E. Belilovsky, I. Rish, and T. Lesort. 2023. Continual pre-training of large language models: How to re-warm your model? In Workshop on Efficient Systems for Foundation Models @ ICML2023.
Gutiérrez-Fandiño, A. Armengol-Estapé, J. Pàmies, M. Llop-Palao, J. Silveira-Ocampo, J. Carrino, C. Armentano-Oller, C. Rodriguez-Penagos, A. Gonzalez-Agirre, and M. Villegas. 2022. MarIA: Spanish Language Models. Procesamiento del Lenguaje Natural, 68:39–60.
Hendrycks, D., S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song, and J. Steinhardt. 2021. Measuring coding challenge competence with apps. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS).
Ke, Z., Y. Shao, H. Lin, T. Konishi, G. Kim, and B. Liu. 2023. Continual pre-training of language models. In The Eleventh International Conference on Learning Representations.
Khanuja, S., S. Ruder, and P. Talukdar. 2023. Evaluating the diversity, equity, and inclusion of NLP technology: A case study for Indian languages. In A. Vlachos and I. Augenstein, editors, Findings of the Association for Computational Linguistics: EACL 2023, pages 1763–1777, Dubrovnik, Croatia, May. Association for Computational Linguistics.
Kingma, D. P. and J. Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
Lopes, R., J. Magalhaes, and D. Semedo. 2024. GlórIA: A generative and open large language model for Portuguese. In P. Gamallo, D. Claro, A. Teixeira, L. Real, M. Garcia, H. G. Oliveira, and R. Amaro, editors, Proceedings of the 16th International Conference on Computational Processing of Portuguese, pages 441–453, Santiago de Compostela, Galicia/Spain, March. Association for Computational Linguistics.
Mihaylov, T., P. Clark, T. Khot, and A. Sabharwal. 2018. Can a suit of armor conduct electricity? a new dataset for open book question answering. In E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium, October-November. Association for Computational Linguistics.
Paperno, D., G. Kruszewski, A. Lazaridou, N. Q. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, and R. Fernández. 2016. The LAMBADA dataset: Word prediction requiring a broad discourse context. In K. Erk and N. A. Smith, editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1525–1534, Berlin, Germany, August. Association for Computational Linguistics.
Rajbhandari, S., J. Rasley, O. Ruwase, and Y. He. 2020. Zero: memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’20. IEEE Press.
Santos, R., J. Silva, L. Gomes, J. Rodrigues, and A. Branco. 2024. Advancing Generative AI for Portuguese with Open Decoder Gervásio PT. In arXiv:2402.18766v2 [cs.CL].
Touvron, H., L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models.
Vilares, D., M. Garcia, and C. Gómez-Rodríguez. 2021. Bertinho: Galician BERT Representations. Procesamiento del Lenguaje Natural, 66:13–26.
Wang, A., A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In T. Linzen, G. Chrupała, and A. Alishahi, editors, Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium, November. Association for Computational Linguistics.
Warstadt, A., A. Singh, and S. R. Bowman. 2019. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641.
Wolf, T., L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush. 2020. Transformers: State-of-theart natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October. Association for Computational Linguistics.
Yang, Y., Y. Zhang, C. Tar, and J. Baldridge. 2019. PAWS-X: A cross-lingual adversarial dataset for paraphrase identification. In K. Inui, J. Jiang, V. Ng, and X. Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3687–3692, Hong Kong, China, November. Association for Computational Linguistics.

Data source: Dialnet