Towards accurate dependency parsing for Galician with limited resources

  1. Sarymsakova, Albina
  2. Sánchez-Rodríguez, Xulia
  3. Garcia, Marcos
Revista:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Ano de publicación: 2024

Número: 73

Páxinas: 247-257

Tipo: Artigo

Outras publicacións en: Procesamiento del lenguaje natural

Resumo

El análisis sintáctico automático es fundamental dentro del PLN. Sin embargo, las herramientas eficaces requieren bancos de ´arboles extensos y de alta calidad para el entrenamiento satisfactorio. En consecuencia, la calidad del análisis sintáctico sigue siendo inadecuada para lenguas de escasos recursos como el gallego. En este contexto, el presente estudio explora varios enfoques para mejorar el análisis sintáctico del gallego utilizando el marco de UD. Nuestros experimentos analizan la calidad del modelo incrementando el tamaño del corpus de entrenamiento inicial añadiendo datos del PUD gallego. Además, exploramos los beneficios de incorporación de las representaciones vectoriales contextualizadas y el uso de varios modelos BERT. Por último, evaluamos el impacto de la integración de datos interlingüísticos para el entrenamiento de variedades similares, analizando el rendimiento del modelo en los bancos de árboles usados. Nuestros hallazgos subrayan (1) la correlación positiva entre los datos de entrenamiento aumentados y el rendimiento mejorado del modelo; (2) el rendimiento superior de los modelos BERT monolingües en comparación con sus análogos multilingües; (3) el rendimiento mejorado general del modelo en los bancos de ´arboles tras la incorporación de datos interlingüísticos.

Referencias bibliográficas

  • Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June. Association for Computational Linguistics.
  • Gamallo, P. and I. González. 2012. DepPattern: a multilingual dependency parser. In Demo Session of the International Conference on Computational Processing of the Portuguese Language (PROPOR 2012), pages 17–20. Citeseer.
  • Garcia, M. 2021. Exploring the representation of word meanings in context: A case study on homonymy and synonymy. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3625–3640, Online, August. Association for Computational Linguistics.
  • Garcia, M., C. Gómez-Rodríguez, and M. A. Alonso. 2018. New treebank or repurposed? on the feasibility of cross-lingual parsing of romance languages with universal dependencies. Natural Language Engineering, 24(1):91–122.
  • Glavas, G. and I. Vulic. 2021. Climbing the tower of treebanks: Improving low-resource dependency parsing via hierarchical source selection. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4878–4888, Online, August. Association for Computational Linguistics.
  • Kann, K., K. Cho, and S. R. Bowman. 2019. Towards realistic practices in low-resource natural language processing: The development set. In K. Inui, J. Jiang, V. Ng, and X.Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 3342–3349, Hong Kong, China, November. Association for Computational Linguistics.
  • Kondratyuk, D. and M. Straka. 2019. 75 languages, 1 model: Parsing universal dependencies universally. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2779–2795, Hong Kong, China. Association for Computational Linguistics.
  • Lopes, L. and T. Pardo. 2024. Towards portparser - a highly accurate parsing system for Brazilian Portuguese following the Universal Dependencies framework. In P. Gamallo, D. Claro, A. Teixeira, L. Real, M. Garcia, H. G. Oliveira, and R. Amaro, editors, Proceedings of the 16th International Conference on Computational Processing of Portuguese, pages 401–410, Santiago de Compostela, Galicia/ Spain, March. Association for Computational Lingustics.
  • Müller-Eberstein, M., R. van der Goot, and B. Plank. 2021. Genre as weak supervision for cross-lingual dependency parsing. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4786–4802, Online and Punta Cana, Dominican Republic, November. Association for Computational Linguistics.
  • Sánchez-Rodr´ıguez, X., A. Sarymsakova, L. Castro, and M. Garcia. 2024. Increasing manually annotated resources for Galician: the parallel Universal Dependencies treebank. In P. Gamallo, D. Claro, A. Teixeira, L. Real, M. Garcia, H. G. Oliveira, and R. Amaro, editors, Proceedings of the 16th International Conference on Computational Processing of Portuguese, pages 587–592, Santiago de Compostela, Galicia/Spain, March. Association for Computational Lingustics.
  • Vania, C., Y. Kementchedjhieva, A. Søgaard, and A. Lopez. 2019. A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages. In K. Inui, J. Jiang, V. Ng, and X.Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 1105–1116, Hong Kong, China, November. Association for Computational Linguistics.
  • Vilares, D., M. Garcia, and C. Gòmez-Rodríguez. 2021. Bertinho: Galician bert representations. arXiv preprint arXiv:2103.13799.
  • Zeman, D., J. Hajic, M. Popel, M. Potthast, M. Straka, F. Ginter, J. Nivre, and S. Petrov. 2018. CoNLL 2018 shared task: Multilingual parsing from raw text to Universal Dependencies. In D. Zeman and J. Hajic, editors, Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 1–21, Brussels, Belgium, October. Association for Computational Linguistics.
  • Zeman, D., M. Popel, M. Straka, [et al.]. 2017. CoNLL 2017 shared task: Multilingual parsing from raw text to Universal Dependencies. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 1–19, Vancouver, Canada, August. Association for Computational Linguistics.