BertinhoGalician BERT Representations

  1. David Vilares Calvo
  2. Marcos García González
  3. Carlos Gómez Rodríguez
Journal:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Year of publication: 2021

Issue: 66

Pages: 13-26

Type: Article

More publications in: Procesamiento del lenguaje natural

Abstract

Este artículo presenta un modelo BERT monolingüe para el gallego. Nos basamos en la tendencia actual que ha demostrado que es posible crear modelos BERT monolingües robustos incluso para aquellos idiomas para los que hay una relativa escasez de recursos, funcionando éstos mejor que el modelo BERT multilingüe oficial (mBERT). Concretamente, liberamos dos modelos monolingües para el gallego, creados con 6 y 12 capas de transformers, respectivamente, y entrenados con una limitada cantidad de recursos (~45 millones de palabras sobre una única GPU de 24GB.) Para evaluarlos realizamos un conjunto exhaustivo de experimentos en tareas como análisis morfosintáctico, análisis sintáctico de dependencias o reconocimiento de entidades. Para ello, abordamos estas tareas como etiquetado de secuencias, con el objetivo de ejecutar los modelos BERT sin la necesidad de incluir ninguna capa adicional (únicamente se añade la capa de salida encargada de transformar las representaciones contextualizadas en la etiqueta predicha). Los experimentos muestran que nuestros modelos, especialmente el de 12 capas, mejoran los resultados de mBERT en la mayor parte de las tareas.

Funding information

This work has received funding from the European Research Council (ERC), which has funded this research under the Euro pean Union’s Horizon 2020 research and innovation programme (FASTPARSE, grant agreement No 714150), from MINECO (ANSWER-ASAP, TIN2017-85160-C2-1-R), from Xunta de Galicia (ED431C 2020/11), from Centro de Investigación de Galicia ‘CITIC’, funded by Xunta de Galicia and the European Union (European Regional Development Fund-Galicia 2014-2020 Program), by grant ED431G 2019/01, and by Centro Singular de Investigación en Tecnoloxías In-telixentes (CiTIUS), ERDF 2014-2020: Call ED431G 2019/04. DV is supported by a 2020 Leonardo Grant for Researchers and Cultural Creators from the BBVA Foundation. MG is supported by a Ramón y Cajal grant (RYC2019-028473-I).

Funders

Bibliographic References

  • Agerri, R., X. Gómez Guinovart, G. Rigau, and M. A. Solla Portela. 2018. Developing new linguistic resources and tools for the Galician language. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May. European Language Resources Association (ELRA).
  • Agerri, R., I. San Vicente, J. A. Campos, A. Barrena, X. Saralegi, A. Soroa, and E. Agirre. 2020. Give your text representation models some love: the case for Basque. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4781–4788, Marseille, France, May. European Language Resources Association.
  • Bengio, Y., R. Ducharme, P. Vincent, and C. Jauvin. 2003. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155.
  • Bojanowski, P., E. Grave, A. Joulin, and T. Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
  • Cañete, J., G. Chaperon, R. Fuentes, and J. Pérez. 2020. Spanish pre-trained BERT model and evaluation data. In Practical ML for Developing Countries Workshop (PML4DC) at ICLR. Learning under limited/low resource scenarios.
  • Collobert, R. and J. Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167.
  • Collobert, R., J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. 2011. Natural Language Processing (Almost) From Scratch. Journal of Machine Learning Research, 12:2493–2537.
  • Dai, A. M. and Q. V. Le. 2015. Semisupervised sequence learning. In Advances in neural information processing systems, pages 3079–3087.
  • Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June.
  • Ettinger, A. 2020. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics, 8:34–48.
  • Freixeiro Mato, X. R. 2003. Gramática da Lingua Galega IV. Gramática do texto. A Nosa Terra, Vigo.
  • Garcia, M. and P. Gamallo. 2010. Análise Morfossintáctica para Português Europeu e Galego: Problemas, Solu¸c˜oes e Avalia¸c˜ao. Linguamática, 2(2):59–67.
  • Garcia, M., C. Gómez-Rodríguez, and M. A. Alonso. 2016. Creación de un treebank de dependencias universales mediante recursos existentes para lenguas próximas: el caso del gallego. Procesamiento del Lenguaje Natural, 57:33–40.
  • Garcia, M., C. Gómez-Rodríguez, and M. A. Alonso. 2018. New treebank or repurposed? on the feasibility of cross-lingual parsing of romance languages with universal dependencies. Natural Language Engineering, 24(1):91–122.
  • Guinovart, X. G. and S. L. Fernández. 2009. Anotación morfosintáctica do Corpus Técnico do Galego. Linguamática, 1(1):61–70.
  • Guinovart, X. 2017. Recursos integrados da lingua galega para a investigación lingüística. Gallaecia. Estudos de lingüística portuguesa e galega, pages 1045–1056.
  • IGE. 2018. Coñecemento e uso do galego. Instituto Galego de Estatística, http://www.ige.eu/web/mostrar_actividade_estatistica.jsp?idioma=gl&codigo=0206004.
  • Jiang, N. and M.-C. de Marneffe. 2019. Evaluating BERT for natural language inference: A case study on the CommitmentBank. In Proceedings of the 2019 Con23 Bertinho: Galician BERT Representations ference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6086–6091, Hong Kong, China, November.
  • Karthikeyan, K., Z. Wang, S. Mayhew, and D. Roth. 2020. Cross-Lingual Ability of Multilingual BERT: An Empirical Study. In International Conference on Learning Representations (ICRL 2020).
  • Kingma, D. P. and J. Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations (ICLR 2015). arXiv preprint arXiv:1412.6980.
  • Kitaev, N. and D. Klein. 2018. Constituency parsing with a self-attentive encoder. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2676–2686, Melbourne, Australia, July.
  • Koutsikakis, J., I. Chalkidis, P. Malakasiotis, and I. Androutsopoulos. 2020. GREEKBERT: The Greeks visiting Sesame Street. In 11th Hellenic Conference on Artificial Intelligence. ACM.
  • Kuratov, Y. and M. Arkhipov. 2019. Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language. Computational Linguistics and Intellectual Technologies, 18:333–339.
  • Lan, Z., M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. 2020. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In Proceedings of the International Conference of Learning Representations (ICLR 2020).
  • Landauer, T. K. and S. T. Dumais. 1997. A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review, 104(2):211.
  • Lee, S., H. Jang, Y. Baik, S. Park, and H. Shin. 2020. KR-BERT: A Small-Scale Korean-Specific Language Model. arXiv preprint arXiv:2008.03979.
  • Lin, Y., Y. C. Tan, and R. Frank. 2019. Open sesame: Getting inside BERT’s linguistic knowledge. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 241–253, Florence, Italy, August. Association for Computational Linguistics.
  • Lindley Cintra, L. F. and C. Cunha. 1984. Nova Gramática do Português Contemporâneo. Livraria Sá da Costa, Lisbon.
  • Liu, Y., M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692.
  • Malvar, P., J. R. Pichel, O. Senra, P. Gamallo, and A. García. 2010. Vencendo a escassez de recursos computacionais. carvalho: Tradutor automático estatístico inglês-galego a partir do corpus paralelo europarl inglês-português. Linguamática, 2(2):31–38.
  • McDonald, S. and M. Ramscar. 2001. Testing the distributional hypothesis: The influence of context on judgements of semantic similarity. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 23.
  • Mikolov, T., K. Chen, G. Corrado, and J. Dean. 2013a. Efficient estimation of word representations in vector space. In Workshop Proceedings of the International Conference on Learning Representations (ICLR) 2013. arXiv preprint arXiv:1301.3781.
  • Mikolov, T., I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
  • Nivre, J., M.-C. de Marneffe, F. Ginter, J. Hajiˇc, C. D. Manning, S. Pyysalo, S. Schuster, F. Tyers, and D. Zeman. 2020. Universal Dependencies v2: An evergrowing multilingual treebank collection. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4034–4043, Marseille, France, May. European Language Resources Association.
  • Ortiz Suárez, P. J., L. Romary, and B. Sagot. 2020. A monolingual approach to contextualized word embeddings for midresource languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1703–1714, Online, July. Association for Computational Linguistics.
  • Padró, L. 2011. Analizadores Multilingües en FreeLing. Linguamatica, 3(2):13–20.
  • Pennington, J., R. Socher, and C. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, October.
  • Peters, M., M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227– 2237, New Orleans, Louisiana, June.
  • Pires, T., E. Schlinger, and D. Garrette. 2019. How multilingual is multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996–5001, Florence, Italy, July.
  • Raffel, C., N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. 2020. Exploring the limits of transfer learning with a unified textto-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  • Rojo, G., M. López Martínez, E. Domínguez Noya, and F. Barcala. 2019. Corpus de adestramento do Etiquetador/Lematizador do Galego Actual (XIADA), versión 2.7. Centro Ramón Piñeiro para a investigación en humanidades.
  • Salant, S. and J. Berant. 2018. Contextualized word representations for reading comprehension. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 554– 559, New Orleans, Louisiana, June.
  • Samartim, R. 2012. Língua somos: A constru¸c˜ao da ideia de língua e da identidade coletiva na galiza (pré-) constitucional. In Novas achegas ao estudo da cultura galega II: enfoques socio-históricos e lingüísticoliterarios, pages 27–36.
  • Sanh, V., L. Debut, J. Chaumond, and T. Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In Proceedings of The 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing at NeurIPS2019. arXiv preprint arXiv:1910.01108.
  • Schnabel, T., I. Labutov, D. Mimno, and T. Joachims. 2015. Evaluation methods for unsupervised word embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 298–307, Lisbon, Portugal, September.
  • Souza, F., R. Nogueira, and R. Lotufo. 2019. Portuguese named entity recognition using BERT-CRF. arXiv preprint arXiv:1909.10649.
  • Strzyz, M., D. Vilares, and C. GómezRodríguez. 2019. Viable dependency parsing as sequence labeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 717–723, Minneapolis, Minnesota, June.
  • TALG. 2016. CTG Corpus (Galician Technical Corpus). TALG Research Group. SLI resources, 1.0, ISLRN 437-045-879-366-6. TALG. 2018. SLI NERC Galician Gold CoNLL.
  • TALG Research Group. SLI resources, 1.0, ISLRN 435-026-256-395-4. Teyssier, P. 1987. História da Língua Portuguesa. Livraria Sá da Costa, Lisbon, 3 edition.
  • Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. 2017. Attention Is All You Need. arXiv preprint arXiv:1706.03762.
  • Vilares, D. and C. Gómez-Rodríguez. 2018. Transition-based parsing with lighter feedforward networks. In Proceedings of the Second Workshop on Universal Dependencies (UDW 2018), pages 162–172, Brussels, Belgium, November. Association for Computational Linguistics.
  • Vilares, D., M. Strzyz, A. Søgaard, and C. Gómez-Rodríguez. 2020. Parsing as pretraining. In Proceedings of the ThirtyFourth AAAI Conference on Artificial Intelligence, AAAI 2020, New York, NY, USA, February 7-12, 2020, pages 9114– 9121.
  • Virtanen, A., J. Kanerva, R. Ilo, J. Luoma, J. Luotolahti, T. Salakoski, F. Ginter, and S. Pyysalo. 2019. Multilingual is not enough: BERT for Finnish. arXiv preprint arXiv:1912.07076.
  • Vuli´c, I., E. M. Ponti, R. Litschko, G. Glavaˇs, and A. Korhonen. 2020. Probing pretrained language models for lexical semantics. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7222–7240, Online, November
  • Wenzek, G., M.-A. Lachaux, A. Conneau, V. Chaudhary, F. Guzmán, A. Joulin, and E. Grave. 2020. CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4003–4012, Marseille, France, May. European Language Resources Association.
  • Wolf, T., L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush. 2019. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv preprint arXiv:1910.03771.
  • Wu, S. and M. Dredze. 2020. Are all languages created equal in multilingual BERT? In Proceedings of the 5th Workshop on Representation Learning for NLP, pages 120–130, Online, July. Association for Computational Linguistics