Distancia diacrónica interlingüística: aplicación al portugués y el castellano

Gamallo Otero, Pablo; Alegría Loinaz, Iñaki; Pichel Campos, José Ramom

Distancia diacrónica interlingüística: aplicación al portugués y el castellano

Gamallo Otero, Pablo
Alegría Loinaz, Iñaki
Pichel Campos, José Ramom

Journal:

Procesamiento del lenguaje natural

ISSN: 1135-5948

Year of publication: 2019

Issue: 63

Pages: 77-84

Type: Article

DIALNET GOOGLE SCHOLAR RUA editor

More publications in: Procesamiento del lenguaje natural

Abstract

The aim of this paper is to establish a corpus-based methodology for automatically measuring the cross-lingual distance between historical periods of two languages using perplexity. The corpus of both has been constructed adhoc with the closest spelling to the original representing chronologically and in a balanced way fiction and non-fiction. The methodology has been applied to two related languages, Portuguese and Spanish, and measured their diachronic distances both in original orthography and in an automatically transcribed spelling. |

€ View funding

Funding information

The authors thanks the referees for thoughtful comments and helpful suggestions. We are very grateful to Fernando Venâncio from the University of Amsterdam, José António Souto Cabo and Carlos Quiroga from the University of Santiago de Com-postela for his expertise in Portuguese and Spanish Language history. This work has received financial support from the DOMINO project (PGC2018-102041-B-I00, MCIU/AEI/FEDER, UE), and the Con-sellería de Cultura, Educación e Orde-nación Universitaria (accreditation 2016-2019, ED431G/08) and the European Regional Development Fund (ERDF).

Funders

- ED431G/08
Ministerio de Ciencia, Innovación y Universidades Spain
European Regional Development Fund European Union
Agencia Estatal de Investigación Spain

Bibliographic References

Asgari, E. and M. R. K. Mofrad. 2016. Comparing fifty natural languages and twelve genetic languages using word embedding language divergence (WELD) as a quantitative measure of language distance. In Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP, pages 65–74, San Diego, California.
Bakker, D., A. Muller, V. Velupillai, S. Wichmann, C. H. Brown, P. Brown, D. Egorov, R. Mailhammer, A. Grant, and E. W. Holman. 2009. Adding typology to lexicostatistics: A combined approach to language classification. Linguistic Typology, 13(1):169–181.
BarbancÌ§on, F., S. Evans, L. Nakhleh, D. Ringe, and T. Warnow. 2013. An experimental study comparing linguistic phylogenetic reconstruction methods. Diachronica, 30:143–170.
Biber, D. 1993. Representativeness in corpus design. Literary and linguistic computing, 8(4):243–257.
Brown, C. H., E. W. Holman, S. Wichmann, and V. Velupilla. 2008. Automated classification of the world’s languages: a description of the method and preliminary results. Language Typology and Universals, 61(4).
Chen, S. F. and J. Goodman. 1996. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, ACL ’96, pages 310–318, Stroudsburg, PA, USA. Association for Computational Linguistics.
Chiswick, B. and P. Miller. 2004. Linguistic Distance: A Quantitative Measure of the Distance Between English and Other Languages. Discussion papers. IZA.
Corredoira, F. V. 1998. A construcÌ§aÌo da lÌÄ±ngua portuguesa frente ao castelhano: o galego como exemplo a contrario.
Curell, C. 2006. La influencia del franceÌs en el espanÌol contemporaÌneo. In La cultura del otro: espanÌol en Francia, franceÌs en EspanÌa, pages 785–792. Universidad de Sevilla.
Degaetano-Ortlieb, S., H. Kermes, A. Khamis, and E. Teich. 2016. An information-theoretic approach to modeling diachronic change in scientific english. Selected Papers from Varieng-From Data to Evidence (d2e).
Ellison, T. M. and S. Kirby. 2006. Measuring language divergence by intra-lexical comparison. In Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics, pages 273–280.
Galves, C. and P. Faria. 2010. Tycho Brahe parsed corpus of historical Portuguese. URL: http://www. tycho. iel. unicamp. br/Ë tycho/corpus/en/index. html.
Gamallo, P., I. Alegria, J. R. Pichel, and M. Agirrezabal. 2016. Comparing two basic methods for discriminating between similar languages and varieties. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pages 170–177.
Gamallo, P., J. R. Pichel, and I. Alegria. 2017. From language identification to language distance. Physica A: Statistical Mechanics and its Applications, 484:152–162.
Gao, Y., W. Liang, Y. Shi, and Q. Huang. 2014. Comparison of directed and weighted co-occurrence networks of six languages. Physica A: Statistical Mechanics and its Applications, 393(C):579–589.
GonzaÌlez, M. 2015. An analysis of twitter corpora and the differences between formal and colloquial tweets. In Proceedings of the Tweet Translation Workshop 2015, pages 1–7.
Holman, E., S. Wichmann, C. Brown, V. Velupillai, A. Muller, and D. Bakker. 2008. Explorations in automated lexicostatistics. Folia Linguistica, 42(2):331– 354.
Liu, H. and J. Cong. 2013. Language clustering with word co-occurrence networks based on parallel texts. Chinese Science Bulletin, 58(10):1139–1144.
Malmasi, S., M. Zampieri, N. LjubesÌicÌ, P. Nakov, A. Ali, and J. Tiedemann. 2016. Discriminating between similar languages and Arabic dialect identification: A report on the third DSL Shared Task. In Proceedings of the 3rd Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (VarDial), pages 1– 14, Osaka, Japan.
Millar, R. M. and L. Trask. 2015. Trask’s historical linguistics. Routledge.
Nakhleh, L., D. A. Ringe, and T. Warnow. 2005. Perfect phylogenetic networks: A new methodology for reconstructing the evolutionary history of natural languages. Language, 81(2):382–420.
Nerbonne, J. and W. Heeringa. 1997. Measuring dialect distance phonetically. In Proceedings of the Third Meeting of the ACL Special Interest Group in Computational Phonology, pages 11–18.
Petroni, F. and M. Serva. 2010. Measures of lexical distance between languages. Physica A: Statistical Mechanics and its Applications, 389(11):2280–2283.
Pichel, J. R., P. Gamallo, and I. Alegria. 2018. Measuring language distance among historical varieties using perplexity. application to european portuguese. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pages 145–155.
Rama, T., L. Borin, G. Mikros, and J. Macutek. 2015. Comparative evaluation of string similarity measures for automatic language classification.
Rama, T. and A. K. Singh. 2009. From bag of languages to family trees from noisy corpus. In Proceedings of the International Conference RANLP-2009, pages 355–359.
Rissanen, M. et al. 1993. The helsinki corpus of english texts. KyttoÌ et. al, pages 73–81.
Satterthwaite-Phillips, D. 2011. Phylogenetic Inference of the Tibeto-Burman Languages Or on the Usefulness of Lexicostatistics (and” megalo”-comparison) for the Subgrouping of Tibeto-Burman. Stanford University.
Sennrich, R. 2012. Perplexity minimization for translation model domain adaptation in statistical machine translation. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL ’12, pages 539–549, Stroudsburg, PA, USA. Association for Computational Linguistics.
Singh, A. K. and H. Surana. 2007. Can corpus based measures be used for comparative study of languages? In Proceedings of ninth meeting of the ACL special interest group in computational morphology and phonology, pages 40–47. Association for Computational Linguistics.
Swadesh, M. 1952. Lexicostatistic dating of prehistoric ethnic contacts. In Proceedings of the American Philosophical Society 96, pages 452–463.
Venâncio, F. 2014. O castelhano como vernáculo português. https://pgl.gal/o-castelhano-como-vernaculo-portugues/
Xavier, M. F., M. T. Brocardo, and M. Vincente. 1994. Cipm–um corpus informatizado do portugueÌs medieval. Actas do X Encontro da AssociacÌ§aÌo Portuguesa de LinguÌÄ±stica, 2:599–612.
Yujian, L. and L. Bo. 2007. A normalized levenshtein distance metric. IEEE transactions on pattern analysis and machine intelligence, 29(6):1091–1095.
Zampieri, M. 2017. Compiling and processing historical and contemporary portuguese corpora. arXiv preprint arXiv:1710.00803.

Data source: Dialnet