La construcción de un corpus paralelo bilingüe multifuncional
Universidade de Santiago de Compostela
ISSN: 1137-2346
Ano de publicación: 2017
Título do exemplar: Morfosintaxis y semántica del verbo en español: historia y descripción
Número: 23
Páxinas: 717-734
Tipo: Artigo
Outras publicacións en: Moenia: Revista lucense de lingüistica & literatura
Obxectivos de Desenvolvemento Sustentable
This article describes the steps and addresses the different aspects/issues to consider in the construction of a bilingual parallel corpus aimed to be used for multiple purposes, with special focus on the cross-linguistic research, translation and teaching of foreign languages. This process is exemplified by the creation of the corpus PaGeS, a parallel corpus German / Spanish, available for online searches via web interface. This corpus, although originally created for cross-linguistic research, aims to cover a wide range of uses. The paper describes the different phases / processes in the construction of the corpus: compilation, preprocessing, corpus markup, linguistic annotation and alignment of the data. Finally, the web interface and the search possibilities for the different user groups are presented.Keywords: corpus linguistics, parallel corpus, cross-linguistics, translation.
Referencias bibliográficas
- BAKER, M. (1996): “Corpus-based translation studies: The challenges that lie ahead”. En H. Somers (ed.): Terminology, LSP and Translation. Amsterdam: Benjamins, 175-86.
- BERNARDINI, S. (2004): “Corpora in the Classroom: An Overview and Some Reflections on Future. Developments”. En J. Sinclair (ed.): How to Use Corpora in Language Teaching. Amsterdam: John Benjamins, 15-36.
- BRAUNE, F. & A. FRASER (2010): “Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora”. COLING ’10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters. Beijing, China, 81-9.
- BROWN, P. F., J. C. LAI & R. L. MERCER (1991): “Aligning Sentences in Parallel Corpora”. Proceedings of the 29th Annual Meeting on Association for Computational Linguistics, ACL ’91. Stroudsburg, PA: ACL, 169-76.
- DOVAL, I. (2016): “Bilingual Parallel Corpora for Linguistic Research”. EPiC Series in Language and Linguistics 1, 88-96.
- GALE, W. A. & K. W. CHURCH (1993): “A Program for Aligning Sentences in Bilingual Corpora”. Computational Linguistics 19/1, 75-102.
- HAJLAOUI, N. et al. (2014): “DCEP-Digital Corpus of the European Parliament”. Proceedings LREC 2014 (Language Resources and Evaluation Conference). Reykjavik, Iceland. Mai 26-31, 2014, 3164-71. En línea: <>.
- HEID, U. (2008): “Corpus linguistics and lexicography”. En Lüdeling & Kytö (2008: 131-53).
- LÜDELING, A. & M. KYTÖ (2008): Corpus Linguistics. An International Handbook. Vol. 1. Handbücher zur Sprachund Kommunikationswissenschat. Berlin: Walter de Gruyter.
- HILL, T. (2011): El verano de los juguetes muertos. Barcelona Penguin Random House. [Der Sommer der toten Puppen. Berlin: Suhrkamp, 2013.]
- JOHANSSON, S. (2007a): Seeing through Multilingual Corpora: On the use of corpora in contrastive studies. Amsterdam: John Benjamins.
- JOHANSSON, S. (2007b): “Using Corpora: From Learning to Research”. En E. Hidalgo, L. Quereda & J. Santana (eds.): Corpora in the Foreign Language Classroom. Amsterdam: Rodopi, 17-30.
- KOEHN, P., (2005): “EuroParl, A parallel corpus for statistical machine translation”. Proceedings of the machine translation summit. Phuket: AAMT, 79-86. En línea: <>.
- KAY, M. & M. RÖSCHEISEN (1993): “Text-translation Alignment”. Computational Linguistics 19.1, 121-142.
- MCENERY, T. & A. HARDIE (2012): Corpus Linguistics. Cambridge: Cambridge University Press.
- PADRÓ, L. (2011): “Analizadores Multilingües en FreeLing”. Linguamatica 3/2, 13-20.
- RÖMER, U. (2008): “Corpora and language teaching”. En Lüdeling & Kytö (2008: 112-31).
- SCHMID, H. (1995): “Improvements in Part-of-Speech Tagging with an Application to German”. Proceedings of the ACL SIGDAT-Workshop. Dublin, 47-50. En línea:<>.
- STEINBERGER R. et al. (2014): “An overview of the European Union’s highly multilingual parallel corpora”. Language Resources and Evaluation Journal 48/4, 679-707.
- TIEDEMANN, J. (2011): Bitext Alignment. Toronto: Morgan & Claypool.
- TIEDEMANN, J. (2012): “Parallel Data, Tools and Interfaces in OPUS”. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012). Paris: ELRA, 2214-8. En línea: <>.
- VARGA, D. et al. (2007): “Parallel corpora for medium density languages”. En N. Nicolov et al. (eds.): Recent Advances in Natural Language Processing IV. Amsterdam: John Benjamins, 590-6.
- VOLK, M., J. GRAËN, & E. CALLEGARO (2014): “Innovations in Parallel Corpus Search Tools”. En N. Calzolari et al. (eds.): Proceedings LREC 2014, 3172-8. En línea: <>.
- WETZEL, D. & F. BOND (2012): “Enriching parallel corpora for statistical machine translation with semantic negation rephrasing”. En M. Carpuat, L. Specia & D. Wu (eds.): Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation. Stroudsburg: ACL, 20-9. En línea: <>.
- ZHEKOVA, D. et al. (2016): “Alignment and Application of Russian-German Multi-Target Parallel Corpora for Linguistic Analysis and Literary Studies”. MATLIT 4/1, 45-61. En línea: <>.