iRead4Skills Dataset 1: corpora by complexity level for FR, PT and SP

Pintard, Alice; François, Thomas; Nagant de Deuxchaisnes, Justine; Barbosa, Sílvia; Reis, Maria Leonor; Moutinho, Michell; Monteiro, Ricardo; Amaro, Raquel; Correia, Susana; Rodríguez Rey, Sandra; Garcia González, Marcos; Mu, Keran; Blanco Escoda, Xavier

doi:10.5281/ZENODO.13768410

iRead4Skills Dataset 1: corpora by complexity level for FR, PT and SP

Pintard, Alice ¹²
François, Thomas ¹²
Nagant de Deuxchaisnes, Justine ¹²
Barbosa, Sílvia ³⁴
Reis, Maria Leonor ³⁴
Moutinho, Michell ³⁴
Monteiro, Ricardo ³⁴
Amaro, Raquel ³⁴
Correia, Susana ³⁴
Rodríguez Rey, Sandra ⁵⁶
Garcia González, Marcos ⁵⁶
Mu, Keran ⁷
Blanco Escoda, Xavier ⁷

1 Université Catholique de Louvain

Université Catholique de Louvain

Louvain-la-Neuve, Bélgica

ROR https://ror.org/02495e989
2 CENTAL
3 Universidade Nova de Lisboa

Universidade Nova de Lisboa

Lisboa, Portugal

ROR https://ror.org/02xankh89
4 CLUNL
5 CITIUS
6 Universidade de Santiago de Compostela

Universidade de Santiago de Compostela

Santiago de Compostela, España

ROR https://ror.org/030eybx10
7 Universitat Autònoma de Barcelona

Universitat Autònoma de Barcelona

Barcelona, España

ROR https://ror.org/052g8jq94

Show affiliations +

Editor: Zenodo

Year of publication: 2024

Type: Dataset

License: CC BY-NC-ND 4.0

DOI: 10.5281/ZENODO.13768410 Open access editor

Abstract

The iRead4Skills Dataset 1: corpora by level of complexity for FR, PT and SP is a collection of written texts of several genres and levels of complexity, in txt format, compiled under the scope of the project iReadSkills – Intelligent Reading Improvement System for Fundamental and Transversal Skills Development. The project, funded by the European Commission (grant number: 1010094837) aims to improve reading skills in the adult population by creating an intelligent system that assesses text complexity and suggests appropriate reading materials to adults with low literacy skills, contributing to reducing skills gaps and to provide access to information and culture (https://iread4skills.com/). The compilation of this first dataset was based on the complexity levels established as relevant for the project (Very Easy (approx. A1), Easy (approx. A2) and Plain (approx. B1) and on the expected needs of learners and trainers. For some genres, there are also texts of a more complex level. The data will provide the basis for the training and test sets for the complexity analysis systems for the three languages of the project: French, Portuguese, and Spanish. The dataset will be further enhanced, validated, and annotated by end-users, originating forthcoming versions and a second, derived, dataset. The resource is composed of three sub corpora: French, Portuguese and Spanish. Each of the sub corpora considers different complexity levels and covers texts from the following communication domains: 01_personal communication 02_institutional/professional communication 03_social media 04_commercial communication/dissemination 05_non-fiction book 06_fiction book 07_didactic book 08_academic/school 09_political communication/dissemination 10_legal documentation 11_religious texts/dissemination French corpus: Number of texts: 2 199 Number of tokens: 530 298 Spanish corpus: Number of texts: 2 533 Number of tokens: 960 644 Portuguese corpus: Number of texts: 2 933 Number of tokens: 946 131 More information about the corpus constitution and text samples is available at https://iread4skills.com/tools-resources/ D3.3 Data set 1: corpora by level of complexity FR, PT and SP.

iRead4Skills Dataset 1: corpora by complexity level for FR, PT and SP

Université Catholique de Louvain

Universidade Nova de Lisboa

Universidade de Santiago de Compostela

Universitat Autònoma de Barcelona

Abstract