Beyond Readability: a corpus-based proposal for text difficulty analysis

Detalhes bibliográficos
Ano de defesa: 2018
Autor(a) principal: Filipe Rubini Castano
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Minas Gerais
UFMG
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://hdl.handle.net/1843/LETR-B8QFX3
Resumo: Since the first half of the twentieth century (Flesch, 1948), the taskof assessing text difficulty has been primarily tackled by the designand use of readability formulas in many areas: selecting grade levelappropriate books for schoolchildren (Spache, 1953), simplifyingdense subjects, such as medical and legal texts (L. M. Baker et al.,1997; Razek et al., 1982), and, in more recent years, assisting writersin making themselves more understandable (Readable.io n.d.).However, there is little empirical demonstration of the validity of readability formulas, as shown for instance in Begeny and Greene (2014), Leroy and Kauchak (2014), Schriver (2000), and Sydes and Hartley (1997), and many of the tools that are currently available for assessing text difficulty, e.g. ATOS for Text, ATOS for Books (n.d.), Miltsakaki and Troutt (2007), and Readable.io (n.d.), depend on those formulas to function. In addition, these tools are quite limited, meant to be used for a specific language, text type, and intended audience. In this work, we develop a corpus linguistics-based, lexiconoriented approach to propose a Text Difficulty Scale (TDS) which, conversely to previous efforts, can be adapted for texts of virtually any language, including those that use non-Latin writing systems. To that end, we have used sounder statistical measurements, such as deviation of proportions (DP) (Gries, 2008, 2010); included 2-grams and 3-grams as sources of numerous yet often disregardedidioms and phrasemes (Bu et al., 2011, p. 3); and built a 60+ milliontoken collection of Wikipedia articles in English for demonstrationpurposes. Furthermore, we have made our work available, free and open-source, as a set of Jupyter Notebooks in the Pythonprogramming language. We argue that our proposal not only offers a much-needed flexible measurement of text difficulty, in particular for teachers and students of foreign languages, but also that it could be useful for researchers in cognitive linguistics and psycholinguistics, editors, writers, and children acquiring their first language