Beyond Readability: a corpus-based proposal for text difficulty analysis
Ano de defesa: | 2018 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Dissertação |
Tipo de acesso: | Acesso aberto |
Idioma: | por |
Instituição de defesa: |
Universidade Federal de Minas Gerais
UFMG |
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: | |
Link de acesso: | http://hdl.handle.net/1843/LETR-B8QFX3 |
Resumo: | Since the first half of the twentieth century (Flesch, 1948), the taskof assessing text difficulty has been primarily tackled by the designand use of readability formulas in many areas: selecting grade levelappropriate books for schoolchildren (Spache, 1953), simplifyingdense subjects, such as medical and legal texts (L. M. Baker et al.,1997; Razek et al., 1982), and, in more recent years, assisting writersin making themselves more understandable (Readable.io n.d.).However, there is little empirical demonstration of the validity of readability formulas, as shown for instance in Begeny and Greene (2014), Leroy and Kauchak (2014), Schriver (2000), and Sydes and Hartley (1997), and many of the tools that are currently available for assessing text difficulty, e.g. ATOS for Text, ATOS for Books (n.d.), Miltsakaki and Troutt (2007), and Readable.io (n.d.), depend on those formulas to function. In addition, these tools are quite limited, meant to be used for a specific language, text type, and intended audience. In this work, we develop a corpus linguistics-based, lexiconoriented approach to propose a Text Difficulty Scale (TDS) which, conversely to previous efforts, can be adapted for texts of virtually any language, including those that use non-Latin writing systems. To that end, we have used sounder statistical measurements, such as deviation of proportions (DP) (Gries, 2008, 2010); included 2-grams and 3-grams as sources of numerous yet often disregardedidioms and phrasemes (Bu et al., 2011, p. 3); and built a 60+ milliontoken collection of Wikipedia articles in English for demonstrationpurposes. Furthermore, we have made our work available, free and open-source, as a set of Jupyter Notebooks in the Pythonprogramming language. We argue that our proposal not only offers a much-needed flexible measurement of text difficulty, in particular for teachers and students of foreign languages, but also that it could be useful for researchers in cognitive linguistics and psycholinguistics, editors, writers, and children acquiring their first language |