Detalhes bibliográficos
Ano de defesa: |
2014 |
Autor(a) principal: |
SANTOS, Maelyson Rolim Fonseca dos
 |
Orientador(a): |
FIGUEIRÊDO, Pedro Hugo de |
Banca de defesa: |
Não Informado pela instituição |
Tipo de documento: |
Dissertação
|
Tipo de acesso: |
Acesso aberto |
Idioma: |
por |
Instituição de defesa: |
Universidade Federal Rural de Pernambuco
|
Programa de Pós-Graduação: |
Programa de Pós-Graduação em Física Aplicada
|
Departamento: |
Departamento de Física
|
País: |
Brasil
|
Palavras-chave em Português: |
|
Área do conhecimento CNPq: |
|
Link de acesso: |
http://www.tede2.ufrpe.br:8080/tede2/handle/tede2/6857
|
Resumo: |
The investigation of the process of evolution and characterization of diferent human languages has been one of the most active research elds in recent decades. Although the search for linguistic patterns that can establish a phylogeny of languages is much older, the statistical characterization of the written language, commonly called quantitative linguistic, has a newer tradition that relies on the work developed by Claude Shannon and George Zipf, written at the end of the 1940s. In this work we investigate some statistical aspects of the frequencies and positions for words in texts and the function of this quantities into the information contained in written language. Initially we explored the scaling relationship between the vocabulary V and the text sizes T, called Heaps' Law, which according to our results is typical for each language. We establish, empirically, a functional relationship between maximum frequency kmax and the total number of words in the text. Secondly we analyze morphological features of symbols, obtaining the word sizes distribution and from its respective entropy. We conclude that this procedure allows us to categorize diferent linguistic groups. Finally we introduce two models able to provide universal limiting behaviors to the relationship between standard deviation and frequency k. The models were designed to describe the behavior of correlated and uncorrelated words, reproducing various properties of texts as the fraction f of correlated words and the structural entropy H. All our theoretical results were compared with those obtained from 500 texts that include wikipedia articles and literary works from various epochs in 10 languages distributed in three linguistic families: germanic (german, danish, swedish and english), romanic (spanish, italian, french and portuguese) and uralic ( nnish and hungarian). |